otherwise a bunch of jobs will fail like following:
```
[WARNING]: Unable to parse /home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-ubuntu-container-stable-3.2-bluestore_lvm_osds/tests/functional/bs-lvm-osds/container/hosts-ubuntu as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
```
As of testinfra 2.0.0, the binary name is `py.test`.
But let's pin the version to 1.19.0.
Indeed, migrating to 2.0.0 requires our current testing to be reworked a bit.
Since we don't have the bandwidth ATM for this, it's better to simply
keep testing with testinfra 1.19.0.
Note that I've replaced all `testinfra` occurences by `py.test` anyway.
removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.
Typical error:
```
fatal: [mon0]: FAILED! => changed=false
attempts: 3
msg: No package matching 'ceph' is available
```
The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
apt:
update_cache: yes
cache_valid_time: 3600
register: result
until: result is succeeded
```
since the task installing ceph packages has a `update_cache: no` it
fails:
```
- name: install ceph for debian
apt:
name: "{{ debian_ceph_pkgs | unique }}"
update_cache: no
state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
register: result
until: result is succeeded
```
/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.
using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.
on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
cmd: command -v ceph-volume
failed_when_result: false
msg: '[Errno 2] No such file or directory: ''command'': ''command'''
rc: 2
```
Kevin Coakley [Thu, 28 Feb 2019 20:57:03 +0000 (12:57 -0800)]
Add changed_when: false to the "get osd ids" statement
The "get osd ids" statement only registers the osd_ids_non_container variable. Running "ls /var/lib/ceph/osd/ | sed 's/.*-//'" should never produce a change on the system. Adding changed_when: false prevents irrelevant change messages from Ansible.
Dimitri Savineau [Wed, 27 Feb 2019 16:07:38 +0000 (11:07 -0500)]
mon: Add mds permissions to client.admin
The administrator keyring needs full capabilities on mds like mon,
osd and mgr.
Whithout this, the client.admin key won't be able to run commands
against mds (like ceph tell mds.0 session ls)
Kevin Coakley [Tue, 26 Feb 2019 17:30:31 +0000 (09:30 -0800)]
Set permissions on monitor directory to u=rwX,g=rX,o=rX recursive
Set directories to 755 and files to 644 to /var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }} recursively instead of setting files and directories to 755 recursively. The ceph mon process writes files to this path with permissions 644. This update stops ansible from updating the permissions in /var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }} every time ceph mon writes a file and increases idempotency.
the previous approach was wrong.
checking if `item.key` is in `osd_auto_discovery_exclude` (`['dm-',
'loop']`) is incorrect because it will obviously not match. Therefore,
the condition will return `True` whatever the device we are checking.
when the following failure is thrown
```
rhel-8.0.0-beta-1.7- [=== ] --- B/s | 0 B --:-- ETArhel-8.0.0-beta-1.7-appstream 0.0 B/s | 0 B 00:00
rhel-8.0.0-beta-1.7- [=== ] --- B/s | 0 B --:-- ETArhel-8.0.0-beta-1.7-baseos 0.0 B/s | 0 B 00:00
rhel-8.0.0-beta-1.7- [ === ] --- B/s | 0 B --:-- ETArhel-8.0.0-beta-1.7-builder 0.0 B/s | 0 B 00:00
Failed to synchronize cache for repo 'rhel-8.0.0-beta-1.7-appstream', ignoring this repo.
Failed to synchronize cache for repo 'rhel-8.0.0-beta-1.7-baseos', ignoring this repo.
Failed to synchronize cache for repo 'rhel-8.0.0-beta-1.7-builder', ignoring this repo.
No match for argument: python3
Error: Unable to find a match
```
dnf returns 0 anyway.
Let's ensure the pattern 'Failed' isn't present in the output.
I didn't use the `ceph/ubuntu-bionic` image because it's broken at the
time of writing this commit. I'll switch back to `ceph/ubuntu-bionic` as
soon as it will be fixed.
osd: make the 'wait for all osd to be up' task configurable
introduce two new variables to make the check that 'wait for all osd to
be up' configurable.
It's possible that for some deployments, OSDs can take longer to be seen
as UP and IN.
TASK [ceph-mgr : fetch ceph mgr keyring] ***************************************************************************************************************************************************************************
skipping: [mon-000] => {
"changed": false,
"skip_reason": "Conditional result was False"
}
skipping: [mon-002] => {
"changed": false,
"skip_reason": "Conditional result was False"
}
skipping: [mon-001] => {
"changed": false,
"skip_reason": "Conditional result was False"
}
TASK [ceph-mgr : copy ceph keyring(s) if needed] *******************************************************************************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
failed: [mon-002] (item={'name': '/etc/ceph/ceph.mgr.li895-17.keyring', 'dest': '/var/lib/ceph/mgr/ceph-li895-17/keyring', 'copy_key': True}) => {
"changed": false,
"item": {
"copy_key": true,
"dest": "/var/lib/ceph/mgr/ceph-li895-17/keyring",
"name": "/etc/ceph/ceph.mgr.li895-17.keyring"
}
}
MSG:
Could not find or access 'fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring'
Searched in:
/root/ceph-ansible/roles/ceph-mgr/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring
/root/ceph-ansible/roles/ceph-mgr/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring
/root/ceph-ansible/roles/ceph-mgr/tasks/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring
/root/ceph-ansible/roles/ceph-mgr/tasks/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring
/root/ceph-ansible/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring
/root/ceph-ansible/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li895-17.keyring on the Ansible Controller.
If you are using a module and expect the file to exist on the remote, see the remote_src option
skipping: [mon-002] => (item={'name': '/etc/ceph/ceph.client.admin.keyring', 'dest': '/etc/ceph/ceph.client.admin.keyring', 'copy_key': False}) => {
"changed": false,
"item": {
"copy_key": false,
"dest": "/etc/ceph/ceph.client.admin.keyring",
"name": "/etc/ceph/ceph.client.admin.keyring"
},
"skip_reason": "Conditional result was False"
}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
failed: [mon-001] (item={'name': '/etc/ceph/ceph.mgr.li985-128.keyring', 'dest': '/var/lib/ceph/mgr/ceph-li985-128/keyring', 'copy_key': True}) => {
"changed": false,
"item": {
"copy_key": true,
"dest": "/var/lib/ceph/mgr/ceph-li985-128/keyring",
"name": "/etc/ceph/ceph.mgr.li985-128.keyring"
}
}
MSG:
Could not find or access 'fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring'
Searched in:
/root/ceph-ansible/roles/ceph-mgr/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring
/root/ceph-ansible/roles/ceph-mgr/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring
/root/ceph-ansible/roles/ceph-mgr/tasks/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring
/root/ceph-ansible/roles/ceph-mgr/tasks/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring
/root/ceph-ansible/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring
/root/ceph-ansible/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li985-128.keyring on the Ansible Controller.
If you are using a module and expect the file to exist on the remote, see the remote_src option
skipping: [mon-001] => (item={'name': '/etc/ceph/ceph.client.admin.keyring', 'dest': '/etc/ceph/ceph.client.admin.keyring', 'copy_key': False}) => {
"changed": false,
"item": {
"copy_key": false,
"dest": "/etc/ceph/ceph.client.admin.keyring",
"name": "/etc/ceph/ceph.client.admin.keyring"
},
"skip_reason": "Conditional result was False"
}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
failed: [mon-000] (item={'name': '/etc/ceph/ceph.mgr.li1166-30.keyring', 'dest': '/var/lib/ceph/mgr/ceph-li1166-30/keyring', 'copy_key': True}) => {
"changed": false,
"item": {
"copy_key": true,
"dest": "/var/lib/ceph/mgr/ceph-li1166-30/keyring",
"name": "/etc/ceph/ceph.mgr.li1166-30.keyring"
}
}
MSG:
Could not find or access 'fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring'
Searched in:
/root/ceph-ansible/roles/ceph-mgr/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring
/root/ceph-ansible/roles/ceph-mgr/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring
/root/ceph-ansible/roles/ceph-mgr/tasks/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring
/root/ceph-ansible/roles/ceph-mgr/tasks/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring
/root/ceph-ansible/files/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring
/root/ceph-ansible/fetch//bb401e2a-c524-428e-bba9-8977bc96f04b//etc/ceph/ceph.mgr.li1166-30.keyring on the Ansible Controller.
If you are using a module and expect the file to exist on the remote, see the remote_src option
skipping: [mon-000] => (item={'name': '/etc/ceph/ceph.client.admin.keyring', 'dest': '/etc/ceph/ceph.client.admin.keyring', 'copy_key': False}) => {
"changed": false,
"item": {
"copy_key": false,
"dest": "/etc/ceph/ceph.client.admin.keyring",
"name": "/etc/ceph/ceph.client.admin.keyring"
},
"skip_reason": "Conditional result was False"
}
NO MORE HOSTS LEFT *************************************************************************************************************************************************************************************************
to retry, use: --limit @/root/ceph-linode/linode.retry
David Waiting [Mon, 10 Dec 2018 14:54:18 +0000 (09:54 -0500)]
ensure at least one osd is up
The existing task checks that the number of OSDs is equal to the number of up OSDs before continuing.
The problem is that if none of the OSDs have been discovered yet, the task will exit immediately and subsequent pool creation will fail (num_osds = 0, num_up_osds = 0).
In this change, we also check that at least one OSD is present. In our testing, this results in the task correctly waiting for all OSDs to come up before continuing.
Signed-off-by: David Waiting <david_waiting@comcast.com>
switch_to_containers: use ceph binary from container
use the ceph binary from the container instead of the host.
If the ceph CLI version isn't compatible between host and container
image, it can cause the CLI to hang.
instead of using `RuntimeDirectory` parameter in systemd unit files,
let's use a systemd `tmpfiles.d` to ensure `/run/ceph`.
Explanation:
`podman` doesn't create the `/var/run/ceph` if it doesn't exist the time
where the container is run while `docker` used to create it.
In case of `switch_to_containers` scenario, `/run/ceph` gets created by
a tmpfiles.d systemd file; when switching to containers, the systemd
unit file complains because `/run/ceph` already exists
The better fix would be to ensure `/usr/lib/tmpfiles.d/ceph-common.conf`
is removed and only rely on `RuntimeDirectory` from systemd unit file parameter
but we come from a non-containerized environment which is already running,
it means `/run/ceph` is already created and when starting the unit to
start the container, systemd will still complain and we can't simply
remove the directory if daemons are collocated.
switch_to_containers: do not try to redeploy monitors
`ceph-mon` tries to redeploy monitors because it assumes it was not yet
deployed since `mon_socket_stat` and `ceph_mon_container_stat` are
undefined (indeed, we stop the daemon before calling `ceph-mon` in the
switch_to_containers playbook).
Rishabh Dave [Tue, 12 Feb 2019 06:55:13 +0000 (12:25 +0530)]
fix mistake in task that aborts when ntpd is chosen on Atomic
Since it's already confusing whether ntp_daemon_type should be "ntp" or
"ntpd", fix the mistake in the title of the task that aborts if
ntp_daemon_type is set to "ntpd" and OS being used is Atomic.
Sébastien Han [Fri, 8 Feb 2019 15:05:20 +0000 (16:05 +0100)]
mon: do not hardcode ceph uid
167 is the ceph uid for Red Hat based system, thus trying to deploy a
monitor on Debian fail since the ceph user id on that system is 64045.
This commit uses the ceph_uid variable which contains the right uid
based on system/container detection.
Closes: https://github.com/ceph/ceph-ansible/issues/3589 Signed-off-by: Sébastien Han <seb@redhat.com>
Leah Neukirchen [Thu, 7 Feb 2019 17:09:21 +0000 (18:09 +0100)]
Fix uses of default(omit) with string concatenation
When {{omit}} is concatenated with another string, it expands to something
like __omit_place_holder__63eea0d96dd6ed867b95405e11d87dddf61f448d.
However, in these use-cases we need an empty string.
Typical error:
```
fatal: [iscsi-gw0]: FAILED! =>
msg: 'an error occurred while trying to read the file ''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'': [Errno 13] Permission denied: b''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'''
```
`become: True` is not needed on the following task:
`copy crt file(s) to gateway nodes`.
Since it's already set in the main playbook (site.yml/site-container.yml)
The thing is that the files get generated in the 'fetch_directory' with
root user because there is a 'delegate_to' + we run the playbook with
`become: True` (from main playbook).
The idea here is to create files under ansible user so we can open them
later to copy them on the remote machine.
John Fulton [Thu, 31 Jan 2019 21:17:20 +0000 (16:17 -0500)]
Fix CNI error when net=host is not used in some podman calls
With 'podman version 1.0.0' on RHEL8 beta the 'get ceph version' and
'ceph monitor mkfs' commands fail [1] with "error configuring network
namespace for container Missing CNI default network".
When net=host is added these errors are resolved. net=host is used in
many other calls (grep -R net=host | wc -l --> 38).
when ceph-container-common notifies handlers because a new container
image has been pulled, ceph-handler will throw an error because of
undefined variables since they are set in ceph-facts role.
- also add `--foreground` which seems to fix some issue we are facing when
using timeout with `podman`.
- use this fact in the `is ceph running already?` task.
John Fulton [Fri, 1 Feb 2019 13:32:14 +0000 (08:32 -0500)]
Make python print statements python3 compatible
The restart_osd_daemon.sh generated from the j2 template
contains a python call which uses 'print x' instead of
'print(x)'. Add the missing parentheses to make this call
compatible with both 2 and 3.
Also add parentheses to other python print calls found
in roles/ceph-client/defaults/main.yml and
infrastructure-playbooks/cluster-os-migration.yml.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1671721 Signed-off-by: John Fulton <fulton@redhat.com>