ToprHarley [Mon, 18 Feb 2019 18:02:03 +0000 (19:02 +0100)]
Convert interface names to underscores
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540881 Signed-off-by: Tomas Petr <tpetr@redhat.com>
(cherry picked from commit 573adce7dd4f306c384b3308c8049ae49ef59716)
Kevin Coakley [Tue, 26 Feb 2019 17:30:31 +0000 (09:30 -0800)]
Set permissions on monitor directory to u=rwX,g=rX,o=rX recursive
Set directories to 755 and files to 644 to
/var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }} recursively instead of
setting files and directories to 755 recursively. The ceph mon
process writes files to this path with permissions 644. This update stops
ansible from updating the permissions in
/var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }} every time ceph mon writes
a file and increases idempotency.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1683997 Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d327681b99915578fc8b389fda69556966db905f)
Dimitri Savineau [Wed, 27 Feb 2019 16:07:38 +0000 (11:07 -0500)]
mon: Add mds permissions to client.admin
The administrator keyring needs full capabilities on mds like mon,
osd and mgr.
Whithout this, the client.admin key won't be able to run commands
against mds (like ceph tell mds.0 session ls)
osd: make the 'wait for all osd to be up' task configurable
introduce two new variables to make the check that 'wait for all osd to
be up' configurable.
It's possible that for some deployments, OSDs can take longer to be seen
as UP and IN.
David Waiting [Mon, 10 Dec 2018 14:54:18 +0000 (09:54 -0500)]
ensure at least one osd is up
The existing task checks that the number of OSDs is equal to the number of up OSDs before continuing.
The problem is that if none of the OSDs have been discovered yet, the task will exit immediately and subsequent pool creation will fail (num_osds = 0, num_up_osds = 0).
In this change, we also check that at least one OSD is present. In our testing, this results in the task correctly waiting for all OSDs to come up before continuing.
setup_ntp: call handler to disable ntpd if chronyd used
The task setup chronyd called the handler disable chronyd, which of
course defeats the purpose.
Changing the task to disable ntpd instead fixes the issue of chronyd
being disabled after it got enabled.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1673664 Fixes: #3582 Signed-off-by: Patrick C. F. Ernzer pcfe@redhat.com
(cherry picked from commit c605ff6a68720ab43b63086c3ac1d529a651f585)
Typical error:
```
fatal: [iscsi-gw0]: FAILED! =>
msg: 'an error occurred while trying to read the file ''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'': [Errno 13] Permission denied: b''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'''
```
`become: True` is not needed on the following task:
`copy crt file(s) to gateway nodes`.
Since it's already set in the main playbook (site.yml/site-container.yml)
The thing is that the files get generated in the 'fetch_directory' with
root user because there is a 'delegate_to' + we run the playbook with
`become: True` (from main playbook).
The idea here is to create files under ansible user so we can open them
later to copy them on the remote machine.
Leah Neukirchen [Thu, 7 Feb 2019 17:09:21 +0000 (18:09 +0100)]
Fix uses of default(omit) with string concatenation
When {{omit}} is concatenated with another string, it expands to something
like __omit_place_holder__63eea0d96dd6ed867b95405e11d87dddf61f448d.
However, in these use-cases we need an empty string.
Sébastien Han [Mon, 26 Nov 2018 16:58:49 +0000 (17:58 +0100)]
osd: expose udev into the container
In order to be able to retrieve udev information, we must expose its
socket. As per, https://github.com/ceph/ceph/pull/25201 ceph-volume will
start consuming udev output.
Noah Watkins [Tue, 6 Nov 2018 16:49:39 +0000 (08:49 -0800)]
Fixup shrink_osd[_container] scenario config
** configuration seems to be for filestore:
[ERROR]: [ceph-osd0] Validation failed for variable: lvm_volumes
** Removing `radosgw_interface: eth1` to resolve:
The task includes an option with an undefined variable. The error was:
'ansible.vars.hostvars.HostVarsVars object' has no attribute
u'ansible_eth1'
The error appears to have been in
'/home/nwatkins/src/ceph-ansible/roles/ceph-defaults/tasks/set_radosgw_address.yml':
line 21, column 5, but may be elsewhere in the file depending on the
exact syntax problem.
The offending line appears to be:
- name: set_fact _radosgw_address to radosgw_interface - ipv4
^ here
This condition is useless and it's also creating issues we don't see in
our CI. ceph_release is set by either ceph-common or ceph-docker-common
so let's keep it this way.
Sébastien Han [Mon, 14 Jan 2019 15:31:45 +0000 (16:31 +0100)]
switch: do not fail on missing key
Some people use the switch playbook to perform upgrade so they end up in
the same situation than https://bugzilla.redhat.com/show_bug.cgi?id=1650572
This is applying the same fix as 729744c6a8c69f5fdf66b67fb28063297996e30a.
We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".
Also please note that the commit 488281187e8ac6c587db74961db9e075f31c8eae was merged (on PR 3477)
"as it is" (despite of merge conflicts) which was not supposed to be
the case ideally. This had a side-effect that the feature of supporting
multiple NTP daemons (new ones are namely chronyd and timesyncd) was
also backported which is itself against the convention. For
consistency's sake the feature was backported to stable-3.1 as well.
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.
the OSD part of the purge delegates commands on monitor node, we need to
gather monitors facts to know the `ansible_hostname` fact that is used
in the `docker_exec_cmd` fact.
Sébastien Han [Mon, 3 Dec 2018 21:59:17 +0000 (22:59 +0100)]
purge-cluster: add support for mon/mgr collocation
Recently we introduced the default collocation of mon/mgr without the
need of a dedicated mgrs section. This means we have to stop the mgr
process on that machine too.
Sébastien Han [Mon, 3 Dec 2018 21:46:52 +0000 (22:46 +0100)]
purge-docker-cluster: add support for mgr/mon collocation
Recently we introduced the collocation of mon and mgr by default, so we
don't need to have an explicit mgrs section for this. This means we have
to remove the mgr container on the mon machines too.
Sébastien Han [Thu, 4 Oct 2018 15:40:25 +0000 (17:40 +0200)]
purge-docker-cluster: add ceph-volume support
This commits adds the support for purging cluster that were deployed
with ceph-volume. It also separates nicely with a block intruction the
work to do when lvm is used or not.
Bruceforce [Fri, 4 Jan 2019 16:26:02 +0000 (17:26 +0100)]
The nfs_ganesha_dev_apt_repo variable was set incorrect in task
"fetch nfs-ganesha development repository"
This has to be pushed directly to stable-3.2 since master has diverged
Rishabh Dave [Wed, 12 Dec 2018 11:15:00 +0000 (16:45 +0530)]
ceph-infra: merge ntp_debian.yml and ntp_rpm.yml
Merge ntp_debian.yml and ntp_rpm.yml into one (the new file is called
setup_ntp.yml) since they are almost identical. Also avoid repetition
of the common setup step for ntpd and chronyd services.
update: do not enforce `serial: 1` on client nodes
There is no need to enforce `serial: 1` on client nodes.
Let's make it parameterizable by introducing a new *extra* variable
`client_update_batch`, if not filled this will default to `{{
ansible_forks }}`.
NOTE: this is only usable as an extra variable passed with
`-e client_update_batch=<num>`
Rishabh Dave [Mon, 17 Dec 2018 10:34:46 +0000 (16:04 +0530)]
set any_errors_fatal to true for all host sections
Add `any_errors_fatal: true` to all host sections in `site.yml.sample`
and `site-container.yml.sample` so that the playbook execution
ceases spontaneously and instantaneously when errors occurs.
- reintroduce `purge_cluster_container` and `purge_cluster_non_container`
on `stable-3.2`,
- remove all purge scenario based on ceph-disk,
- remove purge_lvm_osds_* scenarios.
Sébastien Han [Tue, 4 Dec 2018 09:44:28 +0000 (10:44 +0100)]
test: disable nfs for containers
Based on https://github.com/ceph/ceph-container/pull/1269 and given
there are no stable packages and reliable repository, we disable nfs
ganesha temporarly.
Sébastien Han [Fri, 30 Nov 2018 10:20:03 +0000 (11:20 +0100)]
osd: discover osd_objectstore on the fly
Applying and passing the OSD_BLUESTORE/FILESTORE on the fly is wrong for
existing clusters as their config will be changed.
Typically, if an OSD was prepared with ceph-disk on filestore and we
change the default objectstore to bluestore, the activation will fail.
The flag osd_objectstore should only be used for the preparation, not
activation. The activate in this case detects the osd objecstore which
prevents failures like the one described above.
Sébastien Han [Tue, 27 Nov 2018 16:50:44 +0000 (17:50 +0100)]
ceph-osd: change jinja condition
If an existing cluster runs this config, and has ceph-disk OSD, the
`expose_partitions` won't be expected by jinja since it's inside the
'old' if. We need it as part of the osd_scenario != 'lvm' condition.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1640273 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bef522627e1e9827b86710c7a54f35a0cd596fbb)
Sébastien Han [Thu, 29 Nov 2018 13:26:41 +0000 (14:26 +0100)]
rolling_update: do not fail on missing keys
We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ebc901c6af67300f7b7b8da1b2d0a74147798da5)
Noah Watkins [Fri, 30 Nov 2018 23:46:42 +0000 (15:46 -0800)]
rgw: use correct default rgw frontend address
since 0.0.0.0 is the default radosgw address (not 'address'), not
configuring an address explicitly, and instead configuring the radosgw
interface, would result in 0.0.0.0 being used, instead of falling
through to section that inspects the interface config option.
backport note: this cannot be cherry-picked from master since this code
doesn't exist in master.
Sébastien Han [Wed, 28 Nov 2018 23:27:49 +0000 (00:27 +0100)]
rolling_update: default ceph json output to empty dict
So we can avoid the following failure:
The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)["quorum_names"]
' failed. The error was: No JSON object could be decoded
We just need to set a default, the next iteration will have a more
complete json since the command won't fail.
Add real default value for osd pool size customization.
Ceph itself has an `osd_pool_default_size` default value to `3`.
If users don't specify a pool size in various pools definition within
ceph-ansible, we should default to `3`.
By the way, this kind of condition isn't really clear:
```
when:
- rbd_pool_size | default ("")
```
we should try to get the customized value then default to what is in
`osd_pool_default_size` (which has its default value pointing to
`ceph_osd_pool_default_size` (`3`) as well) and compare it to
`ceph_osd_pool_default_size`.
mon: move `osd_pool_default_pg_num` in `ceph-defaults`
`osd_pool_default_pg_num` parameter is set in `ceph-mon`.
When using ceph-ansible with `--limit` on a specifc group of nodes, it
will fail when trying to access this variables since it wouldn't be
defined.
tests: do not fully override previous ceph_conf_overrides
We run an initial deployment with `osd_pool_default_size: 1` in
`ceph_conf_overrides`.
When re-running the playbook to test idempotency and handlers, we reset
`ceph_conf_overrides`, we must append a new value instead of just
overwritting it, otherwise, this can lead to error in the CI.
Sébastien Han [Wed, 21 Nov 2018 15:18:58 +0000 (16:18 +0100)]
rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f9263b9ac3b5649f1e3cf3cbaf12d10)
Sébastien Han [Wed, 21 Nov 2018 15:17:04 +0000 (16:17 +0100)]
ceph_key: add a get_key function
When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.
Sébastien Han [Fri, 16 Nov 2018 15:15:24 +0000 (16:15 +0100)]
switch: disable all ceph units
Prior to this commit we were only disabling ceph-osd units, but forgot
the ceph.target which is controlling everything and will restart the
ceph-osd units at each reboot.
Now that everything gets disabled there won't be any conflicts between
the old non-container and the new container units.
Sébastien Han [Wed, 28 Nov 2018 23:10:29 +0000 (00:10 +0100)]
osd: re-introduce disk_list check
This commit
https://github.com/ceph/ceph-ansible/commit/4cc1506303739f13bb7a6e1022646ef90e004c90#diff-51bbe3572e46e3b219ad726da44b64ebL13
accidentally removed this check.
This is a must have for ceph-disk based containerized OSDs.