Sébastien Han [Tue, 4 Dec 2018 17:41:36 +0000 (18:41 +0100)]
tox: add missing ceph_docker_image_tag in shrink
When calling shrink on containerized deployment, we were first doing the
setup with `latest-master` and then when calling the playbook we were
using the default value for `ceph_docker_image_tag` that comes from
ceph-defaults. Now we pass
`ceph_docker_image_tag={env:CEPH_DOCKER_IMAGE_TAG:latest-master}` so the
play will run the right container image.
Sébastien Han [Tue, 4 Dec 2018 09:44:28 +0000 (10:44 +0100)]
test: disable nfs for containers
Based on https://github.com/ceph/ceph-container/pull/1269 and given
there are no stable packages and reliable repository, we disable nfs
ganesha temporarly.
Sébastien Han [Wed, 28 Nov 2018 23:10:29 +0000 (00:10 +0100)]
osd: re-introduce disk_list check
This commit
https://github.com/ceph/ceph-ansible/commit/4cc1506303739f13bb7a6e1022646ef90e004c90#diff-51bbe3572e46e3b219ad726da44b64ebL13
accidentally removed this check.
This is a must have for ceph-disk based containerized OSDs.
Sébastien Han [Fri, 30 Nov 2018 10:20:03 +0000 (11:20 +0100)]
osd: discover osd_objectstore on the fly
Applying and passing the OSD_BLUESTORE/FILESTORE on the fly is wrong for
existing clusters as their config will be changed.
Typically, if an OSD was prepared with ceph-disk on filestore and we
change the default objectstore to bluestore, the activation will fail.
The flag osd_objectstore should only be used for the preparation, not
activation. The activate in this case detects the osd objecstore which
prevents failures like the one described above.
Sébastien Han [Tue, 27 Nov 2018 16:50:44 +0000 (17:50 +0100)]
ceph-osd: change jinja condition
If an existing cluster runs this config, and has ceph-disk OSD, the
`expose_partitions` won't be expected by jinja since it's inside the
'old' if. We need it as part of the osd_scenario != 'lvm' condition.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1640273 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Mon, 3 Dec 2018 10:59:49 +0000 (11:59 +0100)]
ceph_key: allow setting 'dest' to a file
This is useful in situations where you fetch the key from the mon store
and want to write the file with a different name to a dedicated
directory. This is important when fetching the mgr key, they are created
as mgr.ceph-mon2 but we want them in /var/lib/ceph/mgr/ceph-ceph-mon0/keyring
Sébastien Han [Fri, 16 Nov 2018 09:50:38 +0000 (10:50 +0100)]
mon: do not serialized container bootstrap
This commit unifies the container and non-container code, which in the
meantime gives use the ability to deploy N mon container at the same
time without having to serialized the deployment. This will drastically
reduces the time needed to bootstrap the cluster.
Note, this is only possible since Nautilus because the monitors are
bootstrap the initial keys on their own once they reach quorum. In the
Nautilus version of the ceph-container mon, we stopped generating the
keys 'manually' from inside the container, for more detail see: https://github.com/ceph/ceph-container/pull/1238
Sébastien Han [Fri, 26 Oct 2018 12:32:49 +0000 (14:32 +0200)]
mgr: only copy keys with dedicated mgr
When collocating mon and mgr, the mgr container will attempt to create
its own key since it has the admin key at its disposal. Also at this
point there is nothing to fetch since the key is not created by the
mons, as mentionned above the mgr creates the key on its own.
Sébastien Han [Tue, 16 Oct 2018 13:40:35 +0000 (15:40 +0200)]
site: collocated mon and mgr by default
This will speed up the deployment and also deploy mon and mgr collocated
just as recommended.
This won't prevent you of adding more and dedicaded machines for mgr if
needed.
Sébastien Han [Mon, 26 Nov 2018 10:08:27 +0000 (11:08 +0100)]
mon: default ceph_health_raw to json
During the first iteration, the command won't return anything, or can
simply fail and might not return a valid json structure. Ansible will
fail parsing it in the filter `from_json` so let's default that variable
to empty dictionary.
Sébastien Han [Mon, 26 Nov 2018 09:56:14 +0000 (10:56 +0100)]
container-common: remove old check
This removes a bit of unnecessary code, the check was always wrong
because of the condition 'not ceph_current_status.get('rc', 1) == 0'
It will never match since `Not` is used for bool and we are checking for
an rc.
Also, even though the check would work, this will be a major blocker for
a complete meltdown. If the whole platform is shutdown then nothing will
be up but files will be present, so this check is definitely wrong.
Sébastien Han [Fri, 26 Oct 2018 12:13:43 +0000 (14:13 +0200)]
rolling-update: remove old condition
This failure condition was only valid at the time where clusters didn't
have ceph-mgr activated. Now since we collocate the ceph-mgr with the
mon by default, if the daemon wasn't present it will be created during
the upgrade.
Sébastien Han [Fri, 16 Nov 2018 09:46:10 +0000 (10:46 +0100)]
ceph_key: apply permissions using ansible code module
Instead of applying file permissions from our code, let's rely on the
ansible code 'file' module for this. This is now handled at the task
declaration level instead of inside the module.
Sébastien Han [Mon, 26 Nov 2018 10:06:10 +0000 (11:06 +0100)]
sites: fail the playbook on any failure
We need to apply any_errors_fatal: true to every play so it can take
effect, not only on the initial pass. With this flag, any error in the
playbook will cause the playbook to stop.
rolling_update: create missing keyring only on running mon
try to create the potentially missing keys only on monitors that are
actually running.
The current node being played is stopped before this task.
By the way, delegating the command on all nodes but the current node
being played ensures that the generated keys will be present on all
monitors.
Sébastien Han [Wed, 28 Nov 2018 23:27:49 +0000 (00:27 +0100)]
rolling_update: default ceph json output to empty dict
So we can avoid the following failure:
The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)["quorum_names"]
' failed. The error was: No JSON object could be decoded
We just need to set a default, the next iteration will have a more
complete json since the command won't fail.
validate: change default value for `radosgw_address`
change default value of `radosgw_address` to keep consistency with
`monitor_address`.
Moreover, `ceph-validate` checks if the value is '0.0.0.0' to determine
if it has to run `check_eth_rgw.yml`.
when upgrading from RHCS 2.5 to 3.2, it fails because the task `create
ceph mgr keyring(s) when mon is containerized` has a when condition
`inventory_hostname == groups[mon_group_name]|last`.
First, this is incorrect because `inventory_hostname` is referring to a
mgr node, it means this condition would have never been satisfied.
Then, this condition + `serial: 1` makes the mgr keyring creating skipped on
the first node. Further, the `ceph-mgr` role tries to copy the mgr
keyring (it's not aware we are running `serial: 1`) this leads to a
failure like the following:
```
TASK [ceph-mgr : copy ceph keyring(s) if needed] ***************************************************************************************************************************************************************************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10
Tuesday 27 November 2018 12:03:34 +0000 (0:00:00.296) 0:11:01.290 ******
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'
failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"}
```
The ceph_key module is idempotent, so there is no need to have such a
condition.
The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_eth1'
```
Indeed, `radosgw_interface` is the network interface on rgw only. It is
expected that this same interface doesn't exist on `localhost`, so, when
running `shrink_mon.yml`, the role `ceph-defaults` is called in
`hosts: localhost` and causes the playbook to fail.
Sébastien Han [Tue, 20 Nov 2018 10:28:02 +0000 (11:28 +0100)]
ceph-defaults: use podman on Fedora only
It seems Atomic 7.5 has podman already, however this is an old version
(0.4). The podman integration is targetting RHEL 8, so Fedora is
currently the closest to that.
Sébastien Han [Mon, 19 Nov 2018 13:12:43 +0000 (14:12 +0100)]
iscsi: expose /dev/log in the container
During its initialisation both rbd-target-api and rbd-target-gw try to
open /dev/log for their syslog handler. If the device is not present the
service fails to start. Thus expose /dev/log from the host in the
container solves that problem.
Sébastien Han [Mon, 19 Nov 2018 10:03:45 +0000 (11:03 +0100)]
testinfra: add support for podman
Since we are now testing on docker and podman our functionnal tests must
reflect that. So now, if we detect the podman binary we will use it,
otherwise we default to docker.
Sébastien Han [Fri, 16 Nov 2018 09:41:12 +0000 (10:41 +0100)]
test_build_key_path_bootstrap_osd: fix
The entity name is client.bootstrap-osd (as returned by Ceph), and not
bootstrap-osd. The build_key_path function split 'client.bootstrap-osd'
on the '.' so using bootstrap-osd fails with index out of range.
Sébastien Han [Fri, 16 Nov 2018 09:37:07 +0000 (10:37 +0100)]
ceph_key: remove set-uid support
The support of set-uid was remove from Ceph during the Nautilus cycle by
the following commit: d6def8ba1126209f8dcb40e296977dc2b09a376e so this
will not work anymore when deploying Nautilus clusters and above.
Sébastien Han [Sat, 17 Nov 2018 18:58:54 +0000 (19:58 +0100)]
client: do not use a dummy container anymore
Since 84fcf4639140c390a7f1fcd790ba190503713f86 we now use the container
binary cli to create ceph keys instead of creating a container and
'docker execing' into it.
Sébastien Han [Fri, 16 Nov 2018 09:46:10 +0000 (10:46 +0100)]
ceph_key: rework container support
Previously, we were doing a 'docker exec' inside a mon container, this
worked but this wasn't ideal since it required a mon to be up to
generate keys. We must be able to generate a key without a running mon,
e.g, when we create the initial key or simply when you want to generate
a key from any node that is not a mon.
Now, just like the ceph_volume module we use a 'docker run' command with
the right binary as an entrypoint to perform the choosen action, this is
more elegant and also only requires an env variable to be set in the
playbook: CEPH_CONTAINER_IMAGE.
Andrew Schoen [Tue, 20 Nov 2018 20:28:58 +0000 (14:28 -0600)]
ceph-volume: be idempotent when the batch strategy changes
If you deploy with 2 HDDs and 1 SDD then each subsequent deploy both
HDD drives will be filtered out, because they're already used by ceph.
ceph-volume will report this as a 'strategy change' because the device
list went from a mixed type of HDD and SDD to a single type of only SDD.
This situation results in a non-zero exit code from ceph-volume. We want
to handle this situation gracefully and report that nothing will be changed.
A similar json structure to what would have been given by ceph-volume is
returned in the 'stdout' key.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650306 Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Sébastien Han [Mon, 26 Nov 2018 16:58:49 +0000 (17:58 +0100)]
osd: expose udev into the container
In order to be able to retrieve udev information, we must expose its
socket. As per, https://github.com/ceph/ceph/pull/25201 ceph-volume will
start consuming udev output.
tests: do not fully override previous ceph_conf_overrides
We run an initial deployment with `osd_pool_default_size: 1` in
`ceph_conf_overrides`.
When re-running the playbook to test idempotency and handlers, we reset
`ceph_conf_overrides`, we must append a new value instead of just
overwritting it, otherwise, this can lead to error in the CI.
Sébastien Han [Wed, 21 Nov 2018 15:18:58 +0000 (16:18 +0100)]
rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Wed, 21 Nov 2018 15:17:04 +0000 (16:17 +0100)]
ceph_key: add a get_key function
When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.
Sébastien Han [Fri, 16 Nov 2018 15:15:24 +0000 (16:15 +0100)]
switch: disable all ceph units
Prior to this commit we were only disabling ceph-osd units, but forgot
the ceph.target which is controlling everything and will restart the
ceph-osd units at each reboot.
Now that everything gets disabled there won't be any conflicts between
the old non-container and the new container units.
Add real default value for osd pool size customization.
Ceph itself has an `osd_pool_default_size` default value to `3`.
If users don't specify a pool size in various pools definition within
ceph-ansible, we should default to `3`.
By the way, this kind of condition isn't really clear:
```
when:
- rbd_pool_size | default ("")
```
we should try to get the customized value then default to what is in
`osd_pool_default_size` (which has its default value pointing to
`ceph_osd_pool_default_size` (`3`) as well) and compare it to
`ceph_osd_pool_default_size`.
mon: move `osd_pool_default_pg_num` in `ceph-defaults`
`osd_pool_default_pg_num` parameter is set in `ceph-mon`.
When using ceph-ansible with `--limit` on a specifc group of nodes, it
will fail when trying to access this variables since it wouldn't be
defined.
Typical error seen:
```
$ sudo ceph daemon osd.2 config get osd_memory_target
Can't get admin socket path: unable to get conf option admin_socket for osd.2:
parse error setting 'osd_memory_target' to '7823740108,8' (strict_si_cast:
unit prefix not recognized)
```
This commit ensures the value inserted in ceph.conf will be an integer.
Sébastien Han [Tue, 20 Nov 2018 17:03:14 +0000 (18:03 +0100)]
site: resync container playbook
This PR https://github.com/ceph/ceph-ansible/pull/3251 forgot to create a symlink from site-docker.yml.sample to site-container.yml.sample.
This commit resyncs and put the symlink in place.
Boris Ranto [Tue, 20 Nov 2018 10:46:08 +0000 (11:46 +0100)]
defaults/facts: Use list instead of keys
It is safer to use the list filter than the keys() method since the keys
method does have some interoperability issues between python2 and
python3 based ansible/jinja.
Boris Ranto [Mon, 19 Nov 2018 23:45:40 +0000 (00:45 +0100)]
start_osds: Use list instead of keys
If you use python3 based ansible then keys() returns a dict_keys object,
not a list of keys. This breaks the installation on such a system. Using
the list filter provides a more robust solution that should work on both
python2 and python3 based ansible. You can find some more information
about the issue, here:
Dan Mick [Sat, 17 Nov 2018 00:28:54 +0000 (16:28 -0800)]
validate plugin: handle missing exception fields without traceback
"missing variable" errors introduced by PR3058 would attempt to
be reported, but since the exception contained no "path" definition,
would cause a second exception in the Invalid exception handler.
Make the exception handler verify that any field it tries to use
exists, clean up its message formatting, and reduce the verbose
level to see the literal error from notario in case more goes
wrong in future.
Sébastien Han [Sun, 18 Nov 2018 20:48:47 +0000 (21:48 +0100)]
ceph.ceph-container-common remove symlink
This error was introduced in the recent refactor of ceph-docker-common
in https://github.com/ceph/ceph-ansible/pull/3251. However, the Ansible
galaxy linter is not happy about it and fails importing the role.
Removing this since it's not used anymore.