setting this setting to 1 makes the CI covering the related code in the
playbook without breaking the upgrade scenarios.
Those scenarios were broken because there is a check `TASK [waiting for
clean pgs...]` in rolling_update.yml, since the pool size for
`cephfs_metadata` and `cephfs_data` are updated to `2` in
`ceph-override.json` and there is not enough osd to honor this size,
some PGs are degraded and make the mentioned check failing.
since `ceph-volume` introduction, there is no need to split those tasks.
Let's refact this part of the code so it's clearer.
By the way, this was breaking rolling_update.yml when `openstack_config:
true` playbook because nothing ensured OSDs were started in ceph-osd role (In
`openstack_config.yml` there is a check ensuring all OSD are UP which was
obviously failing) and resulted with OSDs on the last OSD node not started
anyway.
when upgrading from RHCS 2.5 to 3.2, it fails because the task `create
ceph mgr keyring(s) when mon is containerized` has a when condition
`inventory_hostname == groups[mon_group_name]|last`.
First, this is incorrect because `inventory_hostname` is referring to a
mgr node, it means this condition would have never been satisfied.
Then, this condition + `serial: 1` makes the mgr keyring creating skipped on
the first node. Further, the `ceph-mgr` role tries to copy the mgr
keyring (it's not aware we are running `serial: 1`) this leads to a
failure like the following:
```
TASK [ceph-mgr : copy ceph keyring(s) if needed] ***************************************************************************************************************************************************************************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10
Tuesday 27 November 2018 12:03:34 +0000 (0:00:00.296) 0:11:01.290 ******
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'
failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"}
```
The ceph_key module is idempotent, so there is no need to have such a
condition.
Andrew Schoen [Tue, 20 Nov 2018 20:28:58 +0000 (14:28 -0600)]
ceph-volume: be idempotent when the batch strategy changes
If you deploy with 2 HDDs and 1 SDD then each subsequent deploy both
HDD drives will be filtered out, because they're already used by ceph.
ceph-volume will report this as a 'strategy change' because the device
list went from a mixed type of HDD and SDD to a single type of only SDD.
This situation results in a non-zero exit code from ceph-volume. We want
to handle this situation gracefully and report that nothing will be changed.
A similar json structure to what would have been given by ceph-volume is
returned in the 'stdout' key.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650306 Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit e13f32c1c5be2e4007714f704297827b16488ec6)
Typical error seen:
```
$ sudo ceph daemon osd.2 config get osd_memory_target
Can't get admin socket path: unable to get conf option admin_socket for osd.2:
parse error setting 'osd_memory_target' to '7823740108,8' (strict_si_cast:
unit prefix not recognized)
```
This commit ensures the value inserted in ceph.conf will be an integer.
VasishtaShastry [Sun, 28 Oct 2018 17:37:21 +0000 (23:07 +0530)]
ceph-validate : Added functions to accept true and flase
ceph-validate used to throw error for setting flags as 'true' or 'false' for True and False
Now user can set the flags 'dmcrypt' and 'osd_auto_discovery' as 'true' or 'false'
Rishabh Dave [Wed, 31 Oct 2018 14:46:13 +0000 (10:46 -0400)]
remove configuration files for ceph packages on ubuntu clusters
For apt-get, purge command needs to be used, instead of remove command,
to remove related configuration files. Otherwise, packages might be
shown as installed while running dpkg command even after removing them.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1640061 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 640cad3fd810f0aacd41fc35b96f0be3f85fbd0d)
rgw: add a dedicated variable for multisite endpoint
We should give users the possibility to set the IP they want as
multisite endpoint, setting the default value to `{{ ansible_fqdn }}` to
not force them to set this variable.
Ali Maredia [Mon, 18 Sep 2017 22:33:23 +0000 (18:33 -0400)]
rgw: update rgw multisite tasks
- remove destroy tasks
- cleanup conditionals and syntax
- remove unnecessary realm pulls
- enable multisite to be tested in automated
testing infra
- add multisite related vars to main.yml and
group_vars
- update README-MULTISITE
- ensure all `radosgw-admin` commands are being run
on a mon
Sébastien Han [Tue, 30 Oct 2018 11:18:16 +0000 (12:18 +0100)]
travis: add ansible-galaxy integration
This instructs Travis to notify Galaxy when a build completes. Since 3.0
the ansible-galaxy has the ability to build and push roles from repos
with multiple roles.
Closes: https://github.com/ceph/ceph-ansible/issues/3165 Signed-off-by: Sébastien Han <seb@redhat.com>
Neha Ojha [Thu, 25 Oct 2018 17:45:00 +0000 (17:45 +0000)]
roles: do not limit docker_memory_limit for various daemons
Since we do not have enough data to put valid upper bounds for the memory
usage of these daemons, do not put artificial limits by default. This will
help us avoid failures like OOM kills due to low default values.
Whenever required, these limits can be manually enforced by the user.
More details in
https://bugzilla.redhat.com/show_bug.cgi?id=1638148
since we set `configure_firewall: true` in
`ceph-defaults/defaults/main.yml` there is no need to explicitly set it
in `centos7_cluster` and `docker_cluster` testing scenarios.
Sébastien Han [Thu, 9 Aug 2018 09:32:53 +0000 (11:32 +0200)]
rolling_update: fix upgrade when using fqdn
CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516 Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Tue, 16 Oct 2018 15:20:54 +0000 (10:20 -0500)]
validate: check the version of python-notario
If the version of python-notario is < 0.0.13 an error message is given
like "TypeError: validate() got an unexpected keyword argument
'defined_keys'", which is not helpful in figuring
out you've got an incorrect version of python-notario.
This check will avoid that situation by telling the user that they need
to upgrade python-notario before they hit that error.
- fix a typo in vagrant_variables that cause a networking issue for
containerized scenario.
- add containerized_deployment: true
- remove a useless block of code: the fact docker_exec_cmd is set in
ceph-defaults which is played right after.
Sébastien Han [Thu, 24 May 2018 17:47:29 +0000 (10:47 -0700)]
infra: add a gather-ceph-logs.yml playbook
Add a gather-ceph-logs.yml which will log onto all the machines from
your inventory and will gather ceph logs. This is not intended to work
on containerized environments since the logs are stored in journald.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582280 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Thu, 27 Sep 2018 14:31:22 +0000 (16:31 +0200)]
infra: rename osd-configure to add-osd and improve it
The playbook has various improvements:
* run ceph-validate role before doing anything
* run ceph-fetch-keys only on the first monitor of the inventory list
* set noup flag so PGs get distributed once all the new OSDs have been
added to the cluster and unset it when they are up and running
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Thu, 27 Sep 2018 14:29:22 +0000 (16:29 +0200)]
ceph-fetch-keys: refact
This commits simplies the usage of the ceph-fetch-keys role. The role
now has a nicer way to find various ceph keys and fetch them on the
ansible server.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962 Signed-off-by: Sébastien Han <seb@redhat.com>
Andy McCrae [Fri, 5 Oct 2018 13:36:36 +0000 (14:36 +0100)]
Add ability to use a different client container
Currently a throw-away container is built to run ceph client
commands to setup users, pools & auth keys. This utilises
the same base ceph container which has all the ceph services
inside it.
This PR allows the use of a separate container if the deployer
wishes - but defaults to use the same full ceph container.
This can be used for different architectures or distributions,
which may support the the Ceph client, but not Ceph server,
and allows the deployer to build and specify a separate client
container if need be.
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
infra: fix wrong condition on firewalld start task
a non skipped task won't have the `skipped` attribute, so `start
firewalld` task will complain about that.
Indeed, `skipped` and `rc` attributes won't exist since the first task
`check firewalld installation on redhat or suse` won't be skipped in
case of non-containerized deployment.
binhong.hua [Wed, 10 Oct 2018 15:24:30 +0000 (23:24 +0800)]
vagrantfile: remove disk path of OSD nodes
osd node's disks will remain on vagrant host,when run "vagrant destroy",
because we use time as a part of disk path, and time on delete not equal time on create.
we already use random_hostname in Libvirt backend,it will create disk
use the hostname as a part of diskname. for example: vagrant_osd2_1539159988_065f15e3e1fa6ceb0770-hda.qcow2.
`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.
Sébastien Han [Tue, 2 Oct 2018 15:37:06 +0000 (17:37 +0200)]
ceph-handler: change osd container check
Now that the container is named ceph-osd@<id> looking for something that
contains a host is not necessary. This is also backward compatible as it
will continue to match container names with hostname in them.
Sébastien Han [Fri, 28 Sep 2018 16:07:08 +0000 (18:07 +0200)]
osd: ceph-volume activate, just pass the OSD_ID
We don't need to pass the device and discover the OSD ID. We have a
task that gathers all the OSD ID present on that machine, so we simply
re-use them and activate them. This also handles the situation when you
have multiple OSDs running on the same device.
Sébastien Han [Fri, 28 Sep 2018 11:06:18 +0000 (13:06 +0200)]
ceph_volume: add container support for batch command
The batch option got recently added, while rebasing this patch it was
necessary to implement it. So now, the batch option can work on
containerized environments.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630977 Signed-off-by: Sébastien Han <seb@redhat.com>