Andrew Schoen [Wed, 6 Jun 2018 19:59:47 +0000 (14:59 -0500)]
tests: increase ssh timeout and retries in ansible.cfg
We see quite a few failures in the CI related to testing nodes losing
ssh connection. This modification allows ansible to retry more times and
wait longer before timing out. This seems to really affect testing
scenarios that use a large amount of testing nodes. The centos7_cluster
scenario specifically has 12 nodes and suffered from these failures
often.
Potential error if someone doesnt pass the mode in `keys` dict for
client nodes:
```
fatal: [client2]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'mode'
The error appears to have been in '/home/guits/ceph-ansible/roles/ceph-client/tasks/create_users_keys.yml': line 117, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: get client cephx keys
^ here
exception type: <class 'ansible.errors.AnsibleUndefinedVariable'>
exception: 'dict object' has no attribute 'mode'
```
adding a default value will avoid the deployment failing for this.
Functional tests are broken when testing against 'dev' release (ceph).
Adding a dummy value here will make it possible to run ceph-ansible CI
against dev ceph release.
Typical error:
```
> if request.node.get_marker("from_luminous") and ceph_release_num[ceph_stable_release] < ceph_release_num['luminous']:
E KeyError: 'dev'
```
Andrew Schoen [Mon, 4 Jun 2018 15:36:32 +0000 (10:36 -0500)]
ceph-common: move firewall checks after package installation
We need to do this because on dev or rhcs installs ceph_stable_release
is not mandatory and the firewall check tasks have a task that is
conditional based off the installed version of ceph. If we perform those
checks after package install then they will not fail on dev or rhcs
installs.
client: use dummy created container when there is no mon in inventory
the `docker_exec_cmd` fact set in client role when there is no monitor
in inventory is wrong, `ceph-client-{{ hostname }}` is never created so
it will fail anyway.
41b4632 has introduced a change in functionnals tests.
Since the admin keyring isn't copied on rgw nodes anymore in tests, let's use
the rgw keyring to achieve them.
Sébastien Han [Tue, 5 Jun 2018 03:56:55 +0000 (11:56 +0800)]
test: do not always copy admin key
The admin key must be copied on the osd nodes only when we test the
shrink scenario. Shrink relies on ceph-disk commands that require the
admin key on the node where it's being executed.
Now we only copy the key when running on the shrink-osd scenario.
Refact of 8704144e3157aa253fb7563fe701d9d434bf2f3e
There is no need to have duplicated tasks for this. The rgw pools
creation should be delegated on a monitor node se we don't have to care
if the admin keyring is present on rgw node.
By the way, only one task is needed to create the pools, we just need to
use the `docker_exec_cmd` fact already defined in `ceph-defaults` to
achieve it.
Sébastien Han [Mon, 4 Jun 2018 02:40:14 +0000 (10:40 +0800)]
ceph-common: add firewall rules for ceph-mgr
Prior to this commit the firewall tasks were not opening the ceph-mgr
ports. This would lead to unclean configuration since the ceph-mgr
daemons can not connect to the OSDs.
Thi commit opens the right ports on the ceph-mgr nodes to talk with the
OSDs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526400 Signed-off-by: Sébastien Han <seb@redhat.com>
Erwan Velu [Fri, 1 Jun 2018 16:53:10 +0000 (18:53 +0200)]
ceph-defaults: Enable local epel repository
During the tests, the remote epel repository is generating a lots of
errors leading to broken jobs (issue #2666)
This patch is about using a local repository instead of a random one.
To achieve that, we make a preliminary install of epel-release, remove
the metalink and enforce a baseurl to our local http mirror.
That should speed up the build process but also avoid the random errors
we face.
This patch is part of a patch series that tries to remove all possible yum failures.
Since the openstack_config.yml has been moved to `ceph-osd` we must move
this `set_fact` in ceph-osd otherwise the tasks in
`openstack_config.yml` using `openstack_keys` will actually use the
defaults value from `ceph-defaults`.
osds: wait for osds to be up before creating pools
This is a follow up on #2628.
Even with the openstack pools creation moved later in the playbook,
there is still an issue because OSDs are not all UP when trying to
create pools.
Adding a task which checks for all OSDs to be UP with a `retries/until`
condition should definitively fix this issue.
Fix a typo in `tag` target, double quote are missing here.
Without them, the `make tag` command fails like this:
```
if [[ "v3.0.35" == ]]; then \
echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \
exit 1; \
fi
/bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator
/bin/sh: -c: line 0: syntax error near `;'
/bin/sh: -c: line 0: `if [[ "v3.0.35" == ]]; then echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; exit 1; fi'
make: *** [tag] Error 2
```
When playing ceph-mds role, mon nodes have set a fact with the default
pg num for osd pools, we can simply default to this value for cephfs
pools (`cephfs_pools` variable).
At the moment the variable definition for `cephfs_pools` looks like:
and we have a task in `ceph-validate` to ensure `pgs` has been set to a
valid value.
We could simply avoid this check by setting the default value of `pgs`
to `hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num']` and
let to users the possibility to override this value.
in `ceph-osd` there is no need to set `docker_exec_cmd` since the only
place where this fact is used is in `openstack_config.yml` which
delegate all docker command to a monitor node. It means we need the
`docker_exec_cmd` fact that has been set referring to `ceph-mon-*`
containers, this fact is already set earlier in `ceph-defaults`.
By the way, when collocating an OSD with a MON it fails because the container
`ceph-osd-{{ ansible_hostname }}` doesn't exist.
Removing this task will allow to collocate an OSD with a MON.
For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.
The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.
Typical error seen:
```
STDERR:
failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```
Indeed, on the machine we can see:
```
Every 2.0s: df -h Tue May 29 17:21:13 2018
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/atomicos-root 3.0G 3.0G 14M 100% /
```
The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.
```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
Logical volume atomicos/root successfully resized.
```
In the CI we can see at many times failures like following:
`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`
It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.
This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.
When collocating mds on monitor node, the cephpfs will fail
because `docker_exec_cmd` is reset to `ceph-mds-monXX` which is
incorrect because we need to delegate the task on `ceph-mon-monXX`.
In addition, it wouldn't have worked since `ceph-mds-monXX` container
isn't started yet.
Moving the task earlier in the `ceph-mds` role will fix this issue.
Sébastien Han [Mon, 16 Apr 2018 13:57:23 +0000 (15:57 +0200)]
rgw: container add option to configure multi-site zone
You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637 Signed-off-by: Sébastien Han <seb@redhat.com>
Since we fixed the `gather and delegate facts` task, this exception is
not needed anymore. It's a leftover that should be removed to save some
time when deploying a cluster with a large client number.
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.
The idea here is to move cephfs pools creation in `ceph-mds` role.
let's move this variable in group_vars/all.yml in all testing scenarios
accordingly to this commit 1f15a81c480f60bc82bfc3a1aec3fe136e6d3bc4 so
we keep consistency between the playbook and the tests.
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.
The idea here is to move openstack pools creation at the end of `ceph-osd` role.
Luigi Toscano [Tue, 22 May 2018 09:46:33 +0000 (11:46 +0200)]
ceph-radosgw: disable NSS PKI db when SSL is disabled
The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.
It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.
Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a122b7ce8ec6a5c27a70bc19589d177.
Now profiles drives the setting of rgw keystone *.
Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
Vishal Kanaujia [Wed, 16 May 2018 09:58:31 +0000 (15:28 +0530)]
Skip GPT header creation for lvm osd scenario
The LVM lvcreate fails if the disk already has a GPT header.
We create GPT header regardless of OSD scenario. The fix is to
skip header creation for lvm scenario.
Sébastien Han [Tue, 22 May 2018 23:52:40 +0000 (16:52 -0700)]
rolling_update: fix get fsid for containers
When running ansible2.4-update_docker_cluster there is an issue on the
"get current fsid" task. The current task only works for
non-containerized deployment but will run all the time (even for
containerized). This currently results in the following error:
This is not really representative on the real error since the 'ceph' cli is available on that machine.
On other environments we will have something like "command not found: ceph".
Fix restarting OSDs twice during a rolling update.
During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.
Alfredo Deza [Mon, 21 May 2018 12:09:00 +0000 (08:09 -0400)]
validate: split schema for lvm osd scenario per objecstore
The bluestore lvm osd scenario does not require a journal entry. For
this reason we need to have a separate schema for that and filestore or
notario will fail validation for the bluestore lvm scenario because the
journal key does not exist in lvm_volumes.
Andrew Schoen [Mon, 21 May 2018 15:11:22 +0000 (10:11 -0500)]
ceph-validate: do not check ceph version on dev or rhcs installs
A dev or rhcs install does not require ceph_stable_release to be set and
instead generates that by looking at the installed ceph-version.
However, at this point in the playbook ceph may not have been installed
yet and ceph-common has not be run.
Fixes: https://github.com/ceph/ceph-ansible/issues/2618 Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Andrew Schoen [Tue, 8 May 2018 14:26:07 +0000 (09:26 -0500)]
ceph-defaults: remove backwards compat for containerized_deployment
The validation module does not get config options with the template
syntax rendered, so we're gonna remove that and just default it to
False. The backwards compat was schedule to be removed in 3.1 anyway.
Andrew Schoen [Wed, 2 May 2018 17:55:33 +0000 (12:55 -0500)]
adds a requiremnts.txt file for the project
With the addition of the validate module we need to ensure
that notario is installed. This will be done with the use
of this requirments.txt file and pip.
Andrew Schoen [Tue, 1 May 2018 16:22:31 +0000 (11:22 -0500)]
ceph-defaults: fix failing tasks when osd_scenario was not set correctly
When devices is not defined because you want to use the 'lvm'
osd_scenario but you've made a mistake selecting that scenario these
tasks should not fail.
Sébastien Han [Fri, 18 May 2018 12:43:57 +0000 (14:43 +0200)]
defaults: restart_osd_daemon unit spaces
Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail
It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317 Signed-off-by: Sébastien Han <seb@redhat.com>
take-over: fix bug when trying to override variable
A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.
Typical error:
```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```
Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.
Andy McCrae [Mon, 19 Feb 2018 16:57:18 +0000 (16:57 +0000)]
Fix template reference for ganesha.conf
We can simply reference the template name since it exists within the
role that we are calling. We don't need to check the ANSIBLE_ROLE_PATH
or playbooks directory for the file.
Sébastien Han [Wed, 16 May 2018 15:37:10 +0000 (17:37 +0200)]
switch: disable ceph-disk units
During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846 Signed-off-by: Sébastien Han <seb@redhat.com>
there is some leftover on devices when purging osds because of a invalid
device list construction.
typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
"changed": true,
"cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
"delta": "0:00:00.015188",
"end": "2018-05-16 12:41:40.408597",
"item": "/dev/sda sda1",
"rc": 0,
"start": "2018-05-16 12:41:40.393409"
}
Error: Could not stat device /dev/sda sda1 - No such file or directory.
```
the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output
For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`
Sébastien Han [Wed, 16 May 2018 14:02:41 +0000 (16:02 +0200)]
rolling_update: move osd flag section
During a minor update from a jewel to a higher jewel version (10.2.9 to
10.2.10 for example) osd flags don't get applied because they were done
in the mgr section which is skipped in jewel since this daemons does not
exist.
Moving the set flag section after all the mons have been updated solves
that problem.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1548071 Co-authored-by: Tomas Petr <tpetr@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>
Ken Dreyer [Thu, 10 May 2018 23:08:05 +0000 (17:08 -0600)]
Makefile: add "make tag" command
Add a new "make tag" command. This automates some common operations:
1) Automatically determine the next Git tag version number to create.
For example:
"3.2.0beta1 -> "3.2.0beta2"
"3.2.0rc1 -> "3.2.0rc2"
"3.2.0" -> "3.2.1"
2) Create the Git tag, and print instructions for the user to push it to
GitHub.
3) Sanity check that HEAD is a stable-* branch or master (bail on
everything else).
4) Sanity check that HEAD is not already tagged.
Note, we will still need to tag manually once each time we change the
format, for example when moving from tagging "betas" to tagging "rcs",
or "rcs" to "stable point releases".
Signed-off-by: Ken Dreyer <kdreyer@redhat.com> Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>