Matthew Vernon [Wed, 19 Sep 2018 12:26:26 +0000 (13:26 +0100)]
restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.
On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".
Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.
(could this be backported to at least stable-3.0 please?)
Matthew Vernon [Fri, 21 Sep 2018 16:55:01 +0000 (17:55 +0100)]
restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs
Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.
```
Exception occurred:
File
"/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py", line 26, in <module>
from sphinx.ext import doctest
SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```
The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'
The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: set_fact ceph_current_status (convert to json)
^ here
```
From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a
What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.
Sébastien Han [Tue, 21 Aug 2018 09:15:44 +0000 (11:15 +0200)]
rolling_upgrade: set sortbitwise properly
Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2e6e885bb75156c74735a65c05b4757b031041bb)
Sébastien Han [Mon, 20 Aug 2018 13:53:03 +0000 (15:53 +0200)]
iscsi group name preserve backward compatibility
Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.
This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.
Sébastien Han [Mon, 20 Aug 2018 12:41:06 +0000 (14:41 +0200)]
take-over-existing-cluster: do not call var_files
We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:
This commit was giving a new failure later during the rolling_update
process. Basically, this was modifying the list of devices and started
impacting the ceph-osd itself. The modification to accomodate the
osd_auto_discovery parameter should happen outside of the ceph-osd.
Also we are trying to not play ceph-osd role during the rolling_update
process so we can speed up the upgrade.
Sébastien Han [Mon, 13 Aug 2018 13:59:25 +0000 (15:59 +0200)]
rolling_update: register container osd units
Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit dad10e8f3f67f1e0c6a14ef3e0b1f51f90d9d962)
Sébastien Han [Wed, 6 Jun 2018 06:41:46 +0000 (14:41 +0800)]
contrib: fix generate group_vars samples
For ceph-iscsi-gw and ceph-rbd-mirror roles the group_name are named
differently (by default) than the role name so we have to change the
script to generate the correct name.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1602327 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 315ab08b1604e655eee4b493eb2c1171a67df506)
Sébastien Han [Fri, 10 Aug 2018 09:08:14 +0000 (11:08 +0200)]
mon: fix calamari initialisation
If calamari is already installed and ceph has been upgraded to a higher
version the initialisation will fail later. So if we detect the
calamari-server is too old compare to ceph_rhcs_version we try to update
it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1601755 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4c9e24a90fca2271978a066e38dfadead88d8167)
Sébastien Han [Thu, 9 Aug 2018 13:18:34 +0000 (15:18 +0200)]
osd: generate device list for osd_auto_discovery on rolling_update
rolling_update relies on the list of devices when performing the restart
of the OSDs. The task that is builind the devices list out of the
ansible_devices dict only runs when there are no partitions on the
drives. However during an upgrade the OSD are already configured, they
have been prepared and have partitions so this task won't run and thus
the devices list will be empty, skipping the restart during
rolling_update. We now run the same task under different requirements
when rolling_update is true and build a list when:
* osd_auto_discovery is true
* rolling_update is true
* ansible_devices exists
* no dm/lv are part of the discovery
* the device is not removable
* the device has more than 1 sector
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit e84f11e99ef42057cd1c3fbfab41ef66cda27302)
Fix in regular expression matching OSD ID on non-contenerized
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.
defaults: backward compatibility with fqdn deployments
This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration
Sébastien Han [Mon, 23 Jul 2018 12:56:20 +0000 (14:56 +0200)]
rolling_update: set osd sortbitwise
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b3266c5be2f88210589cfa56a5fe0a5092f79ee6)
Sébastien Han [Fri, 27 Jul 2018 14:52:19 +0000 (16:52 +0200)]
osd: do not remove expose_partition container
The container runs with --rm which means it will be deleted by Docker
when exiting. Also 'docker rm -f' is not idempotent and returns 1 if the
container does not exist.
since no latest-bis-jewel exists, it's using latest-bis which points to
ceph mimic. In our testing, using it for idempotency/handlers tests
means upgrading from jewel to mimic which is not what we want do.
Sébastien Han [Wed, 6 Jun 2018 04:07:33 +0000 (12:07 +0800)]
ceph-iscsi: rename group iscsi_gws
Let's try to avoid using dashes as testinfra needs to be able to read
the groups.
Typically, with iscsi-gws we can't add a marker for these iscsi nodes,
using an underscore fixes the issue.
We must check in the generated fact `_disabled_ceph_mgr_modules` to
enable disabled mgr module.
Otherwise, this task will be skipped because it's not comparing the
right list.
tests: fix broken test when collocated daemons scenarios
At the moment, a lot of tests are skipped when daemons are collocated.
Our tests consider a node belong to only 1 group while it's possible for
certain scenario it can belong to multiple groups.
Also pinning to pytest 3.6.1 so we can use `request.node.iter_markers()`
tests: skip rgw_tuning_pools_are_set when rgw_create_pools is not defined
since ooo_collocation scenario is supposed to be the same scenario than the
one tested by OSP and they are not passing `rgw_create_pools` the test
`test_docker_rgw_tuning_pools_are_set` will fail:
```
> pools = node["vars"]["rgw_create_pools"]
E KeyError: 'rgw_create_pools'
```
skipping this test if `node["vars"]["rgw_create_pools"]` is not defined
fixes this failure.
tests: factorize docker tests using docker_exec_cmd logic
avoid duplicating test unnecessarily just because of docker exec syntax.
Using the same logic than in the playbook with `docker_exec_cmd` allow us
to execute the same test on both containerized and non containerized environment.
The idea is to set a variable `docker_exec_cmd` with the
'docker exec <container-name>' string when containerized and
set it to '' when non containerized.
This part has changed from mimic and became:
```
"servicemap": {
"epoch": 2,
"modified": "2018-07-04 09:54:36.164786",
"services": {
"rbd-mirror": {
"daemons": {
"summary": "",
"14151": {
"start_epoch": 2,
"start_stamp": "2018-07-04 09:54:35.541272",
"gid": 14151,
"addr": "192.168.1.80:0/240942528",
"metadata": {
"arch": "x86_64",
"ceph_release": "mimic",
"ceph_version": "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)",
"ceph_version_short": "13.2.0",
"cpu": "Intel(R) Xeon(R) CPU X5650 @ 2.67GHz",
"distro": "centos",
"distro_description": "CentOS Linux 7 (Core)",
"distro_version": "7",
"hostname": "ceph-rbd-mirror0",
"id": "ceph-rbd-mirror0",
"instance_id": "14151",
"kernel_description": "#1 SMP Wed May 9 18:05:47 UTC 2018",
"kernel_version": "3.10.0-862.2.3.el7.x86_64",
"mem_swap_kb": "1572860",
"mem_total_kb": "1015548",
"os": "Linux"
}
}
}
}
}
}
```
This patch modifies the function `test_rbd_mirror_is_up()` in
`test_rbd_mirror.py` so it works with `mimic` and keeps backward compatibility
with `luminous`
tests: fix `_get_osd_id_from_host()` in TestOSDs()
We must initialize `children` variable in `_get_osd_id_from_host()`,
otherwise, if for any reason the deployment has failed and result with
an osd host with no OSD registered, we won't enter in the condition,
therefore, `children` is never set and the function tries to return
something undefined.
Typical error:
```
E UnboundLocalError: local variable 'children' referenced before assignment
```
these tests are skipped on bluestore osds scenarios.
they were going to fail anyway since they are run on mon nodes and
`devices` is defined in inventory for each osd node. It means
`num_devices * num_osd_hosts` returns `0`.
The result is that the test expects to have 0 OSDs up.
The idea here is to move these tests so they are run on OSD nodes.
Each OSD node checks their respective OSD to be UP, if an OSD has 2
devices defined in `devices` variable, it means we are checking for 2
OSD to be up on that node, if each node has all its OSD up, we can say
all OSD are up.
On containerized deployment, if a mon is stopped, the socket is not
purged and can cause failure when a cluster is redeployed after the
purge playbook has been run.
Typical error:
```
fatal: [osd0]: FAILED! => {}
MSG:
'dict object' has no attribute 'osd_pool_default_pg_num'
```
the fact is not set because of this previous failure earlier:
Vishal Kanaujia [Wed, 16 May 2018 09:58:31 +0000 (15:28 +0530)]
Skip GPT header creation for lvm osd scenario
The LVM lvcreate fails if the disk already has a GPT header.
We create GPT header regardless of OSD scenario. The fix is to
skip header creation for lvm scenario.
purge-docker: added conditionals needed to successfully re-run purge
Added 'ignore_errors: true' to multiple lines which run docker commands; even in cases where docker is no longer installed. Because of this, certain tasks in the purge-docker-cluster.yml will cause the playbook to fail if re-run and stop the purge. This leaves behind a dirty environment, and a playbook which can no longer be run.
Fix Regex line 275: Sometimes 'list-units' will output 4 spaces between loaded+active. The update will account for both scenarios.
purge fetch_directory: in other roles fetch_directory is hard linked ex.: "{{ fetch_directory }}"/"{{ somedir }}". That being said, fetch_directory will never have a trailing slash in the all.yml so this task was never being run(causing failures when trying to re-deploy).
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
(cherry picked from commit d1f2d64b15479524d9e347303c8d5e4bfe7c15d8) Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Thu, 28 Jun 2018 07:53:03 +0000 (09:53 +0200)]
systemd: remove changed_when: false
When using a module there is no need to apply this Ansible option. The
module will handle the idempotency on its own. So the module decides
wether or not the task has changed during the execution.
Sébastien Han [Thu, 28 Jun 2018 07:54:24 +0000 (09:54 +0200)]
ceph-osd: trigger osd container restart on script change
The script ceph-osd-run.sh holds the config options to start the
container, if one of these options are modified we must restart the
container. This was not the case before becauase the 'notify' flag
wasn't present.
Sébastien Han [Wed, 27 Jun 2018 09:23:00 +0000 (11:23 +0200)]
mon: honour mon_docker_net_host option
--net=host was hardcoded in the startup line so even though
mon_docker_net_host was set to False the net option would always be
activated.
mon_docker_net_host is set to True by default so this commit does not
change the behaviour.
Sébastien Han [Fri, 15 Jun 2018 19:39:34 +0000 (15:39 -0400)]
mon/osd: bump container memory limit
As discussed with the cores, the current limits are too low and should
be bumped to higher value.
So now by default monitors get 3GB and OSDs get 5GB.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591876 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit a9ed3579ae680949b4f53aee94003ca50d1ae721) Signed-off-by: Sébastien Han <seb@redhat.com>
tests: increase memory to 1024Mb for centos7_cluster scenario
we see more and more failure like `fatal: [mon0]: UNREACHABLE! => {}` in
`centos7_cluster` scenario, Since we have 30Gb RAM on hypervisors, we
can give monitors a bit more RAM. By the way, nodes on containerized cluster
testing scenario have already 1024Mb memory allocated.
Since we fixed the `gather and delegate facts` task, this exception is
not needed anymore. It's a leftover that should be removed to save some
time when deploying a cluster with a large client number.
Erwan Velu [Fri, 1 Jun 2018 16:53:10 +0000 (18:53 +0200)]
ceph-defaults: Enable local epel repository
During the tests, the remote epel repository is generating a lots of
errors leading to broken jobs (issue #2666)
This patch is about using a local repository instead of a random one.
To achieve that, we make a preliminary install of epel-release, remove
the metalink and enforce a baseurl to our local http mirror.
That should speed up the build process but also avoid the random errors
we face.
This patch is part of a patch series that tries to remove all possible yum failures.
Fix a typo in `tag` target, double quote are missing here.
Without them, the `make tag` command fails like this:
```
if [[ "v3.0.35" == ]]; then \
echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \
exit 1; \
fi
/bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator
/bin/sh: -c: line 0: syntax error near `;'
/bin/sh: -c: line 0: `if [[ "v3.0.35" == ]]; then echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; exit 1; fi'
make: *** [tag] Error 2
```
Ken Dreyer [Thu, 10 May 2018 23:08:05 +0000 (17:08 -0600)]
Makefile: add "make tag" command
Add a new "make tag" command. This automates some common operations:
1) Automatically determine the next Git tag version number to create.
For example:
"3.2.0beta1 -> "3.2.0beta2"
"3.2.0rc1 -> "3.2.0rc2"
"3.2.0" -> "3.2.1"
2) Create the Git tag, and print instructions for the user to push it to
GitHub.
3) Sanity check that HEAD is a stable-* branch or master (bail on
everything else).
4) Sanity check that HEAD is not already tagged.
Note, we will still need to tag manually once each time we change the
format, for example when moving from tagging "betas" to tagging "rcs",
or "rcs" to "stable point releases".
Signed-off-by: Ken Dreyer <kdreyer@redhat.com> Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fcea56849578bd47e65b130ab6884e0b96f9d89d) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Sébastien Han [Mon, 16 Apr 2018 13:57:23 +0000 (15:57 +0200)]
rgw: container add option to configure multi-site zone
You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1c084efb3cb7e48d96c9cbd6bd05ca4f93526853) Signed-off-by: Sébastien Han <seb@redhat.com>
For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.
The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.
Typical error seen:
```
STDERR:
failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```
Indeed, on the machine we can see:
```
Every 2.0s: df -h Tue May 29 17:21:13 2018
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/atomicos-root 3.0G 3.0G 14M 100% /
```
The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.
```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
Logical volume atomicos/root successfully resized.
```
In the CI we can see at many times failures like following:
`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`
It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.
This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.
Paul Cuzner [Fri, 25 May 2018 00:13:20 +0000 (12:13 +1200)]
Add privilege escalation to iscsi purge tasks
Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit 2890b57cfc2e1ef9897a791ce60f4a5545011907) Signed-off-by: Sébastien Han <seb@redhat.com>
Luigi Toscano [Tue, 22 May 2018 09:46:33 +0000 (11:46 +0200)]
ceph-radosgw: disable NSS PKI db when SSL is disabled
The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.
It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.
Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a122b7ce8ec6a5c27a70bc19589d177.
Now profiles drives the setting of rgw keystone *.
Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
(cherry picked from commit 43e96c1f98312734e2f12a1ea5ef29981e9072bd) Signed-off-by: Sébastien Han <seb@redhat.com>
Fix restarting OSDs twice during a rolling update.
During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.
Sébastien Han [Fri, 18 May 2018 12:43:57 +0000 (14:43 +0200)]
defaults: restart_osd_daemon unit spaces
Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail
It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2f43e9dab5f077276162069f449978ea97c2e9c0) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there is some leftover on devices when purging osds because of a invalid
device list construction.
typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
"changed": true,
"cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
"delta": "0:00:00.015188",
"end": "2018-05-16 12:41:40.408597",
"item": "/dev/sda sda1",
"rc": 0,
"start": "2018-05-16 12:41:40.393409"
}
Error: Could not stat device /dev/sda sda1 - No such file or directory.
```
the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output
For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`
Sébastien Han [Wed, 16 May 2018 15:37:10 +0000 (17:37 +0200)]
switch: disable ceph-disk units
During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 49a47124859e6577fb99e6dd680c5244ccd6f38f) Signed-off-by: Sébastien Han <seb@redhat.com>
take-over: fix bug when trying to override variable
A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.
Typical error:
```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```
Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.
Sébastien Han [Wed, 16 May 2018 14:02:41 +0000 (16:02 +0200)]
rolling_update: move osd flag section
During a minor update from a jewel to a higher jewel version (10.2.9 to
10.2.10 for example) osd flags don't get applied because they were done
in the mgr section which is skipped in jewel since this daemons does not
exist.
Moving the set flag section after all the mons have been updated solves
that problem.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1548071 Co-authored-by: Tomas Petr <tpetr@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit d80a871a078a175d0775e91df00baf625dc39725)
Sébastien Han [Mon, 14 May 2018 07:21:48 +0000 (09:21 +0200)]
iscsi: add python-rtslib repository
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 8c7c11b774f54078b32b652481145699dbbd79ff) Signed-off-by: Sébastien Han <seb@redhat.com>
trying to mask target when `/etc/systemd/system/target.service` doesn't
exist seems to be a bug.
There is no need to mask a unit file which doesn't exist.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a145caf947aec64467150a007b7aafe57abe2891) Signed-off-by: Sébastien Han <seb@redhat.com>
Gregory Meno [Wed, 9 May 2018 18:17:26 +0000 (11:17 -0700)]
adds missing state needed to upgrade nfs-ganesha
in tasks for os_family Red Hat we were missing this
fixes: bz1575859 Signed-off-by: Gregory Meno <gmeno@redhat.com>
(cherry picked from commit 26f6a650425517216fb57c08e1a8bda39ddcf2b5) Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Tue, 8 May 2018 14:11:14 +0000 (07:11 -0700)]
common: enable Tools repo for rhcs clients
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574458 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 07ca91b5cb7e213545687b8a62c421ebf8dd741d) Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Thu, 3 May 2018 14:54:53 +0000 (16:54 +0200)]
common: copy iso files if rolling_update
If we are in a middle of an update we want to get the new package
version being installed so the task that copies the repo files should
not be skipped.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572032 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4a186237e6fdc98f779c2e25985da4325b3b16cd) Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Wed, 4 Apr 2018 14:23:54 +0000 (16:23 +0200)]
add .vscode/ to gitignore
I personally dev on vscode and I have some preferences to save when it
comes to running the python unit tests. So escaping this directory is
actually useful.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3c4319ca4b5355d69b2925e916420f86d29ee524) Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Fri, 20 Apr 2018 09:13:51 +0000 (11:13 +0200)]
shrink-osd: ability to shrink NVMe drives
Now if the service name contains nvme we know we need to remove the last
2 character instead of 1.
If nvme then osd_to_kill_disks is nvme0n1, we need nvme0
If ssd or hdd then osd_to_kill_disks is sda1, we need sda
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1561456 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 66c1ea8cd561fce6cfe5cdd1ecaa13411c824e3a) Signed-off-by: Sébastien Han <seb@redhat.com>