git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

tests: improve mds tests

the expected number of mds daemon consist of number of daemons that are
'up' + number of daemons 'up:standby'.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c94ada69e80d7a1ddfbd2de2b13086d57a6fdfcd)

mon: copy openstack keys over to all mon

When configuring openstack, the created keyrings aren't copied over to
all monitors nodes.

This should have been backported from
433ecc7cbcc1ac91cab509dabe5c647d58c18c7f but this would implie too much
changes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rolling_update: fix facts gathering delegation

this is kind of follow up on what has been made in #2560.
See #2560 and #2553 for details.

Closes: #2708
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 232a16d77ff1048a2d3c4aa743c44e864fa2b80b)
Signed-off-by: Sébastien Han <seb@redhat.com>

playbook: follow up on #2553

Since we fixed the `gather and delegate facts` task, this exception is
not needed anymore. It's a leftover that should be removed to save some
time when deploying a cluster with a large client number.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 828848017cefd981e14ca9e4690dd7d1320f0eef)

rgws: renames create_pools variable with rgw_create_pools.

Renamed to be consistent with the role (rgw) and have a meaningful name.

Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
(cherry picked from commit 600e1e2c2680e8102f4ef17855d4bcd89d6ef733)
Signed-off-by: Sébastien Han <seb@redhat.com>

Adds RGWs pool creation to containerized installation.

ceph command has to be executed from one of the monitor containers
if not admin copy present in RGWs. Task has to be delegated then.

Adds test to check proper RGW pool creation for Docker container scenarios.

Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
(cherry picked from commit 8704144e3157aa253fb7563fe701d9d434bf2f3e)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: skip disabling fastest mirror detection on atomic host

There is no need to execute this task on atomic hosts.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f0cd4b065144843762b9deca667e05a1903b2121)

ceph-defaults: Enable local epel repository

During the tests, the remote epel repository is generating a lots of
errors leading to broken jobs (issue #2666)

This patch is about using a local repository instead of a random one.
To achieve that, we make a preliminary install of epel-release, remove
the metalink and enforce a baseurl to our local http mirror.

That should speed up the build process but also avoid the random errors
we face.

This patch is part of a patch series that tries to remove all possible yum failures.

Signed-off-by: Erwan Velu <erwan@redhat.com>
(cherry picked from commit 493f615eae3510021687e8cfc821364cc26a71ac)

Makefile: followup on #2585

Fix a typo in `tag` target, double quote are missing here.

Without them, the `make tag` command fails like this:

```
if [[ "v3.0.35" ==  ]]; then \
            echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \
            exit 1; \
        fi
/bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator
/bin/sh: -c: line 0: syntax error near `;'
/bin/sh: -c: line 0: `if [[ "v3.0.35" ==  ]]; then     echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35";     exit 1; fi'
make: *** [tag] Error 2
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b67f42feb95594fb403908d61383dc25d6cd342)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Makefile: add "make tag" command

Add a new "make tag" command. This automates some common operations:

1) Automatically determine the next Git tag version number to create.
   For example:
   "3.2.0beta1 -> "3.2.0beta2"
   "3.2.0rc1 -> "3.2.0rc2"
   "3.2.0" -> "3.2.1"

2) Create the Git tag, and print instructions for the user to push it to
   GitHub.

3) Sanity check that HEAD is a stable-* branch or master (bail on
   everything else).

4) Sanity check that HEAD is not already tagged.

Note, we will still need to tag manually once each time we change the
format, for example when moving from tagging "betas" to tagging "rcs",
or "rcs" to "stable point releases".

Signed-off-by: Ken Dreyer <kdreyer@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fcea56849578bd47e65b130ab6884e0b96f9d89d)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rgw: container add option to configure multi-site zone

You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1c084efb3cb7e48d96c9cbd6bd05ca4f93526853)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: resize root partition when atomic host

For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.

The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.

Typical error seen:

```
STDERR:

failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```

Indeed, on the machine we can see:
```
Every 2.0s: df -h                                                                                                                                                                                                                                       Tue May 29 17:21:13 2018
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  3.0G  3.0G   14M 100% /
```

The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.

```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
  Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
  Logical volume atomicos/root successfully resized.
```

```
-bash-4.2# xfs_growfs /
meta-data=/dev/mapper/atomicos-root isize=512    agcount=4, agsize=192000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=768000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 768000 to 2543616
```

```
-bash-4.2# df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  9.7G  1.4G  8.4G  14% /
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 34f70428521ab30414ce8806c7e2967a7387ff00)

tests: avoid yum failures

In the CI we can see at many times failures like following:

`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`

It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.

This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Erwan Velu <evelu@redhat.com>
(cherry picked from commit 98cb6ed8f602d9c54b63c5381a17dbca75df6bc2)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Add privilege escalation to iscsi purge tasks

Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit 2890b57cfc2e1ef9897a791ce60f4a5545011907)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-radosgw: disable NSS PKI db when SSL is disabled

The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.

It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.

Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a122b7ce8ec6a5c27a70bc19589d177.
Now profiles drives the setting of rgw keystone *.

Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
(cherry picked from commit 43e96c1f98312734e2f12a1ea5ef29981e9072bd)
Signed-off-by: Sébastien Han <seb@redhat.com>

Fix restarting OSDs twice during a rolling update.

During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.

(cherry picked from commit c7e269fcf5620a49909b880f57f5cbb988c27b07)
Signed-off-by: Sébastien Han <seb@redhat.com>

defaults: restart_osd_daemon unit spaces

Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail

It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2f43e9dab5f077276162069f449978ea97c2e9c0)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

purge_cluster: fix dmcrypt purge

dmcrypt devices aren't closed properly, therefore, it may fail when
trying to redeploy after a purge.

Typical errors:

```
ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command
'/sbin/blkid' returned non-zero exit status 2
```

```
ceph-disk: Error: unable to read dm-crypt key:
/var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf:
/etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key
```

Closing properly dmcrypt devices allows to redeploy without error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9801bde4d4ce501208fc297d5cb0ab2e0aa28702)

purge_cluster: wipe all partitions

In order to ensure there is no leftover after having purged a cluster,
we must wipe all partitions properly.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a9247c4de78dec8a63f17400deb8b06ce91e7267)

purge_cluster: fix bug when building device list

there is some leftover on devices when purging osds because of a invalid
device list construction.

typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
    "changed": true,
    "cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
    "delta": "0:00:00.015188",
    "end": "2018-05-16 12:41:40.408597",
    "item": "/dev/sda sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:40.393409"
}

STDOUT:

sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600

STDERR:

Error: Could not stat device /dev/sda sda1 - No such file or directory.
```

the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output

eg:

```
changed: [osd3] => (item=/dev/sda1) => {
    "changed": true,
    "cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
    "delta": "0:00:00.013634",
    "end": "2018-05-16 12:41:09.068166",
    "item": "/dev/sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:09.054532"
}

STDOUT:

/dev/sda sda1
```

For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9cad113e2f22132d08208cd58462f11056c41305)

switch: fix ceph_uid fact for osd

In addition to b324c17 this commit fix the ceph uid for osd role in the
switch from non containerized to containerized playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit adeecc51f8adf7834b936b7cf6a1be7e6bb82d27)

switch: fix ceph_uid fact

Latest is now centos not ubuntu anymore so the condition was wrong.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 767abb5de02c0ecdf81a18f6ca63f2e978d3d7a4)

switch: disable ceph-disk units

During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 49a47124859e6577fb99e6dd680c5244ccd6f38f)
Signed-off-by: Sébastien Han <seb@redhat.com>

take-over: fix bug when trying to override variable

A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.

Typical error:

```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```

Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 415dc0a29b10b28cbd047fe28eb4dd38419ea5dc)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rolling_update: move osd flag section

During a minor update from a jewel to a higher jewel version (10.2.9 to
10.2.10 for example) osd flags don't get applied because they were done
in the mgr section which is skipped in jewel since this daemons does not
exist.
Moving the set flag section after all the mons have been updated solves
that problem.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1548071
Co-authored-by: Tomas Petr <tpetr@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit d80a871a078a175d0775e91df00baf625dc39725)

iscsi: add python-rtslib repository

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 8c7c11b774f54078b32b652481145699dbbd79ff)
Signed-off-by: Sébastien Han <seb@redhat.com>

iscsi-gw: fix issue when trying to mask target

trying to mask target when `/etc/systemd/system/target.service` doesn't
exist seems to be a bug.
There is no need to mask a unit file which doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a145caf947aec64467150a007b7aafe57abe2891)
Signed-off-by: Sébastien Han <seb@redhat.com>

FIX: run restart scripts in `noexec` /tmp

- One can not run scripts directly in place, that mounted with `noexec`
option. But one can run scripts as arguments for `bash/sh`.

Signed-off-by: Arano-kai <captcha.is.evil@gmail.com>
(cherry picked from commit 5cde3175aede783feb89cbbc4ebb5c2f05649b99)
Signed-off-by: Sébastien Han <seb@redhat.com>

osd: clean legacy syntax in ceph-osd-run.sh.j2

Quick clean on a legacy syntax due to e0a264c7e

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7b387b506a21fd71eedd7aabab9f114353b63abc)
Signed-off-by: Sébastien Han <seb@redhat.com>

adds missing state needed to upgrade nfs-ganesha

in tasks for os_family Red Hat we were missing this

fixes: bz1575859
Signed-off-by: Gregory Meno <gmeno@redhat.com>
(cherry picked from commit 26f6a650425517216fb57c08e1a8bda39ddcf2b5)
Signed-off-by: Sébastien Han <seb@redhat.com>

common: make the delegate_facts feature optional

Since we encountered issue with this on ansible2.2, this commit provide
the ability to enable or disable it regarding which ansible we are
running.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4596fbaac1322a4c670026bc018e3b5b061b072b)

playbook: improve facts gathering

there is no need to gather facts with O(N^2) way.
Only one node should gather facts from other node.

Fixes: #2553
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 75733daf23d56008b246d8c05c5069303edd4197)
Signed-off-by: Sébastien Han <seb@redhat.com>

Make sure the restart_mds_daemon script is created with the correct MDS name

(cherry picked from commit b12bf62c36955d1e502552f8fddb03f44d7d6fc7)
Signed-off-by: Sébastien Han <seb@redhat.com>

common: enable Tools repo for rhcs clients

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574458
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 07ca91b5cb7e213545687b8a62c421ebf8dd741d)
Signed-off-by: Sébastien Han <seb@redhat.com>

common: copy iso files if rolling_update

If we are in a middle of an update we want to get the new package
version being installed so the task that copies the repo files should
not be skipped.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572032
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4a186237e6fdc98f779c2e25985da4325b3b16cd)
Signed-off-by: Sébastien Han <seb@redhat.com>

Revert "add .vscode/ to gitignore"

This reverts commit ce67b05292e224d640738bf506ce873680ff9b97.

add .vscode/ to gitignore

I personally dev on vscode and I have some preferences to save when it
comes to running the python unit tests. So escaping this directory is
actually useful.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3c4319ca4b5355d69b2925e916420f86d29ee524)
Signed-off-by: Sébastien Han <seb@redhat.com>

shrink-osd: ability to shrink NVMe drives

Now if the service name contains nvme we know we need to remove the last
2 character instead of 1.

If nvme then osd_to_kill_disks is nvme0n1, we need nvme0
If ssd or hdd then osd_to_kill_disks is sda1, we need sda

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1561456
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 66c1ea8cd561fce6cfe5cdd1ecaa13411c824e3a)
Signed-off-by: Sébastien Han <seb@redhat.com>

tox: use container latest tag for upgrades

Currently tag-build-master-luminous-ubuntu-16.04 is not used anymore.
Also now, 'latest' points to CentOS so we need to make that switch here
too.

We know have latest tags for each stable release so let's use them and
point tox at them to deploy the right version.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 14eff6b571eb760e8afcdfefc063f1af06342809)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-defaults: fix ceph_uid fact on container deployments

Red Hat is now using tags[3,latest] for image rhceph/rhceph-3-rhel7.
Because of this, the ceph_uid conditional passes for Debian
when 'ceph_docker_image_tag: latest' on RH deployments.
I've added an additional task to check for rhceph image specifically,
and also updated the RH family task for ceph/daemon [centos|fedora]tags.

Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
(cherry picked from commit 127a643fd0ce4d66a5243b789ab0905e54e9d960)
Signed-off-by: Sébastien Han <seb@redhat.com>

rhcs: re-add apt-pining

When installing rhcs on Debian systems the red hat repos must have the
highest priority so we avoid packages conflicts and install the rhcs
version.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1565850
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit a98885a71ec63ff129d7001301a0323bfaadad8a)
Signed-off-by: Sébastien Han <seb@redhat.com>

defaults: check only 1 time if there is a running cluster

There is no need to check for a running cluster n*nodes time in
`ceph-defaults` so let's add a `run_once: true` to save some resources
and time.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 899b0eb4514a9b1e6929dd5abf415195085c4e1d)
Signed-off-by: Sébastien Han <seb@redhat.com>

setup cephx keys when not nfs_obj_gw

Copy the admin key when configured nfs_file_gw (but not nfs_obj_gw). Also,
copy/setup RGW related directories only when configured as nfs_obj_gw.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 7f91547304349199bf10a636b4e10ccaf20a4212)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

common: add tools repo for iscsi gw

To install iscsi gw packages we need to enable the tools repo.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1547849
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 37117071ebb7ab3cf68b607b6760077a2b46a00d)
Signed-off-by: Sébastien Han <seb@redhat.com>

nfs: ensure nfs-server server is stopped

NFS-ganesha cannot start is the nfs-server service
is running. This commit stops nfs-server in case it
is running on a (debian, redhat, suse) node before
the nfs-ganesha service starts up

fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1508506

Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 01c58695fc344d65876b3acaea4f915f896401ac)
Signed-off-by: Sébastien Han <seb@redhat.com>

mds: to support copy_admin_keyring

(cherry picked from commit db3a5ce6d917e399236163f7de097f1b40a9a26c)
Signed-off-by: Sébastien Han <seb@redhat.com>

Fixed a typo (extra space)

(cherry picked from commit 020e66c1b4374956a4bd8882d729eed65a3e3f90)
Signed-off-by: Sébastien Han <seb@redhat.com>

osd: to support copy_admin_key

(cherry picked from commit e1a1f81b6fdab41ac051cbf5f29eb101df3b50da)
Signed-off-by: Sébastien Han <seb@redhat.com>

nfs: to support copy_admin_key - containerized

(cherry picked from commit 6b59416f7596d7c62c46b8a607f9a1eb9988689e)
Signed-off-by: Sébastien Han <seb@redhat.com>

defaults: fix backward compatibility

backward compatibility with `ceph_mon_docker_interface` and
`ceph_mon_docker_subnet` was not working since there wasn't lookup on
`monitor_interface` and `public_network`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 66c4118dcd0c8e7a7081bce5c8d6ba7752b959fd)
Signed-off-by: Sébastien Han <seb@redhat.com>

common: upgrade/install ceph-test RPM first

Prior to this change, if a user had ceph-test-12.2.1 installed, and
upgraded to ceph v12.2.3 or newer, the RPM upgrade process would
fail.

The problem is that the ceph-test RPM did not depend on an exact version
of ceph-common until v12.2.3.

In Ceph v12.2.3, ceph-{osdomap,kvstore,monstore}-tool binaries moved
from ceph-test into ceph-base. When ceph-test is not yet up-to-date, Yum
encounters package conflicts between the older ceph-test and newer
ceph-base.

When all users have upgraded beyond Ceph < 12.2.3, this is no longer
relevant.

(cherry picked from commit 3752cc6f38dbf476845e975e6448225c0e103ad6)
Signed-off-by: Sébastien Han <seb@redhat.com>

Deploying without managed monitors failed

Tripleo deployment failed when the monitors not manged
by tripleo itself with:
FAILED! => {"msg": "list object has no element 0"}

The failing play item was introduced by
f46217b69ae18317cb0c1cc3e391a0bca5767eb6 .

fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1552327

Signed-off-by: Attila Fazekas <afazekas@redhat.com>
(cherry picked from commit ecd3563c2128553d4145a2f9c940ff31458c33b4)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-iscsi: fix certificates generation and distribution

Prior to this patch, the certificates where being generated on a single
node only (because of the run_once: true). Thus certificates were not
distributed on all the gateway nodes.

This would require a second ansible run to work. This patches fix the
creation and keys's distribution on all the nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540845
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit f3caee84605e17f1fdfa4add634f0bf2c2cd510e)
Signed-off-by: Sébastien Han <seb@redhat.com>

do not delegate facts on client nodes

This commit is a workaround for
https://bugzilla.redhat.com/show_bug.cgi?id=1550977

We iterate over all nodes on each node and we delegate the facts gathering.
This is high memory consuming when having a large number of nodes in the
inventory.
That way of gathering is not necessary for clients node so we can simply
gather local facts for these nodes.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5b73be254d249a23ac2eb2f86c4412ef296352a9)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-mds: delete duplicate tasks which cause multimds container deployments to fail.

This update will resolve error['cephfs' is undefined.] in multimds container deployments.
See: roles/ceph-mon/tasks/create_mds_filesystems.yml. The same last two tasks are present there, and actully need to happen in that role since "{{ cephfs }}" gets defined in
roles/ceph-mon/defaults/main.yml, and not roles/ceph-mds/defaults/main.yml.

Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
(cherry picked from commit ca572a11f1eb7ded5583c8d8b810a42db61cd98f)
Signed-off-by: Sébastien Han <seb@redhat.com>

cleanup osd.conf.j2 in ceph-osd

osd crush location is set by ceph_crush in the library,
osd.conf.j2 is not used any more.

Signed-off-by: Ning Yao <yaoning@unitedstack.com>
(cherry picked from commit 691ddf534989b4d27dc41997630b3307436835ea)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-osd note that some scenarios use ceph-disk vs. ceph-volume

Signed-off-by: Alfredo Deza <adeza@redhat.com>
(cherry picked from commit 3fcf966803e35d7ba30e7c1b0ba78db94c664594)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-defaults: set is_atomic variable

This variable is needed for containerized clusters and is required for
the ceph-docker-common role. Typically the is_atomic variable is set in
site-docker.yml.sample though so if ceph-docker-common is used outside
of that playbook it needs set in another way. Moving the creation of
the variable inside this role means playbooks don't need to worry
about setting it.

fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1558252

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 6cffbd5409353fc1ce05b3a4a6246d6ef244e731)

Fix config_template to consistently order sections

In ec042219e64a321fa67fce0384af76eeb238c645 we added OrderedDict and
sorted to be able to preserve order for config_template k,v pairs inside
a section.

This patch adds a similar ordering for the sections themselves, which
could still change order and intiiate handler restarts.

OrderedDict isn't needed because we use .items() to return a list that
can then be sorted().

(cherry picked from commit fe4ba9d1353abb49775d5541060a55919978f45f)

common: run updatedb task on debian systems only

The command doesn't exist on Red Hat systems so it's better to skip it
instead of ignoring the error.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cb0f598965d0619dd4f44a8f991af539b67c6f38)

rgw: add cluster name option to the handler

If the cluster name is different than 'ceph', the command will fail so
we need to pass the cluster name.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 7f19df81964c669f649d9f6eb5104022b421eea3)

ci: add copy_admin_key test to container scenario

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fd94840a6ef130c6e142e9b5c5138bb11c621d37)

rgw: ability to copy ceph admin key on containerized

If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9c85280602142fa1fb60c6f15c6d0c9e8c62d401)

rgw: run the handler on a mon host

In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 67f46d8ec362b7b8aacb91e009e528b5e62d48ac)

tests: make CI jobs using 'ansible.cfg'

The jobs launches by the CI are not using 'ansible.cfg'.
There are some parameters that should avoid SSH failure that we are used
to see in the CI so far.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1e283bf69be8b9efbc1a7a873d91212ad57c7351)
Signed-off-by: Sébastien Han <seb@redhat.com>

client: use `ceph_uid` fact to set uid/gid on admin key

That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d35bc9bde6502ffa81f3c77679cf3f418cd62ca)
Signed-off-by: Sébastien Han <seb@redhat.com>

mds: fix ansible_service_mgr typo

This commit fixes a typo introduced by 4671b9e74e657988137f6723ef12e38c66d9cd40

(cherry picked from commit 1e1b26ca4d6f4ede84756003b9ffad851530e956)

Make rule_name optional when defining items in openstack_pools

Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.

(cherry picked from commit a83e1aeea39b9c7ae2757b166f3def7d4f67f161)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: change ceph_docker_image_tag for 2nd run

The ceph-ansible upstream CI runs severals tests, including a
'idempotency/handlers' test. It means the playbook is run a first time
and then a second time with an other container image version to ensure the
handlers run properly and the containers are well restarted.
This can cause issues.
For instance, in that specific case which drove me to submit this commit,
I've hit the case where `latest` image ships ceph 12.2.3 while the `stable-3.0`
(which is the image used for the second run) ships ceph 12.2.2.

The goal of this test is not to verify we can upgrade from a specific
version to another but to ensure handlers are working even if it's a valid
failure here.
It should be caught by a test dedicated to that usecase.

We just need to have a container image which has a different id for
the upstream CI, we need the same content in container imagebut a different
image id in the registry since the test relies on image id to decide whether
the container should be restarted.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a8986459f2ed7e077390162d0df431a3321a478)
Signed-off-by: Sébastien Han <seb@redhat.com>

ci: add tripleo scenario testing

This should help to see earlier any failure in a tripleo deployment scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 707458c979f17632d97c205e29524cadc9dec5b3)
Signed-off-by: Sébastien Han <seb@redhat.com>

Restart services if handler called

This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.

For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.

To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.

Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.

Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.

(cherry picked from commit 59a4335a5639c9be12ee8a23805aaa14882b077e)
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1548357

Adjust /etc/updatedb.conf to not parse /var/lib/ceph

Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.

To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.

(cherry picked from commit 2779d2a850265d01b62b9d8b4db7c2b4ce8b8fec)

update: look for short and fqdn in ceph_health_raw

According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c04e67347c284c2c127f09b201e8a293c5192e1f)
Signed-off-by: Sébastien Han <seb@redhat.com>

container: osd remove run_once

When used along with delegate, run_once does not belong well. Thus,
using | last always brings the desired result.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit c816a9282c8f778f18249827397901e04c040019)
Signed-off-by: Sébastien Han <seb@redhat.com>

docker-common: fix container restart on new image

We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit d47d02a5eb20067b5ae997ab18aeebe40b27cff0)
Signed-off-by: Sébastien Han <seb@redhat.com>

default: remove duplicate code

This is already defined in ceph-defaults.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ebc195487c2a2c8764594403b388c3d4624443fe)
Signed-off-by: Sébastien Han <seb@redhat.com>

test: add test for containers resources changes

We change the ceph_mon_docker_memory_limit on the second run, this
should trigger a restart of services.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 7d690878df4e34f2003996697e8f623b49282578)
Signed-off-by: Sébastien Han <seb@redhat.com>

test: add test for restart on new container image

Since we have a task to test the handlers we can test a new container to
validate the service restart on a new container image.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 79864a8936e8c25ac66bba3cee48d7721453a6af)
Signed-off-by: Sébastien Han <seb@redhat.com>

Set application for OpenStack pools

Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.

We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.

infra: do not include host_vars/* in take-over-existing-cluster.yml

These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 7c7017ebe66c70b1f3e06ee71466f30beb4eb2b0)
Signed-off-by: Sébastien Han <seb@redhat.com>

rolling update: fix undefined jewel_minor_update failure

Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.

The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 699c777e680655be12f53cabed626b28623f8160)
Signed-off-by: Sébastien Han <seb@redhat.com>

containers: bump memory limit

A default value of 4GB for MDS is more appropriate and 3GB for OSD also.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1531607
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 97f520bc7488b8e09d4057783049c8975fbc336e)
Signed-off-by: Sébastien Han <seb@redhat.com>

infra: fix take-over-existing-cluster.yml playbook

The ansible inventory could have more than just ceph-ansible hosts, so
we shouldnt use "hosts: all", also only grab one file when getting
the ceph cluster name instead of failing when there is more than one
file in /etc/ceph. Also fix location of the ceph.conf template

(cherry picked from commit 41d10a2f6496c216eaad87112a0794e51204c578)

osd: fix osd restart when dmcrypt

This commit fixes a bug that occurs especially for dmcrypt scenarios.

There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.

If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :

```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```

It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.

Adding `--net=host` to the 'disk_list' container fixes this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e537779bb3cf73c569ce6c29ab8b20169cc5ffae)
Signed-off-by: Sébastien Han <seb@redhat.com>

default: define 'osd_scenario' variable

osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 22f843e3d4e7fa32f8cd74eaf36772445ed20c0d)
Signed-off-by: Sébastien Han <seb@redhat.com>

Check for docker sockets named after both _hostname or _fqdn

While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.

(cherry picked from commit bdcc52b96dc1f9c99ce490117170f644623d4846)
Signed-off-by: Sébastien Han <seb@redhat.com>

mon: Fixed crush_rule_config for containerised deployment.

Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml

Signed-off-by: Greg Charot <gcharot@redhat.com>
(cherry picked from commit a6d1922a2e70c36036ff130dc6b6b942101379ba)

Convert interface names to underscores for facts

If a deployer uses an interface name with a dash/hyphen in it, such
as 'br-storage' for the monitor_interface group_var, the ceph.conf.j2
template fails to find the right facts. It looks for
'ansible_br-storage' but only 'ansible_br_storage' exists.

This patch converts the interface name to underscores when the
template does the fact lookup.

(cherry picked from commit 5676fa23b169e0ca3af7d4f9b804bbe90d1cccc6)
Signed-off-by: Sébastien Han <seb@redhat.com>

purge-docker: fix ceph-osd-zap name container

the `zap ceph osd disks` task should iter on `resolved_parent_device`
instead of `combined_devices_list` which contain only the base device
name (vs. full path name in `combined_devices_list`).

this fixes the issue where docker complain about container name because
of illegal characters such as `/` :
```
"/usr/bin/docker-current: Error response from daemon: Invalid container
name (ceph-osd-zap-magna074-/dev/sdb1), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed.","See '/usr/bin/docker-current run --help'."
""
```

having the the basename of the device path is enough for the container
name.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3b2f6c34e42eae4033a209d819620211dc68c34b)

ceph-osd: respect nvme partitions when device is a disk.

(cherry picked from commit d7dadc3e7b9d2e218d85784df72e4cd008ecb1ee)
Signed-off-by: Sébastien Han <seb@redhat.com>

syntax: change local_action syntax

Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit deaf273b25601991fc16712cc03820207125554f)
Signed-off-by: Sébastien Han <seb@redhat.com>

common: do not use `shell` module when it is not needed

There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit dd0c98c5a2e9e26bca60e00564ea2018984545f6)
Signed-off-by: Sébastien Han <seb@redhat.com>

config: remove any spaces in public_network or cluster_network

With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.

but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 6f9dd26caab18c4e4e98a78bc834f2fa5c255bc7)
Signed-off-by: Sébastien Han <seb@redhat.com>

purge: fix resolve parent device task

This is a typo caused by leftover.
It was previously written like this :
`shell: echo /dev/$(lsblk -no pkname "{{ item }}") }}")`
and has been rewritten to :
`shell: $(lsblk --nodeps -no pkname "{{ item }}") }}")`
because we are appending later the '/dev/' in the next task.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f372a4232e830856399a25e55c2ce239ac086614)
Signed-off-by: Sébastien Han <seb@redhat.com>

Do not search osd ids if ceph-volume

Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.

This happens always, regardless if the setup and deployment is correct.

Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.

How reproducible: 100%

Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster

Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
    "attempts": 10,
    "changed": false,
    "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
    "delta": "0:00:00.002717",
    "end": "2018-01-21 18:10:31.237933",
    "failed": true,
    "failed_when_result": false,
    "rc": 0,
    "start": "2018-01-21 18:10:31.235216"
}

STDOUT:

0
1
2

Expected results:
There aren't any (or just a few) timeouts while the OSDs are found

Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.

Basically this line:

    until: osd_id.stdout_lines|length == devices|unique|length

Means in this 2 OSD case it is trying to ensure the following incorrect condition:

    until: 2 == 0

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103
(cherry picked from commit 5132cc3de4780fdfb4fdeab7535c3bc50151aa6b)
Signed-off-by: Sébastien Han <seb@redhat.com>

Add default for radosgw_keystone_ssl

This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.

Fixes: #2111
(cherry picked from commit 481173f20377b09d781ee6bc2d5b26c9d8637519)
Signed-off-by: Sébastien Han <seb@redhat.com>

Revert "monitor_interface: document need to use monitor_address when using IPv6"

This reverts commit 10b91661ceef7992354032030c7c2673a90d40f4.

This reverts also the same comment added in
1359869497a44df0c3b4157f41453b84326b58e7

(cherry picked from commit f1232b33fd7a8da53aa2e1ad2b11ee16109633b3)
Signed-off-by: Sébastien Han <seb@redhat.com>

config: add host-specific ceph_conf_overrides evaluation and generation.

This allows us to use host-specific variables in ceph_conf_overrides variable. For example, this fixes usage of such variables (e.g. 'nss db path' having {{ ansible_hostname }} inside) in ceph_conf_overrides for rados gateway configuration (see profiles/rgw-keystone-v3) - issue #2157.

Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
(cherry picked from commit 93e9f3723bb4bcf8004bbcea3213d72d11588899)

ceph-common: Don't check for ceph_stable_release for distro packages

When we consume the distribution packages, we don't have the choise on
which version to install, so we shouldn't require that variable to be
set. Distributions normally provide only one version of Ceph in the
official repositories so we get whatever they provide.

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit dd6ee72547a4eca22c8c9b8691b910c2cfa821d3)

upgrade: skip luminous tasks for jewel minor update

These tasks are needed only when upgrading to luminous.
They are not needed in Jewel minor upgrade and by the way, they fail because
`ceph versions` command doesn't exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1535382
(cherry picked from commit c7ec12d49ca3c3f936f4c7a34ef15c042ab0f699)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>