git-server-git.apps.pok.os.sepia.ceph.com Git

mds: add filter | int on condition

This seems to break the update scenario CI testing in stable-3.2.

Typical error:
```
The conditional check 'mds_max_mds > 1' failed. The error was: Unexpected templating type error occurred on ({% if mds_max_mds > 1 %} True {% else %} False {% endif %}): '>' not supported between instances of 'str' and 'int'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: pg num should be a power of two number

This patch changes the pg_num value of the rgw pools foo and bar to be
a power of two number.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: reduce max_mds from 3 to 2

Having max_mds value equals to the number of mds nodes generates a
warning in the ceph cluster status:

cluster:
id:     6d3e49a4-ab4d-4e03-a7d6-58913b8ec00a'
health: HEALTH_WARN'
        insufficient standby MDS daemons available'
(...)
services:
  mds:     cephfs:3 {0=mds1=up:active,1=mds0=up:active,2=mds2=up:active}'

Let's use 2 active and 1 standby mds.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4a6d19dae296969954e5101e9bd53443fddde03d)

mds: allow multi mds when ceph is >= jewel

Otherwise, the task to set max_mds will fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add multimds coverage

This commit makes the all_daemons scenario deploying 3 mds in order to
cover the multimds case.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: update tox due to pipeline removal

This commit reflects the recent changes in ceph/ceph-build#1406

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bcaf8cedeec0f06eb8641b0038569a5cd3a3e7be)

tests: set gateway_ip_list dynamically

so we dont' have to hardcode this in the tests

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d9f6b37ae62308e5204b1325ee7b53f47b96e6fd)

Vagrantfile: support more than 9 nodes per daemon type

because of the current ip address assignation, it's not possible to
deploy more than 9 nodes per daemon type.
This commit refact a bit and allows us to get around this limitation.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 778c51a0ff7a8c66c464f4828a0f87dd290e1c3e)

tests: add inventory host file for ubuntu job

this commit adds the missing inventory host file for ubuntu/all_daemons
based scenarios.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: Update ansible ssh_args variable

Because we're using vagrant, a ssh config file will be created for
each nodes with options like user, host, port, identity, etc...
But via tox we're override ANSIBLE_SSH_ARGS to use this file. This
remove the default value set in ansible.cfg.

Also adding PreferredAuthentications=publickey because CentOS/RHEL
servers are configured with GSSAPIAuthenticationis enabled for ssh
server forcing the client to make a PTR DNS query.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34f9d51178f4cd37a7df1bb74897dff7eb5c065f)

ansible.cfg: Add library path to configuration

Ceph module path needs to be configured if we want to avoid issues
like:

no action detected in task. This often indicates a misspelled module
name, or incorrect module path

Currently the ansible-lint command in Travis CI complains about that.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1668478
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a1a871cadee5e86d181e1306c985e620b81fccac)
(cherry picked from commit c056ae7b8cf24b78382cda397e796413c56c06d5)

mgr: wait for all mgr to be available

before managing mgr modules, we must ensure all mgr are available
otherwise we can hit failure like following:

```
stdout:Error ENOENT: all mgr daemons do not support module 'restful', pass --force to force enablement
```

It happens because all mgr are not yet available when trying to manage
with mgr modules.

This should have been cherry-picked from
41f7518c1ba0fea74a61e4207999ec622a052270 but there's too much changes.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

nfs: add coverage on `ganesha_conf_overrides`

This commit adds `ganesha_conf_overrides` variable in CI testing.
This fixes the test `test_nfs_config_override`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: fix purge scenarios names

This commit fixes the purge_* scenario names in stable-3.1

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add missing variables in collocation scenario

add :

ceph_origin: repository
ceph_repository: community

in all.yml for collocation scenario (non contanier)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: fix path to inventory host file in tox-update.ini

the path had `/{env:CONTAINER_DIR:}` which is already added in
`changedir=` section. That led to a wrong path so the initial deployment
couldn't complete.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: split update in a dedicated tox.ini file

This commit splits the update scenario into a dedicated tox.ini file.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: use INVENTORY env variable in tox

let's use `INVENTORY` variable to run against the right inventory host
regarding which OS we are running on.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add back testinfra testing

136bfe0 removed testinfra testing on all scenario excepted all_daemons

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8d106c2c58d354e10335ca017fd8df4c427e38a6)

tests: pin pytest-xdist to 1.27.0

looks like newer version of pytest-xdist requires pytest>=4.4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ba0a95211cc00b2cae14b018722f437c0091a2ef)

tox: Fix container purge jobs

On containerized CI jobs the playbook executed is purge-cluster.yml
but it should be set to purge-docker-cluster.yml

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd0869cd01090e135a9312a6890ed7611f8e3a1c)

tests: fix shrink_mon scenario

since the node names have changed recently (the 'ceph-' prefix has been
removed), we must change the name in the shrink_mon playbook command
here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: fix shrink_osd scenario

the wrong image version was used to run shrink_osd playbook.
in stable-3.1 we should use a luminous image, not nautilus which doesn't
have ceph-disk binary anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: disable nfs scenario

The packages are broken, so let's remove it, until this solved.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: test idempotency only on all_daemons job

there's no need to test this on all scenarios.
testing idempotency on all_daemons should be enough and allow us to save
precious resources for the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 136bfe096c5e97c5c983d02882919d4af2af48a6)

osd: backward compatibility with old disk_list.sh location

Since all files in container image have moved to `/opt/ceph-container`
this check must look for new AND the old path so it's backward
compatible. Otherwise it could end up by templating an inconsistent
`ceph-osd-run.sh`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 987bdac963cee8d8aba1f10659f23bb68c2b1d1b)

iscsi-gws: remove a leftover

remove leftover introduced by 9d590f4

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d4b3c1d409fb3b837bcbf61eed7917a8b7b7681d)

iscsi: fix permission denied error

Typical error:
```
fatal: [iscsi-gw0]: FAILED! =>
msg: 'an error occurred while trying to read the file ''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'': [Errno 13] Permission denied: b''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'''
```

`become: True` is not needed on the following task:

`copy crt file(s) to gateway nodes`.

Since it's already set in the main playbook (site.yml/site-container.yml)

The thing is that the files get generated in the 'fetch_directory' with
root user because there is a 'delegate_to' + we run the playbook with
`become: True` (from main playbook).

The idea here is to create files under ansible user so we can open them
later to copy them on the remote machine.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d590f4339a4d758f07388bf97b7eabdcbca6043)

tests: rename all nodes name

remove the 'ceph-' prefix in order to have the same names in all
branches.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: use memory backend for cache fact

force ansible to generate facts for each run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a1bafdc2181b3a951991fcc9a5108edde757615)

tests: remove lvm_batch scenario

this scenario doesn't exist in stable-3.1

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: refact all stable-3.1 testing

refact the testing on stable-3.1 the same way it has been made for
stabe-3.2 and master.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

use shortname in keyring path

socket.gethostname may return a FQDN. Problem found in Linode.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 8cd0308f5f570635d66295c442ea49dc2c043194)

ceph-common: disable unrequired NTP services

When one of the currently supported NTP services has been set up,
disable rest of the NTP services on Ceph nodes.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1651875
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 6fa757d34358e90ae3a2f035b50d319193521ec5)

ceph-common: merge ntp_debian.yml and ntp_rpm.yml

Merge ntp_debian.yml and ntp_rpm.yml into one (the new file is called
setup_ntp.yml) since they are almost identical.

Since this is as a "as it is" backport for the original commit, it also
adds the feature of supporting multiple NTP daemons (namely, chronyd &
timesyncd). This is to maintain consistency across all branches
since the backport for stable-3.2 was auto-merged by mergify despite
of conflicts.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b03ab607422eda0094d74223d52024a373b7ee9a)

Add support for different NTP daemons

Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.

Fixes issue #3086 on Github

Signed-off-by: Benjamin Cherian <benjamin_cherian@amat.com>
(cherry picked from commit 85071e6e530ddd80df35920d9fbe63047478d66b)

rolling_update: do not fail on missing keys

We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>

update: fix a typo

`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d8f0daa05ed8da987984d638af3a794)

rolling_update: refact set_fact `mon_host`

each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.

Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018 14:02:30 +0000 (0:00:07.493) 0:02:50.005 *****
fatal: [mon1]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584f1b3a99515e9b94f450be22420c545)

rolling_update: create rbd and rbd-mirror keyrings

During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f9263b9ac3b5649f1e3cf3cbaf12d10)

ceph_key: add a get_key function

When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 691f373543d96d26b1af61c4ff7731fd888a9ce9)

Fix problem with ceph_key in python3

Pretty basic problem of iteritems removal.

Signed-off-by: Jairo Llopis <yajo.sk8@gmail.com>
(cherry picked from commit fc20973c2b9a89a15f8940cd100c34df4caca030)

tox: fix a typo

the line setting `ANSIBLE_CONFIG` obviously contains a typo introduced
by 1e283bf69be8b9efbc1a7a873d91212ad57c7351

`ANSIBLE_CONFIG` has to point to a path only (path to an ansible.cfg)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a0cceb3e44f17f417f1c7d86c51f915dbaf0bd2f)

rolling_update: fix upgrade when using fqdn

CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 44d0da0dd497bfab040c1e64fb406e4c13150028)

tests: do not install lvm2 on atomic host

we need to detect whether we are running on atomic host to not try to
install lvm2 package.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d2ca24eca8849a8f2df748c3f7c4e0d6885b6298)

tests: install lvm2 before setting up ceph-volume/LVM tests

Signed-off-by: Alfredo Deza <adeza@redhat.com>
(cherry picked from commit 3e488e8298a0c8ec4ec98317a1d7f1efc4926257)

Stringify ceph_docker_image_tag

This could be a numeric input, but is treated like a string leading to
runtime errors.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 8dcc8d1434dbe2837d91162f4647246c54826e97)

Avoid using tests as filter

Fixes the deprecation warning:

[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
using `result|search` use `result is search`.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 306e308f133c9b9757d6cae5f88d2c39903cae2f)

Sync config_template with upstream for Ansible 2.6

The original_basename option in the copy module changed to be
_original_basename in Ansible 2.6+, this PR resyncs the config_template
module to allow this to work with both Ansible 2.6+ and before.

Additionally, this PR removes the _v1_config_template.py file, since
ceph-ansible no longer supports versions of Ansible before version 2,
and so we shouldn't continue to carry that code.

Closes: #2843
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
(cherry picked from commit a1b3d5b7c3eed146372cbda5941653809504bb10)

switch: copy initial mon keyring

We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bae0f41705e4ca25492a7ff0169490331b897874)

switch: support migration when cluster is scrubbing

Similar to c13a3c3 we must allow scrubbing when running this playbook.

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.

This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.

Closes: #3182
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54b02fe1876cb6a33aebd8fcba04bd426f30b967)

defaults: fix osd containers handler

`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.

This should have been backported but it's part of a too important
refact in master that can't be backported.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

switch: allow switch big clusters (more than 99 osds)

The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9fccffa1cac2e2b527ad35e7398db6f20b79b835)
(cherry picked from commit d5e57af23df2c5155975bb7c693d7c2700d101da)

defaults: fix osd handlers that are never triggered

`run_once: true` + `inventory_hostname == groups.get(osd_group_name) |
last` is a bad combination since if the only node being run isn't the
last, the task will be definitly skipped.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

config: look up for monitor_address_block in hostvars

`monitor_address_block` should be read from hostvars[host] instead of
current node being played.

eg:

Let's assume we have:

```
[mons]
ceph-mon0 monitor_address=192.168.1.10
ceph-mon1 monitor_interface=eth1
ceph-mon2 monitor_address_block=192.168.1.0/24
```

the ceph.conf generation task will end up with:

```
fatal: [ceph-mon0]: FAILED! => {}

MSG:

'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```

the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:

```
    {%- else -%}
      {% set interface = 'ansible_' + (monitor_interface | replace('-', '_')) %}
      {% if ip_version == 'ipv4' -%}
        {{ hostvars[host][interface][ip_version]['address'] }}
      {%- elif ip_version == 'ipv6' -%}
        [{{ hostvars[host][interface][ip_version][0]['address'] }}]
      {%- endif %}
    {%- endif %}
```

`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1635303
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6130bc841dd25adf9a1ae26e6f82aef6b33328d8)

purge: actually remove of /var/lib/ceph/*

38dc20e74b89c1833d45f677f405fe758fd10c04 introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.

`/var/lib/ceph/*` files are not purged it means there is a leftover.

When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.

Typical error (from container output):

```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16 /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23 /entrypoint.sh:
SUCCESS
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 144c92b21ff151cd490fc9f47f7d90a19021e4c6)

restart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex

`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.

Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 806461ac6edd6aada39173df9d9163239fd82555)

restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK

After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 04f4991648568e079f19f8e531a11a5fddd45c87)

rolling_update: ensure pgs_by_state has at least 1 entry

Previous commit c13a3c3 has removed a condition.

This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 179c4d00d702ff9f7a10a3eaa513c289dd75d038)

upgrade: consider all 'active+clean' states as valid pgs

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.

This commit allows an upgrade even while PGs are scrubbing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c13a3c34929f34af11fbd746e9c0502a70f84b97)

Fix version check in ceph.conf template

We need to look for ceph_release when comparing with release names,
not ceph_version.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1631789
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit 6126210e0e426a4dc96ef78f90c8c6473f4c5b7c)

restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs

Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.

Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit aa97ecf0480c1075187b38038463f2f52144c754)

config: set default _rgw_hostname value to respective host

the default value for _rgw_hostname was took from the current node being
played while it should be took from the respective node in the loop.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d6fd514e0cbfb8283c349353582966938cd1c10)

tests: followup on b89cc1746f

Update network subnets in group_vars/all

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a88bccf8707137728c1d94e8a2424e63522293c)

shrink-osd: fix purge osd on containerized deployment

ce1dd8d introduced the purge osd on containers but it was incorrect.

`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4159326a182d15376bf5e5913da4bb6281e27957)

tests: fix monitor_address for shrink_osd scenario

b89cc1746 introduced a typo. This commit fixes it

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3382c5226c8e6e974dee7be39392652a203bb280)

nfs: ignore error on semanage command for ganesha_t

As of rhel 7.6, it has been decided it doesn't make sense to confine
`ganesha_t` anymore. It means this domain won't exist anymore.

Let's add a `failed_when: false` in order to make the deployment not
failing when trying to run this command.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1626070
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a6f77340fd942c7ce1a969347215cc5e3b18b1b2)

tests: pin sphinx version to 1.7.9

using sphinx 1.8.0 breaks our doc test CI job.

Typical error:

```
Exception occurred:
  File
  "/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py",  line 26, in <module>
      from sphinx.ext import doctest
      SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```

See: https://github.com/sphinx-doc/sphinx/issues/5417

Pinning to 1.7.9 to fix our CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f2c660d2566b9d8772c7dbee7cbcd005a61bfc2)

defaults: add a default value to rgw_hostname

let's add ansible_hostname as a default value for rgw_hostname if no
hostname in servicemap matches ansible_fqdn.

Fixes: #3063
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9ff26e80f2a628b34372edac65931df87b01a763)

tests: do not upgrade ceph release for switch_to_containers scenario

Using `UPDATE_*` environment variables here will make an upgrade of the
ceph release when running switch_to_containers scenario which is not
correct.

Eg:
If ceph luminous was first deployed, then we should switch to ceph
luminous containers, not to mimic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Revert "client: add quotes to the dict values"

This commit is adding quotes that make keyring unusuable

eg:

```
client.john
        key: AQAN0RdbAAAAABAAH5D3WgMN9Rxw3M8jkpMIfg==
        caps: [mds] ''
        caps: [mgr] 'allow *'
        caps: [mon] 'allow rw'
        caps: [osd] 'allow rw'
```

Trying to import such a keyring and use it will result:

```
Error EACCES: access denied
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1623417
This reverts commit 424815501a0c6072234a8e1311a0fefeb5bcc222.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ecbd3e45584791678e590172c5a0ceda7bd83623)

purge: only purge /var/lib/ceph content

Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 38dc20e74b89c1833d45f677f405fe758fd10c04)

run rados cmd in container if containerized deployment

When ceph-nfs is deployed containerized and ceph-common is not
installed on the host the start_nfs task fails because the rados
command is missing on the host.

Run rados commands from a ceph container instead so that
they will succeed.

Signed-off-by: Tom Barron <tpb@dyncloud.net>
(cherry picked from commit bf8f589958450ce07ec19d01fb98176ab50ab71f)

roles: ceph-rgw: Enable the ceph-radosgw target

If the ceph-radosgw target is not enabled, then enabling the
ceph-radosgw@ service has no effect since nothing will pull
it on the next reboot. As such, we need to ensure that the
target is enabled.

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit 217f35dbdb5036274be4674e9b0be2127b8875d7)

Dont run client dummy container on non-x86_64 hosts

The dummy client container currently wont work on non-x86_64 hosts.
This PR creates a filtered client group that contains only hosts
that are x86_64 - which can then be the group to run the
dummy container against.

This is for the specific case of a containerized_deployment where
there is a mixture of non-x86_64 hosts and x86_64 hosts. As such
the filtered group will contain all hosts when running with
containerized_deployment: false.

Currently ppc64le is not supported for Ceph server components.

Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
(cherry picked from commit 772e6b9be20ce82d3b8f9ffdf6b7bc4f6be842b8)

doc: remove old statement

We have been supporting multiple devices for journalin containerized
deployments for a while now and forgot about this.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622393
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 124fc727f472551ab2a14a8d6b9d3d54159a1b08)

remove warning for unsupported variables

As promised, these will go unsupported for 3.1 so let's actually remove
them :).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622729
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9ba670567e97b7ad16e6f623ae99a5ad3ee6d880)

sites: fix conditonnal

Same problem again... ceph_release_num[ceph_release] is only set in
ceph-docker-common/common roles so putting the condition on that role
will never work. Removing the condition.

The downside of this is we will be installing packages and then skip the
role on the node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622210
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ae5ebeeb00214d9ea27929b4670c6de4ad27d829)

site-docker.yml: remove useless condition

If we play site-docker.yml, we are already in a
containerized_deployment. So the condition is not needed.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 30cfeb5427535cd8dc98370ee33205be3b67bde0)

ci: stop using different images on the same run

There is no point of using hosts running on atomic AND centos hosts. So
let's run containerized scenarios on Atomic only.

This solves this error here:

```
fatal: [client2]: FAILED! => {
"failed": true
}

MSG:

The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'

The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: set_fact ceph_current_status (convert to json)
^ here
```

From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a

What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 7012835d2b1880e7a6ef9a224df456b2dd1024cc)

release-note: stable-3.1

stable-3.1 is approaching, so let's write our first release note.

Signed-off-by: Sébastien Han <seb@redhat.com>

defaults: fix rgw_hostname

A couple if things were wrong in the initial commit:

* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it

* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.

* set the fact rgw_hostname on rgw nodes only

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 6d7fa99ff74b3ec25d1a6010b1ddb25e00c123be)

rolling_upgrade: set sortbitwise properly

Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2e6e885bb75156c74735a65c05b4757b031041bb)

iscsi group name preserve backward compatibility

Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 77a3a682f358c8e9a40c5b50e980b5e9ec5f6d60)

osd: fix ceph_release

We need ceph_release in the condition, not ceph_stable_release

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1619255
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 8c70a5b1975c31992cdfa0a46a04bd9afbc1a806)

take-over-existing-cluster: do not call var_files

We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b7387068109a521796f8e423a61449541043f4e6)

roles: ceph-defaults: Delegate cluster information task to monitor node

Since commit f422efb1d6b56ce56a7d39a21736a471e4ed357 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible

fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"

This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.

Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit 37e50114dedf6a7aec0f1b2e1b9d2dd997a11d8e)

roles: ceph-defaults: Check if 'rgw' attribute exists for rgw_hostname

If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:

msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'

We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit 126e2e3f92475a17f9a04e1e792ee6eb69fbfab0)

mgr: improve/fix disabled modules check

Follow up on 36942af6983d60666f3f8a1a06b352a440a6c0da

"disabled_modules" is always a list, it's the items in the list that
can be dicts in mimic. Many ways to fix this, here's one.

Signed-off-by: Dardo D Kleiner <dardokleiner@gmail.com>
(cherry picked from commit f6519e4003404e10ae1f5e86298cffd4405591da)

lv-create: use copy instead of the template module

The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 04df3f0802c0bc903172314d05a38e869f0eee6a)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: cat the contents of lv-create.log in infra_lv_create

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit f5a4c8986982f277f6fd5bcd5b28c6099f655d79)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: add an example logfile_path config option in lv_vars.yml

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 131796f2750f1209a019ae75a500e6f1a1ab37f8)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: adds a testing scenario for lv-create and lv-teardown

Using an explicitly named testing environment name allows us to have a
specific [testenv] block for this test. This greatly simplifies how it will
work as it doesn't really anything from the ceph cluster tests.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 810cc47892e53701485c540ff51c00c860ea0a00)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-teardown: fail silently if lv_vars.yml is not found

This allows user to opt out of using lv_vars.yml and load configuration
from other sources.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit b0bfc173510ec7d5da715c0048e633a8fe3d2a4d)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-teardown: set become: true at the playbook level

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 8424858b40bafebe3569b33279e4d8d824e2276b)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: fail silenty if lv_vars.yml is not found

If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit e43eec57bb44bf5f7a10da8548ca22a8772c2d92)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: set become: true at the playbook level

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit fde47be13cc753218153b3dbc0db5a4daa752b21)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: use the template module to write log file

The copy module will not expand the template and render the variables
included, so we must use template.

Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 35301b35af4e71713edb944eb54654b587710527)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/vars/lv_vars.yaml: minor fixes

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 909b38da829485b2ec56b61bf2b2fa0df02b0ed4)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/lv-create.yml: use tempfile to create logfile

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit f65f3ea89fdba98057172e32e1a43ee6370c04d9)
Signed-off-by: Sébastien Han <seb@redhat.com>