git-server-git.apps.pok.os.sepia.ceph.com Git

rolling_update: do not fail on missing keys

We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>

update: fix a typo

`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d8f0daa05ed8da987984d638af3a794)

rolling_update: refact set_fact `mon_host`

each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.

Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018 14:02:30 +0000 (0:00:07.493) 0:02:50.005 *****
fatal: [mon1]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584f1b3a99515e9b94f450be22420c545)

rolling_update: create rbd and rbd-mirror keyrings

During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f9263b9ac3b5649f1e3cf3cbaf12d10)

ceph_key: add a get_key function

When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 691f373543d96d26b1af61c4ff7731fd888a9ce9)

Fix problem with ceph_key in python3

Pretty basic problem of iteritems removal.

Signed-off-by: Jairo Llopis <yajo.sk8@gmail.com>
(cherry picked from commit fc20973c2b9a89a15f8940cd100c34df4caca030)

tox: fix a typo

the line setting `ANSIBLE_CONFIG` obviously contains a typo introduced
by 1e283bf69be8b9efbc1a7a873d91212ad57c7351

`ANSIBLE_CONFIG` has to point to a path only (path to an ansible.cfg)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a0cceb3e44f17f417f1c7d86c51f915dbaf0bd2f)

rolling_update: fix upgrade when using fqdn

CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 44d0da0dd497bfab040c1e64fb406e4c13150028)

tests: do not install lvm2 on atomic host

we need to detect whether we are running on atomic host to not try to
install lvm2 package.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d2ca24eca8849a8f2df748c3f7c4e0d6885b6298)

tests: install lvm2 before setting up ceph-volume/LVM tests

Signed-off-by: Alfredo Deza <adeza@redhat.com>
(cherry picked from commit 3e488e8298a0c8ec4ec98317a1d7f1efc4926257)

Stringify ceph_docker_image_tag

This could be a numeric input, but is treated like a string leading to
runtime errors.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 8dcc8d1434dbe2837d91162f4647246c54826e97)

Avoid using tests as filter

Fixes the deprecation warning:

[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
using `result|search` use `result is search`.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 306e308f133c9b9757d6cae5f88d2c39903cae2f)

Sync config_template with upstream for Ansible 2.6

The original_basename option in the copy module changed to be
_original_basename in Ansible 2.6+, this PR resyncs the config_template
module to allow this to work with both Ansible 2.6+ and before.

Additionally, this PR removes the _v1_config_template.py file, since
ceph-ansible no longer supports versions of Ansible before version 2,
and so we shouldn't continue to carry that code.

Closes: #2843
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
(cherry picked from commit a1b3d5b7c3eed146372cbda5941653809504bb10)

switch: copy initial mon keyring

We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bae0f41705e4ca25492a7ff0169490331b897874)

switch: support migration when cluster is scrubbing

Similar to c13a3c3 we must allow scrubbing when running this playbook.

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.

This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.

Closes: #3182
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54b02fe1876cb6a33aebd8fcba04bd426f30b967)

defaults: fix osd containers handler

`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.

This should have been backported but it's part of a too important
refact in master that can't be backported.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

switch: allow switch big clusters (more than 99 osds)

The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9fccffa1cac2e2b527ad35e7398db6f20b79b835)
(cherry picked from commit d5e57af23df2c5155975bb7c693d7c2700d101da)

defaults: fix osd handlers that are never triggered

`run_once: true` + `inventory_hostname == groups.get(osd_group_name) |
last` is a bad combination since if the only node being run isn't the
last, the task will be definitly skipped.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

config: look up for monitor_address_block in hostvars

`monitor_address_block` should be read from hostvars[host] instead of
current node being played.

eg:

Let's assume we have:

```
[mons]
ceph-mon0 monitor_address=192.168.1.10
ceph-mon1 monitor_interface=eth1
ceph-mon2 monitor_address_block=192.168.1.0/24
```

the ceph.conf generation task will end up with:

```
fatal: [ceph-mon0]: FAILED! => {}

MSG:

'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```

the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:

```
    {%- else -%}
      {% set interface = 'ansible_' + (monitor_interface | replace('-', '_')) %}
      {% if ip_version == 'ipv4' -%}
        {{ hostvars[host][interface][ip_version]['address'] }}
      {%- elif ip_version == 'ipv6' -%}
        [{{ hostvars[host][interface][ip_version][0]['address'] }}]
      {%- endif %}
    {%- endif %}
```

`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1635303
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6130bc841dd25adf9a1ae26e6f82aef6b33328d8)

purge: actually remove of /var/lib/ceph/*

38dc20e74b89c1833d45f677f405fe758fd10c04 introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.

`/var/lib/ceph/*` files are not purged it means there is a leftover.

When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.

Typical error (from container output):

```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16 /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23 /entrypoint.sh:
SUCCESS
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 144c92b21ff151cd490fc9f47f7d90a19021e4c6)

restart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex

`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.

Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 806461ac6edd6aada39173df9d9163239fd82555)

restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK

After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 04f4991648568e079f19f8e531a11a5fddd45c87)

rolling_update: ensure pgs_by_state has at least 1 entry

Previous commit c13a3c3 has removed a condition.

This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 179c4d00d702ff9f7a10a3eaa513c289dd75d038)

upgrade: consider all 'active+clean' states as valid pgs

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.

This commit allows an upgrade even while PGs are scrubbing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c13a3c34929f34af11fbd746e9c0502a70f84b97)

Fix version check in ceph.conf template

We need to look for ceph_release when comparing with release names,
not ceph_version.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1631789
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit 6126210e0e426a4dc96ef78f90c8c6473f4c5b7c)

restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs

Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.

Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit aa97ecf0480c1075187b38038463f2f52144c754)

config: set default _rgw_hostname value to respective host

the default value for _rgw_hostname was took from the current node being
played while it should be took from the respective node in the loop.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d6fd514e0cbfb8283c349353582966938cd1c10)

tests: followup on b89cc1746f

Update network subnets in group_vars/all

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a88bccf8707137728c1d94e8a2424e63522293c)

shrink-osd: fix purge osd on containerized deployment

ce1dd8d introduced the purge osd on containers but it was incorrect.

`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4159326a182d15376bf5e5913da4bb6281e27957)

tests: fix monitor_address for shrink_osd scenario

b89cc1746 introduced a typo. This commit fixes it

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3382c5226c8e6e974dee7be39392652a203bb280)

nfs: ignore error on semanage command for ganesha_t

As of rhel 7.6, it has been decided it doesn't make sense to confine
`ganesha_t` anymore. It means this domain won't exist anymore.

Let's add a `failed_when: false` in order to make the deployment not
failing when trying to run this command.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1626070
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a6f77340fd942c7ce1a969347215cc5e3b18b1b2)

tests: pin sphinx version to 1.7.9

using sphinx 1.8.0 breaks our doc test CI job.

Typical error:

```
Exception occurred:
  File
  "/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py",  line 26, in <module>
      from sphinx.ext import doctest
      SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```

See: https://github.com/sphinx-doc/sphinx/issues/5417

Pinning to 1.7.9 to fix our CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f2c660d2566b9d8772c7dbee7cbcd005a61bfc2)

defaults: add a default value to rgw_hostname

let's add ansible_hostname as a default value for rgw_hostname if no
hostname in servicemap matches ansible_fqdn.

Fixes: #3063
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9ff26e80f2a628b34372edac65931df87b01a763)

tests: do not upgrade ceph release for switch_to_containers scenario

Using `UPDATE_*` environment variables here will make an upgrade of the
ceph release when running switch_to_containers scenario which is not
correct.

Eg:
If ceph luminous was first deployed, then we should switch to ceph
luminous containers, not to mimic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Revert "client: add quotes to the dict values"

This commit is adding quotes that make keyring unusuable

eg:

```
client.john
        key: AQAN0RdbAAAAABAAH5D3WgMN9Rxw3M8jkpMIfg==
        caps: [mds] ''
        caps: [mgr] 'allow *'
        caps: [mon] 'allow rw'
        caps: [osd] 'allow rw'
```

Trying to import such a keyring and use it will result:

```
Error EACCES: access denied
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1623417
This reverts commit 424815501a0c6072234a8e1311a0fefeb5bcc222.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ecbd3e45584791678e590172c5a0ceda7bd83623)

purge: only purge /var/lib/ceph content

Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 38dc20e74b89c1833d45f677f405fe758fd10c04)

run rados cmd in container if containerized deployment

When ceph-nfs is deployed containerized and ceph-common is not
installed on the host the start_nfs task fails because the rados
command is missing on the host.

Run rados commands from a ceph container instead so that
they will succeed.

Signed-off-by: Tom Barron <tpb@dyncloud.net>
(cherry picked from commit bf8f589958450ce07ec19d01fb98176ab50ab71f)

roles: ceph-rgw: Enable the ceph-radosgw target

If the ceph-radosgw target is not enabled, then enabling the
ceph-radosgw@ service has no effect since nothing will pull
it on the next reboot. As such, we need to ensure that the
target is enabled.

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit 217f35dbdb5036274be4674e9b0be2127b8875d7)

Dont run client dummy container on non-x86_64 hosts

The dummy client container currently wont work on non-x86_64 hosts.
This PR creates a filtered client group that contains only hosts
that are x86_64 - which can then be the group to run the
dummy container against.

This is for the specific case of a containerized_deployment where
there is a mixture of non-x86_64 hosts and x86_64 hosts. As such
the filtered group will contain all hosts when running with
containerized_deployment: false.

Currently ppc64le is not supported for Ceph server components.

Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
(cherry picked from commit 772e6b9be20ce82d3b8f9ffdf6b7bc4f6be842b8)

doc: remove old statement

We have been supporting multiple devices for journalin containerized
deployments for a while now and forgot about this.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622393
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 124fc727f472551ab2a14a8d6b9d3d54159a1b08)

remove warning for unsupported variables

As promised, these will go unsupported for 3.1 so let's actually remove
them :).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622729
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9ba670567e97b7ad16e6f623ae99a5ad3ee6d880)

sites: fix conditonnal

Same problem again... ceph_release_num[ceph_release] is only set in
ceph-docker-common/common roles so putting the condition on that role
will never work. Removing the condition.

The downside of this is we will be installing packages and then skip the
role on the node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622210
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ae5ebeeb00214d9ea27929b4670c6de4ad27d829)

site-docker.yml: remove useless condition

If we play site-docker.yml, we are already in a
containerized_deployment. So the condition is not needed.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 30cfeb5427535cd8dc98370ee33205be3b67bde0)

ci: stop using different images on the same run

There is no point of using hosts running on atomic AND centos hosts. So
let's run containerized scenarios on Atomic only.

This solves this error here:

```
fatal: [client2]: FAILED! => {
"failed": true
}

MSG:

The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'

The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: set_fact ceph_current_status (convert to json)
^ here
```

From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a

What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 7012835d2b1880e7a6ef9a224df456b2dd1024cc)

release-note: stable-3.1

stable-3.1 is approaching, so let's write our first release note.

Signed-off-by: Sébastien Han <seb@redhat.com>

defaults: fix rgw_hostname

A couple if things were wrong in the initial commit:

* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it

* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.

* set the fact rgw_hostname on rgw nodes only

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 6d7fa99ff74b3ec25d1a6010b1ddb25e00c123be)

rolling_upgrade: set sortbitwise properly

Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2e6e885bb75156c74735a65c05b4757b031041bb)

iscsi group name preserve backward compatibility

Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 77a3a682f358c8e9a40c5b50e980b5e9ec5f6d60)

osd: fix ceph_release

We need ceph_release in the condition, not ceph_stable_release

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1619255
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 8c70a5b1975c31992cdfa0a46a04bd9afbc1a806)

take-over-existing-cluster: do not call var_files

We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b7387068109a521796f8e423a61449541043f4e6)

roles: ceph-defaults: Delegate cluster information task to monitor node

Since commit f422efb1d6b56ce56a7d39a21736a471e4ed357 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible

fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"

This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.

Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit 37e50114dedf6a7aec0f1b2e1b9d2dd997a11d8e)

roles: ceph-defaults: Check if 'rgw' attribute exists for rgw_hostname

If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:

msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'

We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.

Signed-off-by: Markos Chandras <mchandras@suse.de>
(cherry picked from commit 126e2e3f92475a17f9a04e1e792ee6eb69fbfab0)

mgr: improve/fix disabled modules check

Follow up on 36942af6983d60666f3f8a1a06b352a440a6c0da

"disabled_modules" is always a list, it's the items in the list that
can be dicts in mimic. Many ways to fix this, here's one.

Signed-off-by: Dardo D Kleiner <dardokleiner@gmail.com>
(cherry picked from commit f6519e4003404e10ae1f5e86298cffd4405591da)

lv-create: use copy instead of the template module

The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 04df3f0802c0bc903172314d05a38e869f0eee6a)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: cat the contents of lv-create.log in infra_lv_create

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit f5a4c8986982f277f6fd5bcd5b28c6099f655d79)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: add an example logfile_path config option in lv_vars.yml

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 131796f2750f1209a019ae75a500e6f1a1ab37f8)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: adds a testing scenario for lv-create and lv-teardown

Using an explicitly named testing environment name allows us to have a
specific [testenv] block for this test. This greatly simplifies how it will
work as it doesn't really anything from the ceph cluster tests.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 810cc47892e53701485c540ff51c00c860ea0a00)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-teardown: fail silently if lv_vars.yml is not found

This allows user to opt out of using lv_vars.yml and load configuration
from other sources.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit b0bfc173510ec7d5da715c0048e633a8fe3d2a4d)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-teardown: set become: true at the playbook level

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 8424858b40bafebe3569b33279e4d8d824e2276b)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: fail silenty if lv_vars.yml is not found

If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit e43eec57bb44bf5f7a10da8548ca22a8772c2d92)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: set become: true at the playbook level

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit fde47be13cc753218153b3dbc0db5a4daa752b21)
Signed-off-by: Sébastien Han <seb@redhat.com>

lv-create: use the template module to write log file

The copy module will not expand the template and render the variables
included, so we must use template.

Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 35301b35af4e71713edb944eb54654b587710527)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/vars/lv_vars.yaml: minor fixes

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 909b38da829485b2ec56b61bf2b2fa0df02b0ed4)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/lv-create.yml: use tempfile to create logfile

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit f65f3ea89fdba98057172e32e1a43ee6370c04d9)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/lv-create.yml: add lvm_volumes to suggested paste

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 65fdad072386698ab9b6f7962107b6000f1f8378)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/lv-create.yml: copy without using a template file

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 50a6d8141ced888f9c1ce5e90f9461e6a101d5bc)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks/lv-create.yml: don't use action to copy

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 186c4e11c7832eeb676409171d401a0ba2864f2a)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks: standardize variable usage with a space after brackets

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 9d43806df9fe44496dd48c2b7d3bef2e59d92365)
Signed-off-by: Sébastien Han <seb@redhat.com>

vars/lv_vars.yaml: remove journal_device

Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit e0293de3e72e11f3aeba2d84b24bba1f7839ab57)
Signed-off-by: Sébastien Han <seb@redhat.com>

infrastructure-playbooks: playbooks for creating LVs for bucket indexes and journals

These playbooks create and tear down logical
volumes for OSD data on HDDs and for a bucket index and
journals on 1 NVMe device.

Users should follow the guidelines set in var/lv_vars.yaml

After the lv-create.yml playbook is run, output is
sent to /tmp/logfile.txt for copy and paste into
osds.yml

Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 1f018d861267a2bde7d8f2179d47d673bfcfb13f)
Signed-off-by: Sébastien Han <seb@redhat.com>

Revert "osd: generate device list for osd_auto_discovery on rolling_update"

This reverts commit e84f11e99ef42057cd1c3fbfab41ef66cda27302.

This commit was giving a new failure later during the rolling_update
process. Basically, this was modifying the list of devices and started
impacting the ceph-osd itself. The modification to accomodate the
osd_auto_discovery parameter should happen outside of the ceph-osd.

Also we are trying to not play ceph-osd role during the rolling_update
process so we can speed up the upgrade.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3149b2564fb89f2352820d83be02c09f658bdf60)

rolling_update: register container osd units

Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit dad10e8f3f67f1e0c6a14ef3e0b1f51f90d9d962)

contrib: fix generate group_vars samples

For ceph-iscsi-gw and ceph-rbd-mirror roles the group_name are named
differently (by default) than the role name so we have to change the
script to generate the correct name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1602327
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 315ab08b1604e655eee4b493eb2c1171a67df506)

Use /var/lib/ceph/osd folder to filter osd mount point

In some case, use may mount a partition to /var/lib/ceph, and umount
it will be failure and no need to do so too.

Signed-off-by: Jeffrey Zhang <zhang.lei.fly@gmail.com>
(cherry picked from commit 85cc61a6d9f23cc98a817ea988c8b50e6c55698f)

stable 3.1 igw: add api setting support

Port the parts of this upstream commit:

commit 91bf53ee932a6748c464bea762f8fb6f07f11347
Author: Sébastien Han <seb@redhat.com>
Date: Fri Mar 23 11:24:56 2018 +0800

ceph-iscsi: support for containerize deployment

that allows configuration of
API settings in roles/ceph-iscsi-gw/templates/iscsi-gateway.cfg.j2
using the iscsi-gws.yml.

This fixes Red Hat BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1613963

Signed-off-by: Mike Christie <mchristi@redhat.com>

stable 3.1 igw: enable and start rbd-target-api

Backport
https://github.com/ceph/ceph-ansible/pull/2984
to stable 3.1.

From upstream commit:

commit 1164cdc002cccb9dc1c6f10fb6b4370eafda3c4b
Author: Guillaume Abrioux <gabrioux@redhat.com>
Date: Thu Aug 2 11:58:47 2018 +0200

iscsigw: install ceph-iscsi-cli package

installs the cli package but does not start and enable the
rbd-target-api daemon needed for gwcli to communicate with the igw
nodes. This just enables and starts it.

This fixes Red Hat BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1613963.

Signed-off-by: Mike Christie <mchristi@redhat.com>

group_vars: resync missing options

resync group_vars file with the defaults/main.yml files.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2dd75a1e6ed6d8b323ef647cbb5cf1d2a2f90154)

fail if fqdn deployment attempted

fqdn configuration possibility caused a lot of trouble, it's adding a
lot of complexity because of multiple cases and the relation between
ceph-ansible and ceph-container. Moreover, there is no benefit for such
a feature.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613155
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

config: ensure rgw section has the correct name

the ceph.conf.j2 always assumes the hostname used to register the
radosgw in the servicemap is equivalent to `{{ ansible_hostname }}`
which returns the shortname form.

We need to detect which form of the hostname was used in case of already
deployed cluster and update the ceph.conf accordingly.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1580408
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f422efb1d6b56ce56a7d39a21736a471e4ed357c)

mgr: backward compatibility for module management

Follow up on 3abc253fecc91f29c90e23ae95e1b83f8ffd3de6

The structure had even changed within `luminous` release.
It was first:

```
{
    "enabled_modules": [
        "balancer",
        "dashboard",
        "restful",
        "status"
    ],
    "disabled_modules": [
        "influx",
        "localpool",
        "prometheus",
        "selftest",
        "zabbix"
    ]
}
```
Then it changed for:

```
{
  "enabled_modules": [
      "status"
  ],
  "disabled_modules": [
      "balancer",
      "dashboard",
      "influx",
      "localpool",
      "prometheus",
      "restful",
      "selftest",
      "zabbix"
  ]
}
```

and finally:
```
{
  "enabled_modules": [
      "status"
  ],
  "disabled_modules": [
      {
          "name": "balancer",
          "can_run": true,
          "error_string": ""
      },
      {
          "name": "dashboard",
          "can_run": true,
          "error_string": ""
      }
  ]
}
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 36942af6983d60666f3f8a1a06b352a440a6c0da)

tests: resync iscsigw group name with master

let's align the name of that group in stable-3.1 with master branch.

Not having the same group name on different branches is confusing and
make some nightlies job failing in the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: fix a typo in testinfra for iscsigws and jewel scenario

group name for iscsi-gw nodes in testing is `iscsi-gws`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: generate device list for osd_auto_discovery on rolling_update

rolling_update relies on the list of devices when performing the restart
of the OSDs. The task that is builind the devices list out of the
ansible_devices dict only runs when there are no partitions on the
drives. However during an upgrade the OSD are already configured, they
have been prepared and have partitions so this task won't run and thus
the devices list will be empty, skipping the restart during
rolling_update. We now run the same task under different requirements
when rolling_update is true and build a list when:

* osd_auto_discovery is true
* rolling_update is true
* ansible_devices exists
* no dm/lv are part of the discovery
* the device is not removable
* the device has more than 1 sector

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit e84f11e99ef42057cd1c3fbfab41ef66cda27302)

rolling_update: add role ceph-iscsi-gw

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575829
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit e91648a7afab88e84aea64c6bb7627580d420466)

mon: fix calamari initialisation

If calamari is already installed and ceph has been upgraded to a higher
version the initialisation will fail later. So if we detect the
calamari-server is too old compare to ceph_rhcs_version we try to update
it.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1601755
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4c9e24a90fca2271978a066e38dfadead88d8167)

rgw: remove useless condition

The include does not need a condition on containerized_deployment since
we are already in an include than has the same condition.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 5a89479abe759844eb59bac190105d9ba34ed0b1)

rgw: remove unused file

copy_configs.yml was not including and is a leftover so let's remove it.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3bce117de2165b8cf3e47805bd80871da7361001)

rgw: ability to use ceph-ansible vars into containers

Since the container now simply reads the ceph.conf, we remove all the
unnecessary options.

Also this PR is the foundation to support multiple backend, such as the
new 'beast' from Ceph Mimic.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582411
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4d64dd468696ed98e4c63d07d9f39216c6a7d3cb)

# Conflicts:
# roles/ceph-rgw/tasks/docker/main.yml

common: upgrade/install ceph-test deb first

When we deploy a Jewel cluster on Ubuntu with ceph_test: True, we're
unable to upgrade that cluster to Luminous.

"apt-get install ceph-common" fails to upgrade to luminous if a jewel ceph-test package is installed:

  Some packages could not be installed. This may mean that you have
  requested an impossible situation or if you are using the unstable
  distribution that some required packages have not yet been created
  or been moved out of Incoming.
  The following information may help to resolve the situation:

  The following packages have unmet dependencies:
   ceph-base : Breaks: ceph-test (< 12.2.2-14) but 10.2.11-1xenial is to be installed
   ceph-mon : Breaks: ceph-test (< 12.2.2-14) but 10.2.11-1xenial is to be installed

In ceph-ansible master, we resolve this whole class of problem by
installing all the packages in one operation (see
b338fafd90bbe489726b92d703bf4bc29d1caf6d).

For the stable-3.1 branch, take a less-invasive approach, and upgrade
ceph-test prior to any other package. This matches the approach I took
for RPMs in 3752cc6f38dbf476845e975e6448225c0e103ad6, before we had the
better solution in b338fafd90bbe489726b92d703bf4bc29d1caf6d.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610997
Signed-off-by: Ken Dreyer <kdreyer@redhat.com>

Allow mgr bootstrap keyring to be defined

In environments where we wish to have manual/greater control over
how the bootstrap keyrings are used, we need to able to externally
define what the mgr keyring secret will be and have ceph-ansible
use it, instead of it being autogenerated

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610213
Signed-off-by: Graeme Gillies <ggillies@akamai.com>
(cherry picked from commit a46025820d363dc3e91c380fd6b60fb6152b998b)

Resync rhcs_edits.txt

We were missing an option so let's add it back.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1519835
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 19518656a7966470842743da3c1c3bda0fa8c0f8)

test: remove osd_crush_location from shrink scenarios

This is not needed since this is already covered by docker_cluster and
centos_cluster scenarios.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 50be3fd9e8c0944cdddbd88bc8287e65765b0c63)

test: follow up on osd_crush_location for containers

This was fixed by
https://github.com/ceph/ceph-ansible/commit/578aa5c2d54a680912e4e015b6fb3dbbc94d4fd0
on non-container, we need to apply the same fix for containers.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 77d4023fbe2d8a57affb65ba05d1a987308a576e)

iscsigw: install ceph-iscsi-cli package

Install ceph-iscsi-cli in order to provide the `gwcli` command tool.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1602785
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1164cdc002cccb9dc1c6f10fb6b4370eafda3c4b)

Fix in regular expression matching OSD ID on non-contenerized
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.

Closes: #2964
Signed-off-by: Artur Fijalkowski <artur.fijalkowski@ing.com>
(cherry picked from commit 52d9d406b107c4926b582905b3d442feabf1fafc)

defaults: backward compatibility with fqdn deployments

This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a6ff6bbf8a7b4ba4ab5236eca93325d8ee61b1b)

rolling_update: set osd sortbitwise

upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b3266c5be2f88210589cfa56a5fe0a5092f79ee6)

config: enforce socket name

This was introduced by
https://github.com/ceph/ceph/commit/59ee2e8d3b14511e8d07ef8325ac8ca96e051784
and made our socket checks impossible to run. The PID could be found,
but the cctid cannot.
This happens during upgrade to mimic and on cluster running on mimic.

So let's force the admin socket the way it was so we can properly check
for existing instances also the line $cluster-$name.$pid.$cctid.asok
is only needed when running multiple instances of the same daemon,
thing ceph-ansible cannot do at the time of writing

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610220
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ea9e60d48d6631631ac9294d4ef291f8d7a30d78)

tests: support update scenarios in test_rbd_mirror_is_up()

`test_rbd_mirror_is_up()` is failing on update scenarios because it
assumes the `ceph_stable_release` is still set to the value of the
original ceph release, it means it won't enter in the right part of the
condition and fails.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d8281e50f12eb299e2b419befab41cb7a0f39de2)

igw: fix image removal during purge

We were not passing in the ceph conf info into the rbd image removal
command, so if the clustername was not the default igw purge would fail
due to the rbd rm command failing.

This just fixes the bug by passing in the ceph conf info which has the
clustername to use.

This fixes Red Hat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1601949

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d572a9a6020592607d98aa30cea75940428506b0)