git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

debug commit

dnm

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

containers: modify bindmount option

This commit changes the bind mount option for the mount point
`/var/lib/ceph` in the systemd template for mon and mgr containers. This
is needed in case of collocating mon/mgr with osds using dmcrypt
scenario.
Once mon/mgr got converted to containers, the dmcrypt layer sub mount is
still seen in `/var/lib/ceph`. For some reason it makes the
corresponding devices busy so any other container can't open/close it.
As a result, it prevents osds from starting properly.

Since it only happens on the nodes converted before the OSD play, the idea is
to bind mount `/var/lib/ceph` on mon and mgr with the `rshared` option
so once the sub mount is unmounted, it is propagated inside the
container so it doesn't see that mount point.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896392
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f5ba6d9b0117d283c44cc96af1810bf4cbb29b0a)

switch2container: chown symlink in mon/mgr plays

fa2bb3a only fix the symlink owner/group issue in the OSD play. If the
OSDs are collocated with other services like MONs and MGRs then the
chown command will fail.

$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 35ed9977aac9afbcad4f726a865891f0e84b4680)

mon: fix force peer addition task

when using `monitor_interface`, if nodes don't have same interface names
this task will fail like following:

```
fatal: [argo010]: FAILED! => {
"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_enp1s0f0'\n\nThe error appears to have been in '/usr/share/ceph-ansible/roles/ceph-mon/tasks/docker/main.yml': line 19, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: ipv4 - force peer addition as potential bootstrap peer for cluster bringup - monitor_interface\n ^ here\n"
}

```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876551
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: add missing param to the container cli calls

This adds some missing param to the container cli calls in
ceph-osd-run.sh.j2

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1885558
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-osd: add missing container_binary

90f3f61 introduced the docker-to-podman.yml playbook but the
ceph-osd-run.sh.j2 template still has some docker hardcoded instead
of using the container_binary variable.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

library/ceph_key: set no_log on secret

We don't need to show this information during the module execution.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a3f4e2b4d11d3185f4064be5ab2969f0df894ff2)

docs: update URLs to point to the RTD links

Fixes #5798
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
(cherry picked from commit f3a78371d9e1336595f5ce8ae7932a5f97004bbe)

facts: refact `ceph_uid` fact

There's no need to set this fact with a set_fact
We can achieve this in ceph-defaults

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875058
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

switch2container: chown symlink for devices

If the OSD directory is using symlinks for referencing devices (like
block, db, wal for bluestore and journal for filestore) then the chown
command could fail to change the owner:group on some system.

$ ls -hl /var/lib/ceph/osd/ceph-0/
total 28K
lrwxrwxrwx 1 ceph ceph 92 Sep 15 01:53 block -> /dev/ceph-45113532-95ca-471b-bd75-51de46f1339c/osd-data-570a1aee-60c0-44c9-8036-ffed7d67a4e6
-rw------- 1 ceph ceph 37 Sep 15 01:53 ceph_fsid
-rw------- 1 ceph ceph 37 Sep 15 01:53 fsid
-rw------- 1 ceph ceph 55 Sep 15 01:53 keyring
-rw------- 1 ceph ceph  6 Sep 15 01:53 ready
-rw------- 1 ceph ceph  3 Sep 15 02:00 require_osd_release
-rw------- 1 ceph ceph 10 Sep 15 01:53 type
-rw------- 1 ceph ceph  2 Sep 15 01:53 whoami
$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied
$ find /var/lib/ceph/osd/ceph-0 -not -user 167
/var/lib/ceph/osd/ceph-0/block

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da4280e243f50114e1ae6455a46360012feb8f3d)

switch2container: remove deb systemd units

When running the switch2container playbook on a Debian based system
then the systemd unit path isn't the same than Red Hat based system.
Because the systemd unit files aren't removed then the new container
systemd unit isn't take in count.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c1af69a7e79a5909903490028f7ae13e519c98e0)

tests: migrate to quay.ceph.io registry

in order to avoid docker.io rate limiting

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2001039c0e290b1643311d6ea10c9f767adf11ee)

Remove 'run_once: true' from wait 'for all osd to be up' task in ceph-osd/tasks/main.yml role.
This together with condition 'ansible_play_hosts_all | last' causes skipping that task on the first host.

Signed-off-by: RPietrzak <rp.pietrzak@gmail.com>

tests: followup on bff2114

remove same node for containerized deployments

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: remove 1 osd node for upgrade scenario

This node was needed for the upgrade job in stable-4.0.
Since we moved the code erasure pool testing in lvm_osds, we don't need
to fire up that node anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: move systemd rendering task

This commit moves the systemd rendering task into `systemd.yml` file.
Otherwise, when running docker to podman playbook, the systemd unit file
isn't updated as it should be.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1870141
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: change lvm bindmount

This commit makes the bindmount a bit more generic, otherwise it
currently makes the OSDs failing to start in an OSP FFU upgrade
(with RHEL7 > RHEL8 OS upgrade).
docker2podman playbook is run from ceph-ansible stable-3.2 branch
against RHEL7 nodes where `/var/run/lvmetad.socket` exists but once the
system is upgraded to RHEL8, this socket doesn't exist anymore and
prevent OSDs from starting after the reboot.

As a workaround we can make this bindmount a bit more generic like what
is done in `stable-4.0` branch by mounting `/run/lvm` instead.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1866252
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

docker2podman: set disk_list for non lvm scenario

When using non lvm scenarios (collocated or non-collocated) then the
disk_list variable isn't set because this is done during the ceph-osd
role (start_osds.yml) which isn't executed in the docker2podman
playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862046
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: pin pytest-forked to 1.2.0

The pytest-forked 1.3.0 release isn't compatible with the pytest release
we are using in that branch.

-----------------------
pytest-forked 1.3.0 requires pytest>=3.10, but you'll have pytest 3.6.1
which is incompatible.
-----------------------

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

README-MULTISITE: fix old conflict

The automatic backport [1] done by mergify has merged the backport PR
even if a conflict was present in the documentation.

[1] https://github.com/ceph/ceph-ansible/pull/3803

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

facts: explicitly disable facter and ohai

By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b8be6f9f9b1ba8568072daf39da7a2c)

ceph-osd: exit gracefully when no data partition

When using collocated or non-collocated osd_scenarios (ceph-disk) and
trying to deterime the OSD_DEVICE from the OSD_ID passed to the systemd
unit then we can be in a situation where the OSD hasn't been activated
but the OSD ID exists.
This means the data partition isn't in activate state and the ceph-disk
list command won't show the OSD ID on the data partition.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1850377
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

infra: introduce docker to podman playbook

This isn't backported from master because there are too many changes
between stable-3.2 and other newer branches.

NOTE:
This playbook *doesn't* add podman support in stable-3.2 at all.
This is a tripleO dedicated playbook which is intended to be run
early during FFU workflow in order to prepare the OS upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1853457
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

doc: add a note about deprecated branches

This commit adds a note about `stable-3.0` `stable-3.1` branches which
are deprecated and not maintained anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bbe30bcc69ffcf117ee97e8500f5247b4542f186)

doc: add a note about containerized deployments

This commit updates the documentation to add a note about containerized
deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e61488507b400b8fb2eedab99889871da27eef12)

doc: fix warning treated as an error

Typical error:

```
Warning, treated as error:
/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/source/day-2/upgrade.rst:2:Title underline too short.
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5c254861bdc146d3ef73dc99d6f52f7d03e22deb)

switch_to_containers: don't set noup flag

We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d38456f9e316bee3daeb2f72dda0315cae)

switch-to-containers: set and unset osd flags

The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e020615bb99eb9db1520a977e5ac3ef4)

Revert "switch-to-containers: set and unset osd flags"

This reverts commit 5a4134098a6b0bdb513425f23c8bc8804cd75adc.

We need to provide a tag for RHCS 3.3z6 without this commit.

Revert "switch_to_containers: don't set noup flag"

This reverts commit b7ec4a995b9df0fb673160864887b973012b39bc.

We need to provide a tag for RHCS 3.3z6 without this commit.

docker: Add Requires on docker service

When using docker container engine then the systemd unit scripts only
use a dependency on the docker daemon via the After parameter.
But if docker is restarted on a live system then the ceph systemd units
should wait for the docker daemon to be fully restarted.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1846830
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd22f1d1ec8c692848aee5337cd0d682a3a058b7)

docs: Add upgrade operation.

This commit adds a chapter about the ceph upgrade process.

Closes: #5393
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e41487dbce9dd5e9d754270bec426bea920406be)

switch_to_containers: don't set noup flag

We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d38456f9e316bee3daeb2f72dda0315cae)

switch-to-containers: set and unset osd flags

The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e020615bb99eb9db1520a977e5ac3ef4)

clients: move dummy container creation

This commit moves the dummy container creation task right before the
cephx keys creation task so it can't be run out of time.

Also, this commit makes the dummy container running for ever.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1828105
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

typo: updating type check on rc

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1827271
Signed-off-by: ianwatsonrh <ianwatson@redhat.com>
(cherry picked from commit ccf6a7f153c7a36a700b914db956058f2408304b)

doc: add day-2 operations documentation

This commit is the first of a serie in order to describe all day-2 operations
that are possible via ceph-ansible using a set of playbook provided in
`infrastructure-playbooks` directory.

Fixes: #5061
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7e800303e9933cb61a5288b608e8d4f2cfdd7746)

library/ceph_volume: look for error messages in stderr

Error message were moved to from stdout in stderr here -
https://github.com/ceph/ceph/commit/b8d6dcbe9f803c96c0af68da54f1262e9b6a9e77#diff-20f7c578a4e69ec61a5869d706567a24R137.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1793542
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 4249d1e02d6da07466a4ddf1282cf4600a131773)

ceph-validate: update RHEL requirement for RHCS

We were not testing the right ansible_distribution fact value for RHEL
distribution.
This commit also updates the minial RHEL version supported by RHCS.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5de74fe512575b2873b5863f5817f676954d3469)

add-osd: refact the playbook

There's no need to have two plays anymore since we now set/unset osd
flags in `ceph-osd` role.

Also, this commit makes the role `ceph-facts` to be called after
`ceph-defaults`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

add-osd: fix fact gathering in add-osd

This commit makes this playbook gathering facts from all other nodes but
clients.
When collocating OSDs on other nodes it can fail like following:

```
fatal: [vm252-11]: FAILED! => {
"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"
}
```

In that case, a fact from a RGW node is called when rendering the
`ceph.conf.j2` but it fails because facts are gathered only from mon and
osd nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1806765
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

add-osd: unset noup flag after last osd is deployed

this commit fixes a bug when using `add-osd.yml` playbook.
`noup` flag is set early but it never got unset before the "wait for pgs
clean" check, so the playbook always fails because OSDs aren't never
seen UP.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1816023
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph_key: fetch key when needed

Fetch the key when it is present in the cluster but not on the node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ccfa249919b338197daec353cb5d4e535b3fb734)

ceph_key: fix idempotency when no secret is passed

553584cbd0d014429e665f998776e8d198f72d2b introduced a regression when no
secret is passed, it overwrites the secret each time the task is run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 003defec0311af0f03da861f80d596852bdb9cf5)

ceph_key: remove 'update' state

With this change, the state `present` is enough to update a keyring.
If the keyring already exist, it will be updated if caps or secret
passed to the module are different.
If the keyring doen't exist, it will be created.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1808367
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 553584cbd0d014429e665f998776e8d198f72d2b)

tests: add mgr nodes to shrink_mon inventory

Since 306ce82 we explicitly fail when there's no mgr node preent in the
inventory.

fatal: [mon0]: FAILED! => {
"changed": false
}

MSG:

Please add a mgr host to your inventory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

osd: support changing default rule even when osd_crush_location isn't defined

Creating crush rules even with no crush hierarchy configuration is a
valid scenario so we shouldn't be bound to the first task result (which
configure crush hierarchy) to be able to add new crush rules.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1816989
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5b0476385ccb00a9edb9092a183c18e2637afd5d)

Add site-container.yml symlink

This adds a symlink to the site-docker.yml.sample playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

switch_to_containers: exclude clients nodes from facts gathering

just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b45375a2c0566406b49896ae316a293)
(cherry picked from commit 5c3ba0787cf346c7e5eb5df74a1da4984c16e7aa)

main: exclude client nodes from facts gathering when delegate_facts_host

This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 865d2eac9ba81bdb9ecbd841e4b73608648dfae2)

The _filtered_clients list should intersect with ansible_play_batch

Client configuration with --limit fails without this patch
because certain tasks are only done to the first host in the
_filtered_clients list and it's likely that first host will
not be included in what's sepcified with --limit. To fix this
the _filtered_clients list should be built from all clients
in the inventory that are also in the running play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798781
Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit e4bf4857f556465c60f89d32d5f2a92d25d5c90f)

defaults: remove legacy comment

This is no longer true, let's remove this comment given that this option
is not ignored in containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e551b5ba1a65e653540b5b5c7cb3a4f5d32b2540)

docker-common: remove legacy tasks for ntp configuration

Those tasks aren't needed in docker-common since the introduction of
`ceph-infra` role. They are duplicated tasks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1810376
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit cd0195c5622ecc2d0001eb7c988d7b6a6fac1d5e)

tests: add inventory host for 4.0 upgrade job

This inventory is intended to be used in the upgrade scenario in
stable-4.0 branch.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: modify add-osd job

This commit modifies the way we test add-osd scenario given that the
playbook add-osd.yml is broken at the moment.

As a workaround we can use main playbook with `--limit` to achieve this
operation.

Note: This commit is intended to be reverted once we get a fix.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: pg num should be a power of two number

This patch changes the pg_num value of the rgw pools foo and bar to be
a power of two number.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-rgw: Fix customize pool size "when" condition

In 3c31b19ab39f297635c84edb9e8a5de6c2da7707, I fixed the `customize pool
size` task by replacing `item.size` with `item.value.size`. However, I
missed the same issue in the `when` condition.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 3842aa1a30277b5ea3acf78ac1aef37bad5afb14)

ceph-rgw: Fix custom pool size setting

RadosGW pools can be created by setting

```yaml
rgw_create_pools:
  .rgw.root:
    pg_num: 512
    size: 2
```

for instance. However, doing so would create pools of size
`osd_pool_default_size` regardless of the `size` value. This was due to
the fact that the Ansible task used

```
{{ item.size | default(osd_pool_default_size) }}
```

as the pool size value, but `item.size` is always undefined; the
correct variable is `item.value.size`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 3c31b19ab39f297635c84edb9e8a5de6c2da7707)

ceph-{mon,osd}: move default crush variables

Since ed36a11 we move the crush rules creation code from the ceph-mon to
the ceph-osd role.
To keep the backward compatibility we kept the possibility to set the
crush variables on the mons side but we didn't move the default values.
As a result, when using crush_rule_config set to true and wanted to use
the default values for crush_rules then the crush rule ansible task
creation will fail.

"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute
'crush_rules'"

This patch move the default crush variables from ceph-mon to ceph-osd
role but also use those default values when nothing is defined on the
mons side.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798864
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1fc6b337142efdc76c10340c076653d298e11c68)

ceph-validate: fail if no mgr host is present

We already stop the upgrade playbook (rolling_update.yml) if there's
no mgr node present so we should also do the same for initial
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1788644
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-mon: use interactive session with aliases

When using ceph aliases with commands that require manual intervention
to stop then the command will keep running inside the container (like
using Ctrl+c).
For handling this, we should use the interactive session option (-it)
with the docker commands.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1797874
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

iscsi: Fix crashes during rolling update

During a rolling update we will run the ceph iscsigw tasks that start
the daemons then run the configure_iscsi.yml tasks which can create
iscsi objects like targets, disks, clients, etc. The problem is that
once the daemons are started they will accept confifguration requests,
or may want to update the system themself. Those operations can then
conflict with the configure_iscsi.yml tasks that setup objects and we
can end up in crashes due to the kernel being in a unsupported state.

This could also happen during creation, but is less likely due to no
objects being setup yet, so there are no watchers or users accessing the
gws yet. The fix in this patch works for both update and initial setup.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1795806

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 77f3b5d51b84a6338847c5f6a93f22a3a6a683d2)

tests: retry to fire up VMs on vagrant failure

Add a script to retry several times to fire up VMs to avoid vagrant
failures.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 1ecb3a9352d869d8fde694cefae9de8af8f6fee8)

config: fix external client scenario

When no monitor group is present in the inventory, this task fails.
This affects only non-containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e7bc0794054008ac2d6771f0d29d275493319665)

tests: add external_clients scenario

This commit adds a new 'external ceph clients' scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 641729357e5c8bc4dbf90b5c4f7b20e1d3d51f7d)

validate: allow running ceph-ansible 3.2 against ansible 2.7

This commit allows ceph-ansible 3.2 to be run against ansible 2.7

However, note that running stable-3.2 against ansible 2.7 doesn't get
any testing upstream this might break the playbook, only ansible 2.6 is
officially supported.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1781635
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add 'all_in_one' scenario

Add new scenario 'all_in_one' in order to catch more collocated related
issues.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3e7dbb4b16dc7de3b46c18db4c00e7f2c2a50453)

update: remove legacy tasks

These tasks should have been removed with backport #4756

Note:
This should have been backported from master but it's not possible
because of too many change between master and stable-3.2

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1740463
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-defaults: remove rgw from ceph_conf_overrides

The [rgw] section in the ceph.conf file or via the ceph_conf_overrides
variable doesn't exist and has no effect.
To apply overrides to all radosgw instances we should use either the
[global] or [client] sections.
Overrides per radosgw instance should still use the
[client.rgw.{instance-name}] section.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794552
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2f07b8513158d3fc36c5b0d29386f46dd28b5efa)

defaults: change monitor|radosgw_address default values

To avoid confusion, let's change the default value from `0.0.0.0` to
`x.x.x.x`.
Users might think setting `0.0.0.0` will make the daemon binding on all
interfaces.

Fixes: #4827
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fc02fc98ebce0f99e81628e76ad28e7bf65435de)

tox: allow copy admin key for purge scenario

This is enabled in the group_vars/clients file but it's overrided in
extra vars by tox.
Let's do it like that for now.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: add coverage on purge playbook

This commit adds a playbook to be played before we run purge playbook,
it first creates an rbd image then map an rbd device on client0 so the
purge playbook will try to unmap it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit db77fbda15bf9f79f8122559b01b6625005ae29c)

purge: use sysfs to unmap rbd devices

in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.

Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3cfcc7a105156dfde65b23e9d8662cd848537094)

update: only run post osd upgrade play on 1 mon

There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589bde0eecb8ac72a7ec8bc1f66403eeb)

update: use flags noout and nodeep-scrub only

1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b9535348dff616665be749503f80c4fca)

ceph-defaults: exclude rbd devices from discovery

The RBD devices aren't excluded from the devices list in the LVM auto
discovery scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783908
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 6f0556f01536932bdf47e8f1aab341b2c6761537)

ceph-osd: wait for all osds once

cf8c6a3 moves the 'wait for all osds' task from openstack_config to the
main tasks list.
But the openstack_config code was executed only on the last OSD node.
We don't need to do this check on all OSD node so we need to add set
run_once to true on that task.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5bd1cf40eb5823aab3c4e16b60b37c30600f9283)

ceph-osd: wait for all osd before crush rules

When creating crush rules with device class parameter we need to be sure
that all OSDs are up and running because the device class list is
is populated with this information.
This is now enable for all scenario not openstack_config only.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf8c6a384999be8caedce1121dfd57ae114d5bb6)

rolling_update: create crush rule after osd play

When upgrading from jewel to luminous we can execute the crush rule tasks
only when the 'osd require-osd-release luminous' command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-osd: add device class to crush rules

This adds device class support to crush rules when using the class key
in the rule dict via the create-replicated sub command.
If the class key isn't specified then we use the create-simple sub
command for backward compatibility.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1636508
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ef2cb99f739ade80e285d83050ac01184aafc753)

move crush rule creation from mon to osd role

If we want to create crush rules with the create-replicated sub command
and device class then we need to have the OSD created before the crush
rules otherwise the device classes won't exist.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ed36a11eabbdbb040652991300cdfc93d51ed491)

ceph-validate: add rbdmirror validation

When ceph_rbd_mirror_configure is set to true we need to ensure that
the required variables aren't empty.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1760553
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4a065cebd70d259bfd59b6f5f9baa45d516a9c3a)

switch_to_containers: set GUID on lockbox part

The ceph lockbox partition (part number 5) used with non lvm scenarios
and in non containerized deployment don't have a valid PARTUUID.
The value is set to 00000000-0000-0000-0000-000000000000 for each OSD
devices.

$ blkid -t PARTLABEL="ceph lockbox" -o value -s PARTUUID
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000

When switching to containerized deployment we manually mount the lockbox
partition by using the PARTUUID.
Unfortunately because we have most of the time multiple OSD on the same
node we can't have the right symlink in /dev/disk/by-partuuid because it
will point to only one partition.

/dev/disk/by-partuuid/00000000-0000-0000-0000-000000000000 -> ../../sdb5

After the switch_to_containers playbook then only one OSD will restart
correctly and the other will try to access to the wrong device causing
error like 'xxxx is still in use'.

When deploying with containers and dmcrypt OSDs we force a PARTUUID
value during the ceph-disk prepare task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-mds: allow directory fragmentation

We need to explicitly enable the allow_dirfrags flag on cephfs pool
after upgrading to Luminous.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776233
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

facts: avoid duplicated element in devices list

When using `osd_auto_discovery`, `devices` is built multiple times due
to multiple runs of `ceph-facts` role. It end up with duplicate
instances of a same device in the list.

Using `unique` filter when building the list fixes this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 23b1f43897db0a03ef94f51e83ed3c562c4584d0)

tests: add shrink-osd-legacy testing

This commit introduce back testing against ceph-disk deployed osds.

In stable-3.2 which is the most common version used at customers
(downstream pov), a bunch of OSDs are still deployed using ceph-disk.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-osd: support fqdn in inventory

When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9ca6b05b52694dec53ce61fdc16bb83c93979d)

ceph-iscsi: add ceph-iscsi stable repositories

This commit adds the support of the ceph-iscsi stable repository when
use ceph_repository community instead of always using the devel
repositories.
We're still using the devel repositories for rtslib and tcmu-runner in
both cases (dev and community).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ansible.cfg: do not enforce PreferredAuthentications

There's no need to enforce PreferredAuthentications by default.
Users can still choose to override the ansible.cfg with any additional
parameter like this one to fit their infrastructure.

Fixes: #4826
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d682412e2aa5eeb411cc0dff9a3ffef4b4aa8683)

ceph-osd: update systemd unit script

The systemd unit script wasn't updated with the new container name
format (without the hostname).
We now have the same start/stop docker commands for all scenarios.
During the device to id OSD migration we need to be sure that the
old container with the hostname are stopped.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1780688
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: add lvm-auto-discovery scenario

This adds the lvm-auto-discovery scenario.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-defaults: exclude md devices from discovery

The md devices (RAID software) aren't excluded from the devices list in
the auto discovery scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1764601
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 014f51c2a42e4922b43da07b97c4b810ede32200)

facts: fix auto_discovery exclude

the previous approach was wrong.
checking if `item.key` is in `osd_auto_discovery_exclude` (`['dm-',
'loop']`) is incorrect because it will obviously not match. Therefore,
the condition will return `True` whatever the device we are checking.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f420072727441fd6e6a22a15cc9034a3a678cae)

osd: add possibility to exclude device in osd_auto_discovery

Add a new `osd_auto_discovery_exclude` to give the possibility of
excluding some devices in auto_discovery scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 83d7ef777ec19eb5f96c553d869ec8ad90fd5061)

ceph-facts: generate devices when osd_auto_discovery is true

This task used to live in ceph-osd, but we need it defined here to that
ceph-config can use it when trying to determine the number of osds.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 88eda479a9d6df6224a24f56dbcbdb204daab150)

tests: reduce max_mds from 3 to 2

Having max_mds value equals to the number of mds nodes generates a
warning in the ceph cluster status:

cluster:
id:     6d3e49a4-ab4d-4e03-a7d6-58913b8ec00a'
health: HEALTH_WARN'
        insufficient standby MDS daemons available'
(...)
services:
  mds:     cephfs:3 {0=mds1=up:active,1=mds0=up:active,2=mds2=up:active}'

Let's use 2 active and 1 standby mds.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4a6d19dae296969954e5101e9bd53443fddde03d)

switch_to_containers: fix umount ceph partitions

When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.

We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0

Also we should not fail on the ceph directory listing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 39cfe0aa65ddd96458ba9d0a031d801efbb0d394)

tests: fix update scenario (container)

The path to the inventory isn't correct because we are missing the variable
`CONTAINER_DIR` here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: revert vagrant_variable file name detection

This commit reverts the following change:

https://github.com/ceph/ceph-ansible/pull/4510/commits/fcf181342a70b78a355d1c985699028012326b5f#diff-23b6f443c01ea2efcb4f36eedfea9089R7-R14

this is causing CI failures so this commit is intended to unlock the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5353ab8a23aee92ebab8146eeeeffcfcb25c0865)

rolling_update: don't enable ceph-mon unit

On non containerized deployment the ceph-mon hostname/fqdn systemd
service are stopped at the beginning of the mon upgrade.
But the parameter enabled is set to true for both task so even if we're
not using the fqdn then it will enabled the systemd unit based on it.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649617
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>