git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

rgw: multisite refact

Add the possibility to deploy rgw multisite configuration with a mix of
secondary and primary zones on a same rgw node.
Before that, on a same node, all instances were either primary
zones *OR* secondary.

Now you can define a rgw instance like following:

```
rgw_instances:
  - instance_name: 'rgw0'
    rgw_zonemaster: false
    rgw_zonesecondary: true
    rgw_zonegroupmaster: false
    rgw_realm: 'france'
    rgw_zonegroup: 'zonegroup-france'
    rgw_zone: paris-00
    radosgw_address: "{{ _radosgw_address }}"
    radosgw_frontend_port: 8080
    rgw_zone_user: jacques.chirac
    rgw_zone_user_display_name: "Jacques Chirac"
    system_access_key: P9Eb6S8XNyo4dtZZUUMy
    system_secret_key: qqHCUtfdNnpHq3PZRHW5un9l0bEBM812Uhow0XfB
    endpoint: http://192.168.101.12:8080
```

Basically it's now possible to define `rgw_zonemaster`,
`rgw_zonesecondary` and `rgw_zonegroupmaster` at the intsance
level instead of the whole node level.

Also, this commit adds an option `deploy_secondary_zones` (default True)
which can be set to `False` in order to explicitly ask the playbook to
not deploy secondary zones in case where the corresponding endpoint are
not deployed yet.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1915478
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 71a5e666e39b11cd7945afa28a9f6fbe7de8c2b7)

library: fix bug in radosgw_zone.py

If for some reason `get_zonegroup()` returns a failure, we must handle
and make the module exit properly instead of failing with the following
python trace:

```
Traceback (most recent call last):
  File "./AnsiballZ_radosgw_zone.py", line 247, in <module>
    _ansiballz_main()
  File "./AnsiballZ_radosgw_zone.py", line 234, in _ansiballz_main
    exitcode = debug(sys.argv[1], zipped_mod, ANSIBALLZ_PARAMS)
  File "./AnsiballZ_radosgw_zone.py", line 202, in debug
    runpy.run_module(mod_name='ansible.modules.radosgw_zone', init_globals=None, run_name='__main__', alter_sys=True)
  File "/usr/lib64/python3.6/runpy.py", line 205, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/vagrant/.ansible/tmp/ansible-tmp-1610728441.41-685133-218973990589597/debug_dir/ansible/modules/radosgw_zone.py", line 467, in <module>
    main()
  File "/home/vagrant/.ansible/tmp/ansible-tmp-1610728441.41-685133-218973990589597/debug_dir/ansible/modules/radosgw_zone.py", line 463, in main
    run_module()
  File "/home/vagrant/.ansible/tmp/ansible-tmp-1610728441.41-685133-218973990589597/debug_dir/ansible/modules/radosgw_zone.py", line 425, in run_module
    zonegroup = json.loads(_out)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fedb36688dbf05521966ee22e831283f0967d17a)

ceph-defaults: change default ceph container tag

The "latest" ceph container tag references the latest stable release
(octopus at the moment). "latest" is an alias on "latest-octopus".
On the devel branch we should use "latest-master" tag instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7d567719756199f34d6cb05347435905a8c4aece)

module_utils: don't add newline to the data

When executing a command via the run_command method and passing some
data with stdin then the default behavior is to add append a newline.
This breaks the value of password used by our modules.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 661690857726de9a63abc3c9503f6bcae52cb975)

dashboard: manage password backward compatibility

The ceph dashboard changed the way the password are provided via the
CLI.
This breaks the backward compatibility when using a recent ceph-ansible
version with ceph release without that feature.
This patch adds tasks for legacy workflow (ceph release without that
feature) in both ceph-dashboard role and ceph_dashboard_user module.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

dashboard: configure passwords via stdin

Due to recent changes in ceph, the few dashboard passwors
must be passed via `-i`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ef975ef5eacb14b66a8982ccb27a38ce552abcc4)

library: refact ceph_dashboard_user

refact this module due to recent changes in ceph pacific.
The password must be passed with `-i` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2725db3e9f42472fdf63c81e5b0236bc6449cf4b)

mon: fix cephx disabled deployment

Due to missing condition on `cephx` variable, cephx disabled deployments
are broken.
This commit fixes this.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1910151
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4af084570274c2ffdc601c9113242df4808cb726)

fs2bs: skip migration when a mix of fs and bs is detected

Since the default of `osd_objectstore` has changed as of 3.2, some
deployments might have a mix of filestore and bluestore OSDs on a same
node. In some specific cases, there's a possibility that a filestore OSD
shares a journal/db device with a bluestore OSD. We shouldn't try to
redeploy in this context because ceph-volume will complain. (either
because in lvm batch you can't pass partition or about gpt header).
The safest option is to skip the migration on the node when such a mix
is detected or force all osds including those already using bluestore
(option `force_filestore_to_bluestore=True` has to be passed as an extra var).
If all OSDs are using filestore, then they will be migrated to
bluestore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875777
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e66f12d1387e7fa86138ae18d3026a1f31328b6b)

switch2container: fix mon quorum check

The current check makes no sense because it checks any of other monitor
than the one being played (either a previous one already converted or a
next that isn't yet converted) is present on the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1909011
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 175ffa1b882960e8127ada7f6a4b1e6c9a9b8fba)

Path for ceph config missing in crash template

The path where ceph.conf is located (/etc/ceph) missing in the Docker container bind mounts, this throws errors

Signed-off-by: Mike Currin <currin@gmail.com>
(cherry picked from commit 4cbc9a48c9cd025df7dcd438c5bcf975868638c0)

rgw: support switching from single-site to multisite

When collocating rgw with either a mon, mgr or osd, switching from
single site to a multisite rgw setup failed because of the handlers
triggered between the ansible play of the collocated daemon and the play
of the rgw. Since the multisite changes are not yet applied the handlers
fail.
The idea here is to ensure we run the multisite configuration from the
ceph-handler role before the restart happens, this way it won't complain
because of non existing multisite configuration.

(Note: this is also valid when simply changing a multisite configuration)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1888630
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 513c8cfe551da78ea89d2513dffebd9649bfeb44)

library: remove containerized parameter from cv

The ceph-volume module relies on environment variables to determine if
the command should be executed within a container or not.
The containerized parameter isn't used anymore and we can remove it.

Fixes: #6153
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 613ab11b9be8cfe2c9ef4f1e1ff12510d794c6c7)

cephadm: remove loop on host add tasks

Instead of iterate over the host list for adding the node/label to the
host orchestrator configuration then we can do it parallelly.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5b6f907a72e78e2d5424f0a59e336efc2f36fffc)

library: add cephadm_bootstrap module

This adds cephadm_bootstrap ansible module for replacing the command module
usage with the cephadm bootstrap command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c3ed124d310298d894ed340730c5d4dd265629ed)

library: add cephadm_adopt module

This adds cephadm_adopt ansible module for replacing the command module
usage with the cephadm adopt command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 08f118077fde6706c31a1a953ea1b0d5812f7201)

library: add missing `target_size_ratio` parameter support in ceph_pool module

When creating a new pool, target_size_ratio was ignored by ansible module ceph_pool.py.
target_size_ratio is now used when pg_autoscale_mode is on.
Tests added to library tests.
This adds too the use in the role ceph-rgw.

Signed-off-by: Fabien Brachere <fabien.brachere@celeste.fr>
(cherry picked from commit 4026ba9da136fef03d0070a02bed066e021d362a)

ceph-config: fix ceph-volume lvm batch report

Since the major ceph-volume lvm batch refactoring, the report value
is different.
Before the refact, the report was a dict with the OSDs list to be created
under the "osds" key.
After the refact, the report is a list of dict.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 827b23353ff389660f393cb36fb36540c362e8cf)

library: add ceph_osd_flag module

This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5da593604a50605b228107752588a907169eb381)

monitoring: use config_template module for config

The alertmanager, grafana and prometheus configuration file are
generated with the template module which doesn't allow for using
config overrides.
Instead we could use the config_template plugin action and add a
new variable for overrides (one for each component).

With this patch, one should be able to add configuration to
prometheus with the following:

---
alertmanager_conf_overrides:
global:
smtp_smarthost: 'localhost:25'
...

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1902999
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5a410263470932f2f1a22572e0f2c42939591402)

ceph-osd: use global crush_device_class in lvm_volumes

Use global crush_device_class variable if it's not set per OSD

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
(cherry picked from commit 5e9444fa5c283587d38e823012c5bfdb434a8f2c)

tests: force box removal

This avoids interactive mode for `vagrant box remove`.
This can happen for some reason when there's leftover from previous
deployment (VMs not destroyed as expected)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 011c97786bc08e306267b6cc4374e0f9df6cb416)

tests: rgw_multisite playbook test refactor

Currently we create an object from the primary sites but we try to read
that object still from the master which doesn't make sense, we should
try to read it from a secondary site.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e2ea403d5ef938a3ea12276004eaa71f9919a4a3)

fix broken ceph-fetch-keys role

set fetch_directory variable in default/main.yml instead of using the
defaults jinja filter in tasks/main.yml.

Fixes: #6072
Signed-off-by: Karl-Heinz Preuß <karl-heinz.preuss@cms.hu-berlin.de>
(cherry picked from commit 6ce34ef59fe33874ae7876d27b991e23c6281fdb)

Revert "config: Always use osd_memory_target if set"

This reverts commit 4d1fdd2b05d55f8028fb5593d41fa61dbddd7095.

This breaks the backward compatibility with previous osd_memory_target
calculation and we could have a value lower than the minimum value allowed
(896M) which causes some ceph commands to fail (like ceph assimilate-conf).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit aa6e1f20eaa5272ff0bb5e8b3cded16273aa120c)

purge-container-cluster: always prune force

Since podman 2.x, there's now a confirmation when running podman
container prune command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0108c9f941c3b66faa3f2463cde4d4f0587ca9e4)

tests/vagrant: update box version to CentOS 8.3

This updates the CentOS libvirt box version to 8.3

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 801e7a29cfd708751b066cc44f2bfd614d40c196)

ceph-mon: No become during gen mon initial keyring

Since the backing generate_secret() just hands out urandom output,
running as privileged doesn't seem to be required. It's not
desireable to provide sudo in some Ansible runner environments.

Signed-off-by: Jukka Nousiainen <jukka.nousiainen@csc.fi>
(cherry picked from commit eb7473491b25c5f899a110f6ae1076ef5096d6d5)

rhcs: drop fetch_directory override

Since the fetch_directory variable has been dropped then we don't need
the override in rhcs file.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a2cbab16a41dfa8508ebac6fb4c1c8053a4eb6ff)

common: do not use pipefail when not needed

Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`

Fixes: #6090
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 86a8889ee3adffc81d3435895a7a117824e779ad)

osd: add tag on 'wait for all osd to be up' task

This allows skipping this task if really desired.
Use it carefully. Use it at your own risk.

Fixes: #6073
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5c4ae5356d9ae10573d699236897969052f77823)

iscsigw: remove `--cap-add=all` from `podman run` cmd

As of podman `2.0.5`, `--cap-add` and `--privileged` are exclusive
options.

```
Nov 30 13:56:30 magna089 podman[171677]: Error: invalid config provided: CapAdd and privileged are mutually exclusive options
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1902149
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d40dd764e004f9765e5d4e12507cdf3c707a3271)

container: remove `--ignore` from `podman rm` command

As of podman 2.0.5, `--ignore` param conflicts with `--storage`.
```
Nov 30 13:53:10 magna089 podman[164443]: Error: --storage conflicts with --volumes, --all, --latest, --ignore and --cidfile
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c68b124ba89e0e4e7c4845b6dd1ce98be8e074d4)

switch2containers: do not stop ceph.target in osd play

`ceph.target` should be disabled only. Otherwise, in collocation
scenario you stop other collocated services in the OSD play which isn't
what we want to do. Each daemon has its corresponding play for managing
the transition to container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901865
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b05620597121c6388b7fbf227fb01f8efb2bda6)

alertmanager/prometheus: fix owner/group

Set the owner/group on alertmanager and prometheus directories and
files to nobody and nogroup (uid and gid 65534) to avoid permission
issues.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901543
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit eb452d35bc7bae076ed7727494dc52e6528b21a3)

mon: refact initial keyring generation

adding monitor is no longer possible because we generate a new mon
keyring each time the playbook is run.

Fixes: #5864
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 970c6a4ee6923588adb81d8c49185ff8e340d52e)

mon: replace `command` task by `copy`

We can achieve this task using `copy` module.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ff2ca270f6e3c28c66950d8bb07aeddc373c0ae)

ceph-iscsi: set the pool name in the config file

When using a custom pool for iSCSI gateway then we need to set the pool
name in the configuration otherwise the default rbd pool name will be
used.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 40a87c4b92c658f601b7170400d8b2cb3f849673)

tests: use github workflow for nbsp char check

Let's use a github workflow instead of travis for this.

With this commit we can get rid of Travis.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 94c37b9de89ffd93449e77f7a90ad50b700fd0db)

lint: ignore 302,303,505 errors

ignore 302,303 and 505 errors

[302] Using command rather than an argument to e.g. file
[303] Using command rather than module
[505] referenced files must exist

they aren't relevant on these tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 195d88fcda60e28e246245a1bf566b6e412b2ab3)

lint: do not use 'local_action'

Fix ansible-lint 504 error:

[504] Do not use 'local_action', use 'delegate_to: localhost'

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c948b668ebe7cf7af61090ac6f9b93e0a16e14ee)

lint: trailing whitespace

Fix ansible-lint 201 error:

[201] Trailing whitespace

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit dfc7e6e4bdd9edf49080b59ff03ac560bd990f86)

lint: all tasks should be named

Fix ansible-lint 502 error:

[502] All tasks should be named

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 97dd9218dd0493edc68bc9e69cb39fae8924f6dc)

lint: use shell only when shell functionality is required

Fix ansible-lint 305 error:

[305] Use shell only when shell functionality is required

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 11b4bf5083639abea66a874ba86ac38a1b706ca6)

lint: don't compare to literal true/false

Fix ansible lint 601 error:

[601] Don't compare to literal True/False

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2011e4dbc8e4c90dfb60ecd4c59325782d95986c)

lint: variables should have spaces before and after

Fix ansible lint 206 error:

[206] Variables should have spaces before and after: {{ var_name }}

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9fba6eecfa5de7415ebe6ba6a1ab496412de8eed)

lint: commands should not change things

Fix ansible lint 301 error:

[301] Commands should not change things if nothing needs doing

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5450de58b37cb169adecd87f547551b976d4b9bf)

lint: set pipefail on shell tasks

Fix ansible lint 306 error:

[306] Shells that use pipes should set the pipefail option

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1879c26eb9ed46db9adfa6803ef4a03b15b6914b)

tests: use github workflow for ansible-lint

let's use github workflow instead of travis.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d4400f911aee4f49f775cc04630d37e127ddd6cd)

osd: ensure /var/lib/ceph/osd/{cluster}-{id} is present

This commit ensures that the `/var/lib/ceph/osd/{{ cluster }}-{{ osd_id }}` is
present before starting OSDs.

This is needed specificly when redeploying an OSD in case of OS upgrade
failure.
Since ceph data are still present on its devices then the node can be
redeployed, however those directories aren't present since they are
initially created by ceph-volume. We could recreate them manually but
for better user experience we can ask ceph-ansible to recreate them.

NOTE:
this only works for OSDs that were deployed with ceph-volume.
ceph-disk deployed OSDs would have to get those directories recreated
manually.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1898486
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 873fc8ec0ff12fa1d1b45c5400050f15d0417480)

ceph-facts: fix read osd pool default crush fact

We don't need to use run_once on that task when having running monitors
otherwise the read task could be skip and the set task will fail.

The conditional check 'crush_rule_variable.rc == 0' failed. The error
was: error while evaluating conditional (crush_rule_variable.rc == 0):
'dict object' has no attribute 'rc'

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1898856
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e150df789edb966549ba2a8f2415a844ce612d46)

tests: use github workflow for pytest

Move the pytest testing from TravisCI to Github workflow.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3e79f0322a703003ab1af51104b69a1d2162951e)

tests: enforce pytest-rerunfailures version

This commit enforces the pytest-rerunfailures installed so it's <9.0

This is to avoid the following error:

```
ERROR: pytest-rerunfailures 9.0 has requirement pytest>=5.0, but you'll have pytest 4.6.11 which is incompatible.
```

latest version of pytest-rerunfailures isn't compatible with the version
of pytest we are using.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 19097026fbda71752a500119b0c99c1a9f8d523d)

containers: modify bindmount option

This commit changes the bind mount option for the mount point
`/var/lib/ceph` in the systemd template for mon and mgr containers. This
is needed in case of collocating mon/mgr with osds using dmcrypt
scenario.
Once mon/mgr got converted to containers, the dmcrypt layer sub mount is
still seen in `/var/lib/ceph`. For some reason it makes the
corresponding devices busy so any other container can't open/close it.
As a result, it prevents osds from starting properly.

Since it only happens on the nodes converted before the OSD play, the idea is
to bind mount `/var/lib/ceph` on mon and mgr with the `rshared` option
so once the sub mount is unmounted, it is propagated inside the
container so it doesn't see that mount point.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896392
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f5ba6d9b0117d283c44cc96af1810bf4cbb29b0a)

container: force rm --storage on ExecStartPre

This is a workaround to avoid error like following:
```
Error: error creating container storage: the container name "ceph-mgr-magna022" is already in use by "4a5f674e113f837a0cc561dea5d2cd55d16ca159a647b7794ab06c4c276ef701"
```

that doesn't seem to be 100% reproducible but it shows up after a
reboot. The only workaround we came up with at the moment is to run
`podman rm --storage <container>` before starting it.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1887716
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ba7824c55e4a5d6732208859293ac3f47bb54a2)

switch2container: chown symlink in mon/mgr plays

fa2bb3a only fix the symlink owner/group issue in the OSD play. If the
OSDs are collocated with other services like MONs and MGRs then the
chown command will fail.

$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 35ed9977aac9afbcad4f726a865891f0e84b4680)

ceph-facts: Fix osd_pool_default_crush_rule fact

The `osd_pool_default_crush_rule` is set based on `crush_rule_variable`, which
is the output of a `grep` command.

However, two consecutive tasks can set that variable, and if the second task is
skipped, it still overwrites the `crush_rule_variable`, leading the
`osd_pool_default_crush_rule` to be set to `ceph_osd_pool_default_crush_rule`
instead of the output of the first task.

This commit ensures that the fact is set right after the `crush_rule_variable`
is assigned, before it can be overwritten.

Closes #5912

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit c5f7343a2f696ab3bfef77e735eafdeae4e4883b)

config: Always use osd_memory_target if set

The osd_memory_target variable was only used if it was higher than the
calculated value based on the number of OSDs. This is changed to always
use the value if it is set in the configuration. This allows this value
to be intentionally set lower so that it does not have to be changed
when more OSDs are added later.

Signed-off-by: Gaudenz Steinlin <gaudenz.steinlin@cloudscale.ch>
(cherry picked from commit 4d1fdd2b05d55f8028fb5593d41fa61dbddd7095)

switch2container: disable ceph-osd enabled-runtime

When deploying the ceph OSD via the packages then the ceph-osd@.service
unit is configured as enabled-runtime.
This means that each ceph-osd service will inherit from that state.
The enabled-runtime systemd state doesn't survive after a reboot.
For non containerized deployment the OSD are still starting after a
reboot because there's the ceph-volume@.service and/or ceph-osd.target
units that are doing the job.

$ systemctl list-unit-files|egrep '^ceph-(volume|osd)'|column -t
ceph-osd@.service     enabled-runtime
ceph-volume@.service  enabled
ceph-osd.target       enabled

When switching to containerized deployment we are stopping/disabling
ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the
systemd unit files.
But the new systemd units for containerized ceph-osd service will still
inherit from ceph-osd@.service unit file.

As a consequence, if an OSD host is rebooting after the playbook execution
then the ceph-osd service won't come back because they aren't enabled at
boot.

This patch also adds a reboot and testinfra run after running the switch
to container playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fa2bb3af86b48befd3901939d38eda20dff6f5e5)

main: followup on pr 6012

This tag can be set at the play level.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2fa17520c425117f87048a1d555c2e73c9e6cf6e)

Add ceph_client tag to execute or skip the playbook

There are some use cases where there's a need to skip the execution
of the ceph-ansible client role even though the client section of the
inventory isn't empty.
This can happen in contexts where the services are colocated or when
a all-in-one deployment is performed.
The purpose of this change is adding a 'ceph_client' tag to avoid
altering the ceph-ansible execution flow but at the same time be able
to include or exclude a set of tasks using this tag.

Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit fafd5f871a81f5e8cdba6e531e499a9678b2dcad)

dashboard: change dashboard_grafana_api_no_ssl_verify default value

This sets the `dashboard_grafana_api_no_ssl_verify` default value
according to the length of `dashboard_crt` and `dashboard_key`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5cadfea42e8dd31e019568cdfe1b0f3d64f5dcc4)

dashboard: enable https by default

see linked bz for details

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1889426
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 767d3c898e2d8f7dddb655fd98827d5da8b338e8)

osd: Fix number of OSD calculation

If some OSDs are to be created and others already exist the calculation
only counted the to be created OSDs. This changes the calculation to
take all OSDs into account.

Signed-off-by: Gaudenz Steinlin <gaudenz.steinlin@cloudscale.ch>
(cherry picked from commit 15044da03052fcb4a3c45f344f41e06b0d418e4d)

rolling_update: fix mgr start with mon collocation

cec994b introduced a regression when a mgr is collocated with a mon.
During the mon upgrade, the mgr service is masked to avoid to be
restarted on packages update.
Then the start mgr task is failing because the service is still masked.
Instead we should unmask it.

Fixes: #5983
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3d3ce263274d648f8fb376716f52b8b91b6f1313)

infrastructure: consume ceph_fs module

bd611a7 introduced the new ceph_fs module but missed some tasks in
rolling_update and shrink-mds playbooks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 16afe90806ea24503302bf6cf85b75b744def275)

rolling_update: use ceph health instead of ceph -s

The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.

$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit acddf4fb679f5f5251b3414793680042ee3be394)

rgw/rbdmirror: use service dump instead of ceph -s

The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the rgw/rbdmirror services status, we're only using the
servicmap structure in the ceph status output.
To optimize this, we could use the ceph service dump command which contains
the same needed information.
This command returns less information and is slightly faster than the ceph
status command.

$ ceph status -f json | wc -c
2001
$ ceph service dump -f json | wc -c
1105
$ time ceph status -f json > /dev/null

real 0m0.557s
user 0m0.516s
sys 0m0.040s
$ time ceph service dump -f json > /dev/null

real 0m0.454s
user 0m0.434s
sys 0m0.020s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f9081931f8a369b075060083cdb225e3477f99a)

monitor: use quorum_status instead of ceph status

The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.

$ ceph status -f json | wc -c
2001
$ ceph quorum_status -f json | wc -c
957
$ time ceph status -f json > /dev/null

real 0m0.577s
user 0m0.538s
sys 0m0.029s
$ time ceph quorum_status -f json > /dev/null

real 0m0.544s
user 0m0.527s
sys 0m0.016s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12169e08fc299dbd2fcaecc9d42dedca)

osds: use pg stat command instead of ceph status

The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.

$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null

real 0m0.529s
user 0m0.503s
sys 0m0.024s
$ time ceph pg stat -f json > /dev/null

real 0m0.426s
user 0m0.409s
sys 0m0.016s

The data returned by the ceph status is even bigger when using the
nautilus release.

$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee505885908ac2ae15bf201a638359faaf78d251)

osds: use ceph osd stat instead of ceph status

Improve the checked way of the OSD created checking process.
This replaces the ceph status command by the ceph osd stat command.
The osdmap structure isn't needed anymore.

$ ceph status -f json | wc -c
2001
$ ceph osd stat -f json | wc -c
132
$ time ceph status -f json > /dev/null

real    0m0.563s
user    0m0.526s
sys     0m0.036s
$ time ceph osd stat -f json > /dev/null

real 0m0.457s
user 0m0.411s
sys 0m0.045s

Signed-off-by: wangxiaotong <wangxiaotong@fiberhome.com>
(cherry picked from commit b9cb0f12e9e79600f1a974dd88ba1ed1d833211f)

common: follow up on #5948

In addition to f7e2b2c608eef4bbba47586f1e24d6ade1572758

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 371d854a5c03cfb30d27d5cdbaaad61f7f8d6c58)

openstack: use ceph_keyring_permissions by default

Otherwise this task fails if no permission is set on the item.
Previously the code omited the mode parameter if it was not set, but
this was lost with commit ab370b6ad823e551cfc324fd9c264633a34b72b5.

Signed-off-by: Gaudenz Steinlin <gaudenz.steinlin@cloudscale.ch>
(cherry picked from commit 79ff79c422e88e5ec848bec880ef01a87ceeb298)

podman: force log driver to journald

Since we've changed to podman configuration using the detach mode and
systemd type to forking then the container logs aren't present in the
journald anymore.
The default conmon log driver is using k8s-file.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1890439
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 16cd183b9cb827156ab83cba6c9b85d341d681be)

ceph-mon: Don't set monitor directory mode recursively

After rolling updates performed with
`infrastructure-playbooks/rolling_updates.yml`, files located in
`/var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }}` had mode 0755 (including
the keyring), making them world-readable.

This commit separates the task that configured permissions recursively on
`/var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }}` into two separate tasks:

1. Set the ownership and mode of the directory itself;
2. Recursively set ownership in the directory, but don't modify the mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 0d76826bbb7b0b9303583c31147ebad9e5c420f9)

ceph-handler: fix curl ipv6 command with rgw

When using the curl command with ipv6 address and brackets then we need
to use the -g option otherwise the command fails.

$ curl http://[fdc2:328:750b:6983::6]:8080
curl: (3) [globbing] error: bad range specification after pos 9

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cdb7b09cd7631eb1af0c70360c0f9959526bc795)

common: drop `fetch_directory` feature

This commit drops the `fetch_directory` feature.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1cc9666c09c616cc743b4f2f05d2b52cfd6c32cb)

ceph-config: ceph.conf rendering refactor

This commit cleans up the `main.yml` task file of `ceph-config`.
It drops the local ceph.conf generation.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 900c0f44925ec0c6c1acb16433044ac40717e00e)

iscsi: fix ownership on iscsi-gateway.cfg

This file is currently deployed with '0644' ownership making this file
readable by any user on the system.
Since it contains sensitive information it should be readable by the
owner only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1890119
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a822f773002a010ebedddcc2c8cd8f5a03dc786a)

crash: refact caps definition

there is no need to use `{{ }}` syntax here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a8bd947c7dbd62b8acb9ac4fc1a2aad08a06546f)

ceph-volume: refresh lvm metadata cache

When running rhel8 containers on a rhel7 host, after zapping an OSD
there's a discrepancy with the lvmetad cache that needs to be refreshed.
Otherwise, the host still sees the lv and can makes the user confused.
If user tries to redeploy an OSD, it will fail because the LV isn't
present and need to be recreated.

ie:

```
stderr: lsblk: ceph-block-8/block-8: not a block device
stderr: blkid: error: ceph-block-8/block-8: No such file or directory
stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
usage: ceph-volume lvm prepare [-h] --data DATA [--data-size DATA_SIZE]
                               [--data-slots DATA_SLOTS] [--filestore]
                               [--journal JOURNAL]
                               [--journal-size JOURNAL_SIZE] [--bluestore]
                               [--block.db BLOCK_DB]
                               [--block.db-size BLOCK_DB_SIZE]
                               [--block.db-slots BLOCK_DB_SLOTS]
                               [--block.wal BLOCK_WAL]
                               [--block.wal-size BLOCK_WAL_SIZE]
                               [--block.wal-slots BLOCK_WAL_SLOTS]
                               [--osd-id OSD_ID] [--osd-fsid OSD_FSID]
                               [--cluster-fsid CLUSTER_FSID]
                               [--crush-device-class CRUSH_DEVICE_CLASS]
                               [--dmcrypt] [--no-systemd]
ceph-volume lvm prepare: error: Unable to proceed with non-existing device: ceph-block-8/block-8
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886534
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0bb106045ee10e08c157134b6e00ab846ce26e1f)

ceph-osd: Fix check mode for start osds tasks

Correctly set `osd_ids_non_container.stdout_lines` to an empty list if it's
undefined (i.e. in check mode).

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8b0023cb77ce79ab4669783f16a5c295d54ce247)

ceph-mon: Fix check mode for deploy monitor tasks

Skip the `get initial keyring when it already exists` task when both commands
whose `stdout` output it requires have been skipped (e.g. when running in check
mode).

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8f436ab5d80c924d8841215307c17e38a70fb4bd)

ceph-crash: Only deploy key to targeted hosts

The current task installs the ceph-crash key to "most" hosts via
"delegate_to". This key is only used by the ceph-crash daemon and should
just be installed on all hosts targeted by this role. There is no need
for using a delegated task.

Signed-off-by: Gaudenz Steinlin <gaudenz.steinlin@cloudscale.ch>
(cherry picked from commit 68cc93fb18d516a04e288418811787355fb0582e)

flake8: run the workflow conditionally

We don't need to run flake8 on ansible modules and their tests if we
don't have any modifitions.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 00b7ee27df59fb0d5a537f6c0ad11c910695126d)

ceph-osd: start osd after systemd overrides

The service should be started after the ceph-osd systemd overrides has
been added, otherwise, the latter isn't considered.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1860739
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 59d0f0199243de40bde714d1a9019b1715c57dbf)

ceph-osd: don't start the OSD services twice

Using the + operation on two lists doesn't filter out the duplicate
keys.
Currently each OSDs is started (via systemd) twice.
Instead we could use the union filter.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4eaa65c36256189b352c88c1058e888550adbd0f)

handler: refact check_socket_non_container

the `stat --printf=%n` returns something like following:

```
ok: [osd0] => changed=false
  cmd: |-
    stat --printf=%n /var/run/ceph/ceph-osd*.asok
  delta: '0:00:00.009388'
  end: '2020-10-06 06:18:28.109500'
  failed_when_result: false
  rc: 0
  start: '2020-10-06 06:18:28.100112'
  stderr: ''
  stderr_lines: <omitted>
  stdout: /var/run/ceph/ceph-osd.2.asok/var/run/ceph/ceph-osd.5.asok
  stdout_lines: <omitted>
```

it makes the next task "check if the ceph osd socket is in-use" grep
like this:

```
ok: [osd0] => changed=false
  cmd:
  - grep
  - -q
  - /var/run/ceph/ceph-osd.2.asok/var/run/ceph/ceph-osd.5.asok
  - /proc/net/unix
```

which will obviously fail because this path never exists. It makes the
OSD handler broken.

Let's use `find` module instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 46d4d97da9c6078a6c5ff60a39db4b4072fb902b)

Fix Ansible check mode for site.yml.sample playbook

Make sure the `site.yml.sample` playbook can be run in check mode by skipping
tasks that try to read the output of commands that have been skipped.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 54ba38e35ea67c1c342b008be675103e120982d0)

tests: change cephfs pool size

`all_daemons` scenario can't handle pools with `size: 3` because we have
1 osd node in root=HDD and two nodes in root=default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5713ea5d51868855df368e285faff484342ef04)

library: add radosgw_zone module

This adds radosgw_zone ansible module for replacing the command module
usage with the radosgw-admin zone command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1281e8bcc810508c259acf537feef6c6d8677a6f)

library: add radosgw_zonegroup module

This adds radosgw_zonegroup ansible module for replacing the command
module usage with the radosgw-admin zonegroup command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 65dbe0782e29ac177113b7b6a408ee2da7113942)

library: add radosgw_realm module

This adds radosgw_realm ansible module for replacing the command module
usage with the radosgw-admin realm command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d171f4068d5dccdcfbf970607342d36550b011b0)

library: add radosgw_user module

This adds radosgw_user ansible module for replacing the command module
usage with the radosgw-admin user command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 235c7e27cca5ae8bde92465c46808402e8e5e597)

library: add ceph_fs module

This adds the ceph_fs ansible module for replacing the command module
usage with the ceph fs command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd611a785b52eaad0b31d913dfc62b75aef157bc)

ceph_key: support using different keyring

Currently the `ceph_key` module doesn't support using a different
keyring than `client.admin`.
This commit adds the possibility to use a different keyring.

Usage:
```
      ceph_key:
        name: "client.rgw.myrgw-node.rgw123"
        cluster: "ceph"
        user: "client.bootstrap-rgw"
        user_key: /var/lib/ceph/bootstrap-rgw/ceph.keyring
        dest: "/var/lib/ceph/radosgw/ceph-rgw.myrgw-node.rgw123/keyring"
        caps:
          osd: 'allow rwx'
          mon: 'allow rw'
          import_key: False
        owner: "ceph"
        group: "ceph"
        mode: "0400"
```

Where:
`user` corresponds to `-n (--name)`
`user_key` corresponds to `-k (--keyring)`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 12e6260266dec04b4b2d25f3508aa7149fd16714)

rgw: fix multi instances scaleout in baremetal

When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook in baremetal deployments.

When ceph-osd notifies handlers, it means rgw handlers are triggered
too. The issue with this is that they are triggered before the role
ceph-rgw is run.
In the case a scaleout operation is expected on `radosgw_num_instances`
it causes an issue because keyrings haven't been created yet so the new
instances won't start.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881313
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a802fa2810e50e87f61e3a64c27f8826ba6aa250)

tests: reboot and test idempotency on collocation

test reboot and idempotency on collocation scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f83f798206566b714adbc55e2543cbd9529897fa)

ceph_key: remove backward compatibility

It's time to remove this backward compatibility. Users had enough time
to convert their openstack_keys and key values.
We now fail in ceph-validate if the caps key isn't set.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c9603626399501e91dca66fdf8ea434cfe93d2d7)

infrastructure-playbooks: drop add-osd playbook

This playbook isn't needed anymore, we can achieve this operation by
running main playbook with `--limit` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 20718582da841f9efa8a88776da508d0632a1281)