git-server-git.apps.pok.os.sepia.ceph.com Git

podman: force log driver to journald

Since we've changed to podman configuration using the detach mode and
systemd type to forking then the container logs aren't present in the
journald anymore.
The default conmon log driver is using k8s-file.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1890439
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 16cd183b9cb827156ab83cba6c9b85d341d681be)

ceph-handler: fix curl ipv6 command with rgw

When using the curl command with ipv6 address and brackets then we need
to use the -g option otherwise the command fails.

$ curl http://[fdc2:328:750b:6983::6]:8080
curl: (3) [globbing] error: bad range specification after pos 9

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cdb7b09cd7631eb1af0c70360c0f9959526bc795)

iscsi: fix ownership on iscsi-gateway.cfg

This file is currently deployed with '0644' ownership making this file
readable by any user on the system.
Since it contains sensitive information it should be readable by the
owner only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1890119
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a822f773002a010ebedddcc2c8cd8f5a03dc786a)

ceph-osd: Fix check mode for start osds tasks

Correctly set `osd_ids_non_container.stdout_lines` to an empty list if it's
undefined (i.e. in check mode).

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8b0023cb77ce79ab4669783f16a5c295d54ce247)

ceph-mon: Fix check mode for deploy monitor tasks

Skip the `get initial keyring when it already exists` task when both commands
whose `stdout` output it requires have been skipped (e.g. when running in check
mode).

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8f436ab5d80c924d8841215307c17e38a70fb4bd)

crash: refact caps definition

there is no need to use `{{ }}` syntax here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a8bd947c7dbd62b8acb9ac4fc1a2aad08a06546f)

ceph-volume: refresh lvm metadata cache

When running rhel8 containers on a rhel7 host, after zapping an OSD
there's a discrepancy with the lvmetad cache that needs to be refreshed.
Otherwise, the host still sees the lv and can makes the user confused.
If user tries to redeploy an OSD, it will fail because the LV isn't
present and need to be recreated.

ie:

```
stderr: lsblk: ceph-block-8/block-8: not a block device
stderr: blkid: error: ceph-block-8/block-8: No such file or directory
stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
usage: ceph-volume lvm prepare [-h] --data DATA [--data-size DATA_SIZE]
                               [--data-slots DATA_SLOTS] [--filestore]
                               [--journal JOURNAL]
                               [--journal-size JOURNAL_SIZE] [--bluestore]
                               [--block.db BLOCK_DB]
                               [--block.db-size BLOCK_DB_SIZE]
                               [--block.db-slots BLOCK_DB_SLOTS]
                               [--block.wal BLOCK_WAL]
                               [--block.wal-size BLOCK_WAL_SIZE]
                               [--block.wal-slots BLOCK_WAL_SLOTS]
                               [--osd-id OSD_ID] [--osd-fsid OSD_FSID]
                               [--cluster-fsid CLUSTER_FSID]
                               [--crush-device-class CRUSH_DEVICE_CLASS]
                               [--dmcrypt] [--no-systemd]
ceph-volume lvm prepare: error: Unable to proceed with non-existing device: ceph-block-8/block-8
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886534
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0bb106045ee10e08c157134b6e00ab846ce26e1f)

ceph-crash: Only deploy key to targeted hosts

The current task installs the ceph-crash key to "most" hosts via
"delegate_to". This key is only used by the ceph-crash daemon and should
just be installed on all hosts targeted by this role. There is no need
for using a delegated task.

Signed-off-by: Gaudenz Steinlin <gaudenz.steinlin@cloudscale.ch>
(cherry picked from commit 68cc93fb18d516a04e288418811787355fb0582e)

flake8: run the workflow conditionally

We don't need to run flake8 on ansible modules and their tests if we
don't have any modifitions.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 00b7ee27df59fb0d5a537f6c0ad11c910695126d)

ceph-osd: start osd after systemd overrides

The service should be started after the ceph-osd systemd overrides has
been added, otherwise, the latter isn't considered.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1860739
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 59d0f0199243de40bde714d1a9019b1715c57dbf)

ceph-osd: don't start the OSD services twice

Using the + operation on two lists doesn't filter out the duplicate
keys.
Currently each OSDs is started (via systemd) twice.
Instead we could use the union filter.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4eaa65c36256189b352c88c1058e888550adbd0f)

handler: refact check_socket_non_container

the `stat --printf=%n` returns something like following:

```
ok: [osd0] => changed=false
  cmd: |-
    stat --printf=%n /var/run/ceph/ceph-osd*.asok
  delta: '0:00:00.009388'
  end: '2020-10-06 06:18:28.109500'
  failed_when_result: false
  rc: 0
  start: '2020-10-06 06:18:28.100112'
  stderr: ''
  stderr_lines: <omitted>
  stdout: /var/run/ceph/ceph-osd.2.asok/var/run/ceph/ceph-osd.5.asok
  stdout_lines: <omitted>
```

it makes the next task "check if the ceph osd socket is in-use" grep
like this:

```
ok: [osd0] => changed=false
  cmd:
  - grep
  - -q
  - /var/run/ceph/ceph-osd.2.asok/var/run/ceph/ceph-osd.5.asok
  - /proc/net/unix
```

which will obviously fail because this path never exists. It makes the
OSD handler broken.

Let's use `find` module instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 46d4d97da9c6078a6c5ff60a39db4b4072fb902b)

library: Fix new-style modules check mode

Running the `ceph_crush.py`, `ceph_key.py` or `ceph_volume.py` modules in check
mode resulted in the following error:

```
New-style module did not handle its own exit
```

This was due to the fact that they simply returned a `dict` in that case,
instead of calling `module.exit_json()`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 85dd4058145436e86a12ad9f015f5228189437d5)

Fix Ansible check mode for site.yml.sample playbook

Make sure the `site.yml.sample` playbook can be run in check mode by skipping
tasks that try to read the output of commands that have been skipped.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 54ba38e35ea67c1c342b008be675103e120982d0)

tests: change cephfs pool size

`all_daemons` scenario can't handle pools with `size: 3` because we have
1 osd node in root=HDD and two nodes in root=default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5713ea5d51868855df368e285faff484342ef04)

ceph_key: support using different keyring

Currently the `ceph_key` module doesn't support using a different
keyring than `client.admin`.
This commit adds the possibility to use a different keyring.

Usage:
```
      ceph_key:
        name: "client.rgw.myrgw-node.rgw123"
        cluster: "ceph"
        user: "client.bootstrap-rgw"
        user_key: /var/lib/ceph/bootstrap-rgw/ceph.keyring
        dest: "/var/lib/ceph/radosgw/ceph-rgw.myrgw-node.rgw123/keyring"
        caps:
          osd: 'allow rwx'
          mon: 'allow rw'
          import_key: False
        owner: "ceph"
        group: "ceph"
        mode: "0400"
```

Where:
`user` corresponds to `-n (--name)`
`user_key` corresponds to `-k (--keyring)`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 12e6260266dec04b4b2d25f3508aa7149fd16714)

rgw: fix multi instances scaleout in baremetal

When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook in baremetal deployments.

When ceph-osd notifies handlers, it means rgw handlers are triggered
too. The issue with this is that they are triggered before the role
ceph-rgw is run.
In the case a scaleout operation is expected on `radosgw_num_instances`
it causes an issue because keyrings haven't been created yet so the new
instances won't start.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881313
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a802fa2810e50e87f61e3a64c27f8826ba6aa250)

tests: reboot and test idempotency on collocation

test reboot and idempotency on collocation scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f83f798206566b714adbc55e2543cbd9529897fa)

flake8: fix pep8 syntax on tests/functional/tests/

tests/conftest.py and tests present in tests/functional/tests/ has been
missed from previous commit

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8596f1d52c27d22d0e8b7052c9b56c0c350e7fc7)

# Conflicts:
# .github/workflows/flake8.yml

flake8: fix all tests/library/*.py files

This commit modifies all *.py files in ./tests/library/ so flake8
passes.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e49a5241f007083e7ec8618115738b3d3d8873cc)

tests: refact flake8 workflow

drop ricardochaves/python-lint action and use `run` steps instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f2d3432cad0d8256356a9b64bf9bb2827948f40c)

(cherry picked from commit 8f6e0b2a186dbb2a25bc7eec93d4dde3b94e3b15)

tests: add github workflows

Add github workflow. Especially for flake8 for now.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1ee626a1b3c277c3b33e1c30ab65416a776c7b96)
(cherry picked from commit ed6ae6815d665069c9b6f7d790390dceb0f037c9)

library: flake8 ceph-ansible modules

This commit ensure all ceph-ansible modules pass flake8 properly.

Signed-off-by: Wong Hoi Sing Edison <hswong3i@gmail.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 268a39ca0e8698dff9faec1b558d1d99006215aa)
(cherry picked from commit 32a2f04cbc288d79e9a228593be854469a0ecefb)

library: remove legacy file

This file is a leftover and should have been removed when we dropped the
validate module.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8603cba9abb6adcc4c0bd8d8fab1a5772584a159)

fs2bs: support `osd_auto_discovery` scenario

This commit adds the `osd_auto_discovery` scenario support in the
filestore-to-bluestore playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881523
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8b1eeef18a4ae953cf568e750c5cf1e3e96e0b78)

ceph-facts: add get default crush rule from running monitor

In case of deploying new monitor node to an existing cluster,
osd_pool_default_crush_rule should be taken from running monitor because
ceph-osd role won't be run and the new monitor will have different
osd_pool_default_crush_role from other monitors.

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
(cherry picked from commit ff9f4d138f988d908b5f5583e0c1fcf5dd72e36d)

rgw multisite: check connection for realm endpoint

This commit adds connection checks before realm pulls
Curls are performed on the endpoint being pulled from
the mons and the rgws

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1731158
Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 902575369c73dfdb0e94d898bbabb9ebd5b12c93)

ceph-handler: set handler on xxx_stat result

In non containerized deployment we check if the service is running
via the socket file presence.
This is done via the xxx_socket_stat variable that check the file
socket in the /var/run/ceph/ directory.
In some scenarios, we could have the socket file still present in
that directory but not used by any process.
That's why we have the xxx_stat variable which clean those leftovers.

The problem here is that we're set the variable for the handlers status
(like handler_mon_status) based on xxx_socket_stat instead of xxx_stat.
That means we will trigger the handlers if there's an old socket file
present on the system without any process associated.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1866834
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 733596582d0788a52795bc40b1a5cd94ddef0446)

ceph-facts: check for mon socket in its own host

delegate to its own host after checking mon socket to findout if mon socket is in-use or not.

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
(cherry picked from commit 69f7e353823fffd9dff505679f9c9dbeb4fd810d)

mds: support enabling pg autoscaler on rerun

This commit add the pg autoscaler enablement support on ceph-ansible
rerun.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1836431
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-config: remove ceph_release from ceph.conf.j2

We don't use ceph_release variable in the ceph.conf jinja template.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 62bd41f0d4e20d73793d8e8d6a8db722aac146c2)

ansible.cfg: remove cfg file in infrastructure-playbooks

There's no need ot have a copy of this file in infrastructure-playbooks
directory.
playbooks in that directory can be run from the root dir of
ceph-ansible.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f906caa6dada4a88bfaca1ab9a42ed63358e8ae3)

ansible.cfg: set force_valid_group_names param

As of 2.10, group names containing a dash are invalid.
However, setting this option makes it still possible to use a dash in
group names and prevent this warning to show up.
It might need to be definitely addressed in a future ansible release.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1880476
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6938ed13021201cc1cfcd3f3087849fed3157a69)

library/ceph_key: set no_log on secret

We don't need to show this information during the module execution.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a3f4e2b4d11d3185f4064be5ab2969f0df894ff2)

Remove libjemalloc1 installation task

libjemalloc1 package is not required neither for ganesha dependency nor
for the package build process. So this task can be simply dropped.

Signed-off-by: Dmitriy Rabotyagov <noonedeadpunk@ya.ru>
(cherry picked from commit 297532ca411dbdc6ec96258875058b323008abfe)

README-MULTISITE: Fix syntax issues from markdownlint

This commit makes the following changes:

- Remove trailing whitespace;
- Use consistent header levels;
- Fix code blocks;
- Remove hard tabs;
- Fix ordered lists;
- Fix bare URLs;
- Use markdown list of sections.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 2c244425ecef8752e434fb7920ab64d8534dd59c)

docs: update URLs to point to the RTD links

Fixes #5798
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
(cherry picked from commit f3a78371d9e1336595f5ce8ae7932a5f97004bbe)

facts: refact `ceph_uid` fact

There's no need to set this fact with a `set_fact`
We can achieve this in `ceph-defaults`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875058
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bcc673f66c22364766beb4b5ebb971bd3f517693)

ceph-facts: move facts to defaults value

There's no need to define a variable via a fact if we can do it via a
default value. Using a fact could be interesseting to override the
default value on some condition.

- ceph_uid could be set to 167 by default because it's only different on
non containerized deployment on Debian/Ubuntu.
- rbd_client_directory_{owner,group,mode} could be set to ceph,ceph,0770
by default install of null as we are doing in the facts.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875058
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7f997e623a7171fa6f00c43cd5b60f3882f8ed04)

container: quote registry password

When using a quote in the registry password then we have the following
error:

The error was: ValueError: No closing quotation

To fix this we need to use the quote filter.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1880252

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 6dcfdf17d43635fcd0dc658c199702945a1228dd)

facts: fix 'set_fact rgw_instances with rgw multisite'

the current condition doesn't work, as soon as the first iteration is
done the condition makes next iterations skip since `rgw_instances` got
set with the first iteration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859872
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ff19c1d851ebf0bd7bd8a744367e8893dc6103a8)

ceph-infra: include iscsi nodes for logrotate

The iscsi nodes aren't included in the logrotate condition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 85643edfe382d34a77fabbd97a4d937b8b74d4e6)

infra: support log rotation for tcmu-runner

This commit adds the log rotation support for tcmu-runner.

ceph-container related PR: ceph/ceph-container#1726

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1873915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f576c02ff7b15c207b77b3f206a3213184b89889)

container: add optional http(s) proxy option

When using a http(s) proxy with either docker or podman we can rely on
the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables.
But with ansible, even if those variables are defined in a source file
then they aren't loaded during the container pull/login tasks.
This implements the http(s) proxy support with docker/podman.
Both implementations are different:
1/ docker doesn't rely en the environment variables with the CLI.
Thos are needed by the docker daemon via systemd.
2/ podman uses the environment variables so we need to add them to
the login/pull tasks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876692
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bda3581294c8f29eda598522c331a4c009243884)

ceph-prometheus: update pool stat counter

Since [1] The bytes_used pool counter in prometheus has been renamed
to stored.

Closes: #5781
[1] https://github.com/ceph/ceph/commit/71fe9149

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e54b924eaf05a7223ec7525657d14e8892ce8957)

switch2container: chown symlink for devices

If the OSD directory is using symlinks for referencing devices (like
block, db, wal for bluestore and journal for filestore) then the chown
command could fail to change the owner:group on some system.

$ ls -hl /var/lib/ceph/osd/ceph-0/
total 28K
lrwxrwxrwx 1 ceph ceph 92 Sep 15 01:53 block -> /dev/ceph-45113532-95ca-471b-bd75-51de46f1339c/osd-data-570a1aee-60c0-44c9-8036-ffed7d67a4e6
-rw------- 1 ceph ceph 37 Sep 15 01:53 ceph_fsid
-rw------- 1 ceph ceph 37 Sep 15 01:53 fsid
-rw------- 1 ceph ceph 55 Sep 15 01:53 keyring
-rw------- 1 ceph ceph  6 Sep 15 01:53 ready
-rw------- 1 ceph ceph  3 Sep 15 02:00 require_osd_release
-rw------- 1 ceph ceph 10 Sep 15 01:53 type
-rw------- 1 ceph ceph  2 Sep 15 01:53 whoami
$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied
$ find /var/lib/ceph/osd/ceph-0 -not -user 167
/var/lib/ceph/osd/ceph-0/block

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da4280e243f50114e1ae6455a46360012feb8f3d)

switch2container: remove deb systemd units

When running the switch2container playbook on a Debian based system
then the systemd unit path isn't the same than Red Hat based system.
Because the systemd unit files aren't removed then the new container
systemd unit isn't take in count.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c1af69a7e79a5909903490028f7ae13e519c98e0)

ansible: bump to ansible 2.9

Prior this commit we were supporting both ansible 2.8 and 2.9.
Let's drop 2.8 now.

Closes: #5459
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1879178
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

purge: remove potential socket leftover

This commit ensure we remove any socket left by ceph and the
`ceph-osd-run.sh` script.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861755
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5e91e0f3e24da0492b6f5dd2bc808215b5066ddc)

tests: do not run node_exporter test on clients

We need to skip these tests on client nodes since we don't deploy
node_exporter on them anymore

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5650a6d7d0a0e2b2fa0ceb080e7d582dc9ceb447)

node-exporter: exclude client nodes

We don't need to install node-exporter on client node because there's
no ceph services running on them.
This also makes sure we use the group name variables in the prometheus
service template instead of hardcoding the values.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b105549ed858eb034d97f5fcad4890e17ee2ebfd)

Revert "Make 'disable ssl for dashboard task' idempotent."

This reverts commit f607857f2a58b2ed14faf49f2b10d056a7f96b30.

> That commit [1] introduced a regression in the dashboard configuration
> because the ceph config get mgr xxxx command doesn't work with
> nautilus.
> In that release the get operation needs an entity.

> [1] f607857

Signed-off-by: Dimitri Savineau dsavinea@redhat.com

facts: refact and optimize memory consumption

there's no need to run this task on all nodes.
This uses too much memory for nothing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1856981
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f0fe193d8ec48414447aa4a7d50b1a9859c71295)

config: only add related rgw section

there's no need to add each rgw section on all rgw nodes.
With this commit, only related rgw section are rendered.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a581a6e6007812cdad935e4f65909b4306046b2)

ceph-iscsi: remove python rtslib shaman repository

The rtslib python library is now available in the distribution so we
shouldn't have to use the shaman repository

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 254ab54f8038c7af2f730dc5abc213490aa60b71)

Add CentOS 8 support for rpm deployment

We were only supporting CentOS 8 for containerized deployment.
Since Nautilus 14.2.10 we now have el8 rpm packages so we should be
able to deploy a nautilus ceph cluster with el8.
Note that the nfs-ganesha isn't supported because there's no el8 rpm
packages for nfs-ganesha V2.8.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

Enable HAProxy backend checks for Ceph RGW

Add the `check` option to server definitions to enable basic HAProxy health
checks for Ceph RADOS gateway backends.

Currently traffic will be forwarded to unhealthly `radosgw.service` servers.
These changes resolve the issue.

Signed-off-by: Niko Smeds nikosmeds@gmail.com
(cherry picked from commit a951c1a3f0a34e086964f52b0bbf7a8d89481aad)

dashboard: refact admin user creation task

this commit splits this task in order to avoid using a `shell` module.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54d3e9650f77466ae4207502e0a2da638d82954d)

Make 'disable ssl for dashboard task' idempotent.

This should reduce number of 'changed' tasks during convergence test.

Signed-off-by: George Shuklin <george.shuklin@gmail.com>
(cherry picked from commit 73d4bb6bd6b560de7f2b3042bdc7d17c901e815a)

Comment out ceph_custom_key

Since there is a check if ceph_custom_key is defined, there is no reason
to define it by default.

Signed-off-by: Rafał Wądołowski <rwadolowski@cloudferro.com>
(cherry picked from commit 55cd6e83e475ab9ad8d684b88da5325d869e9d1c)

ceph_custom_repo: define apt and rpm key for custom repo

This commit also remove the notify on new added debian repo,
force update_cache to yes and define sample ceph_custom_key vars.

Signed-off-by: Anthony Rusdi <33247310+antrusd@users.noreply.github.com>
(cherry picked from commit 4c592066b7c1caaec700af347fc9edf2109c1659)

ceph-rgw: allow specifying crush rule on pool

We already support specifiying a custom crush rule during pool creation
in ceph-osd role but not in ceph-rgw role.
This patch adds the missing code to implement this feature.
Note this is only available for replicated pool not erasure. The rule
must also exist prior the pool creation.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1855439
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cb8f0237e1fe7b890d20d47b5d023a6c618cbd4c)

container: run engine/common roles on first client

We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede7e26d053f87acde99986fddb0fe070)

container: don't install the engine on all clients

We only need the container engine to be installed on the first clients
node in order to execute the pools/keys operation. We already do the
same worflow with the ceph-container-common role which pull the ceph
container image.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9805589ef94230c67439787cb19ffa7e3d5f2b3d)

Allow updating crush rule on existing pool

The crush rule value was only set once during the pool creation. It was
not possible to update the crush rule value by updating the value in the
configuration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1847166
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rgw: allow rgws to be concurrently with or without multisite

Allows rgws in a ceph cluster to be run with
multisite and without multisite at the same time.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 5c1f4b1a1eff8c77c4bdc816debbbc4043efc644)

purge-cluster: use sysfs method for unmapping rbd devices

This way we keep consistency with purge-container-cluster.yml playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f77fa6e2a4d4c2b4522582f53713c2e49fecbe12)

pytest: register ceph_crash mark

Otherwise we see some pytest warning.

PytestUnknownMarkWarning: Unknown pytest.mark.ceph_crash - is this a typo?
You can register custom marks to avoid this warning - for details,
see https://docs.pytest.org/en/latest/mark.html

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 03d46202691514639ff10666a488169bf8b4d150)

ceph-handler: add missing condition on ceph-crash

The ceph-crash tasks present in the ceph-handler role don't need to be
executed on all nodes.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 18e3c7a0a2f5ff1f2482e519178a00cec0c81420)

crash: rm container in ExecPreStart even with docker

We should ensure the container is removed in `ExecPreStart` even when
`{{ container_binary }}` is docker.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 39bb279a53deaf87053cae4014c7143780016536)

ceph-crash: introduce new role ceph-crash

This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1c9b6ae42b3133bb9ac37d4765e5e07)

ceph-facts: only get fsid when monitor are present

When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec701dadc28359b1a4978f8a7ab00e03)

tests: use grafana from quay.io

This changes the grafana container image regitry from docker.io to
quay.io to avoid rate limit.
This also adds the missing container image values for docker2podman
and podman scenarios.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit dd05d8ba9056eb1fbb92419d5c1dd8d14cfbd350)

tests: migrate to quay.ceph.io registry

in order to avoid docker.io rate limiting

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 218aedaab66b1d07a69e635951baceb83e15cd78)

Add --cluster option on ceph require-osd-release command

On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.

Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b687d95704bac76ed0b4f83dfc3ca992)

Fix hosts field in rolling_update playbook when mds are processed

In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c720eeeef72b6eef59bb239e6ed04cdbe)

tests: move erasure pool testing in lvm_osds

This commit moves the erasure pool creation testing from `all_daemons`
to `lvm_osds` so we can decrease the number of osd nodes we spawn so the
OVH Jenkins slaves aren't less overwhelmed when a `all_daemons` based
scenario is being tested.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8476beb5b1f673d8b0925293d9273041c99a9bac)

Set default permission for prometheus config files

Regardless of the outcome of Ansible 2.9.12 issue 71200
we can set a default permission for these files.

Closes: https://github.com/ceph/ceph-ansible/issues/5677
Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit 95dee6f1cad71cddb69f7bcddbd199ebcad45d8c)

shrink-mds: use mds_to_kill_hostname instead

When using fqdn in inventory host file, this task will fail because the
mds is registered with its shortname.

It means we must use `mds_to_kill_hostname` in this task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51c382677dfa5db8fc39ca9c3c4898e017f3c189)

mgr: enable pg_autoscaler by default

Otherwise, even though we set the pg autoscaler attribute on a pool, the
feature won't be working as expected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1836431
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

infra: only install logrotate on right nodes

For intsance, there is no need to install logrotate on clients nodes.

This also ensure logrotate is installed only for containerized
deployments since the packaging has an explicit dependency to logrotate

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8ed11ea3ee9ea006ecf723c070fa18d3b318f580)

travis: enforce ansible-lint 4.2.0

Let's pin to 4.2.0

(because of ansible/ansible-lint/issues/966)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 04d77dcaebb52734a1c6d1838ecfa669bf8f3c67)

infra: add missing tag

This commit adds the missing `with_pkg` tag on the logrotate
installation task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e1cb385740b8b32600eda90910dcff20208f8945)

purge: import ceph-defaults in purge osd play

Otherwise, `ceph_volume_debug` variable is undefined

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 33a544644a671c8b9ffd7c5e761276c1a1ac574d)

infra: add log rotation support (containers)

This commit adds the log rotation support via logrotate in containerized
deployments.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1848388
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f1aa6cea21ca5423bb0404eae6437a19eaae2653)

common: don't enable debug log on ceph-volume calls by default

ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7919ac7d19d854a92e3ed367b361ccc)

nfs: do not copy rgw keyring when `nfs_obj_gw` is true

This keyring shouldn't be copied when `nfs_obj_gw` is `True` if the
cluster doesn't contain a rgw node, which can be the case given we are
using `nfs_obj_gw` instead of `nfs_file_gw` (cephfs vs. object), the
deployment will fail trying to copy a key that doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit dd4b5b0328d585d62103a84e02ca728b588a50f3)

rgw: support 1+ rgw instance in `radosgw_frontend_port`

Change the radosgw_frontend_port to take in account more than 1 RGW instance,
in it's original form `radosgw_frontend_port: radosgw_frontend_port | int`,
it configured the 8080 port to all instances, with the following modification
`radosgw_frontend_port: radosgw_frontend_port | int + item|int` we increase in
1 the port count.

Co-authored-by: Daniel Parkes <dparkes@redhat.com>
Signed-off-by: raul <rmahique@redhat.com>
(cherry picked from commit 110eaf5f9f8a2fe26993e2e663849a74531da9d2)

purge-cluster: check if rbdmap exists

When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.

This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit a57fd7a0900e5c3d04e8b6c997c819d340565967)

Prometheus APIs are only available through plain http

Trying to access these APIs through TLS produces "Could not reach
external API" errors in Ceph dashboard.

Signed-off-by: Paulo Matias <matias@ufscar.br>
(cherry picked from commit dac8e1d0a965125b9fa19616ded89254744581fc)

Allow user to specify grafana_server_fqdn

This is needed to get a TLS certificate to validate correctly.

If unspecified, auto-detected grafana_server_addr is used.

Signed-off-by: Paulo Matias <matias@ufscar.br>
(cherry picked from commit 38ce02c2eacee20395b4d7ad6fc5b7b2c4470a30)

Remove ceph-radosgw.target when switching to containerize daemons

The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.

Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b227c621b6c794db4f571151e5bbf9c4)

shrink_osd: remove osd data directory

Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33b8aa6ad6f0a0f29531699922d9bf75)

tests: refact shrink_osd scenario

This adds more coverage on the shrink_osd scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7efea219d62792c599b9d66035395323334beeaa)

tox: split shrink_osd scenario

Let's split this scenario with a dedicated tox ini file.

This is for testing in two ways:

1/ shrinking OSDs one by one
2/ shrinking multiple OSDs with a single call of the playbook

ceph-build related PR: ceph/ceph-build#1629

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 78e4faf077e2710b8245acb3b63dd49f0875291a)

shrink-osd: various fixes

This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.

This also add a missing `| default(False)` to avoid fowlloing error:

```
fatal: [ceph01]: FAILED! =>
msg: |-
The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2d877c9ca3b08412a8b12f64a111c18)

tox: remove dashboard file

This tox configuration file isn't used anymore as the dashboard scenario
is included with all_daemons.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

dashboard: allow remote TLS cert/key copy

When using TLS on the ceph dashboard or grafana services, we can provide
the TLS certificate and key.
Those files should be present on the ansible controller and they will be
copyied to the right node(s).
In some situation, the TLS certificate and key could be already present
on the target node and not on the ansible controller.
For this scenario, we just need to copy the files locally (on each remote
host).

This patch adds the dashboard_tls_external variable (with default to
false) to allow users to achieve this scenario when configuring this
variable to true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1860815
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0d0f1e71df33484d6619aeaa97eb21d7dfc0ea48)

rolling_update: restart mds after the upgrade

In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74ffbefcce42582c57c6726cc001f98ab)

rolling_update: refact dashboard workflow

The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957e40c9c42370eab73ae99e73d53c8a4)