git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

infra: support log rotation for tcmu-runner

This commit adds the log rotation support for tcmu-runner.

ceph-container related PR: ceph/ceph-container#1726

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1873915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f576c02ff7b15c207b77b3f206a3213184b89889)

container: add optional http(s) proxy option

When using a http(s) proxy with either docker or podman we can rely on
the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables.
But with ansible, even if those variables are defined in a source file
then they aren't loaded during the container pull/login tasks.
This implements the http(s) proxy support with docker/podman.
Both implementations are different:
1/ docker doesn't rely en the environment variables with the CLI.
Thos are needed by the docker daemon via systemd.
2/ podman uses the environment variables so we need to add them to
the login/pull tasks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876692
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bda3581294c8f29eda598522c331a4c009243884)

ceph-prometheus: update pool stat counter

Since [1] The bytes_used pool counter in prometheus has been renamed
to stored.

Closes: #5781
[1] https://github.com/ceph/ceph/commit/71fe9149

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e54b924eaf05a7223ec7525657d14e8892ce8957)

switch2container: chown symlink for devices

If the OSD directory is using symlinks for referencing devices (like
block, db, wal for bluestore and journal for filestore) then the chown
command could fail to change the owner:group on some system.

$ ls -hl /var/lib/ceph/osd/ceph-0/
total 28K
lrwxrwxrwx 1 ceph ceph 92 Sep 15 01:53 block -> /dev/ceph-45113532-95ca-471b-bd75-51de46f1339c/osd-data-570a1aee-60c0-44c9-8036-ffed7d67a4e6
-rw------- 1 ceph ceph 37 Sep 15 01:53 ceph_fsid
-rw------- 1 ceph ceph 37 Sep 15 01:53 fsid
-rw------- 1 ceph ceph 55 Sep 15 01:53 keyring
-rw------- 1 ceph ceph  6 Sep 15 01:53 ready
-rw------- 1 ceph ceph  3 Sep 15 02:00 require_osd_release
-rw------- 1 ceph ceph 10 Sep 15 01:53 type
-rw------- 1 ceph ceph  2 Sep 15 01:53 whoami
$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied
$ find /var/lib/ceph/osd/ceph-0 -not -user 167
/var/lib/ceph/osd/ceph-0/block

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da4280e243f50114e1ae6455a46360012feb8f3d)

switch2container: remove deb systemd units

When running the switch2container playbook on a Debian based system
then the systemd unit path isn't the same than Red Hat based system.
Because the systemd unit files aren't removed then the new container
systemd unit isn't take in count.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c1af69a7e79a5909903490028f7ae13e519c98e0)

ansible: bump to ansible 2.9

Prior this commit we were supporting both ansible 2.8 and 2.9.
Let's drop 2.8 now.

Closes: #5459
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1879178
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

purge: remove potential socket leftover

This commit ensure we remove any socket left by ceph and the
`ceph-osd-run.sh` script.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861755
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5e91e0f3e24da0492b6f5dd2bc808215b5066ddc)

tests: do not run node_exporter test on clients

We need to skip these tests on client nodes since we don't deploy
node_exporter on them anymore

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5650a6d7d0a0e2b2fa0ceb080e7d582dc9ceb447)

node-exporter: exclude client nodes

We don't need to install node-exporter on client node because there's
no ceph services running on them.
This also makes sure we use the group name variables in the prometheus
service template instead of hardcoding the values.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b105549ed858eb034d97f5fcad4890e17ee2ebfd)

Revert "Make 'disable ssl for dashboard task' idempotent."

This reverts commit f607857f2a58b2ed14faf49f2b10d056a7f96b30.

> That commit [1] introduced a regression in the dashboard configuration
> because the ceph config get mgr xxxx command doesn't work with
> nautilus.
> In that release the get operation needs an entity.

> [1] f607857

Signed-off-by: Dimitri Savineau dsavinea@redhat.com

facts: refact and optimize memory consumption

there's no need to run this task on all nodes.
This uses too much memory for nothing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1856981
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f0fe193d8ec48414447aa4a7d50b1a9859c71295)

config: only add related rgw section

there's no need to add each rgw section on all rgw nodes.
With this commit, only related rgw section are rendered.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a581a6e6007812cdad935e4f65909b4306046b2)

ceph-iscsi: remove python rtslib shaman repository

The rtslib python library is now available in the distribution so we
shouldn't have to use the shaman repository

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 254ab54f8038c7af2f730dc5abc213490aa60b71)

Add CentOS 8 support for rpm deployment

We were only supporting CentOS 8 for containerized deployment.
Since Nautilus 14.2.10 we now have el8 rpm packages so we should be
able to deploy a nautilus ceph cluster with el8.
Note that the nfs-ganesha isn't supported because there's no el8 rpm
packages for nfs-ganesha V2.8.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

Enable HAProxy backend checks for Ceph RGW

Add the `check` option to server definitions to enable basic HAProxy health
checks for Ceph RADOS gateway backends.

Currently traffic will be forwarded to unhealthly `radosgw.service` servers.
These changes resolve the issue.

Signed-off-by: Niko Smeds nikosmeds@gmail.com
(cherry picked from commit a951c1a3f0a34e086964f52b0bbf7a8d89481aad)

dashboard: refact admin user creation task

this commit splits this task in order to avoid using a `shell` module.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54d3e9650f77466ae4207502e0a2da638d82954d)

Make 'disable ssl for dashboard task' idempotent.

This should reduce number of 'changed' tasks during convergence test.

Signed-off-by: George Shuklin <george.shuklin@gmail.com>
(cherry picked from commit 73d4bb6bd6b560de7f2b3042bdc7d17c901e815a)

Comment out ceph_custom_key

Since there is a check if ceph_custom_key is defined, there is no reason
to define it by default.

Signed-off-by: Rafał Wądołowski <rwadolowski@cloudferro.com>
(cherry picked from commit 55cd6e83e475ab9ad8d684b88da5325d869e9d1c)

ceph_custom_repo: define apt and rpm key for custom repo

This commit also remove the notify on new added debian repo,
force update_cache to yes and define sample ceph_custom_key vars.

Signed-off-by: Anthony Rusdi <33247310+antrusd@users.noreply.github.com>
(cherry picked from commit 4c592066b7c1caaec700af347fc9edf2109c1659)

ceph-rgw: allow specifying crush rule on pool

We already support specifiying a custom crush rule during pool creation
in ceph-osd role but not in ceph-rgw role.
This patch adds the missing code to implement this feature.
Note this is only available for replicated pool not erasure. The rule
must also exist prior the pool creation.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1855439
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cb8f0237e1fe7b890d20d47b5d023a6c618cbd4c)

container: run engine/common roles on first client

We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede7e26d053f87acde99986fddb0fe070)

container: don't install the engine on all clients

We only need the container engine to be installed on the first clients
node in order to execute the pools/keys operation. We already do the
same worflow with the ceph-container-common role which pull the ceph
container image.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9805589ef94230c67439787cb19ffa7e3d5f2b3d)

Allow updating crush rule on existing pool

The crush rule value was only set once during the pool creation. It was
not possible to update the crush rule value by updating the value in the
configuration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1847166
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rgw: allow rgws to be concurrently with or without multisite

Allows rgws in a ceph cluster to be run with
multisite and without multisite at the same time.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 5c1f4b1a1eff8c77c4bdc816debbbc4043efc644)

purge-cluster: use sysfs method for unmapping rbd devices

This way we keep consistency with purge-container-cluster.yml playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f77fa6e2a4d4c2b4522582f53713c2e49fecbe12)

pytest: register ceph_crash mark

Otherwise we see some pytest warning.

PytestUnknownMarkWarning: Unknown pytest.mark.ceph_crash - is this a typo?
You can register custom marks to avoid this warning - for details,
see https://docs.pytest.org/en/latest/mark.html

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 03d46202691514639ff10666a488169bf8b4d150)

ceph-handler: add missing condition on ceph-crash

The ceph-crash tasks present in the ceph-handler role don't need to be
executed on all nodes.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 18e3c7a0a2f5ff1f2482e519178a00cec0c81420)

crash: rm container in ExecPreStart even with docker

We should ensure the container is removed in `ExecPreStart` even when
`{{ container_binary }}` is docker.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 39bb279a53deaf87053cae4014c7143780016536)

ceph-crash: introduce new role ceph-crash

This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1c9b6ae42b3133bb9ac37d4765e5e07)

ceph-facts: only get fsid when monitor are present

When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec701dadc28359b1a4978f8a7ab00e03)

tests: use grafana from quay.io

This changes the grafana container image regitry from docker.io to
quay.io to avoid rate limit.
This also adds the missing container image values for docker2podman
and podman scenarios.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit dd05d8ba9056eb1fbb92419d5c1dd8d14cfbd350)

tests: migrate to quay.ceph.io registry

in order to avoid docker.io rate limiting

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 218aedaab66b1d07a69e635951baceb83e15cd78)

Add --cluster option on ceph require-osd-release command

On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.

Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b687d95704bac76ed0b4f83dfc3ca992)

Fix hosts field in rolling_update playbook when mds are processed

In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c720eeeef72b6eef59bb239e6ed04cdbe)

tests: move erasure pool testing in lvm_osds

This commit moves the erasure pool creation testing from `all_daemons`
to `lvm_osds` so we can decrease the number of osd nodes we spawn so the
OVH Jenkins slaves aren't less overwhelmed when a `all_daemons` based
scenario is being tested.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8476beb5b1f673d8b0925293d9273041c99a9bac)

Set default permission for prometheus config files

Regardless of the outcome of Ansible 2.9.12 issue 71200
we can set a default permission for these files.

Closes: https://github.com/ceph/ceph-ansible/issues/5677
Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit 95dee6f1cad71cddb69f7bcddbd199ebcad45d8c)

shrink-mds: use mds_to_kill_hostname instead

When using fqdn in inventory host file, this task will fail because the
mds is registered with its shortname.

It means we must use `mds_to_kill_hostname` in this task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51c382677dfa5db8fc39ca9c3c4898e017f3c189)

mgr: enable pg_autoscaler by default

Otherwise, even though we set the pg autoscaler attribute on a pool, the
feature won't be working as expected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1836431
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

infra: only install logrotate on right nodes

For intsance, there is no need to install logrotate on clients nodes.

This also ensure logrotate is installed only for containerized
deployments since the packaging has an explicit dependency to logrotate

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8ed11ea3ee9ea006ecf723c070fa18d3b318f580)

travis: enforce ansible-lint 4.2.0

Let's pin to 4.2.0

(because of ansible/ansible-lint/issues/966)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 04d77dcaebb52734a1c6d1838ecfa669bf8f3c67)

infra: add missing tag

This commit adds the missing `with_pkg` tag on the logrotate
installation task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e1cb385740b8b32600eda90910dcff20208f8945)

purge: import ceph-defaults in purge osd play

Otherwise, `ceph_volume_debug` variable is undefined

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 33a544644a671c8b9ffd7c5e761276c1a1ac574d)

infra: add log rotation support (containers)

This commit adds the log rotation support via logrotate in containerized
deployments.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1848388
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f1aa6cea21ca5423bb0404eae6437a19eaae2653)

common: don't enable debug log on ceph-volume calls by default

ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7919ac7d19d854a92e3ed367b361ccc)

nfs: do not copy rgw keyring when `nfs_obj_gw` is true

This keyring shouldn't be copied when `nfs_obj_gw` is `True` if the
cluster doesn't contain a rgw node, which can be the case given we are
using `nfs_obj_gw` instead of `nfs_file_gw` (cephfs vs. object), the
deployment will fail trying to copy a key that doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit dd4b5b0328d585d62103a84e02ca728b588a50f3)

rgw: support 1+ rgw instance in `radosgw_frontend_port`

Change the radosgw_frontend_port to take in account more than 1 RGW instance,
in it's original form `radosgw_frontend_port: radosgw_frontend_port | int`,
it configured the 8080 port to all instances, with the following modification
`radosgw_frontend_port: radosgw_frontend_port | int + item|int` we increase in
1 the port count.

Co-authored-by: Daniel Parkes <dparkes@redhat.com>
Signed-off-by: raul <rmahique@redhat.com>
(cherry picked from commit 110eaf5f9f8a2fe26993e2e663849a74531da9d2)

purge-cluster: check if rbdmap exists

When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.

This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit a57fd7a0900e5c3d04e8b6c997c819d340565967)

Prometheus APIs are only available through plain http

Trying to access these APIs through TLS produces "Could not reach
external API" errors in Ceph dashboard.

Signed-off-by: Paulo Matias <matias@ufscar.br>
(cherry picked from commit dac8e1d0a965125b9fa19616ded89254744581fc)

Allow user to specify grafana_server_fqdn

This is needed to get a TLS certificate to validate correctly.

If unspecified, auto-detected grafana_server_addr is used.

Signed-off-by: Paulo Matias <matias@ufscar.br>
(cherry picked from commit 38ce02c2eacee20395b4d7ad6fc5b7b2c4470a30)

Remove ceph-radosgw.target when switching to containerize daemons

The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.

Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b227c621b6c794db4f571151e5bbf9c4)

shrink_osd: remove osd data directory

Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33b8aa6ad6f0a0f29531699922d9bf75)

tests: refact shrink_osd scenario

This adds more coverage on the shrink_osd scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7efea219d62792c599b9d66035395323334beeaa)

tox: split shrink_osd scenario

Let's split this scenario with a dedicated tox ini file.

This is for testing in two ways:

1/ shrinking OSDs one by one
2/ shrinking multiple OSDs with a single call of the playbook

ceph-build related PR: ceph/ceph-build#1629

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 78e4faf077e2710b8245acb3b63dd49f0875291a)

shrink-osd: various fixes

This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.

This also add a missing `| default(False)` to avoid fowlloing error:

```
fatal: [ceph01]: FAILED! =>
msg: |-
The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2d877c9ca3b08412a8b12f64a111c18)

tox: remove dashboard file

This tox configuration file isn't used anymore as the dashboard scenario
is included with all_daemons.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

dashboard: allow remote TLS cert/key copy

When using TLS on the ceph dashboard or grafana services, we can provide
the TLS certificate and key.
Those files should be present on the ansible controller and they will be
copyied to the right node(s).
In some situation, the TLS certificate and key could be already present
on the target node and not on the ansible controller.
For this scenario, we just need to copy the files locally (on each remote
host).

This patch adds the dashboard_tls_external variable (with default to
false) to allow users to achieve this scenario when configuring this
variable to true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1860815
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0d0f1e71df33484d6619aeaa97eb21d7dfc0ea48)

rolling_update: restart mds after the upgrade

In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74ffbefcce42582c57c6726cc001f98ab)

rolling_update: refact dashboard workflow

The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957e40c9c42370eab73ae99e73d53c8a4)

rolling_update: stop/start instead of restart

During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

This following comment isn't valid because we should have backported
ceph-crash implementation in stable-4.0 before this commit, which was not
possible because of the needed tag v4.0.25.1 (async release for 4.1z1):

~~Finally, this adds the missing service stop task for ceph crash upgrade
workflow.~~

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d54ea29ccbf5414cb93cdc748c516e79)

ceph-handler: remove iscsigws restart scripts

The iscsigws restart scripts for tcmu-runner and rbd-target-{api,gw}
services only call the systemctl restart command.
We don't really need to copy a shell script to do it when we can use
the ansible service module instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cbe79428e687e383f9764668a56171e5582451be)

podman: always remove container on start

In case of failure, the systemd ExecStop isn't executed so the container
isn't removed. After a reboot of a failed node, the container doesn't
start because the old container is still present in created state.
We should always try to remove the container in ExecStartPre for this
situation.
A normal reboot doesn't trigger this issue and this also doesn't affect
nodes running containers via docker.
This behaviour was introduced by d43769d.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1858865
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 47b7c00287f310ab38e442ba2a147e9f7faab1ee)

update: use tasks_from when including ceph-facts

When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2bae47ab20650dada8e66307a060962)

rgw: set container memory limit to 4g

This commit changes the container memory limit for rgw daemons.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1707488
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 86edae724f98036ada845a138c7f586df395cd3a)

tests: lvm_setup.yml, add carriage return

This commit adds crlf between each task.
It makes the playbook more readable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8ef9fb68bc975f92069f362775f4f281c3b03531)

tests: (lvm_setup.yml), don't shrink lvol

when rerunning lvm_setup.yml on existing cluster with OSDs already
deployed, it fails like following:

```
fatal: [osd0]: FAILED! => changed=false
msg: Sorry, no shrinking of data-lv2 to 0 permitted.
```

because we are asking `lvol` module to create a volume on an empty VG
with size extents = `100%FREE`.

The default behavior of `lvol` is to shrink the volume if the LV's current
size is greater than the requested size.

Given the requested size is calculated like this:

`size_requested = size_percent * this_vg['free'] / 100`

in our case, it is similar to:

`size_requested = 100 * 0 / 100` which basically means `0`

So the current LV size is well greater than the requested size which
leads the module to attempt to shrink it to 0 which isn't obviously now
allowed.

Adding `shrink: false` to the module calls fixes this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 218f4ae361b53fd7632d17495ba54fb6e2afed10)

ceph_key: fix bug in 'info' feature

Fix 'info' feature from ceph_key.py module

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9417ecf0c5c05da2b12572ef36304edbe2cf0ae1)

facts: fix broken facts when using --limit

This commit fixes these tasks when --limit is used.

It makes sure the fact is set on right nodes even when the playbook is
run with `--limit`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f8a951f50c6a64ab3d60a1bf66ca9d2db2f6bc35)

ceph-dashboard: copy TLS cert/key on monitor

The ceph-dashboard role is executed on the mgr nodes so the TLS cert/key
files are copied to those nodes.
But we are running importing the cert/key files into the ceph
configuration on the monitor.

Closes: #5557
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2b8ebf14574e927bfabd939cc6263eb27a65afb3)

ceph_volume: fix regression

do not skip zapping if osd_fsid is passed

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f402ab2b87813f0f9c3fba661a52f5afebc19723)

facts: explicitly disable facter and ohai

By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b8be6f9f9b1ba8568072daf39da7a2c)

radosgw: remove INST_PORT environment variable

This variable isn't consumed by the container so we can remove it.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1361e84a4e5f339018762e1468f2bfddece7b38e)

rgw: fix multi instances scaleout

When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.

The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:

```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```

This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1086e3c15cfad3812f1c141eadd80c0)

dashboard: configure mgr backend before restart

We need to set the mgr dashboard server ip address before restarting the
dashboard module otherwise we can try to bind the dashboard module on an
already used address.
We already do this configuration for the dashboard port value and ssl
setup so we should do the same for server address too.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851455
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 03cd75845fee4d7d51bf5ce999e489d6f943e283)

ceph-volume.py: add support for batch refactored code

See https://github.com/ceph/ceph/pull/34740 for the batch changes.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
(cherry picked from commit d90834b77f31f186ab72f41680c1f15357b7cdba)

rolling_update: add any_errors_fatal

If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10438ea8cc351c4a06447e88d96222b9)

ceph-dashboard: update create/get rgw user tasks

Since [1] if a rgw user already exists then the radosgw-admin user create
command will return an error instead of modifying the current user.
We were already doing separated tasks for create and get operation but
only for multisite configuration but it's not enough.
Instead we should do the get task first and depending on the result
execute the create.
This commit also adds missing run_once and delegate_to statement.

[1] https://github.com/ceph/ceph/commit/269e9b9

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ac0f68ccf06dafe3c5b1321b81d80e2dc9d29015)

tests: add docker hub authentication in jobs

This commit makes all jobs authenticating to docker hub in order to
avoid the rate limit.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 40307f810c76f22b7152cb1f4113089a22a84274)

doc: add a note about deprecated branches

This commit adds a note about `stable-3.0` `stable-3.1` branches which
are deprecated and not maintained anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bbe30bcc69ffcf117ee97e8500f5247b4542f186)

doc: add a note about containerized deployments

This commit updates the documentation to add a note about containerized
deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e61488507b400b8fb2eedab99889871da27eef12)

doc: fix warning treated as an error

Typical error:

```
Warning, treated as error:
/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/source/day-2/upgrade.rst:2:Title underline too short.
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5c254861bdc146d3ef73dc99d6f52f7d03e22deb)

lvm_setup: lookup device from inventory, default to /dev/sd* names

This fixes a long standing fail in ceph-volumes lvm test suite.
Otherwise the default behaviour should not change.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
(cherry picked from commit 1fe8e819f90a6447ea25741c90b15578ac315ecd)

podman: Add Type and PIDFile value to unit files

This changes the way we are running the podman containers via systemd.
They are now in dettached mode and Type/PIDFile set.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1834974
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d43769dc2aa62059ac17098648d933d26192f67f)

ceph-osd: remove ceph-osd-run.sh script

Since we only have one scenario since nautilus then we can just move
the container start command from ceph-osd-run.sh to the systemd unit
service.
As a result, the ceph-osd-run.sh.j2 template and the
ceph_osd_docker_run_script_path variable are removed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 829990e60d8569198e3fc849624131a7cf6ddf84)

dashboard: copy self-signed generated crt to mons

This commit makes the playbook copying self-signed generated certificate
to monitors.
When mons and mgrs are deployed on dedicated nodes the playbook will
fail when trying to import certificate and key files since they are
generated on mgrs whereas we try to import them from a monitor.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1846995
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b7539eb275ccf947cd6122cdbfa062d20ad2472a)

ceph_volume: make zap function idempotent

This commit makes the zap function idempotent, especially when using
lvm_volumes variable.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1845668
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3f47236470e1571963850e8bed68fa2d26f05b66)

docker: Add Requires on docker service

When using docker container engine then the systemd unit scripts only
use a dependency on the docker daemon via the After parameter.
But if docker is restarted on a live system then the ceph systemd units
should wait for the docker daemon to be fully restarted.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1846830
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd22f1d1ec8c692848aee5337cd0d682a3a058b7)

docker2podman: make images pulling optional

This commit makes the images pulling skipped if podman isn't installed
on the machine.

In OSP context, the podman installation is done later in the workflow,
it means all `podman pull` commands will fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1849559
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 37b20b6525a217008b07624d40b1ac95577c7fe3)

docs: Add upgrade operation.

This commit adds a chapter about the ceph upgrade process.

Closes: #5393
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e41487dbce9dd5e9d754270bec426bea920406be)

switch-to-containers: set and unset osd flags

The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e020615bb99eb9db1520a977e5ac3ef4)

switch_to_containers: don't set noup flag

We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d38456f9e316bee3daeb2f72dda0315cae)

container: inspect Id field instead of RepoDigests

When a container image managed by podman isn't tag anymore then the
RepoDigests field when inspecting the image doesn't return any value.
This is different from docker workflow and it breaks the ceph-ansible
container upgrade when collocated multiple services and using a non
fix container tag (like latest or 4).

$ podman images
REPOSITORY              TAG      IMAGE ID       CREATED        SIZE
docker.io/ceph/daemon   latest   680c9c0d38c3   8 days ago     957 MB
<none>                  <none>   011ee108bfc9   2 months ago   1.01 GB

$ podman inspect 680c9c0d38c3 | jq .[0].RepoDigests[0]
"docker.io/ceph/daemon@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e"
$ podman inspect 011ee108bfc9 | jq .[0].RepoDigests[0]
null

Because this field returns "null" then the ansible task trying to
determine this value is failing

-----------------------------
fatal: [foo]: FAILED! =>
  msg: |-
    The task includes an option with an undefined variable. The error
    was: None has no element 0

    The error appears to be in
    'roles/ceph-container-common/tasks/fetch_image.yml': line 137,
    column 3, but may be elsewhere in the file depending on the exact
    syntax problem.

    The offending line appears to be:

    - name: set_fact ceph_osd_image_repodigest_before_pulling
      ^ here
-----------------------------

We don't have this behaviour with docker.

$ docker images
REPOSITORY              TAG      IMAGE ID       CREATED        SIZE
docker.io/ceph/daemon   latest   680c9c0d38c3   8 days ago     928 MB
docker.io/ceph/daemon   <none>   011ee108bfc9   2 months ago   986 MB

$ docker inspect 680c9c0d38c3 | jq .[0].RepoDigests[0]
"docker.io/ceph/daemon@sha256:45e6f28bb67c81b826acb64fad5c0da1cac3dffb41a88992fe4ca2be79575fa6"
$ docker inspect 011ee108bfc9 | jq .[0].RepoDigests[0]
"docker.io/ceph/daemon@sha256:b393a73309d72e43ca7d65cd3519036007947671e373eb59aa75a46185c52231"

Instead we should just get the Id field.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1844496
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cdb30bd125e5128328c3ccef15006acb23494d9c)

switch_to_container: fix osd systemd regex

The systemd LOAD and ACTIVE fileds could have more than one space between
both values.
This update the systemd regex the same way we're using it in different
part of the code.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 50140c9b5dfd4c865b36b333db8d2f725a905a5a)

dashboard: allow disabling grafana api ssl verify

When using an untrusted TLS certificate (like self-signed) on grafana
then the grafana dashboards update subcommand will fail.
One solution could be to trust the TLS certificate.
The other one is to disable the TLS verification on the grafana API.

Closes: #5324
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b20519efd0b9af4f2467daa311b9dca6086d4f87)

rgw multisite: add master zone endpoints to zonegroup

We were only adding the endpoints to the master zone but not to the
zonegroup.
This patch fixes the issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1839228
Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 0175c205fa16c05e2bbf5b4d8111092555aefa66)

common: fix target_size_ratio task enablement

The condition on this task is wrong, we have to check whether
`target_size_ratio` is set in the pool definition instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8c7a48832cd62524982c9ebe193a5ca6ea2c7bfa)

facts: always set ceph_run_cmd and ceph_admin_command

always set these facts on monitor nodes whatever we run with `--limit`.
Otherwise, playbook will fail when using `--limit` on nodes where these
facts are used on a delegated task to monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5e81843e918ed9aa57a5675af2888499700eac2)

osd: add a default value for 'default' in crush_rules

Let's default to `False` for the `default` attribute in `crush_rules`
variable.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1797774
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1b0b7af119929eca86d8f4684e4dbd228d6509f4)

docker2podman: manage dashboard nodes

The dashboard nodes (alertmanager, grafana, node-exporter, and prometheus)
were not manage during the docker to podman migration.

This adds the systemd container template of those services to a dedicated
file (systemd.yml) in order to include it in the docker2podman playbook.

This also adds the dashboard container images pull from docker to podman.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829389
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 252e78b4e4e90bc1c21d9dfd4a7c9bd132e94730)

docker2podman: pull images from docker daemon

The docker2podman playbook only installs the podman package and updates
the systemd units with the right container_binary value.

We never pull the container image so if one service is restarted then
the container image will be pulled first before the service can start
which could cause longer downstream.

To avoid to download the container image from internet again we can just
pull it from the local docker daemon.

The container_{binding,package,service}_name variables are removed
because they are only used in the ceph-container-engine role which
isn't call in this playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d38f21aeba29f341dc737b8cdeaa9fdaa9f55408)

rolling_update: fix rbdmirror group name

The rbdmirror group name was using the wrong variable definition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f9284eac733de99ebcc7f18b1ebdf8f115)