git-server-git.apps.pok.os.sepia.ceph.com Git

clients: do not run ceph-facts entirely

DNM/WIP

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

facts: isolate is_atomic task

This commit isolates the task setting the fact `is_atomic`.
So we can call ceph-facts with `tasks_from` when we only need this fact.
This avoid having to run the whole role just for this.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

debug

debug

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

common: support OSDs with more than 2 digits

When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-osd: support shrinking ceph-disk prepared osds

This commit adds the ceph-disk prepared osds support

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796453
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-osd: don't run ceph-facts entirely

We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

filestore-to-bluestore: reuse dedicated journal

If the filestore configuration was using a dedicated journal with either
a partition or a LV/VG then we need to reuse this for bluestore DB.

When filestore is using a raw devices then we shouldn't destroy
everything (data + journal) but only data otherwise the journal
partition won't exist anymore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

doc: update infra playbooks statements

We don't need to copy the infrastructure playbooks in the root
ceph-ansible directory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-rgw: increase connection timeout to 10

5s as a connection timeout could be low in some setup. Let's increase
it to 10s.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

Configure ceph dashboard backend and dashboard_frontend_vip

This change introduces a new set of tasks to configure the
ceph dashboard backend and listen just on the mgr related
subnet (and not on '*'). For the same reason the proper
server address is added in both prometheus and alertmanger
systemd units.
This patch also adds the "dashboard_frontend_vip" parameter
to make sure we're able to support the HA model when multiple
grafana instances are deployed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792230
Signed-off-by: Francesco Pantano <fpantano@redhat.com>

infrastructure-playbooks: Run shrink-osd tasks on monitor

Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.

This has the added advantage of becoming root on the monitor only, not
on localhost.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>

ceph-dashboard: update create/get rgw user tasks

Since [1] if a rgw user already exists then the radosgw-admin user create
command will return an error instead of modifying the current user.
We were already doing separated tasks for create and get operation but
only for multisite configuration but it's not enough.
Instead we should do the get task first and depending on the result
execute the create.
This commit also adds missing run_once and delegate_to statement.

[1] https://github.com/ceph/ceph/commit/269e9b9

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-rgw: allow SSL certificate content to supplied

Allow SSL certificate & key contents to be written to the path
specified by radosgw_frontend_ssl_certificate. This permits a
certificate to be deployed & renewal of expired certificates
through ceph-ansible.

Signed-off-by: Sam Choraria <sam.choraria@bbc.co.uk>

ceph-defaults: remove bootstrap_dirs_xxx vars

Both bootstrap_dirs_owner and bootstrap_dirs_group variables aren't
used anymore in the code.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rgw: extend automatic rgw pool creation capability

Add support for erasure code pools.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1731148
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-rgw-loadbalancer: Fix SSL newline issue

The ad7a5da commit introduced a regression when using TLS on haproxy
via the haproxy_frontend_ssl_certificate variable.
This cause the "stats socket" and the "tune.ssl.default-dh-param"
parameters to be on the same line resulting haproxy failing to start.

[ALERT] 351/140240 (21388) : parsing [xxxxx] : 'stats socket' : unknown
keyword 'tune.ssl.default-dh-param'. Registered
[ALERT] 351/140240 (21388) : Fatal errors found in configuration.

Fixes: #4869
Signed-off-by: Florian Faltermeier <florian.faltermeier@uibk.ac.at>

rgw: don't create user on secondary zones

The rgw user creation for the Ceph dashboard integration shouldn't be
created on secondary rgw zones.

Closes: #4707
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794351
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

purge-cluster: update package list to remove

We only support python3 so renaming all ceph python packages.
Some ceph packages were missing from the list (ceph-mon, ceph-osd or
rbd-mirror) or didn't exist anymore (ceph-fs-common, libcephfs1).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

Revert "vagrant: temp workaround for CentOS 8 cloud image"

The CentOS 8 vagrant image download is now fixed.

This reverts commit a5385e104884a3692954e4691f3348847a35c7fa.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

The _filtered_clients list should intersect with ansible_play_batch

Client configuration with --limit fails without this patch
because certain tasks are only done to the first host in the
_filtered_clients list and it's likely that first host will
not be included in what's sepcified with --limit. To fix this
the _filtered_clients list should be built from all clients
in the inventory that are also in the running play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798781
Signed-off-by: John Fulton <fulton@redhat.com>

tests: don't install s3cmd on containerized setup

The s3cmd package should only be installed on non containerized
deployment.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-iscsi: don't use ceph_dev_xxx variables

Using ceph_dev_branch and ceph_dev_sha1 for configuring ceph-iscsi
repositories from shaman doesn't make sense because the ceph devel
branches and sha1 aren't compatible with ceph-iscsi devel.
Instead we could rely on the master branch and the latest sha1.
Currently it's not possible to using a custom ceph branch/sha1 value
with iscsi setup otherwise the repository setup will fail.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-nfs: fix ceph_nfs_ceph_user variable

The ceph_nfs_ceph_user variable is a string for the ceph-nfs role but a
list in ceph-client role.
6a6785b introduced a confusion between both variable type in the ceph-nfs
role for external ceph with ganesha.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1801319
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-nfs: add nfs-ganesha-rados-urls package

Since nfs-ganesha 2.8.3 the rados-urls library has been move to a
dedicated package.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-{mon,osd}: move default crush variables

Since ed36a11 we move the crush rules creation code from the ceph-mon to
the ceph-osd role.
To keep the backward compatibility we kept the possibility to set the
crush variables on the mons side but we didn't move the default values.
As a result, when using crush_rule_config set to true and wanted to use
the default values for crush_rules then the crush rule ansible task
creation will fail.

"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute
'crush_rules'"

This patch move the default crush variables from ceph-mon to ceph-osd
role but also use those default values when nothing is defined on the
mons side.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798864
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-grafana: fix grafana_{crt,key} condition

The grafana_{crt,key} aren't boolean variables but strings. The default
value is an empty string so we should do the conditional on the string
length instead of the bool filter

Closes: #5053
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-prometheus: add alertmanager HA config

When using multiple alertmanager nodes (via the grafana-server group)
then we need to specify the other peers in the configuration.

https://prometheus.io/docs/alerting/alertmanager/#high-availability
https://github.com/prometheus/alertmanager#high-availability

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792225
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

containers: add KillMode=none to systemd templates

Because we are relying on docker|podman for managing containers then we
don't need systemd to manage the process (like kill).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

dashboard: allow configuring multiple grafana host

When using multiple grafana hosts then we push set the grafana and
prometheus URL and push the dashboard layout to a single node.

grafana_server_addrs is the list of all grafana nodes and used during
the ceph-dashboard role (on mgr/mon nodes).
grafana_server_addr is the current grafana node used during the
ceph-grafana and ceph-prometheus role (on grafana-server nodes).

We don't have the grafana_server_addr fact duplication code between
external vs collocated nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1784011
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

switch_to_containers: increase health check values

This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Revert "rhcs: update container image name"

This wasn't necesarry. The container image was fixed on the
RedHat's registry

This reverts commit 3bd250c7422a6eaaffb2d8c6a4750b232e6b1c7e.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rhcs: update container image name

The RHCS 4 container image is rhceph/rhceph-4

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1797743
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: remove legacy `osd_scenario` variable

As of stable-4.0 most of these references aren't needed anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-facts: set devices osd_auto_discovery on OSDs

We only need to set the devices fact with osd_auto_discovery on OSD
nodes.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-facts: remove is_podman fact

This was used before the CentOS 8 requirement when using CentOS 7
atomic which has both docker and podman installed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

purge: fix purge cluster failed

Fix purge cluster failed when local container images does not exist.

Purge node-exporter and grafana-server only when dashboard_enabled is set to True.

Signed-off-by: wujie1993 qq594jj@gmail.com

iscsi: Fix crashes during rolling update

During a rolling update we will run the ceph iscsigw tasks that start
the daemons then run the configure_iscsi.yml tasks which can create
iscsi objects like targets, disks, clients, etc. The problem is that
once the daemons are started they will accept confifguration requests,
or may want to update the system themself. Those operations can then
conflict with the configure_iscsi.yml tasks that setup objects and we
can end up in crashes due to the kernel being in a unsupported state.

This could also happen during creation, but is less likely due to no
objects being setup yet, so there are no watchers or users accessing the
gws yet. The fix in this patch works for both update and initial setup.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1795806

Signed-off-by: Mike Christie <mchristi@redhat.com>

ceph-common: rhcs 4 repositories for rhel 7

RHCS 4 is available for both RHEL 7 and 8 so we should also enable the
cdn repositories for that distribution.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796853
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

config: fix external client scenario

When no monitor group is present in the inventory, this task fails.
This affects only non-containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add external_clients scenario

This commit adds a new 'external ceph clients' scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-container-engine: lvm2 on OSD nodes only

Since de8f2a9 the lvm2 package installation has been moved from ceph-osd
role to ceph-container-engine role.
But the scope wasn't limited to the OSD nodes only.
This commit fixes this behaviour.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-defaults: remove rgw from ceph_conf_overrides

The [rgw] section in the ceph.conf file or via the ceph_conf_overrides
variable doesn't exist and has no effect.
To apply overrides to all radosgw instances we should use either the
[global] or [client] sections.
Overrides per radosgw instance should still use the
[client.rgw.{instance-name}] section.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794552
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

dashboard: add quotes when passing password to the CLI

Otherwise, if the variables contains a '$' it will be interpreted as a BASH
variable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: set dashboard|grafana_admin_password

Set these 2 variables in all test scenarios where `dashboard_enabled` is
`True`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

validate: fail if dashboard|grafana_admin_password aren't set

This commit adds a task to make sure user set a custom password for
`grafana_admin_password` and `dashboard_admin_password` variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1795509
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-facts: fix _container_exec_cmd fact value

When using different name between the inventory_hostname and the
ansible_hostname then the _container_exec_cmd fact will get a wrong
value based on the inventory_hostname instead of the ansible_hostname.
This happens when the ceph cluster is already running (update/upgrade).

Later the container exec commands will fail because the container name
is wrong.

We should always set the _container_exec_cmd based on the
ansible_hostname fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1795792
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tox: set extras vars for filestore-to-bluestore

The ansible extra variables aren't set with the ansible-playbook
command running the filestore-to-bluestore playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

filestore-to-bluestore: fix undefine osd_fsid_list

If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).

TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
msg: '''osd_fsid_list'' is undefined

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: add 'all_in_one' scenario

Add new scenario 'all_in_one' in order to catch more collocated related
issues.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

fix calls to `container_exec_cmd` in ceph-osd role

We must call `container_exec_cmd` from the right monitor node otherwise
the value of the fact might mistmatch between the delegated node and the
node being played.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794900
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

filestore-to-bluestore: skip bluestore osd nodes

If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

filestore-to-bluestore: don't fail when with no PV

When the PV is already removed from the devices then we should not fail
to avoid errors like:

stderr: No PV found on device /dev/sdb.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

Ensure that ganesha log directory exists

Some ganesha packages do not create ganesha log directories
while it's expected to be created while changing it's permissions.
Additionally it's no much sense in doing that as a separate task,
so directory is created as correct permissions are set with creation of
the rest required directories.

Signed-off-by: Dmitriy Rabotyagov <drabotyagov@vexxhost.com>

handler: read container_exec_cmd value from first mon

Given that we delegate to the first monitor, we must read the value of
`container_exec_cmd` from this node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792320
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-facts: Fix for 'running_mon is undefined' error, so that
fact 'running_mon' is set once 'grep' successfully exits with 'rc == 0'

Signed-off-by: Vytenis Sabaliauskas <vytenis.sabaliauskas@protonmail.com>

site-container: don't skip ceph-container-common

On HCI environment the OSD and Client nodes are collocated. Because we
aren't running the ceph-container-common role on the client nodes except
the first one (for keyring purpose) then the ceph-role execution fails
due to undefined variables.

Closes: #4970
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794195
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node

When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

filestore-to-bluestore: fix osd_auto_discovery

When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

common: add a default value for ceph_directories_mode

Since this variable makes it possible to customize the mode for ceph
directories, let's make it a bit more explicit by adding a default value
in ceph-defaults.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

filestore-to-bluestore: --destroy with raw devices

We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.

Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
/dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-osd: set container objectstore env variables

Because we need to manage legacy ceph-disk based OSD with ceph-volume
then we need a way to know the osd_objectstore in the container.
This was done like this previously with ceph-disk so we should also
do it with ceph-volume.
Note that this won't have any impact for ceph-volume lvm based OSD.

Rename docker_env_args fact to container_env_args and move the container
condition on the include_tasks call.
Remove OSD_DMCRYPT env variable from the ceph-osd template because it's
now included in the container_env_args variable.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792122
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-rgw: Fix customize pool size "when" condition

In 3c31b19ab39f297635c84edb9e8a5de6c2da7707, I fixed the `customize pool
size` task by replacing `item.size` with `item.value.size`. However, I
missed the same issue in the `when` condition.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>

handler: fix call to container_exec_cmd in handler_osds

When unsetting the noup flag, we must call container_exec_cmd from the
delegated node (first mon member)
Also, adding a `run_once: true` because this task needs to be run only 1
time.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792320
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Fix undefined running_mon

Since commit [1] running_mon introduced, it can be not defined
which results in fatal error [2]. This patch defines default value which
was used before patch [1]

Signed-off-by: Dmitriy Rabotyagov <drabotyagov@vexxhost.com>
[1] https://github.com/ceph/ceph-ansible/commit/8dcbcecd713b0cd7769d3b4d04ef5c2f15881377
[2] https://zuul.opendev.org/t/openstack/build/c82a73aeabd64fd583694ed04b947731/log/job-output.txt#14011

Fix application for openstack_cephfs pools

RBD is invalid application for cephfs pools, so it was change to cephfs.

Signed-off-by: Dmitriy Rabotyagov <drabotyagov@vexxhost.com>

ceph-facts: move facts to defaults value

There's no need to define a variable via a fact if we can do it via a
default value. Using a fact could be interesseting to override the
default value on some condition.

- ceph_uid could be set to 167 by default because it's only different on
non containerized deployment on Debian/Ubuntu.
- rbd_client_directory_{owner,group,mode} could be set to ceph,ceph,0770
by default install of null as we are doing in the facts.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

group_vars: remove useless files

Delete legacy files that aren't used anymore.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

containers: use --cpus instead --cpu-quota

When using docker 1.13.1, the current condition:

```
{% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%}
```

is wrong because it compares the first digit (1) whereas it should
compare the second one.
It means we always use `--cpu-quota` although documentation recommend
using `--cpus` when docker version is 1.13.1 or higher.

From the doc:
> --cpu-quota=<value> Impose a CPU CFS quota on the container. The number of
> microseconds per --cpu-period that the container is limited to before
> throttled. As such acting as the effective ceiling.
> If you use Docker 1.13 or higher, use --cpus instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

remove container_exec_cmd_mgr fact

Iterating over all monitors in order to delegate a `
{{ container_binary }}` fails when collocating mgrs with mons, because
ceph-facts reset `container_exec_cmd` to point to the first member of
the monitor group.

The idea is to force `container_exec_cmd` to be reset in ceph-mgr.
This commit also removes the `container_exec_cmd_mgr` fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1791282
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tox: use vagrant_up.sh instead of vagrant up

We should use the same vagrant wrapper everywhere instead of the vagrant
command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

vagrant: temp workaround for CentOS 8 cloud image

The CentOS cloud infrastructure storing the vagrant CentOS 8 image
changed the directory path and remove the old 8.0 image so the vagrant
box add centos/8 fails returning a 404 http error.
As a workaround we can pull the image from CentOS instead of letting
vagrant doing the resolution.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

drop use_fqdn variables

This has been deprecated in the previous releases. Let's drop it.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

travis: drop python2 support

Since python2 is EOL we can drop it from travis CI matrix.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

shrink-mds: fix condition on fs deletion

the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-iscsi: don't use bracket with trusted_ip_list

The trusted_ip_list parameter for the rbd-target-api service doesn't
support ipv6 address with bracket.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787531
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

osd: use _devices fact in lvm batch scenario

since fd1718f3796312e29cd5fd64fcc46826741303d2, we must use `_devices`
when deploying with lvm batch scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

update: remove legacy

This task is a code duplicate, probably a legacy, let's remove it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

facts: fix osp/ceph external use case

d6da508a9b6829d2d0633c7200efdffce14f403f broke the osp/ceph external use case.

We must skip these tasks when no monitor is present in the inventory.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790508
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-facts: move grafana fact to dedicated file

We don't need to executed the grafana fact everytime but only during
the dashboard deployment.
Especially for ceph-grafana, ceph-prometheus and ceph-dashboard roles.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790303
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

osd: ensure osd ids collected are well restarted

This commit refact the condition in the loop of that task so all
potential osd ids found are well started.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790212
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: do not run openstack_config during upgrade

There is no need to run this part of the playbook when upgrading the
cluter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: use main playbook for add_osds job

This commit replaces the playbook used for add_osds job given
accordingly to the add-osd.yml playbook removal

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: support scaling up using --limit

This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests/setup: update mount options on EL 8

The nobarrier mount flag doesn't exist anymoer on XFS in the EL 8
kernel. That's why the task wasn't working on those systems.
We can still use the other options instead of skipping the task.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-validate: fail on CentOS 7

The Ceph Octopus release is only supported on CentOS 8

Closes: #4918
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: add a docker2podman scenario

This commit adds a new scenario in order to test docker-to-podman.yml
migration playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

docker2podman: use set_fact to override variables

play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

docker2podman: force systemd to reload config

This is needed after a change is made in systemd unit files.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

docker2podman: install podman

This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

purge-iscsi-gateways: don't run all ceph-facts

We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

config: exclude ceph-disk prepared osds in lvm batch report

We must exclude the devices already used and prepared by ceph-disk when
doing the lvm batch report. Otherwise it fails because ceph-volume
complains about GPT header.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786682
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rolling_update: run registry auth before upgrading

There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.

Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

shrink-rgw: refact global workflow

Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

mon: support replacing a mon

We must pick up a mon which actually exists in ceph-facts in order to
detect if a cluster is running. Otherwise, it will state no cluster is
already running which will end up deploying a new monitor isolated in a
new quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622688
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

handler: fix bug

411bd07d54fc3f585296b68f2fd04484328399b5 introduced a bug in handlers

using `handler_*_status` instead of `hostvars[item]['handler_*_status']`
causes handlers to be triggered in anycase even though
`handler_*_status` was set to `False` on a specific node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622688
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-rgw: Fix custom pool size setting

RadosGW pools can be created by setting

```yaml
rgw_create_pools:
  .rgw.root:
    pg_num: 512
    size: 2
```

for instance. However, doing so would create pools of size
`osd_pool_default_size` regardless of the `size` value. This was due to
the fact that the Ansible task used

```
{{ item.size | default(osd_pool_default_size) }}
```

as the pool size value, but `item.size` is always undefined; the
correct variable is `item.value.size`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>

ceph-iscsi: manage ipv6 in trusted_ip_list

Only the ipv4 addresses from the nodes running the dashboard mgr module
were added to the trusted_ip_list configuration file on the iscsigws
nodes.
This also add the iscsi gateways with ipv6 configuration to the ceph
dashboard.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787531
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

shrink-mds: do not play ceph-facts entirely

We only need to set `container_binary`.
Let's use `tasks_from` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-mds: use fact from delegated node

The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

facts: use correct python interpreter

that task is delegated on the first mon so we should always use the
`discovered_interpreter_python` from that node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>