git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

osd: add 'osd blacklist' cap for osp keyrings

This commits adds the `osd blacklist` cap on all OSP clients keyrings.

Fixes: #2296
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2d955757ee9324a018374f628664e2e15dcb7903)

shrink-osd: Stop ceph-disk container based on ID

Since bedc0ab we now manage ceph-osd systemd unit scripts based on ID
instead of device name but it was not present in the shrink-osd
playbook (ceph-disk version).
To keep backward compatibility on deployment that didn't do yet the
transition on OSD id then we should stop unit scripts for both device
and ID.
This commit adds the ulimit nofile container option to get better
performance on ceph-disk commands.
It also fixes an issue when the OSD id matches multiple OSD ids with
the same first digit.

$ ceph-disk list | grep osd.1
/dev/sdb1 ceph data, prepared, cluster ceph, osd.1, block /dev/sdb2
/dev/sdg1 ceph data, prepared, cluster ceph, osd.12, block /dev/sdg2

Finally removing the shrinked OSD directory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rgw: add beast frontend

Allow to configure the rgw beast frontend in addition to civetweb
(default value).
Add rgw_thread_pool_size variable with 512 as default value and keep
backward compatibility with num_threads option when using civetweb.
Update radosgw_civetweb_num_threads to reflect rgw_thread_pool_size
change.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1733406
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d17b1b48b6d4259f88445a0752e1c13b4522ced0)

ceph-osd: check container engine rc for pools

When creating OpenStack pools, we only check if the return code from
the pool list command isn't 0 (ie: if it doesn't exist). In that case,
the return code will be 2. That's why the next condition is rc != 0 for
the pool creation.
But in containerized deployment, the return code could be different if
there's a failure on the container engine command (like container not
running). In that case, the return code could but either 1 (docker) or
125 (podman) so we should fail at this point and not in the next tasks.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1732157

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d549fffdd24d21661b64b31bda20b4e8c6aa82b6)

tests: Update ooo-collocation scenario

The ooo-collocation scenario was still using an old container image and
doesn't match the requirement on latest stable-3.2 code. We need to use
at least the container image v3.2.5.
Also updating the OSD tests to reflect the changes introduced by the
commit bedc0ab because we don't have the OSD systemd unit script using
device name anymore.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

Remove NBSP characters

Some NBSP are still present in the yaml files.
Adding a test in travis CI.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 07c6695d16bc3e8f6d8d5fdc17bd9830b1d94619)

ceph-osd: use OSD id with systemd ceph-disk

When using containerized deployment we have to create the systemd
service unit based on a template.
The current implementation with ceph-disk is using the device name
as paramater to the systemd service and for the container name too.

$ systemctl start ceph-osd@sdb
$ docker ps --filter 'name=ceph-osd-*'
CONTAINER ID IMAGE                        NAMES
065530d0a27f ceph/daemon:latest-luminous  ceph-osd-strg0-sdb

This is the only scenario (compared to non containerized or
ceph-volume based deployment) that isn't using the OSD id.

$ systemctl start ceph-osd@0
$ docker ps --filter 'name=ceph-osd-*'
CONTAINER ID IMAGE                        NAMES
d34552ec157e ceph/daemon:latest-luminous  ceph-osd-0

Also if the device mapping doesn't persist to system reboot (ie sdb
might be remapped to sde) then the OSD service won't come back after
the reboot.

This patch allows to use the OSD id with the ceph-osd systemd service
but requires to activate the OSD manually with ceph-disk first in
order to affect the ID to that OSD.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670734
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-infra: update handler with daemon variable

Both ntp and chrony daemon use variable for the service name because it
could be different depending on the GNU/Linux distribution.
This has been update in 9d88d3199 for chrony but only for the start part
not for the handler.
The commit fixes this for both ntp and chrony.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0ae0193144897676b56e1d4142565e759531fd35)

Install nfs-ganesha stable v2.7

nfs-ganesha v2.5 and 2.6 have hit EOL. Install nfs-ganesha v2.7
stable that is currently being maintained.

Signed-off-by: Ramana Raja <rraja@redhat.com>
(cherry picked from commit dfff89ce67c4e29007e8cbe21e74d94d85b7c0bc)

validate: improve message printed in check_devices.yml

The message prints the whole content of the registered variable in the
playbook, this is not needed and makes the message pretty unclear and
unreadable.

```
"msg": "{'_ansible_parsed': True, 'changed': False, '_ansible_no_log': False, u'err': u'Error: Could not stat device /dev/sdf - No such file or directory.\\n', 'item': u'/dev/sdf', '_ansible_item_result': True, u'failed': False, '_ansible_item_label': u'/dev/sdf', u'msg': u\"Error while getting device information with parted script: '/sbin/parted -s -m /dev/sdf -- unit 'MiB' print'\", u'rc': 1, u'invocation': {u'module_args': {u'part_start': u'0%', u'part_end': u'100%', u'name': None, u'align': u'optimal', u'number': None, u'label': u'msdos', u'state': u'info', u'part_type': u'primary', u'flags': None, u'device': u'/dev/sdf', u'unit': u'MiB'}}, 'failed_when_result': False, '_ansible_ignore_errors': None, u'out': u''} is not a block special file!"
```

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1719023
(cherry picked from commit e6dc3ebd8c8161e56c0a4a1fdb409272b9fd5342)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-osd: (ceph-disk only) remove prepare container

When shrinking an OSD, its corresponding 'prepare container' should be
removed otherwise it prevent from redeploying a new osd because of this
leftover.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-osd: (ceph-disk only) remove gpt header

Removing the gpt header on devices will ease ceph-disk to ceph-volume
migration when using shrink-osd + add-osd playbooks.
ceph-disk requires GPT header where ceph-volume will complain if GPT
header is present.
That won't break ceph-disk (re)deployment since we check and add the GPT
header if needed when deploying ceph-disk ODs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613735
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-handler: Fix rgw socket in restart script

If the SOCKET variable isn't defined in the script then the test
command won't fail because the return code is 0

$ test -S
$ echo $?
0

There multiple issues in that script:
  - The default SOCKET value isn't defined.
  - Update the wget parameters because the command is doing a loop.
We now use the same option than curl.
  - The check_rest function doesn't test the radosgw at all due to
a wrong test command (test against a string) and always returns 0.
This needs to use the DOCKER_EXEC variable in order to execute the
command.

$ test 'wget http://192.168.100.11:8080'
$ echo $?
0

Resolves: #3926

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c90f605b5148d179790cec545d02db1086579994)

ceph-handler: Fix radosgw_address default value

The rgw restart script set the RGW_IP variable depending on ansible
variables:
  - radosgw_address
  - radosgw_address_block
  - radosgw_interface

Those variables have default values defined in ceph-defaults role:

radosgw_interface: interface
radosgw_address: 0.0.0.0
radosgw_address_block: subnet

But in the rgw restart script we always use the radosgw_address value
instead of the radosgw_interface when defined because we aren't testing
the right default value.
As a consequence, the RGW_IP variable will be set to 0.0.0.0 even if
the ip address associated to the radosgw_interface variable is set
correctly. This causes the check_rest function to fail.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

validate.py: Fix alphabetical order on uca

Alphabetized ceph_repository_uca keys due to errors validating when
using UCA/queens repository on Ubuntu 16.04

An exception occurred during task execution. To see the full
traceback, use -vvv. The error was:
SchemaError: -> ceph_stable_repo_uca schema item is not
alphabetically ordered

Closes: #4154
Signed-off-by: Gabriel Ramirez <gabrielramirez1109@gmail.com>
(cherry picked from commit 82262c6e8c22eca98ffba5d0c65fa65e83a62793)

ceph-nfs: use template module for configuration

789cef7 introduces a regression in the ganesha configuration file
generation. The new config_template module version broke it.
But the ganesha.conf file isn't an ini file and doesn't really
need to use the config_template module. Instead we can use the
classic template module.

Resolves: #4045

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 616c4846980bc01144417416d60fd9bb46aa14a9)

purge: ensure no ceph kernel thread is present

This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 20e4852888ecc76d8d0fa194a438fa2a90e1cde3)

add-osd: fix error in validate execution role

ceph-facts should be run before we play ceph-validate since it has
reference to facts that are set in ceph-facts role.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add nfs-ganesha testing

This was removed because of broken repositories which made the CI
failing. That doesn't make sense anymore so adding back it

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-disk: Set max open files limit on container

Same behaviour than ceph-volume (b987534). The ceph-disk command runs
faster when using ulimit nofile with container cli.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-handler: Fix OSD restart script

There's two big issues with the current OSD restart script.

1/ We try to test if the ceph osd daemon socket exists but we use a
wildcard for the socket name : /var/run/ceph/*.asok.
This fails because we usually have multiple ceph osd sockets (or
other ceph daemon collocated) present in /var/run/ceph directory.
Currently the test fails with:

bash: line xxx: [: too many arguments

But it doesn't stop the script execution.
Instead we can specify the full ceph osd socket name because we
already know the OSD id.

2/ The container filter pattern is wrong and could matches multiple
containers resulting the script to fail.
We use the filter with two different patterns. One is with the device
name (sda, sdb, ..) and the other one is with the OSD id (ceph-osd-0,
ceph-osd-15, ..).
In both case we could match more than needed.

$ docker container ls
CONTAINER ID IMAGE NAMES
958121a7cc7d ceph-daemon:latest ceph-osd-strg0-sda
589a982d43b5 ceph-daemon:latest ceph-osd-strg0-sdb
46c7240d71f3 ceph-daemon:latest ceph-osd-strg0-sdaa
877985ec3aca ceph-daemon:latest ceph-osd-strg0-sdab
$ docker container ls -q -f "name=sda"
958121a7cc7d
46c7240d71f3
877985ec3aca

$ docker container ls
CONTAINER ID IMAGE NAMES
2db399b3ee85 ceph-daemon:latest ceph-osd-5
099dc13f08f1 ceph-daemon:latest ceph-osd-13
5d0c2fe8f121 ceph-daemon:latest ceph-osd-17
d6c7b89db1d1 ceph-daemon:latest ceph-osd-1
$ docker container ls -q -f "name=ceph-osd-1"
099dc13f08f1
5d0c2fe8f121
d6c7b89db1d1

Adding an extra '$' character at the end of the pattern solves the
problem.

Finally removing the get_container_osd_id function because it's not
used in the script at all.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 45d46541cb60818e8ad9a6b2d24fea91c0315525)

ceph-volume: Set max open files limit on container

The ceph-volume lvm list command takes ages to complete when having
a lot of LV devices on containerized deployment.
For instance, with 25 OSDs on a node it takes 3 mins 44s to list the
OSD.
Adding the max open files limit to the container engine cli when
executing the ceph-volume command seems to improve a lot thee
execution time ~30s.

This was impacting the OSDs creation with ceph-volume (both filestore
and bluestore) when using multiple LV devices.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1702285
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b98753488110b04cd2071c2b103493235dfc0c80)

ceph-osd: do not relabel /run/udev in containerized context

Otherwise content in /run/udev is mislabeled and prevent some services
like NetworkManager from starting.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 80875adba791b732713f686a4e4eba182758dc9d)

ceph-infra: make chronyd default NTP daemon

Since timesyncd is not available on RHEL-based OSs, change the default
to chronyd for RHEL-based OSs. Also, chronyd is chrony on Ubuntu, so
set the Ansible fact accordingly.

Fixes: https://github.com/ceph/ceph-ansible/issues/3628
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 9d88d3199fd8c6548a56bf9e95cd9239481baa39)

don't install NTPd on Atomic

Since Atomic doesn't allow any installations and NTPd is not present
on Atomic image we are using, abort when ntp_daemon_type is set to ntpd.

https://github.com/ceph/ceph-ansible/issues/3572
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit bdff3e48fd95820383f3e5816be5a985cbdeab38)

remove ceph-agent role and references

The ceph-agent role was used only for RHCS 2 (jewel) so it's not
usefull anymore.
The current code will fail on CentOS distribution because the rhscon
package is only avaible on Red Hat with the RHCS 2 repository and
this ceph release is supported on stable-3.0 branch.

Resolves: #4020

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7503098ca079b200b3adcb1faf2e255d9c74a581)

tests: Update ansible ssh_args variable

Because we're using vagrant, a ssh config file will be created for
each nodes with options like user, host, port, identity, etc...
But via tox we're override ANSIBLE_SSH_ARGS to use this file. This
remove the default value set in ansible.cfg.

Also adding PreferredAuthentications=publickey because CentOS/RHEL
servers are configured with GSSAPIAuthenticationis enabled for ssh
server forcing the client to make a PTR DNS query.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34f9d51178f4cd37a7df1bb74897dff7eb5c065f)

iscsi: assign application (rbd) to pool 'rbd'

if we don't assign the rbd application tag on this pool,
the cluster will get `HEALTH_WARN` state like following:

```
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
application not enabled on pool 'rbd'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4cf17a6fddc052c944026ae1d138263131e677f8)

ceph-handler: replace fuser by /proc/net/unix

We're using fuser command to see if a process is using a ceph unix
socket file. But the fuser command runs through every PID present in
/proc/<PID> to see if one of them is using the file.
On a system running thousands processes, the fuser command can take
a long time to finish.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1717011

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da9891da1e8b9a8c91077c74e54a9df8ebb7070d)

validate: fail in check_devices at the right task

see https://bugzilla.redhat.com/show_bug.cgi?id=1648168#c17 for details.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1648168#c17
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 771648304d7d867e053f8b8fe3ce5b36e061f100)

spec: bring back possibility to install ceph with custom repo

This can be seen as a regression for customers who were used to deploy
in offline environment with custom repositories.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1673254
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c933645bf7015e08c97058186954483c40ecbfbd)

update default rhcs values and docs

The RHCS documentation mentionned in the default values and
group_vars directory are referring to RHCS 2.x while it should be
3.x.

Revolves: https://bugzilla.redhat.com/show_bug.cgi?id=1702732

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

vagrant: Default box to centos/7

We don't use ceph/ubuntu-xenial anymore but only centos/7 and
centos/atomic-host.
Changing the default to centos/7.

Resolves: #4036

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 24d0fd70030e3014405bf3bf2d628ede4cee6466)

tox: Refact lvm_osds scenario

The current lvm_osds only tests filestore on one OSD node.
We also have bs_lvm_osds to test bluestore and encryption.
Let's use only one scenario to test filestore/bluestore and with or
without dmcrypt on four OSD nodes.
Also use validate_dmcrypt_bool_value instead of types.boolean on
dmcrypt validation via notario.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 52b9f3fb2886d703b25f650221ea973147c68ed6)

igw: Fix rolling update service ordering

We must stop tcmu-runner after the other rbd-target-* services
because they may need to interact with tcmu-runner during shutdown.
There is also a bug in some kernels where IO can get stuck in the
kernel and by stopping rbd-target-* first we can make sure all IO is
flushed.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d7ef12910e7b583fa42f84a7173a87e7c679e79e)

Revert "Revert "cv: support zap by osd fsid""

This reverts commit addcc1e61abb50f53bb82ddac22c643c5ce636b7.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Revert "Revert "shrink_osd: use cv zap by fsid to remove parts/lvs""

This reverts commit 043ee8c1584147665b1a38f27f43e599fc2a775f.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osds: allow passing devices by path

ceph-volume didn't work when the devices where passed by path.
Since it now support it, let's allow this feature in ceph-ansible

Closes: #3812
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f2c45dfd3d1d3875a480247ca047aa52d7cd1b1)

Revert "cv: support zap by osd fsid"

This reverts commit 8454f0144af10834da0cddb508a5dea11bda3c72.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Revert "shrink_osd: use cv zap by fsid to remove parts/lvs"

This reverts commit be59e0b451df6028c71eca54754d4d1464a8cc83.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: set default bluestore_wal_devices empty

We only need to set the wal dedicated device when there's three tiers
of storage used.
Currently the block.wal partition will also be created on the same
device than block.db.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1685253
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

rolling_update: restart all ceph-iscsi services

Currently only rbd-target-gw service is restarted during an update.
We also need to restart tcmu-runner and rbd-target-api services
during the ceph iscsi upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1048627eaab27563511011fa3cc31b525e2f4c9)

ceph-mds: Increase cpu limit to 4

In containerized deployment the default mds cpu quota is too low
for production environment.
This is causing performance degradation compared to bare-metal.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1695850
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1999cf3d1902456aa123ed3c96116c21e88799bb)

ceph-osd: Fix merge conflict from mergify

The PR #3916 was merged automatically by mergify even if there was a
confict in the ceph-osd-run.sh.j2 template.
This commit resolves the conflict.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-osd: Increase cpu limit to 4

In containerized deployment the default osd cpu quota is too low
for production environment using NVMe devices.
This is causing performance degradation compared to bare-metal.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1695880
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c17106874c29f3eafb196a30b97fd1f8fd52e768)

# Conflicts:
# roles/ceph-osd/templates/ceph-osd-run.sh.j2

ansible.cfg: Add library path to configuration

Ceph module path needs to be configured if we want to avoid issues
like:

no action detected in task. This often indicates a misspelled module
name, or incorrect module path

Currently the ansible-lint command in Travis CI complains about that.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1668478
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a1a871cadee5e86d181e1306c985e620b81fccac)

ceph-mon: increase timeout waiting for admin and bootstrap keys

With a large and/or busy cluster, it can take significantly more than
30s for a restarted monitor to get to the point where
`ceph-create-keys` returns successfully. A recent upgrade of our
production cluster failed here because it took a couple of minutes for
the newly-upgraded `mon` to be ready. So increase the timeout
significantly.

This patch is applied to stable-3.2, because the affected code is
refactored in stable-4.0 and ceph-create-keys is no longer called.

Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>

tests: Add debug to ceph-override.json

It's usefull to have logs in debug mode enabled in order to have
more information for developpers.
Also reindent to json file.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d25af1b872628607e37741c760aa31b88229f3da)

tests/functional: use ceph-override.json symlink

We don't need to have multiple ceph-override.json copies. We
currently already have symlink to all_daemons/ceph-override.json so
we can do it for all scenarios.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a19054be18d748a5133216a2339f1989a0e0b27b)

ceph-mds: Set application pool to cephfs

We don't need to use the cephfs variable for the application pool
name because it's always cephfs.
If the cephfs variable is set to something else than the default
value it will break the appplication pool task.

Resolves: #3790

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d2efb7f02b9e6f6888449dbaeba0e2435606ca43)

remove all NBSPs char in stable-3.2 branch

this can cause issues, let's replace all of these chars with real
spaces.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

UCA: Uncomment UCA variables in defaults, fix consequent breakage

The Ubuntu Cloud Archive-related (UCA) defaults in
roles/ceph-defaults/defaults/main.yml were commented out, which means
if you set `ceph_repository` to "uca", you get undefined variable
errors, e.g.

```
The task includes an option with an undefined variable. The error was: 'ceph_stable_repo_uca' is undefined

The error appears to have been in '/nfs/users/nfs_m/mv3/software/ceph-ansible/roles/ceph-common/tasks/installs/debian_uca_repository.yml': line 6, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: add ubuntu cloud archive repository
^ here

```

Unfortunately, uncommenting these results in some other breakage,
because further roles were written that use the fact of
`ceph_stable_release_uca` being defined as a proxy for "we're using
UCA", so try and install packages from the bionic-updates/queens
release, for example, which doesn't work. So there are a few `apt` tasks
that need modifying to not use `ceph_stable_release_uca` unless
`ceph_origin` is `repository` and `ceph_repository` is `uca`.

Closes: #3475
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 9dd913cf8a3dcc12683b55ae13d95bca6f15cd32)

ceph-osd: Drop memory flag with bluestore

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit dc1c0dcee21deefa359cae30d3a733a4348bfd2f)

mon/rgw: use last ipv6 address

When using monitor_address_block or radosgw_address_block variables
to configure the mon/rgw address we're getting the first ip address
from the ansible facts present in that cidr.
When there's VIP on that network the first filter could return the
wrong value.
This seems to affect only IPv6 setup because the VIP addresses are
added to the ansible facts at the beginning of the list. This is the
opposite (at the end) when using IPv4.
This causes the mon/rgw processes to bind on the VIP address.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1680155
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: fix update job

jenkins sets CEPH_ANSIBLE_BRANCH to stable-3.2, this makes all
nightly job failing.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rgw multisite: add more than 1 rgw to the master or secondary zone

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1664869
Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 37f46a8c5de9585c2639cc4741ee8f62bc2c854b)

tests: run lvm_setup.yml on secondary cluster

otherwise ceph-osd fails:

```
ceph-volume lvm prepare: error: Unable to proceed with non-existing device: test_group/data-lv2
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

radosgw: Raise cpu limit to 8

In containerized deployment the default radosgw quota is too low
for production environment.
This is causing performance degradation compared to bare-metal.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1680171
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d3ae9fd05fe46933a1437501b0d8a5edb4ca2056)

tests: do not deploy ceph@master in rgw_multisite

deploying ceph@master in stable-3.2 is not possible.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add back testinfra testing

136bfe0 removed testinfra testing on all scenario excepted all_daemons

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8d106c2c58d354e10335ca017fd8df4c427e38a6)

tests: pin pytest-xdist to 1.27.0

looks like newer version of pytest-xdist requires pytest>=4.4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ba0a95211cc00b2cae14b018722f437c0091a2ef)

purge: fix lvm-batch purge osd

`lvm_volumes` and/or `devices` variable(s) can be undefined depending on
the scenario chosen.

These tasks should be run only if these variable are defined, otherwise
it ends up with undefined variable errors.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 01807383132c2897e331dcc665f062f8be0feeb8)

tests: test idempotency only on all_daemons job

there's no need to test this on all scenarios.
testing idempotency on all_daemons should be enough and allow us to save
precious resources for the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 136bfe096c5e97c5c983d02882919d4af2af48a6)

rolling_update: Update systemd unit regex for nvme

The systemd unit regex doesn't handle nvme devices (/dev/nvmeXn1).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1687828
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c8442f3705d0bd7b64fe2b14d925a82d52a052e4)

tests: refact update scenario (stable-3.2)

refact the update scenario like it has been made in master.
(see f0e616962)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

purge-docker-cluster: Remove ceph-osd service

The systemd ceph-osd@.service file used for starting the ceph osd
containers is used in all osd_scenarios.
Currently purging a containerized deployment using the lvm scenario
didn't remove the ceph-osd systemd service.
If the next deployment is a non-containerized deployment, the OSDs
won't be online because the file is still present and override the
one from the package.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7cc626b72dbb242a00f714d925b6aea6b4524c37)

tox: Fix container purge jobs

On containerized CI jobs the playbook executed is purge-cluster.yml
but it should be set to purge-docker-cluster.yml

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd0869cd01090e135a9312a6890ed7611f8e3a1c)

tests: add mgr and nfs nodes in all_daemons

even not used, we need to fire up those VMs to be able to perform the
upgrade in the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Add uca to ceph_repository choices validation

Ubuntu cloud archive is configurable via ceph_repository variable but
the uca choice isn't accepted.
This commit fixes this issue and also validates the associated uca
repository variables.

Resolves: #3739

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 94505a3af264fc0e847e71cbf633dc02ce58e6ff)

defaults: change default value for ceph_docker_image_tag

Since nautilus has been released, it's now the latest stable release, it
means the tag `latest` now refers to nautilus.

`stable-3.2` isn't intended to deploy nautilus, therefore, we should
change the default value for this variable to the latest release
stable-3.2 is able to deploy (mimic).

Closes: #3734
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-osd: Ensure lvm2 is installed

When using osd_scenario lvm, we never check if the lvm2 package is
present on the host.
When using containerized deployment and docker on CentOS/RedHat this
package will be automatically installed as a dependency but not for
Ubuntu distribution.
OSD deployed via ceph-volume require the lvmetad.socket to be active
and running.

Resolves: #3728

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 179fdfbc19ab7001fdd185f131039043d690bbe8)

ceph_crush: fix rstrip for python 3
Removing bytes literals since rstrip only supports type String or None.

Please backport to stable-3.2

Signed-off-by: Bruceforce <markus.greis@gmx.de>
(cherry picked from commit 6d506dba1a6fb3a827460d3a7090517cf3241c39)

ceph_volume: fix rstrip for python 3
Removing bytes literals since rstrip only supports type String or None.

Signed-off-by: Bruceforce <markus.greis@gmx.de>

Remove trailing forward slash in ceph_docker_registry variable from group_vars/rhcs.yml.sample file.

Also fixed rhcs_edits.txt for variable ceph_docker_registry.

Moved namespace to ceph_docker_image variable.

Signed-off-by: Phuong Nguyen <pnguyen@redhat.com>
(cherry picked from commit 3305309e87b16c42af5f7faf35fd322241e8e964)

osd: backward compatibility with old disk_list.sh location

Since all files in container image have moved to `/opt/ceph-container`
this check must look for new AND the old path so it's backward
compatible. Otherwise it could end up by templating an inconsistent
`ceph-osd-run.sh`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 987bdac963cee8d8aba1f10659f23bb68c2b1d1b)

ceph-validate: fail if there's no ipaddr available in monitor_address_block subnet

When using monitor_address_block to determine the ip address of the
monitor node, we need an ip address available in that cidr to be
present in the ansible facts (ansible_all_ipv[46]_addresses).
Currently we don't check if there's an ip address available during
the ceph-validate role.
As a result, the ceph-config role fails due to an empty list during
ceph.conf template creation but the error isn't explicit.

TASK [ceph-config : generate ceph.conf configuration file] *****
fatal: [0]: FAILED! => {"msg": "No first item, sequence was empty."}

With this patch we will fail before the ceph deployment with an
explicit failure message.

Resolves: rhbz#1673687

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5c39735be530b2c7339510486bc4078687236bbb)

Change docker_container parameter network to network_mode

Addressing "populate kv_store with custom ceph.conf":
Unsupported parameters for (docker_container) module. Looking at
https://docs.ansible.com/ansible/latest/modules/docker_container_module.html
shows that the correct parameter is network_mode, not network.

Signed-off-by: Gregory Orange <gregoryo2014@users.noreply.github.com>

Set the default crush rule in ceph.conf

Currently the default crush rule value is added to the ceph config
on the mon nodes as an extra configuration applied after the template
generation via the ansible ini module.

This implies two behaviors:

1/ On each ceph-ansible run, the ceph.conf will be regenerated via
ceph-config+template and then ceph-mon+ini_file. This leads to a
non necessary daemons restart.

2/ When other ceph daemons are collocated on the monitor nodes
(like mgr or rgw), the default crush rule value will be erased by
the ceph.conf template (mon -> mgr -> rgw).

This patch adds the osd_pool_default_crush_rule config to the ceph
template and only for the monitor nodes (like crush_rules.yml).
The default crush rule id is read (if exist) from the current ceph
configuration.
The default configuration is -1 (ceph default).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1638092
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d8538ad4e16fe76d63e607491d41793303f929b1)

add-osd.yml: Add become flag for ceph-validate

The check_devices task fails if the ceph-validate role isn't executed
as a privileged user (Permission denied).

failed: [osd0] (item=/dev/sdb) => {"changed": false, "err": "Error:
Error opening /dev/sdb: Permission denied\n", "item": "/dev/sdb",
"msg": "Error while getting device information with parted script:
'/sbin/parted -s -m /dev/sdb -- unit 'MiB' print'", "out": "", "rc": 1}

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b23c05ae5200255fb8452c26834de1e9db1497cc)

ceph-osd: Install numactl package when needed

With 3e32dce we can run OSD containers with numactl support.
When using numactl command in a containerized deployment we need to
be sure that the corresponding package is installed on the host.
The package installation is only executed when the
ceph_osd_numactl_opts variable isn't empty.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b7f4e3e7c7d73d931fd7fec4c940384c890dee42)

osd: support numactl options on OSD activate

This commit adds OSD containers activate with numactl support.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1684146
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b3eb9206fada05df811602217d8770db854e0adf)

tests: add mgrs section in non_container-collocation

No mgrs are deployed in this scenario, causing the testinfra jobs to
fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: fix collocation scenario

ceph_origin and ceph_repository are mandatory variables.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: use memory backend for cache fact

force ansible to generate facts for each run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a1bafdc2181b3a951991fcc9a5108edde757615)

tests: pin testinfra version

As of testinfra 2.0.0, the binary name is `py.test`.

But let's pin the version to 1.19.0.
Indeed, migrating to 2.0.0 requires our current testing to be reworked a bit.
Since we don't have the bandwidth ATM for this, it's better to simply
keep testing with testinfra 1.19.0.

Note that I've replaced all `testinfra` occurences by `py.test` anyway.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b42250332a0c9afadcb5a670cc7153d72fd8daec)

add-osd: gather facts in second part of playbook

otherwise, it will end up with error like following:

```
FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}
```

because facts won't have been gathered.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670663
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a4408785332512f6fab6134ee488a4cec18639c1)

purge: fix rbd-mirror group name

the default is rbdmirrors in ceph-defaults

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 47ebef374ffbcaed496ff42b3f13bfd21951c333)

purge: fix rbd mirror purge

as of b70d54ac809a92cd88e39e3efa7ed3fee864a866 the service launched isn't
ceph-rbd-mirror@admin.service.

it's now `ceph-rbd-mirror@rbd-mirror.{{ ansible_hostname }}`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a9153084778d5bc219d33e3769485cbea8d7a6a9)

purge: do not remove /var/lib/apt/lists/*

removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.

Typical error:
```
fatal: [mon0]: FAILED! => changed=false
  attempts: 3
  msg: No package matching 'ceph' is available
```

The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
  apt:
    update_cache: yes
    cache_valid_time: 3600
  register: result
  until: result is succeeded
```

since the task installing ceph packages has a `update_cache: no` it
fails:

```
- name: install ceph for debian
  apt:
    name: "{{ debian_ceph_pkgs | unique }}"
    update_cache: no
    state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
    default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
  register: result
  until: result is succeeded
```

/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3849f30f58e35d559bfa137fa301214192db993b)

purge: fix purge of lvm devices

using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.

on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
  cmd: command -v ceph-volume
  failed_when_result: false
  msg: '[Errno 2] No such file or directory: ''command'': ''command'''
  rc: 2
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 89f77589fa0a431490933379c0781e1df2b95440)

Extends check_devices tasks to non-collocated an lvm-batch scenarios

Tuned name of a task and error message to make it more user understandable

Fixes BZ 1648168 - ceph-validate : devices are not validated in non-collocated and lvm_batch scenario

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1648168
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
(cherry picked from commit 34c25ef49b10ef6c789447e785a4bf6938c2a804)

Convert interface names to underscores

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540881
Signed-off-by: Tomas Petr <tpetr@redhat.com>
(cherry picked from commit 573adce7dd4f306c384b3308c8049ae49ef59716)

osd: add ipc=host in systemd template for containers

in addition to 15812970f033206b8680cc68351952d49cc18314

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d5be83e5042a5e22ace6250234ccd81acaffb0a2)

tests: update ceph_volume tests

accordingly to change introduced by b5548ea9412cd7741bee993dddcbfd9daa34cb02

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f2dcb02d213e862c5a5498c2d12cd86b22676c84)

cv: expose host ipc namespace to ceph-volume container

this is needed to properly handle semaphore synchronization for udev
actions via dmcrypt/cryptsetup.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1683770
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 15812970f033206b8680cc68351952d49cc18314)

# Conflicts:
# library/ceph_volume.py

tests: add lvm bluestore dmcrypt support

Add coverage for container / non container lvm bluestore dmcrypt OSDs

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 207fae38d480f7de369106c5bda1dfe0f1b6033c)

Removed not needed mountpoint and removed ubuntu section

Referring to BZ#1683290, as dsavineau suggests, being this
bug tripleO specific, removed the ubuntu section and removed
useless mountpoints.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1683290
Signed-off-by: fpantano <fpantano@redhat.com>
(cherry picked from commit 21fad7ced344e441ffcd5c4010d634b81ead517f)

Added to the ceph-radosgw service template the ca-trust
volume avoiding to expose useless information.
This bug is referred to the following bugzilla:

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1683290
Signed-off-by: fpantano <fpantano@redhat.com>
(cherry picked from commit 0c1944236bfb397e9dff6ef436569556bc00379d)

Set permissions on monitor directory to u=rwX,g=rX,o=rX recursive

Set directories to 755 and files to 644 to
/var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }} recursively instead of
setting files and directories to 755 recursively. The ceph mon
process writes files to this path with permissions 644. This update stops
ansible from updating the permissions in
/var/lib/ceph/mon/{{ cluster }}-{{ monitor_name }} every time ceph mon writes
a file and increases idempotency.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1683997
Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d327681b99915578fc8b389fda69556966db905f)

mon: Move client admin variable to defaults

There's no need to set the client_admin_ceph_authtool_cap variable
via a set_fact task.
Instead we can set this in the role defaults.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 58a9d310d5651171214dc2a621cf2ba197229951)