git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

doc: add day-2 operations documentation

This commit is the first of a serie in order to describe all day-2 operations
that are possible via ceph-ansible using a set of playbook provided in
`infrastructure-playbooks` directory.

Fixes: #5061
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7e800303e9933cb61a5288b608e8d4f2cfdd7746)

library/ceph_volume: look for error messages in stderr

Error message were moved to from stdout in stderr here -
https://github.com/ceph/ceph/commit/b8d6dcbe9f803c96c0af68da54f1262e9b6a9e77#diff-20f7c578a4e69ec61a5869d706567a24R137.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1793542
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 4249d1e02d6da07466a4ddf1282cf4600a131773)

ceph-validate: update RHEL requirement for RHCS

We were not testing the right ansible_distribution fact value for RHEL
distribution.
This commit also updates the minial RHEL version supported by RHCS.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5de74fe512575b2873b5863f5817f676954d3469)

add-osd: refact the playbook

There's no need to have two plays anymore since we now set/unset osd
flags in `ceph-osd` role.

Also, this commit makes the role `ceph-facts` to be called after
`ceph-defaults`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

add-osd: fix fact gathering in add-osd

This commit makes this playbook gathering facts from all other nodes but
clients.
When collocating OSDs on other nodes it can fail like following:

```
fatal: [vm252-11]: FAILED! => {
"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"
}
```

In that case, a fact from a RGW node is called when rendering the
`ceph.conf.j2` but it fails because facts are gathered only from mon and
osd nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1806765
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

add-osd: unset noup flag after last osd is deployed

this commit fixes a bug when using `add-osd.yml` playbook.
`noup` flag is set early but it never got unset before the "wait for pgs
clean" check, so the playbook always fails because OSDs aren't never
seen UP.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1816023
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph_key: fetch key when needed

Fetch the key when it is present in the cluster but not on the node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ccfa249919b338197daec353cb5d4e535b3fb734)

ceph_key: fix idempotency when no secret is passed

553584cbd0d014429e665f998776e8d198f72d2b introduced a regression when no
secret is passed, it overwrites the secret each time the task is run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 003defec0311af0f03da861f80d596852bdb9cf5)

ceph_key: remove 'update' state

With this change, the state `present` is enough to update a keyring.
If the keyring already exist, it will be updated if caps or secret
passed to the module are different.
If the keyring doen't exist, it will be created.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1808367
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 553584cbd0d014429e665f998776e8d198f72d2b)

tests: add mgr nodes to shrink_mon inventory

Since 306ce82 we explicitly fail when there's no mgr node preent in the
inventory.

fatal: [mon0]: FAILED! => {
"changed": false
}

MSG:

Please add a mgr host to your inventory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

osd: support changing default rule even when osd_crush_location isn't defined

Creating crush rules even with no crush hierarchy configuration is a
valid scenario so we shouldn't be bound to the first task result (which
configure crush hierarchy) to be able to add new crush rules.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1816989
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5b0476385ccb00a9edb9092a183c18e2637afd5d)

Add site-container.yml symlink

This adds a symlink to the site-docker.yml.sample playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

switch_to_containers: exclude clients nodes from facts gathering

just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b45375a2c0566406b49896ae316a293)
(cherry picked from commit 5c3ba0787cf346c7e5eb5df74a1da4984c16e7aa)

main: exclude client nodes from facts gathering when delegate_facts_host

This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 865d2eac9ba81bdb9ecbd841e4b73608648dfae2)

The _filtered_clients list should intersect with ansible_play_batch

Client configuration with --limit fails without this patch
because certain tasks are only done to the first host in the
_filtered_clients list and it's likely that first host will
not be included in what's sepcified with --limit. To fix this
the _filtered_clients list should be built from all clients
in the inventory that are also in the running play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798781
Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit e4bf4857f556465c60f89d32d5f2a92d25d5c90f)

defaults: remove legacy comment

This is no longer true, let's remove this comment given that this option
is not ignored in containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e551b5ba1a65e653540b5b5c7cb3a4f5d32b2540)

docker-common: remove legacy tasks for ntp configuration

Those tasks aren't needed in docker-common since the introduction of
`ceph-infra` role. They are duplicated tasks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1810376
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit cd0195c5622ecc2d0001eb7c988d7b6a6fac1d5e)

tests: add inventory host for 4.0 upgrade job

This inventory is intended to be used in the upgrade scenario in
stable-4.0 branch.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: modify add-osd job

This commit modifies the way we test add-osd scenario given that the
playbook add-osd.yml is broken at the moment.

As a workaround we can use main playbook with `--limit` to achieve this
operation.

Note: This commit is intended to be reverted once we get a fix.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: pg num should be a power of two number

This patch changes the pg_num value of the rgw pools foo and bar to be
a power of two number.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-rgw: Fix customize pool size "when" condition

In 3c31b19ab39f297635c84edb9e8a5de6c2da7707, I fixed the `customize pool
size` task by replacing `item.size` with `item.value.size`. However, I
missed the same issue in the `when` condition.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 3842aa1a30277b5ea3acf78ac1aef37bad5afb14)

ceph-rgw: Fix custom pool size setting

RadosGW pools can be created by setting

```yaml
rgw_create_pools:
  .rgw.root:
    pg_num: 512
    size: 2
```

for instance. However, doing so would create pools of size
`osd_pool_default_size` regardless of the `size` value. This was due to
the fact that the Ansible task used

```
{{ item.size | default(osd_pool_default_size) }}
```

as the pool size value, but `item.size` is always undefined; the
correct variable is `item.value.size`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 3c31b19ab39f297635c84edb9e8a5de6c2da7707)

ceph-{mon,osd}: move default crush variables

Since ed36a11 we move the crush rules creation code from the ceph-mon to
the ceph-osd role.
To keep the backward compatibility we kept the possibility to set the
crush variables on the mons side but we didn't move the default values.
As a result, when using crush_rule_config set to true and wanted to use
the default values for crush_rules then the crush rule ansible task
creation will fail.

"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute
'crush_rules'"

This patch move the default crush variables from ceph-mon to ceph-osd
role but also use those default values when nothing is defined on the
mons side.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798864
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1fc6b337142efdc76c10340c076653d298e11c68)

ceph-validate: fail if no mgr host is present

We already stop the upgrade playbook (rolling_update.yml) if there's
no mgr node present so we should also do the same for initial
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1788644
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-mon: use interactive session with aliases

When using ceph aliases with commands that require manual intervention
to stop then the command will keep running inside the container (like
using Ctrl+c).
For handling this, we should use the interactive session option (-it)
with the docker commands.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1797874
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

iscsi: Fix crashes during rolling update

During a rolling update we will run the ceph iscsigw tasks that start
the daemons then run the configure_iscsi.yml tasks which can create
iscsi objects like targets, disks, clients, etc. The problem is that
once the daemons are started they will accept confifguration requests,
or may want to update the system themself. Those operations can then
conflict with the configure_iscsi.yml tasks that setup objects and we
can end up in crashes due to the kernel being in a unsupported state.

This could also happen during creation, but is less likely due to no
objects being setup yet, so there are no watchers or users accessing the
gws yet. The fix in this patch works for both update and initial setup.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1795806

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 77f3b5d51b84a6338847c5f6a93f22a3a6a683d2)

tests: retry to fire up VMs on vagrant failure

Add a script to retry several times to fire up VMs to avoid vagrant
failures.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 1ecb3a9352d869d8fde694cefae9de8af8f6fee8)

config: fix external client scenario

When no monitor group is present in the inventory, this task fails.
This affects only non-containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e7bc0794054008ac2d6771f0d29d275493319665)

tests: add external_clients scenario

This commit adds a new 'external ceph clients' scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 641729357e5c8bc4dbf90b5c4f7b20e1d3d51f7d)

validate: allow running ceph-ansible 3.2 against ansible 2.7

This commit allows ceph-ansible 3.2 to be run against ansible 2.7

However, note that running stable-3.2 against ansible 2.7 doesn't get
any testing upstream this might break the playbook, only ansible 2.6 is
officially supported.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1781635
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add 'all_in_one' scenario

Add new scenario 'all_in_one' in order to catch more collocated related
issues.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3e7dbb4b16dc7de3b46c18db4c00e7f2c2a50453)

update: remove legacy tasks

These tasks should have been removed with backport #4756

Note:
This should have been backported from master but it's not possible
because of too many change between master and stable-3.2

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1740463
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-defaults: remove rgw from ceph_conf_overrides

The [rgw] section in the ceph.conf file or via the ceph_conf_overrides
variable doesn't exist and has no effect.
To apply overrides to all radosgw instances we should use either the
[global] or [client] sections.
Overrides per radosgw instance should still use the
[client.rgw.{instance-name}] section.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794552
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2f07b8513158d3fc36c5b0d29386f46dd28b5efa)

defaults: change monitor|radosgw_address default values

To avoid confusion, let's change the default value from `0.0.0.0` to
`x.x.x.x`.
Users might think setting `0.0.0.0` will make the daemon binding on all
interfaces.

Fixes: #4827
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fc02fc98ebce0f99e81628e76ad28e7bf65435de)

tox: allow copy admin key for purge scenario

This is enabled in the group_vars/clients file but it's overrided in
extra vars by tox.
Let's do it like that for now.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: add coverage on purge playbook

This commit adds a playbook to be played before we run purge playbook,
it first creates an rbd image then map an rbd device on client0 so the
purge playbook will try to unmap it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit db77fbda15bf9f79f8122559b01b6625005ae29c)

purge: use sysfs to unmap rbd devices

in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.

Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3cfcc7a105156dfde65b23e9d8662cd848537094)

update: only run post osd upgrade play on 1 mon

There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589bde0eecb8ac72a7ec8bc1f66403eeb)

update: use flags noout and nodeep-scrub only

1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b9535348dff616665be749503f80c4fca)

ceph-defaults: exclude rbd devices from discovery

The RBD devices aren't excluded from the devices list in the LVM auto
discovery scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783908
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 6f0556f01536932bdf47e8f1aab341b2c6761537)

ceph-osd: wait for all osds once

cf8c6a3 moves the 'wait for all osds' task from openstack_config to the
main tasks list.
But the openstack_config code was executed only on the last OSD node.
We don't need to do this check on all OSD node so we need to add set
run_once to true on that task.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5bd1cf40eb5823aab3c4e16b60b37c30600f9283)

ceph-osd: wait for all osd before crush rules

When creating crush rules with device class parameter we need to be sure
that all OSDs are up and running because the device class list is
is populated with this information.
This is now enable for all scenario not openstack_config only.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf8c6a384999be8caedce1121dfd57ae114d5bb6)

rolling_update: create crush rule after osd play

When upgrading from jewel to luminous we can execute the crush rule tasks
only when the 'osd require-osd-release luminous' command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-osd: add device class to crush rules

This adds device class support to crush rules when using the class key
in the rule dict via the create-replicated sub command.
If the class key isn't specified then we use the create-simple sub
command for backward compatibility.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1636508
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ef2cb99f739ade80e285d83050ac01184aafc753)

move crush rule creation from mon to osd role

If we want to create crush rules with the create-replicated sub command
and device class then we need to have the OSD created before the crush
rules otherwise the device classes won't exist.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ed36a11eabbdbb040652991300cdfc93d51ed491)

ceph-validate: add rbdmirror validation

When ceph_rbd_mirror_configure is set to true we need to ensure that
the required variables aren't empty.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1760553
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4a065cebd70d259bfd59b6f5f9baa45d516a9c3a)

switch_to_containers: set GUID on lockbox part

The ceph lockbox partition (part number 5) used with non lvm scenarios
and in non containerized deployment don't have a valid PARTUUID.
The value is set to 00000000-0000-0000-0000-000000000000 for each OSD
devices.

$ blkid -t PARTLABEL="ceph lockbox" -o value -s PARTUUID
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000

When switching to containerized deployment we manually mount the lockbox
partition by using the PARTUUID.
Unfortunately because we have most of the time multiple OSD on the same
node we can't have the right symlink in /dev/disk/by-partuuid because it
will point to only one partition.

/dev/disk/by-partuuid/00000000-0000-0000-0000-000000000000 -> ../../sdb5

After the switch_to_containers playbook then only one OSD will restart
correctly and the other will try to access to the wrong device causing
error like 'xxxx is still in use'.

When deploying with containers and dmcrypt OSDs we force a PARTUUID
value during the ceph-disk prepare task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-mds: allow directory fragmentation

We need to explicitly enable the allow_dirfrags flag on cephfs pool
after upgrading to Luminous.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776233
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

facts: avoid duplicated element in devices list

When using `osd_auto_discovery`, `devices` is built multiple times due
to multiple runs of `ceph-facts` role. It end up with duplicate
instances of a same device in the list.

Using `unique` filter when building the list fixes this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 23b1f43897db0a03ef94f51e83ed3c562c4584d0)

tests: add shrink-osd-legacy testing

This commit introduce back testing against ceph-disk deployed osds.

In stable-3.2 which is the most common version used at customers
(downstream pov), a bunch of OSDs are still deployed using ceph-disk.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

shrink-osd: support fqdn in inventory

When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9ca6b05b52694dec53ce61fdc16bb83c93979d)

ceph-iscsi: add ceph-iscsi stable repositories

This commit adds the support of the ceph-iscsi stable repository when
use ceph_repository community instead of always using the devel
repositories.
We're still using the devel repositories for rtslib and tcmu-runner in
both cases (dev and community).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ansible.cfg: do not enforce PreferredAuthentications

There's no need to enforce PreferredAuthentications by default.
Users can still choose to override the ansible.cfg with any additional
parameter like this one to fit their infrastructure.

Fixes: #4826
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d682412e2aa5eeb411cc0dff9a3ffef4b4aa8683)

ceph-osd: update systemd unit script

The systemd unit script wasn't updated with the new container name
format (without the hostname).
We now have the same start/stop docker commands for all scenarios.
During the device to id OSD migration we need to be sure that the
old container with the hostname are stopped.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1780688
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: add lvm-auto-discovery scenario

This adds the lvm-auto-discovery scenario.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-defaults: exclude md devices from discovery

The md devices (RAID software) aren't excluded from the devices list in
the auto discovery scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1764601
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 014f51c2a42e4922b43da07b97c4b810ede32200)

facts: fix auto_discovery exclude

the previous approach was wrong.
checking if `item.key` is in `osd_auto_discovery_exclude` (`['dm-',
'loop']`) is incorrect because it will obviously not match. Therefore,
the condition will return `True` whatever the device we are checking.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f420072727441fd6e6a22a15cc9034a3a678cae)

osd: add possibility to exclude device in osd_auto_discovery

Add a new `osd_auto_discovery_exclude` to give the possibility of
excluding some devices in auto_discovery scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 83d7ef777ec19eb5f96c553d869ec8ad90fd5061)

ceph-facts: generate devices when osd_auto_discovery is true

This task used to live in ceph-osd, but we need it defined here to that
ceph-config can use it when trying to determine the number of osds.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 88eda479a9d6df6224a24f56dbcbdb204daab150)

tests: reduce max_mds from 3 to 2

Having max_mds value equals to the number of mds nodes generates a
warning in the ceph cluster status:

cluster:
id:     6d3e49a4-ab4d-4e03-a7d6-58913b8ec00a'
health: HEALTH_WARN'
        insufficient standby MDS daemons available'
(...)
services:
  mds:     cephfs:3 {0=mds1=up:active,1=mds0=up:active,2=mds2=up:active}'

Let's use 2 active and 1 standby mds.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4a6d19dae296969954e5101e9bd53443fddde03d)

switch_to_containers: fix umount ceph partitions

When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.

We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0

Also we should not fail on the ceph directory listing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 39cfe0aa65ddd96458ba9d0a031d801efbb0d394)

tests: fix update scenario (container)

The path to the inventory isn't correct because we are missing the variable
`CONTAINER_DIR` here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: revert vagrant_variable file name detection

This commit reverts the following change:

https://github.com/ceph/ceph-ansible/pull/4510/commits/fcf181342a70b78a355d1c985699028012326b5f#diff-23b6f443c01ea2efcb4f36eedfea9089R7-R14

this is causing CI failures so this commit is intended to unlock the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5353ab8a23aee92ebab8146eeeeffcfcb25c0865)

rolling_update: don't enable ceph-mon unit

On non containerized deployment the ceph-mon hostname/fqdn systemd
service are stopped at the beginning of the mon upgrade.
But the parameter enabled is set to true for both task so even if we're
not using the fqdn then it will enabled the systemd unit based on it.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649617
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

container: add always tag on gather fact tasks

If we execute the site-container.yml playbook with specific tags (like
ceph_update_config) then we need to be sure to gather the facts otherwise
we will see error like:

The task includes an option with an undefined variable. The error was:
'ansible_hostname' is undefined

This commit also adds missing 'gather_facts: false' to mons plays.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1754432
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d7fd769b6ddcb63086cc414e00cce31433d56673)

Evades validation of ceph_repository_type in containerized scenario
This will prevent failure of site-docker.yml with configs in doc.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1769760
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
Co-Authored-By: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9a1f1626c3e57e64bdcd8d37ae600c21f3ea2a24)

ceph_key: restore file mode after a key is fetched

when `import_key` is enabled, if the key already exists, it will only be
fetched using ceph cli, if the mode specified in the `ceph_key` task is
different from what is applied by the ceph cli, the mode isn't restored because
we don't call `module.set_fs_attributes_if_different()` before
`module.exit_json(**result)`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1734513
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b717b5f736448903c69882392e0691fba60893aa)

Remove outdated documentation

Fixes BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1640525

Signed-off-by: Noah Watkins <nwatkins@redhat.com>

mergify: remove mergify config on stable-3.2

This commit removes the mergify config on stable-3.2

At the moment there is no need to have a mergify config on this branch
given that we don't use it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-osd: fix fs.aio-max-nr sysctl condition

[1] introduced a regression on the fs.aio-max-nr sysctl value condition.
The enable key isn't a boolean but a string because the expression isn't
evaluated.
This string output "(osd_objectstore == 'bluestore')" is always true
because item.enable condition only matches non empty string. So the
sysctl value was applyied for both filestore and bluestore backend.

[2] added the bool filter to the condition but the filter always returns
false on string and the sysctl wasn't applyed at all.

This commit fixes the enable key value by evaluating the value instead
of using the string.

[1] https://github.com/ceph/ceph-ansible/commit/08a2b58
[2] https://github.com/ceph/ceph-ansible/commit/ab54fe2

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ece46d33be566994d6ce799fdc4547299b352429)

Support comma-delimited subnets in firewall

ceph.conf supports a comma separated list of
subnet CIDR's for the public_network and the
cluster network. ceph-ansible should support
setting up the firewall for this configuration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1767392
Closes: #4425
Related: #4333
https://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/#network-config-settings

Signed-off-by: Harald Jensås <hjensas@redhat.com>
(cherry picked from commit d94229204d84fc27c5997d273dff577af0ab1684)

ceph-infra: Remove restart firewalld handler

There's no need to restart firewalld service when a new rule is
added due to the usage of the immediate flag.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b7338d438a18b6de9083e4302d809ead7a46b6c6)

ceph-osd: Remove ulimit nofile on container start

Even if this improves ceph-disk/ceph-volume performances then it also
impact the ceph-osd process.
The ceph-osd process shouldn't use 1024:4096 value for the max open
files.
Removing the ulimit option from the container engine and doing this kind
of change on the container side [1].

[1] https://github.com/ceph/ceph-container/pull/1497

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1702285
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9a996aef7f79d5018e6999362fd025e9c04c9b3f)

update: add default values when setting fact

This commit adds a default value in the with_dict because when using
python 2.7, if a task using a with_dict has a condition, it is
evaluated anyway whereas in python 3 it isn't.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rolling_update: remove default filter on mds group

There's no need to use the default filter on active/standby groups
because if the group doesn't exist then the play is just skipped.

Currently this generates warnings like:

[WARNING]: Could not match supplied host pattern, ignoring: |
[WARNING]: Could not match supplied host pattern, ignoring: default([])

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2ca79fcc99bcff6f73478f11e67ba7edb178b029)

rolling_update: fix active mds host value

The active mds host should be based on the inventory hostname and not on
the ansible hostname.
The value returns under the mdsmap structure is based on the OS hostname
so we need to find the right node in the inventory with this value when
doing operation on inventory nodes.

Othewise we could see error like:

The task includes an option with an undefined variable. The error was:
"hostvars[foobar]" is undefined

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1f2352c7974f0839b5e74cb23849e943a1131c6)

update: skip mds deactivation when no mds in inventory

Let's skip this part of the code if there's no mds node in the
inventory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ec906c3af2d188de23cc354ecb9ddcfc0af9d90)

openstack_config: fix docker exec command

container_exec_cmd should be replace by docker_exec_cmd.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1765110
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

update: follow new recommandation to upgrade mds cluster

Refact the mds cluster upgrade code in order to follow the documented
recommandation.
See: https://github.com/ceph/ceph/blob/luminous/doc/cephfs/upgrading.rst

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1569689
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 71cebf80a623388c64dfcb190133eb5f54a524f9)

tests: fix the size on the second data LV

The commit replaces the pv/vg/lv commands used with the ansible command
module by the lvg and lvol modules.
This also fixes the size of the second data LV because we were only using
50% of the remaining space instead of 100%.

With a 50G device, the result was:
  - data-lv1 was 25G
  - data-lv2 was 12.5G
Instead of:
  - data-lv1 was 25G
  - data-lv2 was 25G

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2c03c6fcd33ba6b3a3daf73ecd011e87ab41c0a0)

common: do not override ceph_release when using custom repo

Otherwise it fails like following:

```
TASK [ceph-mds : allow multimds] **************************************************************************************************************************************************
Monday 22 July 2019 16:37:38 +0800 (0:00:03.269) 0:13:25.651 ***********
fatal: [rhel7u6clone1]: FAILED! => {"msg": "The conditional check 'ceph_release_num[ceph_release] == ceph_release_num.luminous' failed. The error was: error while evaluating conditional (ceph_release_num[ceph_release] == ceph_release_num.luminous): 'dict object' has no attribute u'dummy'\n\nThe error appears to have been in '/usr/share/ceph-ansible/roles/ceph-mds/tasks/create_mds_filesystems.yml': line 43, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: allow multimds\n ^ here\n"}
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1645379
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4e9504c939a1daddceef3806f7165844952a6618)

tests: add multimds coverage

This commit makes the all_daemons scenario deploying 3 mds in order to
cover the multimds case.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rbd-mirror: fail if the peer is not added

Due the 'failed_when: false' statement present in the peer task then
the playbook continues to ran even if the peer task was failing (like
incorrect remote peer format.

"stderr": "rbd: invalid spec 'admin@cluster1'"

This patch adds a task to list the peer present and add the peer only if
it's not already added. With this we don't need the failed_when statement
anymore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1665877
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0b1e9c0737ca84c2e4a34f827cf91e1a11007b16)

Remove validate action and notario dependency

The current ceph-validate role is using both validate action and fail
module tasks to validate the ceph configuration.
The validate action is based on the notario python library. When one of
the notario validation fails then a python stack trace is reported to the
ansible task. This output isn't understandable by users.

This patch removes the validate action and the notario depencendy. The
validation is now done with only fail ansible module.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1654790
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

tests: fix rgw multisite vagrant variables

The secondary vagrant variables didn't have the grafana vm variable
set which create an vagrant error.

There was an error loading a Vagrantfile. The file being loaded
and the error message are shown below. This is usually caused by
an invalid or undefined variable.

This patch also changes the ssh-extra-args parameter to ssh-common-args
to get the same values for ssh/sftp/scp. Otherwise we can see warnings
from ansible and some tasks are failing.

[WARNING]: sftp transfer mechanism failed on [mon0]. Use ANSIBLE_DEBUG=1
to see detailed information

It also updates the ssh-common-args value for the rgw-multisite scenario
to reflect the ANSIBLE_SSH_ARGS environment variable value.

Finally changing the IP addresses due to the Vagrant refact done in the
commit 778c51a

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 010158ff847bb59920f6a5bbf383a1cb7056c0cf)

switch_to_containers: optimize ownership change

As per https://github.com/ceph/ceph-ansible/pull/4323#issuecomment-538420164

using `find` command should be faster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1757400
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-Authored-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit c5d0c90bb7d8382fde2f07820c2d8547c8a3603e)

validate: prevent from installing OSD on same disk as the OS

This commit adds a validation task to prevent from installing an OSD on
the same disk as the OS.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1623580
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 80e2d00b16629c4815019e8bc58c4539bd109710)

tests: update tox due to pipeline removal

This commit reflects the recent changes in ceph/ceph-build#1406

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bcaf8cedeec0f06eb8641b0038569a5cd3a3e7be)

switch_to_containers: umount osd lockbox partition

When switching from a baremetal deployment to a containerized deployment
we only umount the OSD data partition.
If the OSD is encrypted (dmcrypt: true) then there's an additional
partition (part number 5) used for the lockbox and mount in the
/var/lib/ceph/osd-lockbox/ directory.
Because this partition isn't umount then the containerized OSD aren't
able to start. The partition is still mount by the system and can't be
remount from the container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 19edf707a50c2e86110b2ba0231091b6bd355bd1)

ceph-config: remove container_binary variable

9e7972a introduced a regression via the container_binary variable
which is undefined.
The CEPH_CONTAINER_BINARY environment variable isn't used at all.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-mgr: fix ceph_key module with container

556052b changed the way the mgr keyring are created but the ceph_key
module need the containerized parameter when the deployment is using
containers.
This module doesn't support CEPH_CONTAINER_[BINARY|IMAGE] environment
variables.

Closes: #4547
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

nfs: stop nfs server service in all context

This commit moves this task in order to stop the nfs server service
regardless the deployment type desired (containerized or non
containerized).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1508506
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6c6a512a720de0268e2d099926413e2816c65174)

nfs: stop nfs server service

The syntax here wasn't working, this refact fixes this task.
Also, removing the `ignore_errors: true` which was hidding the failure.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1508506
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 47034effe0bb7de14442b0ba884ff4abe793b4b7)

playbook: add missing tags

Add missing tag on ceph-handler role call.
Otherwise, we can't use `--tags='ceph_update_config'` for updating the
ceph configuration file.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1754432
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f59dad620d43a740466a26f7fb8eba1ffc5ba0af)

ceph-mgr: create keys for MGRs

Add code in ceph-mgr for creating a keyring for manager in so that
managers can be deployed on a separate node too.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1552210
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 56bfec7c58407e269f6e6fa7b4c8a5928953dc6f)

ceph-handler: don't restart all OSDs with limit

When using the ansible --limit option on one or few OSD nodes and if the
handler is triggered then we will restart the OSD service on all OSDs
nodes instead of the hosts limited by the limit value.
Even if the play is limited by the --limit value we are using all OSD
nodes from the OSD group.

with_items: '{{ groups[osd_group_name] }}'

Instead we should iterate only on the nodes present in both OSD group and
limit list.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0346871fb5c46fb1fedfb24ffe5a8c02108c244e)

Vagrantfile: support more than 9 nodes per daemon type

because of the current ip address assignation, it's not possible to
deploy more than 9 nodes per daemon type.
This commit refact a bit and allows us to get around this limitation.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 778c51a0ff7a8c66c464f4828a0f87dd290e1c3e)

tests: set gateway_ip_list dynamically

so we dont' have to hardcode this in the tests

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: refact 'wait for all osd to be up' task

let's use `until` instead of doing test in bash using python oneliner
also, use `command` instead of `shell`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c76cd5ad844b71a41d898febafda326ad067cc05)

validate: fix gpt header check

Check for gpt header when osd scenario is lvm or lvm batch.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1731310
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>