git-server-git.apps.pok.os.sepia.ceph.com Git

Fix in regular expression matching OSD ID on non-contenerized
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.

Closes: #2964
Signed-off-by: Artur Fijalkowski <artur.fijalkowski@ing.com>
(cherry picked from commit 52d9d406b107c4926b582905b3d442feabf1fafc)

defaults: backward compatibility with fqdn deployments

This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a6ff6bbf8a7b4ba4ab5236eca93325d8ee61b1b)

rolling_update: set osd sortbitwise

upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b3266c5be2f88210589cfa56a5fe0a5092f79ee6)

config: enforce socket name

This was introduced by
https://github.com/ceph/ceph/commit/59ee2e8d3b14511e8d07ef8325ac8ca96e051784
and made our socket checks impossible to run. The PID could be found,
but the cctid cannot.
This happens during upgrade to mimic and on cluster running on mimic.

So let's force the admin socket the way it was so we can properly check
for existing instances also the line $cluster-$name.$pid.$cctid.asok
is only needed when running multiple instances of the same daemon,
thing ceph-ansible cannot do at the time of writing

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610220
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ea9e60d48d6631631ac9294d4ef291f8d7a30d78)

tests: support update scenarios in test_rbd_mirror_is_up()

`test_rbd_mirror_is_up()` is failing on update scenarios because it
assumes the `ceph_stable_release` is still set to the value of the
original ceph release, it means it won't enter in the right part of the
condition and fails.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d8281e50f12eb299e2b419befab41cb7a0f39de2)

igw: fix image removal during purge

We were not passing in the ceph conf info into the rbd image removal
command, so if the clustername was not the default igw purge would fail
due to the rbd rm command failing.

This just fixes the bug by passing in the ceph conf info which has the
clustername to use.

This fixes Red Hat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1601949

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d572a9a6020592607d98aa30cea75940428506b0)

igw: do not fail purge on rbd removal errors

Instead of failing the entire purge operation when the rbd command fails
just log an error. This will allow the higher level target and config
cleanup to complete, and the user only has to manually delete the rbd
images.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 6f72f96dadb7b38f065ccef3f0618a2897f8465f)

osd: do not remove expose_partition container

The container runs with --rm which means it will be deleted by Docker
when exiting. Also 'docker rm -f' is not idempotent and returns 1 if the
container does not exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1609007
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2ca8c519066555e06a261d5dee3fb46ce5daad0b)

ceph-osds: backward compatibility with jewel for osp pools creation

If we want to be backward compatible with release prior to luminous, we
have to set the rule name accordingly to default values used in jewel.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 053709da9730a826976f56a9c2c5e08e47722624)

rbd-mirror: bring back compatibility with jewel deployment

rbd-mirror can't start when deploying jewel because it needs admin
keyring.
Getting back this task brings backward compatibility for jewel
deployment.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1ecbbbdcfa928a3ee7381b0dc2dcf0c460dfb549)

iscsigw: do not run common roles when deploying jewel

Let's not deploy common roles when iscsigw nodes for jewel deployment.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a1ca2c8fd373527b58b68b81d0f6966b1d83adfa)

tests: leave an OSD node in default crush root

jewel used to create a default `rbd` pool in the default crush root
`default`, we need to have at least 1 osd to satisfy the PGs for this
created pool, otherwise the cluster will be in HEALTH_ERR state because
of `pgs stuck unclean`/`pgs stuck inactive`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 578aa5c2d54a680912e4e015b6fb3dbbc94d4fd0)

ceph ansible 3.1 igw: fix rbd-target-gw startup

The problem is rbd-target-gw needs the rbd pool to be created, keyring
to be copied over, and the iscsi-gateway.cfg to be setup before starting
the rbd-target-gw service.

In the master branch this is fixed by this commit:

    commit 91bf53ee932a6748c464bea762f8fb6f07f11347
    Author: Sébastien Han <seb@redhat.com>
    Date:   Fri Mar 23 11:24:56 2018 +0800

        ceph-iscsi: support for containerize deployment

where the needed setup tasks are done in common.yml which is done
before prerequisites.yml.

To avoid porting all those changes to 3.1 this patch just moves the
rbd-target-gw startup to configure_iscsi.yml after everything has
been setup.

This fixes red hat bz:

https://bugzilla.redhat.com/show_bug.cgi?id=1601325

Signed-off-by: Mike Christie <mchristi@redhat.com>

rgw: add more config option for civetweb frontend

In containerized deployments we now inherite from the
radosgw_civetweb_options options when bootstrapping the container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582411
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit e2ea5bac5111c7633640b66f2570dc83893bae7a)

Run creation of empty rados index object to first monitor

When distributing ceph-nfs role, creation of rados index object
fails as it assumes availability of client.admin locally.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1607970
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit e85e5ea781e7ae251b277c1fbca83877e3ebfd82)

tests: add mimic support in stable-3.1

Add mimic support in stable-3.1 branch so we can test it in nightlies CI
jobs.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: do not deploy all daemons for shrink osds scenarios

Let's create a dedicated environment for these scenarios, there is no
need to deploy everything.
By the way, doing so will save some times.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b89cc1746f8652b67d95410ed80473d1a2c3d312)

shrink-osd: purge osd on containerized deployment

Prior to this commit we were only stopping the container, but now we
also purge the devices.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ce1dd8d2b3986d3bd08b4d73efd88b4de72fcc00)

tests: stop hardcoding ansible version

In addition to ceph/ceph-build#1082

Let's set the ansible version in each ceph-ansible branch's respective
requirements.txt.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: add latest-bis-jewel for jewel tests

since no latest-bis-jewel exists, it's using latest-bis which points to
ceph mimic. In our testing, using it for idempotency/handlers tests
means upgrading from jewel to mimic which is not what we want do.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 05852b03013d15f6f400fe6728c24b11f22b75de)

nfs: change default stable branch for nfs-ganesha repo

Since `V2.6-stable` is available and has packages for `mimic`, let's
update this default value accordingly so nfs nodes can be deployed with
mimic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1a626d3c615cb23216bff9281737334122e18b80)

client: do not rely on copy_admin_key to import keys

Relying on `copy_admin_key` to import created keys on client nodes makes
us obliged to copy admin key on those nodes which is not something we might
want.
We should use the fact `condition_copy_admin_key` which will be set to
`True` when the delegated node is a mon which means we can import keys
without taking care of admin keyring.

Fixes: #2867
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ef5fcd0b64ed1a0fe4ffb1750984d29599839a4)

mgr: fix condition to add modules to ceph-mgr

Follow up on #2784

We must check in the generated fact `_disabled_ceph_mgr_modules` to
enable disabled mgr module.
Otherwise, this task will be skipped because it's not comparing the
right list.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600155
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ce5ac930c5b91621a46fc69ddd0dcafb2a24947d)

tests: skip rgw_tuning_pools_are_set when rgw_create_pools is not defined

since ooo_collocation scenario is supposed to be the same scenario than the
one tested by OSP and they are not passing `rgw_create_pools` the test
`test_docker_rgw_tuning_pools_are_set` will fail:
```
> pools = node["vars"]["rgw_create_pools"]
E KeyError: 'rgw_create_pools'
```

skipping this test if `node["vars"]["rgw_create_pools"]` is not defined
fixes this failure.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1c3dae4a90816ac6503967779b7fd77ff84900b5)

tests: skip tests for node iscsi-gw when deploying jewel

CI is deploying a iscsigw node anyway but its not deployed let's skip
test accordingly

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2d560b562a2d439fbed34b3066de93dfdff650ff)

tests: refact test_all_*_osds_are_up_and_in

these tests are skipped on bluestore osds scenarios.
they were going to fail anyway since they are run on mon nodes and
`devices` is defined in inventory for each osd node. It means
`num_devices * num_osd_hosts` returns `0`.
The result is that the test expects to have 0 OSDs up.

The idea here is to move these tests so they are run on OSD nodes.
Each OSD node checks their respective OSD to be UP, if an OSD has 2
devices defined in `devices` variable, it means we are checking for 2
OSD to be up on that node, if each node has all its OSD up, we can say
all OSD are up.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fe79a5d24086fc61ae84a59bd61a053cddb62941)

tests: fix broken test when collocated daemons scenarios

At the moment, a lot of tests are skipped when daemons are collocated.
Our tests consider a node belong to only 1 group while it's possible for
certain scenario it can belong to multiple groups.

Also pinning to pytest 3.6.1 so we can use `request.node.iter_markers()`

Co-Authored-by: Alfredo Deza <adeza@redhat.com>
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d83b24d27121591dafcba8297288c6f3a7ede42e)

tests: fix `_get_osd_id_from_host()` in TestOSDs()

We must initialize `children` variable in `_get_osd_id_from_host()`,
otherwise, if for any reason the deployment has failed and result with
an osd host with no OSD registered, we won't enter in the condition,
therefore, `children` is never set and the function tries to return
something undefined.

Typical error:
```
E UnboundLocalError: local variable 'children' referenced before assignment
```

Fixes: #2860
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9a65ec231d7f76655caa964e1d9228ba7a910dea)

tests: refact test_all_*_osds_are_up_and_in

these tests are skipped on bluestore osds scenarios.
they were going to fail anyway since they are run on mon nodes and
`devices` is defined in inventory for each osd node. It means
`num_devices * num_osd_hosts` returns `0`.
The result is that the test expects to have 0 OSDs up.

The idea here is to move these tests so they are run on OSD nodes.
Each OSD node checks their respective OSD to be UP, if an OSD has 2
devices defined in `devices` variable, it means we are checking for 2
OSD to be up on that node, if each node has all its OSD up, we can say
all OSD are up.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fe79a5d24086fc61ae84a59bd61a053cddb62941)

tests: factorize docker tests using docker_exec_cmd logic

avoid duplicating test unnecessarily just because of docker exec syntax.
Using the same logic than in the playbook with `docker_exec_cmd` allow us
to execute the same test on both containerized and non containerized environment.

The idea is to set a variable `docker_exec_cmd` with the
'docker exec <container-name>' string when containerized and
set it to '' when non containerized.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f2e57a56db2801818135ba85479fedfc00eae30c)

tests: add mimic support for test_rbd_mirror_is_up()

prior mimic, the data structure returned by `ceph -s -f json` used to
gather information about rbd-mirror daemons looked like below:

```
  "servicemap": {
    "epoch": 8,
    "modified": "2018-07-05 13:21:06.207483",
    "services": {
      "rbd-mirror": {
        "daemons": {
          "summary": "",
          "ceph-nano-luminous-faa32aebf00b": {
            "start_epoch": 8,
            "start_stamp": "2018-07-05 13:21:04.668450",
            "gid": 14107,
            "addr": "172.17.0.2:0/2229952892",
            "metadata": {
              "arch": "x86_64",
              "ceph_version": "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)",
              "cpu": "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz",
              "distro": "centos",
              "distro_description": "CentOS Linux 7 (Core)",
              "distro_version": "7",
              "hostname": "ceph-nano-luminous-faa32aebf00b",
              "instance_id": "14107",
              "kernel_description": "#1 SMP Wed Mar 14 15:12:16 UTC 2018",
              "kernel_version": "4.9.87-linuxkit-aufs",
              "mem_swap_kb": "1048572",
              "mem_total_kb": "2046652",
              "os": "Linux"
            }
          }
        }
      }
    }
  }
```

This part has changed from mimic and became:
```
  "servicemap": {
    "epoch": 2,
    "modified": "2018-07-04 09:54:36.164786",
    "services": {
      "rbd-mirror": {
        "daemons": {
          "summary": "",
          "14151": {
            "start_epoch": 2,
            "start_stamp": "2018-07-04 09:54:35.541272",
            "gid": 14151,
            "addr": "192.168.1.80:0/240942528",
            "metadata": {
              "arch": "x86_64",
              "ceph_release": "mimic",
              "ceph_version": "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)",
              "ceph_version_short": "13.2.0",
              "cpu": "Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz",
              "distro": "centos",
              "distro_description": "CentOS Linux 7 (Core)",
              "distro_version": "7",
              "hostname": "ceph-rbd-mirror0",
              "id": "ceph-rbd-mirror0",
              "instance_id": "14151",
              "kernel_description": "#1 SMP Wed May 9 18:05:47 UTC 2018",
              "kernel_version": "3.10.0-862.2.3.el7.x86_64",
              "mem_swap_kb": "1572860",
              "mem_total_kb": "1015548",
              "os": "Linux"
            }
          }
        }
      }
    }
  }
```

This patch modifies the function `test_rbd_mirror_is_up()` in
`test_rbd_mirror.py` so it works with `mimic` and keeps backward compatibility
with `luminous`

(cherry picked from commit 09d795b5b737a05164772f5e3ba469577d605344)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

common: switch from docker module to docker_container

As of ansible 2.4, `docker` module has been removed (was deprecated
since ansible 2.1).
We must switch to `docker_container` instead.

See: https://docs.ansible.com/ansible/latest/modules/docker_module.html#docker-module

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d0746e08586556b7d44ba4b20cd553a27860f30b)

mon: ensure socker is purged when mon is stopped

On containerized deployment, if a mon is stopped, the socket is not
purged and can cause failure when a cluster is redeployed after the
purge playbook has been run.

Typical error:

```
fatal: [osd0]: FAILED! => {}

MSG:

'dict object' has no attribute 'osd_pool_default_pg_num'
```

the fact is not set because of this previous failure earlier:

```
ok: [mon0] => {
    "changed": false,
    "cmd": "docker exec ceph-mon-mon0 ceph --cluster test daemon mon.mon0 config get osd_pool_default_pg_num",
    "delta": "0:00:00.217382",
    "end": "2018-07-09 22:25:53.155969",
    "failed_when_result": false,
    "rc": 22,
    "start": "2018-07-09 22:25:52.938587"
}

STDERR:

admin_socket: exception getting command descriptions: [Errno 111] Connection refused

MSG:

non-zero return code
```

This failure happens when the ceph-mon service is stopped, indeed, since
the socket isn't purged, it's a leftover which is confusing the process.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9f54b3b4a7c1a7015def3a7987e3e9e426385251)

ceph-config: do not log cluster log on container

The container image recently merged both cluster and mon log into a
single stream. Following this, we now see this warning coming from the
container image:

2018-06-19 13:44:01.542990 7ff75b024700 1 mon.vm02@1(peon).log
v57928205 unable to write to '/var/log/ceph/ceph.log' for channel
'cluster': (2) No such file or directory

So we now tell the mon to not log cluster log on the filesystem.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591771
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 713b9fcf9b825ba84b07781d05c967238ee96c14)

ceph-common: fix rhcs condition

We forgot to add mgr_group_name when checking for the mon repo, thus the
conditional on the next task was failing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1598185
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fcf11ecc3567398f92b9f91e1a0749edb921131f)

mgr: fix enabling of mgr module on mimic

The data structure has slightly changed on mimic.

Prior to mimic, it used to be:

```
{
    "enabled_modules": [
        "status"
    ],
    "disabled_modules": [
        "balancer",
        "dashboard",
        "influx",
        "localpool",
        "prometheus",
        "restful",
        "selftest",
        "zabbix"
    ]
}
```

From mimic it looks like this:

```
{
    "enabled_modules": [
        "status"
    ],
    "disabled_modules": [
        {
            "name": "balancer",
            "can_run": true,
            "error_string": ""
        },
        {
            "name": "dashboard",
            "can_run": true,
            "error_string": ""
        }
    ]
}
```

This means we can't simply check if `item` is in `item in
_ceph_mgr_modules.disabled_modules`

the idea here is to use filter `map(attribute='name')` to build a list
when deploying mimic.

Fixes: #2766
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3abc253fecc91f29c90e23ae95e1b83f8ffd3de6)

ceph-client: do not kill the dummy container

The container runs for 300 sec, then dies and removes itself thanks to
the '--rm' option, so there is no point of removing it. Also this is
causing failure under some circonstances.

Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 63658c05c7609894f9e2b72c83ba985ea65b97ec)

ceph-defaults: add default application to pool

We now add a default 'rbd' application type to each pool we create. This
will remove the warning: " application not enabled on N pool(s) "

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1590275
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 103c279c218ae654ec9ced29c1aef54eb8b59990)

ceph-mds: enable application pool

We now enable the application type 'cephfs' for each cephfs pools we
create.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1590275
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit a6294089679729b12022052381d1aa6cf9978df7)

Enable monitor repo for mgr nodes and Tools repo for iscsi/nfs/clients

Signed-off-by: Vasu Kulkarni <vasu@redhat.com>
(cherry picked from commit 1d454b611f9ec5403a474fcb45a6333ca6d36715)

ceph-mon: Generate initial keyring

Minor fix so that initial keyring can be generated using python3.

Signed-off-by: Ha Phan <thanhha.work@gmail.com>
(cherry picked from commit a7b7735b6fd23985d24a492f1bf4c5be7f1961b2)

systemd: remove changed_when: false

When using a module there is no need to apply this Ansible option. The
module will handle the idempotency on its own. So the module decides
wether or not the task has changed during the execution.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit f6239972716b013bafcb9313c9c83723615aa7d6)

# Conflicts:
# roles/ceph-iscsi-gw/tasks/container/containerized.yml

ceph-osd: trigger osd container restart on script change

The script ceph-osd-run.sh holds the config options to start the
container, if one of these options are modified we must restart the
container. This was not the case before becauase the 'notify' flag
wasn't present.

Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1596061
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit abdb53e16a7f46ceebbf4de65ed1add04da0d543)

tests: reduce the amount of time we wait

This sleep 120 looks a bit long, let's reduce this to 30sec and see if
things go faster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 081600842ff3758910109d9f636b54cb12a85ed9)

mon: honour mon_docker_net_host option

--net=host was hardcoded in the startup line so even though
mon_docker_net_host was set to False the net option would always be
activated.
mon_docker_net_host is set to True by default so this commit does not
change the behaviour.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 322e2de7d27c498a9d89b31b96729200bed56e19)

tests: add more nodes in ooo testing scenario

adding more node in this scenario could help to have a better coverage
so we can catch more potential bugs.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 481c14455aa3bcd05cc1c190458148a4d516e991)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: fix *_has_correct_value tests

It might happen that the list of ips/hosts in following line (ceph.conf)
- `mon initial memebers = <hosts>`
- `mon host = <ips>`

are not ordered the same way depending on deployment.

This patch makes the tests looking for each ip or hostname in respective
lines.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f68936ca7e7f556c8d8cee8b2c4565a3c94f72f9)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

common: remove duplicate include of configure_firewall_rpm.yml

408ef69 has introduced a duplicated task, something went wrong with the
backport from 24ef47b (probably a conflict merge hasn't been solved properly).

It's better now to commit directly in stable-3.1 to definitely solve
this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1589146
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

common: start firewalld if configure_firewall

Currently we expect that if configure_firewall is set to True to have
firewalld enabled and running. Let's enforce that.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1589146
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bea4027f0c2b5beebf409a00e7b61e923f4fea0c)
Signed-off-by: Sébastien Han <seb@redhat.com>

mon/osd: bump container memory limit

As discussed with the cores, the current limits are too low and should
be bumped to higher value.
So now by default monitors get 3GB and OSDs get 5GB.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591876
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit a9ed3579ae680949b4f53aee94003ca50d1ae721)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: keep same ceph release during handlers/idempotency test

since `latest` points to `mimic`, we need to force the test to keep the
same ceph release when testing anything else than `mimic`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 21894655a7eaf0db6c99a46955e0e8ebc59a83af)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-mds: do not enable multimds on jewel

Multiple active MDS became stable in Luminous.

Introduced-by: c8573fe0d745e4667b5d757433efec9dac0150bc
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 9ce81ae845510cb70eee76ceef4412aa8155713a)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

client: try to kill dummy container only on first client node

The 'dummy' container is created only on first client node, it means we
must seek to destroy this container only on this node, otherwise this
can cause failure like following :
```
fatal: [192.168.24.8]: FAILED! => {"changed": false, "cmd": ["docker", "rm",
"-f", "ceph-create-keys"], "delta": "0:00:00.023692", "end": "2018-06-12
20:56:07.261278", "msg": "non-zero return code", "rc": 1, "start":
"2018-06-12 20:56:07.237586", "stderr": "Error response from daemon: No such
container: ceph-create-keys", "stderr_lines": ["Error response from daemon: No
such container: ceph-create-keys"], "stdout": "", "stdout_lines": []}

```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1590746
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51cf3b7fa0211fbbfcbd8c4228dcd39d20f02e54)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-osd: set 'openstack_keys_tmp' only when 'openstack_config' is defined.

If 'openstack_config' is false this task shouldn't be executed.

Signed-off-by: Konstantin Shalygin <k0ste@k0ste.ru>
(cherry picked from commit 3a07568496f718ffe44077eac23d7397a17b3b09)
Signed-off-by: Sébastien Han <seb@redhat.com>

common: ability to enable/disable fw configuration

Prior to this patch if you were running on a Red Hat system,
ceph-ansible would try to configure firewalld for you without the
operators's consent.
Now you can enable or disable the fw configuration by setting
configure_firewall to either true or false.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1589146
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2e8412734a81077c2b11de316e08bbecbed2de96)

tests: set CEPH_DOCKER_IMAGE_TAG when ceph release is luminous

Since latest points to mimic for the ceph container images, we need to
set `CEPH_DOCKER_IMAGE_TAG` to `latest-luminous` when ceph release is
luminous

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a351b0872620e3bfca1a2ef5fb5235f35002b160)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: increase memory to 1024Mb for centos7_cluster scenario

we see more and more failure like `fatal: [mon0]: UNREACHABLE! => {}` in
`centos7_cluster` scenario, Since we have 30Gb RAM on hypervisors, we
can give monitors a bit more RAM. By the way, nodes on containerized cluster
testing scenario have already 1024Mb memory allocated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bbb869133563c3b0ddd0388727b894306b6b8b26)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

client: keyrings aren't created when single client node

combining `run_once: true` with `inventory_hostname ==
groups.get(client_group_name) | first` might cause bug when the only
node being run is not the first in the group.

In a deployment with a single client node it might cause issue because
sometimes keyring won't be created since the task could be definitively
skipped.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 090ecff94e341a674556c0f4f578caa73330a0f0)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

tests: update ooo inventory hostfile

Update the inventory host for tripleo testing scenario so it's the same
parameters than in tripleo CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 28d21b4e9c13566855844e9d5da71e2c4ef80894)

client: add a default value for keyring file

Potential error if someone doesnt pass the mode in `keys` dict for
client nodes:

```
fatal: [client2]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'mode'

The error appears to have been in '/home/guits/ceph-ansible/roles/ceph-client/tasks/create_users_keys.yml': line 117, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: get client cephx keys
^ here

exception type: <class 'ansible.errors.AnsibleUndefinedVariable'>
exception: 'dict object' has no attribute 'mode'

```

adding a default value will avoid the deployment failing for this.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8a653cacd56553926126d0b43d328af94bbd0337)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

client: use dummy created container when there is no mon in inventory

the `docker_exec_cmd` fact set in client role when there is no monitor
in inventory is wrong, `ceph-client-{{ hostname }}` is never created so
it will fail anyway.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7b156deb67a9e137962161829e008bcc32835fe8)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: improve mds tests

the expected number of mds daemon consist of number of daemons that are
'up' + number of daemons 'up:standby'.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c94ada69e80d7a1ddfbd2de2b13086d57a6fdfcd)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osd: copy openstack keys over to all mon

When configuring openstack, the created keyrings aren't copied over to
all monitors nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 433ecc7cbcc1ac91cab509dabe5c647d58c18c7f)
Signed-off-by: Sébastien Han <seb@redhat.com>

rolling_update: fix facts gathering delegation

this is kind of follow up on what has been made in #2560.
See #2560 and #2553 for details.

Closes: #2708
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 232a16d77ff1048a2d3c4aa743c44e864fa2b80b)
Signed-off-by: Sébastien Han <seb@redhat.com>

test: do not always copy admin key

The admin key must be copied on the osd nodes only when we test the
shrink scenario. Shrink relies on ceph-disk commands that require the
admin key on the node where it's being executed.

Now we only copy the key when running on the shrink-osd scenario.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 41b4632abca51b4f1ab052e8b47d0bebd2e838e8)
Signed-off-by: Sébastien Han <seb@redhat.com>

change max_mds default to 1

Otherwise, with the removal of mds_allow_multimds, the default of 3 will be set
on every new FS.

Introduced by: c8573fe0d745e4667b5d757433efec9dac0150bc

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1583020
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 91f9da530f139cc6f378d1fc549870cbbc45d460)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: fix rgw tests

41b4632 has introduced a change in functionnals tests.
Since the admin keyring isn't copied on rgw nodes anymore in tests, let's use
the rgw keyring to achieve them.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 47276764f7576b7bf1f258db73f6e09aab77c3b9)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rgw: refact rgw pools creation

Refact of 8704144e3157aa253fb7563fe701d9d434bf2f3e
There is no need to have duplicated tasks for this. The rgw pools
creation should be delegated on a monitor node se we don't have to care
if the admin keyring is present on rgw node.
By the way, only one task is needed to create the pools, we just need to
use the `docker_exec_cmd` fact already defined in `ceph-defaults` to
achieve it.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1550281
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cf06b515f3e173aa3720cf28e4149149d881941)

rgws: renames create_pools variable with rgw_create_pools.

Renamed to be consistent with the role (rgw) and have a meaningful name.

Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
(cherry picked from commit 600e1e2c2680e8102f4ef17855d4bcd89d6ef733)
Signed-off-by: Sébastien Han <seb@redhat.com>

Adds RGWs pool creation to containerized installation.

ceph command has to be executed from one of the monitor containers
if not admin copy present in RGWs. Task has to be delegated then.

Adds test to check proper RGW pool creation for Docker container scenarios.

Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
(cherry picked from commit 8704144e3157aa253fb7563fe701d9d434bf2f3e)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: skip disabling fastest mirror detection on atomic host

There is no need to execute this task on atomic hosts.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f0cd4b065144843762b9deca667e05a1903b2121)

ceph-defaults: Enable local epel repository

During the tests, the remote epel repository is generating a lots of
errors leading to broken jobs (issue #2666)

This patch is about using a local repository instead of a random one.
To achieve that, we make a preliminary install of epel-release, remove
the metalink and enforce a baseurl to our local http mirror.

That should speed up the build process but also avoid the random errors
we face.

This patch is part of a patch series that tries to remove all possible yum failures.

Signed-off-by: Erwan Velu <erwan@redhat.com>
(cherry picked from commit 493f615eae3510021687e8cfc821364cc26a71ac)

Fix template reference for ganesha.conf

We can simply reference the template name since it exists within the
role that we are calling. We don't need to check the ANSIBLE_ROLE_PATH
or playbooks directory for the file.

Signed-off-by: Lionel Sausin <ls@initiatives.fr>

ceph-defaults: add the nautilus 14.x entry to ceph_release_num

The first 14.x tag has been cut so this needs to be added so that
version detection will still work on the master branch of ceph.

Fixes: https://github.com/ceph/ceph-ansible/issues/2671
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit c2423e2c48f68407e50ec075ec27510f2135f0fa)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

mons: move set_fact of openstack_keys in ceph-osd

Since the openstack_config.yml has been moved to `ceph-osd` we must move
this `set_fact` in ceph-osd otherwise the tasks in
`openstack_config.yml` using `openstack_keys` will actually use the
defaults value from `ceph-defaults`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1585139
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit aae37b44f5f17d14034181d5777226d3a582b42d)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osds: wait for osds to be up before creating pools

This is a follow up on #2628.
Even with the openstack pools creation moved later in the playbook,
there is still an issue because OSDs are not all UP when trying to
create pools.

Adding a task which checks for all OSDs to be UP with a `retries/until`
condition should definitively fix this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d5265fe11fb5c1d0058525e8508aba80a396a6b)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Makefile: followup on #2585

Fix a typo in `tag` target, double quote are missing here.

Without them, the `make tag` command fails like this:

```
if [[ "v3.0.35" ==  ]]; then \
            echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \
            exit 1; \
        fi
/bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator
/bin/sh: -c: line 0: syntax error near `;'
/bin/sh: -c: line 0: `if [[ "v3.0.35" ==  ]]; then     echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35";     exit 1; fi'
make: *** [tag] Error 2
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b67f42feb95594fb403908d61383dc25d6cd342)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Makefile: add "make tag" command

Add a new "make tag" command. This automates some common operations:

1) Automatically determine the next Git tag version number to create.
   For example:
   "3.2.0beta1 -> "3.2.0beta2"
   "3.2.0rc1 -> "3.2.0rc2"
   "3.2.0" -> "3.2.1"

2) Create the Git tag, and print instructions for the user to push it to
   GitHub.

3) Sanity check that HEAD is a stable-* branch or master (bail on
   everything else).

4) Sanity check that HEAD is not already tagged.

Note, we will still need to tag manually once each time we change the
format, for example when moving from tagging "betas" to tagging "rcs",
or "rcs" to "stable point releases".

Signed-off-by: Ken Dreyer <kdreyer@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fcea56849578bd47e65b130ab6884e0b96f9d89d)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

rgw: container add option to configure multi-site zone

You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1c084efb3cb7e48d96c9cbd6bd05ca4f93526853)
Signed-off-by: Sébastien Han <seb@redhat.com>

mon: remove check on pg_num for cephfs_pools

It should have been backported from 29a9dff but for better clarity I
think it's better to create a new commit for this.

c68126d6 aims to not make `pgs` attribute mandatory for each element of
`cephfs_pools`. Therefore, we must remove the check in
`roles/ceph-mon/tasks/check_mandatory_vars.yml`.
This task has been removed by 29a9dff but I've chosen to not backport
this commit since it's part of a bunch of commits belonging to a PR
implementing `ceph-validate` role.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

mdss: do not make pg_num a mandatory params

When playing ceph-mds role, mon nodes have set a fact with the default
pg num for osd pools, we can simply default to this value for cephfs
pools (`cephfs_pools` variable).

At the moment the variable definition for `cephfs_pools` looks like:

```
cephfs_pools:
- { name: "{{ cephfs_data }}", pgs: "" }
- { name: "{{ cephfs_metadata }}", pgs: "" }
```

and we have a task in `ceph-validate` to ensure `pgs` has been set to a
valid value.

We could simply avoid this check by setting the default value of `pgs`
to `hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num']` and
let to users the possibility to override this value.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1581164
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c68126d6fdde8efbeb6e5c495a0147a34a5dfd0e)

tests: fix broken symlink

`requirements2.5.txt` is pointing to `tests/requirements2.4.txt` while
it should point to `requirements2.4.txt` since they are in the same
directory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6f489015e44f8c10ea0ad47b23bada6ff8351f68)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

osds: do not set docker_exec_cmd fact

in `ceph-osd` there is no need to set `docker_exec_cmd` since the only
place where this fact is used is in `openstack_config.yml` which
delegate all docker command to a monitor node. It means we need the
`docker_exec_cmd` fact that has been set referring to `ceph-mon-*`
containers, this fact is already set earlier in `ceph-defaults`.

By the way, when collocating an OSD with a MON it fails because the container
`ceph-osd-{{ ansible_hostname }}` doesn't exist.

Removing this task will allow to collocate an OSD with a MON.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1584179
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 34e646e767955024c9b186f9eaa61f809fe45af0)

tests: resize root partition when atomic host

For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.

The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.

Typical error seen:

```
STDERR:

failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```

Indeed, on the machine we can see:
```
Every 2.0s: df -h                                                                                                                                                                                                                                       Tue May 29 17:21:13 2018
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  3.0G  3.0G   14M 100% /
```

The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.

```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
  Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
  Logical volume atomicos/root successfully resized.
```

```
-bash-4.2# xfs_growfs /
meta-data=/dev/mapper/atomicos-root isize=512    agcount=4, agsize=192000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=768000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 768000 to 2543616
```

```
-bash-4.2# df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  9.7G  1.4G  8.4G  14% /
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 34f70428521ab30414ce8806c7e2967a7387ff00)

tests: avoid yum failures

In the CI we can see at many times failures like following:

`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`

It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.

This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Erwan Velu <evelu@redhat.com>
(cherry picked from commit 98cb6ed8f602d9c54b63c5381a17dbca75df6bc2)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

mds: move mds fs pools creation

When collocating mds on monitor node, the cephpfs will fail
because `docker_exec_cmd` is reset to `ceph-mds-monXX` which is
incorrect because we need to delegate the task on `ceph-mon-monXX`.
In addition, it wouldn't have worked since `ceph-mds-monXX` container
isn't started yet.

Moving the task earlier in the `ceph-mds` role will fix this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 608ea947a9a2dcf35b4a9cf43c0ea0486c36dfb5)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Add privilege escalation to iscsi purge tasks

Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit 2890b57cfc2e1ef9897a791ce60f4a5545011907)
Signed-off-by: Sébastien Han <seb@redhat.com>

playbook: follow up on #2553

Since we fixed the `gather and delegate facts` task, this exception is
not needed anymore. It's a leftover that should be removed to save some
time when deploying a cluster with a large client number.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 828848017cefd981e14ca9e4690dd7d1320f0eef)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-defaults: move cephfs vars from the ceph-mon role

We're doing this so we can validate this in the ceph-validate role

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 1f15a81c480f60bc82bfc3a1aec3fe136e6d3bc4)

group_vars: resync group_vars

The previous commit changed the content of roles/$ROLE/default/main.yml
so we have to re generate the group_vars files.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3c32280ca1093f6c3abe0038f524ee3b88dd3672)
Signed-off-by: Sébastien Han <seb@redhat.com>

mdss: move cephfs pools creation in ceph-mds

When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.

The idea here is to move cephfs pools creation in `ceph-mds` role.

[1] https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5673

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3a0e168a76beaf8fb43c6afa56c6cf2b634a8aa8)
Signed-off-by: Sébastien Han <seb@redhat.com>

tests: move cephfs_pools variable

let's move this variable in group_vars/all.yml in all testing scenarios
accordingly to this commit 1f15a81c480f60bc82bfc3a1aec3fe136e6d3bc4 so
we keep consistency between the playbook and the tests.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a10e73d78d07179ff20ea7cabc2f2ccd1b1b967f)
Signed-off-by: Sébastien Han <seb@redhat.com>

osds: move openstack pools creation in ceph-osd

When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.

The idea here is to move openstack pools creation at the end of `ceph-osd` role.

[1] https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5673

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 564a662baf10b9085a6da8c9152400914e310d15)
Signed-off-by: Sébastien Han <seb@redhat.com>

defaults: resync sample files with actual defaults

6644dba5e3a46a5a8c1cf7e66b97f7b7d62e8e95 and
1f15a81c480f60bc82bfc3a1aec3fe136e6d3bc4 introduced changes some changes
in defaults variables files but it seems we've forgotten to
regenerate the sample files.
This commit aims to resync the content of `all.yml.sample`,
`mons.yml.sample` and `rhcs.yml.sample`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f8260119cd920441fa0b8ae063b3d501899406f7)
Signed-off-by: Sébastien Han <seb@redhat.com>

ceph-radosgw: disable NSS PKI db when SSL is disabled

The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.

It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.

Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a122b7ce8ec6a5c27a70bc19589d177.
Now profiles drives the setting of rgw keystone *.

Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
(cherry picked from commit 43e96c1f98312734e2f12a1ea5ef29981e9072bd)
Signed-off-by: Sébastien Han <seb@redhat.com>

rhcs: bump version to 3.0 for stable 3.1

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1519835
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bf9593bcedea6bd0220eeeb4029c6632f5a8e6f6)
Signed-off-by: Sébastien Han <seb@redhat.com>

Skip GPT header creation for lvm osd scenario

The LVM lvcreate fails if the disk already has a GPT header.
We create GPT header regardless of OSD scenario. The fix is to
skip header creation for lvm scenario.

fixes: https://github.com/ceph/ceph-ansible/issues/2592

Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
(cherry picked from commit ef5f52b1f36188c3cab40337640a816dec2542fa)
Signed-off-by: Sébastien Han <seb@redhat.com>

rolling_update: fix get fsid for containers

When running ansible2.4-update_docker_cluster there is an issue on the
"get current fsid" task. The current task only works for
non-containerized deployment but will run all the time (even for
containerized). This currently results in the following error:

TASK [get current fsid] ********************************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-luminous-ansible2.4-update_docker_cluster/rolling_update.yml:214
Tuesday 22 May 2018  22:48:32 +0000 (0:00:02.615)       0:11:01.035 ***********
fatal: [mgr0 -> mon0]: FAILED! => {
    "changed": true,
    "cmd": [
        "ceph",
        "--cluster",
        "test",
        "fsid"
    ],
    "delta": "0:05:00.260674",
    "end": "2018-05-22 22:53:34.555743",
    "rc": 1,
    "start": "2018-05-22 22:48:34.295069"
}

STDERR:

2018-05-22 22:48:34.495651 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000
2018-05-22 22:48:34.495684 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).fault

This is not really representative on the real error since the 'ceph' cli is available on that machine.
On other environments we will have something like "command not found: ceph".

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit da5b104098023194cfa607ebb0a861540d21346c)

Fix restarting OSDs twice during a rolling update.

During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.

(cherry picked from commit c7e269fcf5620a49909b880f57f5cbb988c27b07)
Signed-off-by: Sébastien Han <seb@redhat.com>

switch: disable ceph-disk units

During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 49a47124859e6577fb99e6dd680c5244ccd6f38f)
Signed-off-by: Sébastien Han <seb@redhat.com>