git-server-git.apps.pok.os.sepia.ceph.com Git

tests: test dashboard deployment with podman scenario

This commit adds a grafana-server section in order to test dashboard
deployment with podman.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3c2fd337d9f5791e0082b5f0f6753f14005969ae)

validate: add checks for grafana-server group definition

this commit adds two checks:
- check that the `[grafana-server]` group is defined
- check that the `[grafana-server]` contains at least one node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 02beb0091612c8cd339e1404c8c44c081a46b807)

mgr: fix a typo

this tasks isn't using the right container_exec_cmd, that's delegating
to the wrong node.
Let's use the right fact to fix this command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ec33ee7574172c6f946f1030c26b8b4a6eb285dc)

dashboard: remove cfg80211 module installation

According to this comment [1], this seems to be needed to detect wifi
devices.

In node exporter we can see this:

```
--collector.wifi Enable the wifi collector (default: disabled).
```

since it's enabled by default and we don't even change this in our
systemd templates for node-exporter, we can easily assume in the end
it's not needed. Therefore, let's remove this.

[1] https://github.com/ceph/ceph-ansible/commit/dbf81b6b5bd6d7e977706a93f9e75b38efe32305#diff-961545214e21efed3b84a9e178927a08L21-L23

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b9cdf341beaf5a83bf5eec3b64c2c29ee78185ac)

dashboard: use dedicated group only

There's no need to add complexity and trying to fallback on other group.
Let's deploy dashboard on all nodes present in grafana-server group.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d67230b2a26b40651c1c1dbee68a92b0e851f3d5)

dashboard: move code into a dedicated playbook

Move dashboard, grafana/prometheus and node-exporter plays into a
dedicated playbook in infrastructure-playbook directory.
To avoid using 'dashboard_enabled | bool' condition multiple time
in the main playbook we can just import the dashboard playbook or
not.
This patch also allows to use an unique dashboard playbook for
both baremetal and container playbooks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 43135840b1570d35314eee3115c2ec826cff4d8d)

dashboard: enable dashboard by default

This commit enables dashboard deployment by default.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1726739
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fb1b5b3251be71ae7058de358ca17d4982cc6f81)

# Conflicts:
# tox-dashboard.ini

Remove NBSP characters

Some NBSP are still present in the yaml files.
Adding a test in travis CI.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 07c6695d16bc3e8f6d8d5fdc17bd9830b1d94619)

container: rename docker directories

Those 2 directories should be renamed to be more generic (docker vs.
podman).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 19950b5170950d8eae7360ebf54adaad978e9024)

Avoid to setup provisioners in a fully containerized environment

This commit adds a when clause to avoid the setup of grafana
provisioners in a fully containerized scenario.
This is needed when the ceph-grafana-dashboards package is not
installed and this task could result in a wrong grafana
configuration that let the container crash.

Signed-off-by: fmount <fpantano@redhat.com>
(cherry picked from commit fac1b030cbc67abfe1af83ec1ffdb648db4bda70)

tests: bump nfs-ganesha version

in stable-4.0, nfs-ganesha 2.8 should be used.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-dashboard: enable rgw options conditionally

The dashboard rgw frontend options only need to be applied when there's
some nodes present in the rgw ansible group.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5383c2f7f357ce6c8adb88422d757874d2f53c05)

tests/dashboard: use the dedicated grafana node

The Vagrant dashboard scenario creates a dedicated grafana node but
was not use in the ansible inventory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a9a1f633a9756dff08b1067def8effa25d8c1348)

dashboard: use variables for port value

The current port value for alertmanager, grafana, node-exporter and
prometheus is hardcoded in the roles so it's not possible to change the
port binding of those services.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ab9b719fa42fe493035595612f132b741dde641)

Fix backward compat with old cephfs_pools format

Previously cephfs_pools items used to have a pgs: key but not
pgp_num: nor pg_num:

Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit edd1420217d96d87ce84700317b50256648e1fa3)

handler: fix bug in osd handlers

fbf4ed42aee8fa5fd18c4c289cbb80ffeda8f72e introduced a bug when
container binary is podman.
podman doesn't support ps -f using regular expression, the container id
is never set in the restart script causing the handler to fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1721536
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 618dbf271d04723bf35483a3aa62ef9019f2d390)

validate: fail if gpt header found on unprepared devices

ceph-volume will complain if gpt headers are found on devices.
This commit checks whether a gpt header is present on devices passed in
`devices` variable and fail early.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1730541
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 487d7016850524af64254aa3a9ec61222ec926a4)

tests: remove useless setting

this setting is not needed here since we explicitely set it for
container and non container context.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 87b173d02252f3781c36ac3ed26f575397e9c410)

shrink-rbdmirror: check if rbdmirror is well removed from cluster

This commits adds a check to ensure the daemon has been removed from the
cluster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 916dc1f52f372d52eec045a7780291bef143e037)

tests/functional: add a test for shrink-rbdmirror.yml

Add a new functional test that deploys Ceph cluster with three nodes for
MON, OSD and RBD Mirror and, then, runs shrink-rbdmirror.yml to test it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f80521f773ac8d40afecb617cff19fd8cec497f0)

# Conflicts:
# tox.ini

add a playbook that removes rbd-mirror from a node

Add a playbook named "shrink-rbdmirror.yml" in infrastructure-playbooks/
that removes a RBD Mirror from a node in an already deployed Ceph
cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit c4824acb19053cc6bfee519f3be057c6ecddabf0)

ceph-infra: update handler with daemon variable

Both ntp and chrony daemon use variable for the service name because it
could be different depending on the GNU/Linux distribution.
This has been update in 9d88d3199 for chrony but only for the start part
not for the handler.
The commit fixes this for both ntp and chrony.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0ae0193144897676b56e1d4142565e759531fd35)

ceph-infra: Open prometheus port

The Prometheus porrt 9090 isn't open in the firewall configuration.
Also the dashboard task on the grafana node was not required because
it's already present on the mgr node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 41b44dde85d8a4e6070511a723fa874c7e49876e)

handler: remove legacy condition

since everything is already in a block with the same condition, it's not
needed to leave all of them on these tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ee29f7370add57bb016b693f662b1a34a0a7c607)

validate: improve message printed in check_devices.yml

The message prints the whole content of the registered variable in the
playbook, this is not needed and makes the message pretty unclear and
unreadable.

```
"msg": "{'_ansible_parsed': True, 'changed': False, '_ansible_no_log': False, u'err': u'Error: Could not stat device /dev/sdf - No such file or directory.\\n', 'item': u'/dev/sdf', '_ansible_item_result': True, u'failed': False, '_ansible_item_label': u'/dev/sdf', u'msg': u\"Error while getting device information with parted script: '/sbin/parted -s -m /dev/sdf -- unit 'MiB' print'\", u'rc': 1, u'invocation': {u'module_args': {u'part_start': u'0%', u'part_end': u'100%', u'name': None, u'align': u'optimal', u'number': None, u'label': u'msdos', u'state': u'info', u'part_type': u'primary', u'flags': None, u'device': u'/dev/sdf', u'unit': u'MiB'}}, 'failed_when_result': False, '_ansible_ignore_errors': None, u'out': u''} is not a block special file!"
```

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1719023
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e6dc3ebd8c8161e56c0a4a1fdb409272b9fd5342)

dashboard: Use upstream default port

We are currently using incorrect dashboard default port. The upstream
uses 8443 instead of 8234 by default. This should get us closer to the
upstream project.

Signed-off-by: Boris Ranto <branto@redhat.com>
(cherry picked from commit 21758fcee847fac1146cc784ff93088139afed6b)

ceph-dashboard: remove bool filter for rgw vars

Some dashboard_rgw_api_* variables are using the bool filter but those
variables are strings with an empty string as default value.
So we should test the variable against an empty string instead of a
bool.

dashboard_rgw_api_host: ''
dashboard_rgw_api_port: ''
dashboard_rgw_api_scheme: ''
dashboard_rgw_api_admin_resource: ''

Resolves: #4179

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5413274412394c314df578ddf5208b79af75f222)

ceph-iscsi: Update gateway config/template

- Remove gateway_keyring from the configuration file because it's
not used in ceph-iscsi 3.x release.
- Use config_template instead of template module for iscsi-gateway
configuration file. Because the file is an ini file and we might want
to override more parameters than those present in ceph-ansible.
- Because we can now set the pool name in the configuration, we should
use a variable for that. This is refact with the iscsi_pool_* variables
also used to configure the pool size.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1f2a4f1910d75241deb345228227a3ad7711c145)

tests/functional: add a test for shrink-mgr.yml

Add a new functional test that deploys a Ceph cluster with three nodes
for MON, OSD and MGR and then runs shrink-mgr.yml to test it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 5c95c34d4be5ef545c09d03f4bd1124901119f04)

# Conflicts:
# tox.ini

add a playbook that removes manager from a node

Add a playbook, named "shrink-mgr.yml", in infrastructure-playbooks/
that removes a MGR from a node in an already deployed Ceph cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f4ea75051b238cc41e2bb3414ff1aa572a16083b)

shrink-mds: refact post tasks

This commit refacts the way we check the "mds_to_kill" node is well
stopped.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 7df62fde3476aa0a79821f26775660de1ec0c68e)

tests/functional: add a test for shrink-mds.yml

Add a new functional test that deploys a Ceph cluster with three nodes
for MON, OSD and MDS and then runs shrink-mds.yml to test it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 324b3b4a6c32e7e5a084cf39497f823f7b41d578)

# Conflicts:
# tox.ini

add a playbook that removes mds from a node

Add a playbook, named "shrink-mds.yml", in infrastructure-playbooks/
that removes a MDS from a node in an already deployed Ceph cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 235b1fccc6b4e1c9c40fa44cb71e995703d84a49)

ceph-handler: fix cluster name in socket path

c90f605b5 introduces the default ceph cluster name value in the rgw
socket path for the rgw restart script. But this should use the
`cluster` variable instead.
This commit also fixes this in the osd restart script.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit de7f948b75fef7d839905f731898da04f281f19d)

ceph-mon: Fix cluster name parameter

The ability to add nodes with the monitor role to an existing cluster
whose name differs from the default name is fixed.

Signed-off-by: ilyashestopalov <usr.tester@yandex.ru>
(cherry picked from commit 904532c5e231a85b9cd5fe1e7f789d07cf6995a5)

Add package-install tag on ceph-grafana-dashboard pkg install.

According to the OSP pattern, we need the package-install tag
to control what is installed on the host. This commit just add
the missing tag to meet the TripleO requirements.

See: /issues/4197 for details

Fixes: #4197
Signed-off-by: fmount <fpantano@redhat.com>
(cherry picked from commit 95bd002b351bc54c5d25c8cbae79e1df6e39b71f)

ceph-iscsi-gw: Update log directories bind mount

On containerized deployment we need to bind mount the ceph-iscsi
directory to avoid writing the logs in the container.
The /var/log/ceph directory isn't use by rbd-targe-api/gw services
because they have their own log directories.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 91bef94b6c1ee159173a64cbca36385c67a186de)

iscsi: refact deprecated variables

This commit moves some old variables into ceph-defaults so we can move
the `use_new_ceph_iscsi` fact in ceph-facts role in order.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a781ce881c977d4300ff04be0bce405a32fe4665)

igw: Add check for missing iqn

If the user is still using the older packages and does not setup
the target iqn you will just get a vague error message later on.
This adds a check during the validate task, so it is clear to the
user.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 08a6d10c32beb5c2736b36487efb1860e868d74e)

igw: Add check for mismatch ceph-iscsi and iscsigws.yml settings

If the user has manually installed ceph-iscsi but is trying to setup a
iscsi object in iscsigws.yml you will just a python crash. This patch
adds a check and more user friendly error message for the case.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 2e1a4328a85f4cee5d4e7144b8fd9e96324b5f53)

igw: Update tests to use ceph-iscsi package

gateway_ip_list is depreciated and is only used when using the old
ceph-iscsi-config/cli packages that are no longer being developed
(GH repos are archived). Because ceph-iscsi-config/cli is no longer
being worked on, this modifies the tests to stress the ceph-iscsi
based installs.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 1e64efc2f07af9f99f33a28af31e3186899139b9)

igw: Update iscsigws.yml.sample for ceph-iscsi support

Update iscsigws.yml.sample to document that we cannot use ansible to
setup iSCSI objects and use the new ceph-iscsi package.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 75fee55d19e1e2bec49d9bc4996787ba45f8a349)

igw: Support ceph-iscsi package for install

This adds support for the ceph-iscsi package during install. ceph-iscsi
does not support setting up targets/gws, luns and clients with the
current library/igw_* code. Going forward those tasks should be done with
gwcli or dashboard. ceph-iscsi will only be used if the user has no iscsi
objects setup so we do not break existing setups.

The next patch will update the iscsigws.yml.sample to document that
users must not setup any iscsi object if they want to use the new
package and tools.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit cbe66cec5239504f5532bfac09fcc0a295b22271)

igw: drop gateway_ip_list for container setups

The gateway_ip_list is not used in container setups, so drop it
for that case.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit b7b2213be18cf4fd827389fbc88613bf3e242b82)

igw: move gateway_ip_list check to validate role

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d89d3e7cd6868dda59c5eec721a5d27f095a84df)

igw: Support new ceph-iscsi package during purge

The ceph-iscsi-config and ceph-iscsi-cli packages were combined into
ceph-iscsi and its APIs changed. This fixes up the iscsi purge task to
support the new API and old one.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit b163206db7215b58b067628798af7a29c9e6bbee)

tests: wait 30sec before running testinfra

adding back a sleep 30s after nodes have rebooted before running
testinfra.
This was removed accidentally by d5be83e

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ca84a5359fb13c244b7b865fa850cb6a28564e15)

ceph-handler: Fix rgw socket in restart script

Since Mimic the radosgw socket has two extra fields in the socket
name (before the .asok suffix): <pid>.<ctid>

Before:
  /var/run/ceph/ceph-client.rgw.cephaio-1.asok
After:
  /var/run/ceph/ceph-client.rgw.cephaio-1.16913.23928832.asok

The radosgw restart script doesn't handle this and could fail during
an upgrade.
If the SOCKETS variable isn't defined in the script then the test
command won't fail because the return code is 0

$ test -S
$ echo $?
0

There multiple issues in that script:
  - The default SOCKETS value isn't defined due to a typo
SOCKET vs SOCKETS.
  - Because the socket name uses the pid then we need to check the
socket name after the service restart.
  - After restarting the radosgw service we need to wait few seconds
otherwise the socket won't be created.
  - Update the wget parameters because the command is doing a loop.
We now use the same option than curl.
  - The check_rest function doesn't test the radosgw at all due to
a wrong test command (test against a string) and always returns 0.
This needs to use the DOCKER_EXECS variable in order to execute the
command.

$ test 'wget http://192.168.100.11:8080'
$ echo $?
0

Also remove the test based on the ansible_fqdn because we only use
the ansible_hostname + rgw instance name.

Finally group all for loop into a single one.

Resolves: #3926

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c90f605b5148d179790cec545d02db1086579994)

Add radosgw_frontend_ssl_certificate parameter

This is necessary when configuring RGW with SSL because
in addition to passing specific frontend options, civetweb
appends the 's' character to the binding port and beast uses
ssl_endpoint instead of endpoint.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1722071
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit d526803c6cbb6b04f181479d2087ed25bc0b2d31)

containers: improve logging

bindmount /var/log/ceph on all containers so it's possible to retrieve
logs from the host.

related ceph-container PR: ceph/ceph-container#1408

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1710548
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 33eed78d1774e5d2f87612899352e32552d4eb13)

tox-podman.ini: add missing reruns option

The first py.test call didn't have the reruns option. The commit
fixes it.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f92323f289e29322122d490de83100c8d687a40)

Add condition on dashboard installer phase

Even if dashboard feature is disabled then the installer status will
still report dashboard, grafana and node-exporter roles timing.

INSTALLER STATUS **********************************
Install Ceph Monitor           : Complete (0:01:21)
Install Ceph Manager           : Complete (0:00:49)
Install Ceph Dashboard         : Complete (0:00:00)
Install Ceph Grafana           : Complete (0:00:02)

When need to set the dashboard_enabled condition on those installer
phase pre/post tasks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0604275c112503279e173303418a4daa28ca3478)

nfs: clean template

remove legacy options

```
ganesha.nfsd-115[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:13): Unknown parameter (Dir_Max)
ganesha.nfsd-115[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:14): Unknown parameter (Cache_FDs)

```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b725b3077efda146eee596737a51ae9c3fe11e82)

ceph-osd: Add CONTAINER_IMAGE env variable

This environment variable was added in cb381b4 but was removed in
4d35e9e.
This commit reintroduces the change.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 02fbe76e62881377a4f46ee74e8ac37299f99c59)

dashboard: move ceph-grafana-dashboards package installation

This commit moves the package installation into ceph-dashboard role.
This is needed to install ceph dasboard json file in
`/etc/grafana/dashboards/ceph-dashboard/`.

Closes: #4026
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6e2e30db54bafb271f7f5bd087f426e5762d9e7e)

infra: refact dashboard firewall rules

- There is no need to open ports 3000, 8234, 9283 on all nodes.
- Add missing rule for alertmanager (port 9093)

Closes: #4023
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 14f5fc3c86bcda875a1b3a989efca4cc9188d93e)

dashboard: append mgr modules to ceph_mgr_modules

when `dashboard_enabled` is `True`, let's append `dashboard` and
`prometheus` modules to `ceph_mgr_modules` so they are automatically
loaded.

Closes: #4026
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a2b6f446652b6f043a81f68417fd832996c75b54)

Set grafana_server_addr fact for ipv6 scenarios.

As the bz1721914 describes, the grafana_server_addr
fact is not defined if ip_version used is ipv6.
This commit adds the ip_version condition to set
correctly this fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1721914
Signed-off-by: fmount <fpantano@redhat.com>
(cherry picked from commit e655038743a16e75cb993d1a26c7be755a65ca45)

facts: fix bug in grafana_server_addr fact setting

If no grafana-server group is defined while an mgr group is, that task
will fail because `hostvars[groups[grafana_server_group_name][0]` can't
return anything since `groups['grafana-server']` will be a non existing
key.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 366b309c12adfcef7dba3548ad3f05000148e867)

validate.py: Fix alphabetical order on uca

Alphabetized ceph_repository_uca keys due to errors validating when
using UCA/queens repository on Ubuntu 16.04

An exception occurred during task execution. To see the full
traceback, use -vvv. The error was:
SchemaError: -> ceph_stable_repo_uca schema item is not
alphabetically ordered

Closes: #4154
Signed-off-by: Gabriel Ramirez <gabrielramirez1109@gmail.com>
(cherry picked from commit 82262c6e8c22eca98ffba5d0c65fa65e83a62793)

tests: clean nfs_ganesha variables

- clean some leftover.
- move nfs_ganesha_[stable|dev] in group_vars so dev_setup.yml can modify them.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 45041f52fda35dd368824b485c50117866c514c2)

nfs: add missing | bool filters

To address this warning:
```
[DEPRECATION WARNING]: evaluating nfs_ganesha_dev as a bare variable, this
behaviour will go away and you might need to add |bool to the expression in the
future
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2b9fb377a88d0ad5d888e57ee1929390bcb873c0)

nfs: remove duplicate task

This task is already present in pre_requisite_non_container.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit edb8d42596e1ab0625735460e86757184ffd7c5a)

tests: test nfs-ganesha deployment

Add back the nfs-ganesha deployment testing which was removed because of
broken dependencies.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 013ae62177cc8892d651349ffecab99552a8665e)

tests: deploy nfs-ganesha in container-all_daemons

this commit bring back the nfs-ganesha testing in containerized
deployment.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9201674b5b8238e89c6158838f34415320419998)

purge: ensure no ceph kernel thread is present

This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 20e4852888ecc76d8d0fa194a438fa2a90e1cde3)

ceph-handler: Fix OSD restart script

There's two big issues with the current OSD restart script.

1/ We try to test if the ceph osd daemon socket exists but we use a
wildcard for the socket name : /var/run/ceph/*.asok.
This fails because we usually have multiple ceph osd sockets (or
other ceph daemon collocated) present in /var/run/ceph directory.
Currently the test fails with:

bash: line xxx: [: too many arguments

But it doesn't stop the script execution.
Instead we can specify the full ceph osd socket name because we
already know the OSD id.

2/ The container filter pattern is wrong and could matches multiple
containers resulting the script to fail.
We use the filter with two different patterns. One is with the device
name (sda, sdb, ..) and the other one is with the OSD id (ceph-osd-0,
ceph-osd-15, ..).
In both case we could match more than needed.

$ docker container ls
CONTAINER ID IMAGE NAMES
958121a7cc7d ceph-daemon:latest ceph-osd-strg0-sda
589a982d43b5 ceph-daemon:latest ceph-osd-strg0-sdb
46c7240d71f3 ceph-daemon:latest ceph-osd-strg0-sdaa
877985ec3aca ceph-daemon:latest ceph-osd-strg0-sdab
$ docker container ls -q -f "name=sda"
958121a7cc7d
46c7240d71f3
877985ec3aca

$ docker container ls
CONTAINER ID IMAGE NAMES
2db399b3ee85 ceph-daemon:latest ceph-osd-5
099dc13f08f1 ceph-daemon:latest ceph-osd-13
5d0c2fe8f121 ceph-daemon:latest ceph-osd-17
d6c7b89db1d1 ceph-daemon:latest ceph-osd-1
$ docker container ls -q -f "name=ceph-osd-1"
099dc13f08f1
5d0c2fe8f121
d6c7b89db1d1

Adding an extra '$' character at the end of the pattern solves the
problem.

Finally removing the get_container_osd_id function because it's not
used in the script at all.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 45d46541cb60818e8ad9a6b2d24fea91c0315525)

Change ansible_lsb by ansible_distribution_release

The ansible_lsb fact is based on the lsb package (lsb-base,
lsb-release or redhat-lsb-core).
If the package isn't installed on the remote host then the fact isn't
populated.

--------
"ansible_lsb": {},
--------

Switching to the ansible_distribution_release fact instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit dc187ea6faf708b1674abd5aee917f79a95f849e)

upgrade: accept HEALTH_OK and HEALTH_WARN as valid state

3a100cfa5265b3a5327ef6a8d382a8059391b903 introduced a check which is a
bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6dce51183bda78e2d7ce689c5e0a7bb2f9629f47)

Add higher retry/delay defaults to check the quorum status.

As per bz1718981, this commit adds higher values to check
the quorum status. This is helpful for several OSP deployments
that fail during the scale up.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1718981
Signed-off-by: fpantano <fpantano@redhat.com>
(cherry picked from commit ba73dc7b218c9c0bdac59079e3162b40082cbc1d)

ceph-volume: Set max open files limit on container

The ceph-volume lvm list command takes ages to complete when having
a lot of LV devices on containerized deployment.
For instance, with 25 OSDs on a node it takes 3 mins 44s to list the
OSD.
Adding the max open files limit to the container engine cli when
executing the ceph-volume command seems to improve a lot thee
execution time ~30s.

This was impacting the OSDs creation with ceph-volume (both filestore
and bluestore) when using multiple LV devices.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1702285
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b98753488110b04cd2071c2b103493235dfc0c80)

roles: Remove useless become (true) flag

We already set the become flag to true at a play level in the site*
playbooks so we don't need to set it at a task level.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7c3640177b6f6bdab15eb89866d8905ba38eda47)

remove ceph restapi references

The ceph restapi configuration was only available until Luminous
release so we don't need those leftovers for nautilus+.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da8b7ab7fb8f9a7ff8346585bd7ffa137bcb6ba0)

facts: add a retry on get current fsid task

sometimes it can happen the following task fails:

```
TASK [ceph-facts : get current fsid] *******************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-centos-container-update/roles/ceph-facts/tasks/facts.yml:78
Wednesday 19 June 2019  18:12:49 +0000 (0:00:00.203)       0:02:39.995 ********
fatal: [mon2 -> mon1]: FAILED! => changed=true
  cmd:
  - timeout
  - --foreground
  - -s
  - KILL
  - 600s
  - docker
  - exec
  - ceph-mon-mon1
  - ceph
  - --cluster
  - ceph
  - daemon
  - mon.mon1
  - config
  - get
  - fsid
  delta: '0:00:00.239339'
  end: '2019-06-19 18:12:49.812099'
  msg: non-zero return code
  rc: 22
  start: '2019-06-19 18:12:49.572760'
  stderr: 'admin_socket: exception getting command descriptions: [Errno 2] No such file or directory'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
```

not sure exactly why since just before this task, mon1 seems to be well
UP otherwise it wouldn't have passed the task `waiting for the
containerized monitor to join the quorum`.

As a quick fix/workaround, let's add a retry which allows us to get
around this situation:

```
TASK [ceph-facts : get current fsid] *******************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-scenario/roles/ceph-facts/tasks/facts.yml:78
Thursday 20 June 2019  15:35:07 +0000 (0:00:00.201)       0:03:47.288 *********
FAILED - RETRYING: get current fsid (3 retries left).
changed: [mon2 -> mon1] => changed=true
  attempts: 2
  cmd:
  - timeout
  - --foreground
  - -s
  - KILL
  - 600s
  - docker
  - exec
  - ceph-mon-mon1
  - ceph
  - --cluster
  - ceph
  - daemon
  - mon.mon1
  - config
  - get
  - fsid
  delta: '0:00:00.290252'
  end: '2019-06-20 15:35:13.960188'
  rc: 0
  start: '2019-06-20 15:35:13.669936'
  stderr: ''
  stderr_lines: <omitted>
  stdout: |-
    {
        "fsid": "153e159d-7ade-42a7-842c-4d04348b901e"
    }
  stdout_lines: <omitted>
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 46a268394490cb37bdb56fb839ecc8711bda1ec0)

osd: remove legacy task

`parted_results` isn't used anymore in the playbook.

By the way, `parted` seems to cause issue because it changes the
ownership on devices:

```
root@osd0 ~]# ls -l /dev/sdc*
brw-rw----. 1 root disk 8, 32 Jun 11 08:53 /dev/sdc
brw-rw----. 1 ceph ceph 8, 33 Jun 11 08:53 /dev/sdc1
brw-rw----. 1 ceph ceph 8, 34 Jun 11 08:53 /dev/sdc2

[root@osd0 ~]# parted -s /dev/sdc print
Model: ATA QEMU HARDDISK (scsi)
Disk /dev/sdc: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name           Flags
1      1049kB  1075MB  1074MB               ceph block.db
2      1075MB  2149MB  1074MB               ceph block.db

[root@osd0 ~]# #We can see ownerships have changed from ceph:ceph to root:disk:
[root@osd0 ~]# ls -l /dev/sdc*
brw-rw----. 1 root disk 8, 32 Jun 11 08:57 /dev/sdc
brw-rw----. 1 root disk 8, 33 Jun 11 08:57 /dev/sdc1
brw-rw----. 1 root disk 8, 34 Jun 11 08:57 /dev/sdc2
[root@osd0 ~]#
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eece362b38e246a0e6f5f4a487d33409657c2fe8)

rolling_update: fail early if cluster state is not OK

starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea.
Let's check for the cluster status before trying to upgrade.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3a100cfa5265b3a5327ef6a8d382a8059391b903)

rolling_update: only mask and stop unit in mgr part

Otherwise it fails like following:

```
fatal: [mon0]: FAILED! => changed=false
msg: |-
Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51b2813e0483b042fb52dd3056464eaa4a4b1a3c)

Add installer phase for dashboard roles

This commits adds the support of the installer phase for dashboard,
grafana and node-exporter roles.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c7a5967a6fa60e3f0d81a03bc148834c3d4f9b59)

align cephfs pool creation

The definitions of cephfs pools should match openstack pools.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
Co-Authored-by: Simone Caronni <simone.caronni@teralytics.net>
(cherry picked from commit 67071c3169f40621baec2aa51504e9f361eaf890)

remove ceph-agent role and references

The ceph-agent role was used only for RHCS 2 (jewel) so it's not
usefull anymore.
The current code will fail on CentOS distribution because the rhscon
package is only avaible on Red Hat with the RHCS 2 repository and
this ceph release is supported on stable-3.0 branch.

Resolves: #4020

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7503098ca079b200b3adcb1faf2e255d9c74a581)

dashboard: fix hosts sections in main playbook

ceph-dashboard should be deployed on either a dedicated mgr node or a
mon if they are collocated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bdc870cbf5771f2f56ec4b0d64eb702b23f9969a)

tests: Update ansible ssh_args variable

Because we're using vagrant, a ssh config file will be created for
each nodes with options like user, host, port, identity, etc...
But via tox we're override ANSIBLE_SSH_ARGS to use this file. This
remove the default value set in ansible.cfg.

Also adding PreferredAuthentications=publickey because CentOS/RHEL
servers are configured with GSSAPIAuthenticationis enabled for ssh
server forcing the client to make a PTR DNS query.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34f9d51178f4cd37a7df1bb74897dff7eb5c065f)

tests: increase docker pull timeout

CI is facing issues where docker pull reach the timeout, let's increase
this to avoid CI failures.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1019e3b3dce971cd81bcea854022fc929d42c139)

ceph-infra: make chronyd default NTP daemon

Since timesyncd is not available on RHEL-based OSs, change the default
to chronyd for RHEL-based OSs. Also, chronyd is chrony on Ubuntu, so
set the Ansible fact accordingly.

Fixes: https://github.com/ceph/ceph-ansible/issues/3628
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 9d88d3199fd8c6548a56bf9e95cd9239481baa39)

iscsi: assign application (rbd) to pool 'rbd'

if we don't assign the rbd application tag on this pool,
the cluster will get `HEALTH_WARN` state like following:

```
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
application not enabled on pool 'rbd'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4cf17a6fddc052c944026ae1d138263131e677f8)

ceph-infra: update cache for Ubuntu

Ubuntu-based CI jobs often fail with error code 404 while installing
NTP daemons. Updating cache beforehand should fix the issue.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit d1c266e6c7f9906f6455e71583496c4d902b8565)

mon: enforce mon0 delegation for initial_mon_key register

since this task is designed to be always run on the first monitor, let's
enforce the container name accordingly otherwise it could fail like
following:

```
fatal: [mon1 -> mon0]: FAILED! => changed=true
  cmd:
  - docker
  - exec
  - ceph-mon-mon1
  - ceph
  - --cluster
  - ceph
  - --name
  - mon.
  - -k
  - /var/lib/ceph/mon/ceph-mon0/keyring
  - auth
  - get-key
  - mon.
  delta: '0:00:00.085025'
  end: '2019-06-12 06:12:27.677936'
  msg: non-zero return code
  rc: 1
  start: '2019-06-12 06:12:27.592911'
  stderr: 'Error response from daemon: No such container: ceph-mon-mon1'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 905c2256bdf16fe19aa6e5ead641a30ee559f27d)

ceph-node-exporter: Fix systemd template

069076b introduced a bug in the systemd unit script template. This
commit fixes the options used by the node-exporter container.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d0840217f3e8ffd3fe9d45fbd41b60444b70a290)

dashboard: add allow_embedding support

Add a variable to support the allow_embedding support.

See ceph/ceph-ansible/issues/4084 for details.

Fixes: #4084
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 27856cc49959de0f7cfaef63f4a674fb3264a232)

dashboard: fix dashboard_url setting

This setting must be set to something resolvable.

See: ceph/ceph-ansible/issues/4085 for details

Fixes: #4085
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c9cd9d9e758f7ed2525643f33fe1427d1a857cd)

ceph-handler: replace fuser by /proc/net/unix

We're using fuser command to see if a process is using a ceph unix
socket file. But the fuser command runs through every PID present in
/proc/<PID> to see if one of them is using the file.
On a system running thousands processes, the fuser command can take
a long time to finish.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1717011

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da9891da1e8b9a8c91077c74e54a9df8ebb7070d)

tox-dashboard: update for nautilus

We don't need to use dev_setup playbook on stable branch. We also
need to remove the dev container image variables and update the
value to match nautilus.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ceph-node-exporter: use modprobe ansible module

Instead of using the modprobe command from the path in the systemd
unit script, we can use the modprobe ansible module.
That way we don't have to manage the binary path based on the linux
distribution.

Resolves: #4072

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit dbf81b6b5bd6d7e977706a93f9e75b38efe32305)

Fix units and add ability to have a dedicated instance

Few fixes on systemd unit templates for node_exporter and
alertmanager container parameters.
Added the ability to use a dedicated instance to deploy the
dashboard components (prometheus and grafana).
This commit also introduces the grafana_group_name variable
to refer grafana group and keep consistency with the other
groups.
During the integration with TripleO some grafana/prometheus
template variables resulted undefined. This commit adds the
ability to check if the group exist and create, accordingly,
different job groups in prometheus template.

Signed-off-by: fmount <fpantano@redhat.com>
(cherry picked from commit 069076bbfdef30c64b33dfcdd063b1e31d65d617)

validate: fail in check_devices at the right task

see https://bugzilla.redhat.com/show_bug.cgi?id=1648168#c17 for details.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1648168#c17
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 771648304d7d867e053f8b8fe3ce5b36e061f100)

spec: bring back possibility to install ceph with custom repo

This can be seen as a regression for customers who were used to deploy
in offline environment with custom repositories.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1673254
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c933645bf7015e08c97058186954483c40ecbfbd)

container-common: support podman on Ubuntu

Currently we're only able to use podman on ubuntu if podman's
installation is done manually before the ceph-ansible execution
because the deb package is present in an external repository.
We already manage the docker-ce installation via an external
repository so we should be able to allow the podman installation
with the same mechanism too.

https://github.com/containers/libpod/blob/master/install.md#ubuntu

Resolves: #3947

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 518ab794fb0965c6ca8af56f18e0c54529eca8d5)

podman: Add systemd dependency on network.target

When using podman, the systemd unit scripts don't have a dependency
on the network. So we're not sure that the network is up and running
when the containers are starting.
With docker this behaviour is already handled because the systemd
unit scripts depend on docker service which is started after the
network.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f49090df7ef82419c69dfd7a22250a79c17de42f)

ansible: use 'bool' filter on boolean conditionals

By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these:
```
[DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable,
this behaviour will go away and you might need to add |bool to the expression
in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This
feature will be removed in version 2.12. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
```

Now appended ``| bool`` on a lot of the affected variables.

Sometimes the coding style from ``variable|bool`` changed to ``variable | bool`` *(with spaces at the pipe)*.

Closes: #4022
Signed-off-by: L3D <l3d@c3woc.de>
(cherry picked from commit ab54fe20ec2e3bf16e4544c39548d1e21dacf0d5)

purge-cluster: clean all ceph repo files

We currently only purge rh_storage yum repository file but depending
on the ceph_repository value we are using, the ceph repository file
could have a different name.

Resolves: #4056

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 44c63903cacb06fd6a32fcc591d31b2be3c7e82a)