]> git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log
ceph-ansible.git
7 years agorestart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex
Matthew Vernon [Wed, 19 Sep 2018 13:25:15 +0000 (14:25 +0100)]
restart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex

`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.

Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 806461ac6edd6aada39173df9d9163239fd82555)

7 years agorestart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
Matthew Vernon [Wed, 19 Sep 2018 12:26:26 +0000 (13:26 +0100)]
restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK

After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 04f4991648568e079f19f8e531a11a5fddd45c87)

7 years agorestart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs
Matthew Vernon [Fri, 21 Sep 2018 16:55:01 +0000 (17:55 +0100)]
restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs

Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.

Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit aa97ecf0480c1075187b38038463f2f52144c754)

7 years agonfs: ignore error on semanage command for ganesha_t
Guillaume Abrioux [Wed, 12 Sep 2018 13:02:06 +0000 (15:02 +0200)]
nfs: ignore error on semanage command for ganesha_t

As of rhel 7.6, it has been decided it doesn't make sense to confine
`ganesha_t` anymore. It means this domain won't exist anymore.

Let's add a `failed_when: false` in order to make the deployment not
failing when trying to run this command.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1626070
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a6f77340fd942c7ce1a969347215cc5e3b18b1b2)

7 years agotests: pin sphinx version to 1.7.9
Guillaume Abrioux [Thu, 13 Sep 2018 10:48:25 +0000 (12:48 +0200)]
tests: pin sphinx version to 1.7.9

using sphinx 1.8.0 breaks our doc test CI job.

Typical error:

```
Exception occurred:
  File
  "/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py",  line 26, in <module>
      from sphinx.ext import doctest
      SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```

See: https://github.com/sphinx-doc/sphinx/issues/5417

Pinning to 1.7.9 to fix our CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f2c660d2566b9d8772c7dbee7cbcd005a61bfc2)

7 years agoci: stop using different images on the same run
Sébastien Han [Thu, 23 Aug 2018 09:21:54 +0000 (11:21 +0200)]
ci: stop using different images on the same run

There is no point of using hosts running on atomic AND centos hosts. So
let's run containerized scenarios on Atomic only.

This solves this error here:

```
fatal: [client2]: FAILED! => {
    "failed": true
}

MSG:

The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'

The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: set_fact ceph_current_status (convert to json)
  ^ here
```

From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a

What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 7012835d2b1880e7a6ef9a224df456b2dd1024cc)

7 years agoceph-mds: do not enable multimds on jewel v3.0.45
Patrick Donnelly [Mon, 4 Jun 2018 19:52:47 +0000 (12:52 -0700)]
ceph-mds: do not enable multimds on jewel

Multiple active MDS became stable in Luminous.

Introduced-by: c8573fe0d745e4667b5d757433efec9dac0150bc
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 9ce81ae845510cb70eee76ceef4412aa8155713a)

7 years agorolling_upgrade: set sortbitwise properly
Sébastien Han [Tue, 21 Aug 2018 09:15:44 +0000 (11:15 +0200)]
rolling_upgrade: set sortbitwise properly

Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2e6e885bb75156c74735a65c05b4757b031041bb)

7 years agoiscsi group name preserve backward compatibility v3.0.44
Sébastien Han [Mon, 20 Aug 2018 13:53:03 +0000 (15:53 +0200)]
iscsi group name preserve backward compatibility

Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 77a3a682f358c8e9a40c5b50e980b5e9ec5f6d60)

7 years agotake-over-existing-cluster: do not call var_files
Sébastien Han [Mon, 20 Aug 2018 12:41:06 +0000 (14:41 +0200)]
take-over-existing-cluster: do not call var_files

We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b7387068109a521796f8e423a61449541043f4e6)

7 years agomgr: improve/fix disabled modules check
Dardo D Kleiner [Wed, 15 Aug 2018 12:50:19 +0000 (08:50 -0400)]
mgr: improve/fix disabled modules check

Follow up on 36942af6983d60666f3f8a1a06b352a440a6c0da

"disabled_modules" is always a list, it's the items in the list that
can be dicts in mimic.  Many ways to fix this, here's one.

Signed-off-by: Dardo D Kleiner <dardokleiner@gmail.com>
(cherry picked from commit f6519e4003404e10ae1f5e86298cffd4405591da)

7 years agoRevert "osd: generate device list for osd_auto_discovery on rolling_update" v3.0.43
Sébastien Han [Mon, 13 Aug 2018 13:54:37 +0000 (15:54 +0200)]
Revert "osd: generate device list for osd_auto_discovery on rolling_update"

This reverts commit e84f11e99ef42057cd1c3fbfab41ef66cda27302.

This commit was giving a new failure later during the rolling_update
process. Basically, this was modifying the list of devices and started
impacting the ceph-osd itself. The modification to accomodate the
osd_auto_discovery parameter should happen outside of the ceph-osd.

Also we are trying to not play ceph-osd role during the rolling_update
process so we can speed up the upgrade.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3149b2564fb89f2352820d83be02c09f658bdf60)

7 years agorolling_update: register container osd units
Sébastien Han [Mon, 13 Aug 2018 13:59:25 +0000 (15:59 +0200)]
rolling_update: register container osd units

Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit dad10e8f3f67f1e0c6a14ef3e0b1f51f90d9d962)

7 years agoUse /var/lib/ceph/osd folder to filter osd mount point
Jeffrey Zhang [Mon, 13 Aug 2018 05:23:48 +0000 (13:23 +0800)]
Use /var/lib/ceph/osd folder to filter osd mount point

In some case, use may mount a partition to /var/lib/ceph, and umount
it will be failure and no need to do so too.

Signed-off-by: Jeffrey Zhang <zhang.lei.fly@gmail.com>
(cherry picked from commit 85cc61a6d9f23cc98a817ea988c8b50e6c55698f)

7 years agocontrib: fix generate group_vars samples
Sébastien Han [Wed, 6 Jun 2018 06:41:46 +0000 (14:41 +0800)]
contrib: fix generate group_vars samples

For ceph-iscsi-gw and ceph-rbd-mirror roles the group_name are named
differently (by default) than the role name so we have to change the
script to generate the correct name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1602327
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 315ab08b1604e655eee4b493eb2c1171a67df506)

7 years agomgr: backward compatibility for module management
Guillaume Abrioux [Tue, 7 Aug 2018 12:46:07 +0000 (14:46 +0200)]
mgr: backward compatibility for module management

Follow up on 3abc253fecc91f29c90e23ae95e1b83f8ffd3de6

The structure had even changed within `luminous` release.
It was first:

```
{
    "enabled_modules": [
        "balancer",
        "dashboard",
        "restful",
        "status"
    ],
    "disabled_modules": [
        "influx",
        "localpool",
        "prometheus",
        "selftest",
        "zabbix"
    ]
}
```
Then it changed for:

```
{
  "enabled_modules": [
      "status"
  ],
  "disabled_modules": [
      "balancer",
      "dashboard",
      "influx",
      "localpool",
      "prometheus",
      "restful",
      "selftest",
      "zabbix"
  ]
}
```

and finally:
```
{
  "enabled_modules": [
      "status"
  ],
  "disabled_modules": [
      {
          "name": "balancer",
          "can_run": true,
          "error_string": ""
      },
      {
          "name": "dashboard",
          "can_run": true,
          "error_string": ""
      }
  ]
}
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 36942af6983d60666f3f8a1a06b352a440a6c0da)

7 years agomon: fix calamari initialisation v3.0.42
Sébastien Han [Fri, 10 Aug 2018 09:08:14 +0000 (11:08 +0200)]
mon: fix calamari initialisation

If calamari is already installed and ceph has been upgraded to a higher
version the initialisation will fail later. So if we detect the
calamari-server is too old compare to ceph_rhcs_version we try to update
it.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1601755
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4c9e24a90fca2271978a066e38dfadead88d8167)

7 years agoosd: generate device list for osd_auto_discovery on rolling_update
Sébastien Han [Thu, 9 Aug 2018 13:18:34 +0000 (15:18 +0200)]
osd: generate device list for osd_auto_discovery on rolling_update

rolling_update relies on the list of devices when performing the restart
of the OSDs. The task that is builind the devices list out of the
ansible_devices dict only runs when there are no partitions on the
drives. However during an upgrade the OSD are already configured, they
have been prepared and have partitions so this task won't run and thus
the devices list will be empty, skipping the restart during
rolling_update. We now run the same task under different requirements
when rolling_update is true and build a list when:

* osd_auto_discovery is true
* rolling_update is true
* ansible_devices exists
* no dm/lv are part of the discovery
* the device is not removable
* the device has more than 1 sector

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit e84f11e99ef42057cd1c3fbfab41ef66cda27302)

7 years agoiscsigw: install ceph-iscsi-cli package v3.0.41
Guillaume Abrioux [Thu, 2 Aug 2018 09:58:47 +0000 (11:58 +0200)]
iscsigw: install ceph-iscsi-cli package

Install ceph-iscsi-cli in order to provide the `gwcli` command tool.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1602785
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1164cdc002cccb9dc1c6f10fb6b4370eafda3c4b)

7 years agoFix in regular expression matching OSD ID on non-contenerized
Artur Fijalkowski [Thu, 2 Aug 2018 11:28:44 +0000 (13:28 +0200)]
Fix in regular expression matching OSD ID on non-contenerized
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.

Closes: #2964
Signed-off-by: Artur Fijalkowski <artur.fijalkowski@ing.com>
(cherry picked from commit 52d9d406b107c4926b582905b3d442feabf1fafc)

7 years agodefaults: backward compatibility with fqdn deployments
Guillaume Abrioux [Tue, 31 Jul 2018 13:18:28 +0000 (15:18 +0200)]
defaults: backward compatibility with fqdn deployments

This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a6ff6bbf8a7b4ba4ab5236eca93325d8ee61b1b)

7 years agorolling_update: set osd sortbitwise v3.0.40
Sébastien Han [Mon, 23 Jul 2018 12:56:20 +0000 (14:56 +0200)]
rolling_update: set osd sortbitwise

upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit b3266c5be2f88210589cfa56a5fe0a5092f79ee6)

7 years agotox: test upgrades to luminous only on stable-3.0
Sébastien Han [Tue, 31 Jul 2018 14:16:04 +0000 (16:16 +0200)]
tox: test upgrades to luminous only on stable-3.0

stable-3.0 must not attempt to run an upgrade from Luminous to Mimic, so
if running on stable-3.0 we should only do:

* Jewel to Jewel minor
* Jewel to Luminous
* Luminous to Luminous minor

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoosd: do not remove expose_partition container
Sébastien Han [Fri, 27 Jul 2018 14:52:19 +0000 (16:52 +0200)]
osd: do not remove expose_partition container

The container runs with --rm which means it will be deleted by Docker
when exiting. Also 'docker rm -f' is not idempotent and returns 1 if the
container does not exist.

(641f141c0fd070cdf982401f4529da41614d237e should have been backported to
avoid conflict merge but it's too many changes to be backported.)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1609007
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2ca8c519066555e06a261d5dee3fb46ce5daad0b)

7 years agorgw: add more config option for civetweb frontend
Sébastien Han [Tue, 24 Jul 2018 16:27:12 +0000 (18:27 +0200)]
rgw: add more config option for civetweb frontend

In containerized deployments we now inherite from the
radosgw_civetweb_options options when bootstrapping the container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582411
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit e2ea5bac5111c7633640b66f2570dc83893bae7a)

7 years agocommon: enforce ansible version check
Sébastien Han [Tue, 24 Jul 2018 09:38:51 +0000 (11:38 +0200)]
common: enforce ansible version check

Make sure we fail if the Ansible version is >= 2.5, stable-3.0 only
support Ansible between 2.3.x and 2.4.x.

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotests: do not deploy all daemons for shrink osds scenarios
Guillaume Abrioux [Mon, 23 Jul 2018 14:40:49 +0000 (16:40 +0200)]
tests: do not deploy all daemons for shrink osds scenarios

Let's create a dedicated environment for these scenarios, there is no
need to deploy everything.
By the way, doing so will save some times.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b89cc1746f8652b67d95410ed80473d1a2c3d312)

7 years agoshrink-osd: purge osd on containerized deployment
Sébastien Han [Wed, 18 Jul 2018 14:20:47 +0000 (16:20 +0200)]
shrink-osd: purge osd on containerized deployment

Prior to this commit we were only stopping the container, but now we
also purge the devices.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ce1dd8d2b3986d3bd08b4d73efd88b4de72fcc00)

7 years agotests: stop hardcoding ansible version
Guillaume Abrioux [Thu, 19 Jul 2018 11:52:36 +0000 (13:52 +0200)]
tests: stop hardcoding ansible version

In addition to ceph/ceph-build#1082

Let's set the ansible version in each ceph-ansible branch's respective
requirements.txt.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agotests: add latest-bis-jewel for jewel tests
Guillaume Abrioux [Tue, 17 Jul 2018 08:47:28 +0000 (10:47 +0200)]
tests: add latest-bis-jewel for jewel tests

since no latest-bis-jewel exists, it's using latest-bis which points to
ceph mimic. In our testing, using it for idempotency/handlers tests
means upgrading from jewel to mimic which is not what we want do.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 05852b03013d15f6f400fe6728c24b11f22b75de)

7 years agoceph-iscsi: rename group iscsi_gws
Sébastien Han [Wed, 6 Jun 2018 04:07:33 +0000 (12:07 +0800)]
ceph-iscsi: rename group iscsi_gws

Let's try to avoid using dashes as testinfra needs to be able to read
the groups.
Typically, with iscsi-gws we can't add a marker for these iscsi nodes,
using an underscore fixes the issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 20c8065e48b6ea3027669bd5fa2f79779eac570a)

7 years agomgr: fix condition to add modules to ceph-mgr
Guillaume Abrioux [Wed, 11 Jul 2018 14:34:09 +0000 (16:34 +0200)]
mgr: fix condition to add modules to ceph-mgr

Follow up on #2784

We must check in the generated fact `_disabled_ceph_mgr_modules` to
enable disabled mgr module.
Otherwise, this task will be skipped because it's not comparing the
right list.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600155
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ce5ac930c5b91621a46fc69ddd0dcafb2a24947d)

7 years agomgr: fix enabling of mgr module on mimic
Guillaume Abrioux [Mon, 18 Jun 2018 15:26:21 +0000 (17:26 +0200)]
mgr: fix enabling of mgr module on mimic

The data structure has slightly changed on mimic.

Prior to mimic, it used to be:

```
{
    "enabled_modules": [
        "status"
    ],
    "disabled_modules": [
        "balancer",
        "dashboard",
        "influx",
        "localpool",
        "prometheus",
        "restful",
        "selftest",
        "zabbix"
    ]
}
```

From mimic it looks like this:

```
{
    "enabled_modules": [
        "status"
    ],
    "disabled_modules": [
        {
            "name": "balancer",
            "can_run": true,
            "error_string": ""
        },
        {
            "name": "dashboard",
            "can_run": true,
            "error_string": ""
        }
    ]
}
```

This means we can't simply check if `item` is in `item in
_ceph_mgr_modules.disabled_modules`

the idea here is to use filter `map(attribute='name')` to build a list
when deploying mimic.

Fixes: #2766
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3abc253fecc91f29c90e23ae95e1b83f8ffd3de6)

7 years agomgr: fix enabling of mgr module on mimic
Guillaume Abrioux [Mon, 18 Jun 2018 15:26:21 +0000 (17:26 +0200)]
mgr: fix enabling of mgr module on mimic

The data structure has slightly changed on mimic.

Prior to mimic, it used to be:

```
{
    "enabled_modules": [
        "status"
    ],
    "disabled_modules": [
        "balancer",
        "dashboard",
        "influx",
        "localpool",
        "prometheus",
        "restful",
        "selftest",
        "zabbix"
    ]
}
```

From mimic it looks like this:

```
{
    "enabled_modules": [
        "status"
    ],
    "disabled_modules": [
        {
            "name": "balancer",
            "can_run": true,
            "error_string": ""
        },
        {
            "name": "dashboard",
            "can_run": true,
            "error_string": ""
        }
    ]
}
```

This means we can't simply check if `item` is in `item in
_ceph_mgr_modules.disabled_modules`

the idea here is to use filter `map(attribute='name')` to build a list
when deploying mimic.

Fixes: #2766
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3abc253fecc91f29c90e23ae95e1b83f8ffd3de6)

7 years agoAdd ability to enable ceph mgr modules.
Fabien Brachere [Mon, 16 Oct 2017 13:04:23 +0000 (15:04 +0200)]
Add ability to enable ceph mgr modules.

(Couldn't find actual author email to add the signoff accordingly so I've
added it with my email to make the CI happy.)

(cherry picked from commit 3a587575d76ecfec050d3d12cd416853baf590af)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agotests: skip tests for node iscsi-gw when deploying jewel
Guillaume Abrioux [Fri, 22 Jun 2018 09:56:24 +0000 (11:56 +0200)]
tests: skip tests for node iscsi-gw when deploying jewel

CI is deploying a iscsigw node anyway but its not deployed let's skip
test accordingly

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2d560b562a2d439fbed34b3066de93dfdff650ff)

7 years agotests: fix broken test when collocated daemons scenarios
Guillaume Abrioux [Wed, 20 Jun 2018 11:44:08 +0000 (13:44 +0200)]
tests: fix broken test when collocated daemons scenarios

At the moment, a lot of tests are skipped when daemons are collocated.
Our tests consider a node belong to only 1 group while it's possible for
certain scenario it can belong to multiple groups.

Also pinning to pytest 3.6.1 so we can use `request.node.iter_markers()`

Co-Authored-by: Alfredo Deza <adeza@redhat.com>
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d83b24d27121591dafcba8297288c6f3a7ede42e)

7 years agotests: skip rgw_tuning_pools_are_set when rgw_create_pools is not defined
Guillaume Abrioux [Fri, 22 Jun 2018 09:20:33 +0000 (11:20 +0200)]
tests: skip rgw_tuning_pools_are_set when rgw_create_pools is not defined

since ooo_collocation scenario is supposed to be the same scenario than the
one tested by OSP and they are not passing `rgw_create_pools` the test
`test_docker_rgw_tuning_pools_are_set` will fail:
```
>       pools = node["vars"]["rgw_create_pools"]
E       KeyError: 'rgw_create_pools'
```

skipping this test if `node["vars"]["rgw_create_pools"]` is not defined
fixes this failure.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1c3dae4a90816ac6503967779b7fd77ff84900b5)

7 years agotests: factorize docker tests using docker_exec_cmd logic
Guillaume Abrioux [Mon, 25 Jun 2018 15:10:37 +0000 (17:10 +0200)]
tests: factorize docker tests using docker_exec_cmd logic

avoid duplicating test unnecessarily just because of docker exec syntax.
Using the same logic than in the playbook with `docker_exec_cmd` allow us
to execute the same test on both containerized and non containerized environment.

The idea is to set a variable `docker_exec_cmd` with the
'docker exec <container-name>' string when containerized and
set it to '' when non containerized.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f2e57a56db2801818135ba85479fedfc00eae30c)

7 years agotests: add mimic support for test_rbd_mirror_is_up()
Guillaume Abrioux [Thu, 5 Jul 2018 13:16:19 +0000 (15:16 +0200)]
tests: add mimic support for test_rbd_mirror_is_up()

prior mimic, the data structure returned by `ceph -s -f json` used to
gather information about rbd-mirror daemons looked like below:

```
  "servicemap": {
    "epoch": 8,
    "modified": "2018-07-05 13:21:06.207483",
    "services": {
      "rbd-mirror": {
        "daemons": {
          "summary": "",
          "ceph-nano-luminous-faa32aebf00b": {
            "start_epoch": 8,
            "start_stamp": "2018-07-05 13:21:04.668450",
            "gid": 14107,
            "addr": "172.17.0.2:0/2229952892",
            "metadata": {
              "arch": "x86_64",
              "ceph_version": "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)",
              "cpu": "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz",
              "distro": "centos",
              "distro_description": "CentOS Linux 7 (Core)",
              "distro_version": "7",
              "hostname": "ceph-nano-luminous-faa32aebf00b",
              "instance_id": "14107",
              "kernel_description": "#1 SMP Wed Mar 14 15:12:16 UTC 2018",
              "kernel_version": "4.9.87-linuxkit-aufs",
              "mem_swap_kb": "1048572",
              "mem_total_kb": "2046652",
              "os": "Linux"
            }
          }
        }
      }
    }
  }
```

This part has changed from mimic and became:
```
  "servicemap": {
    "epoch": 2,
    "modified": "2018-07-04 09:54:36.164786",
    "services": {
      "rbd-mirror": {
        "daemons": {
          "summary": "",
          "14151": {
            "start_epoch": 2,
            "start_stamp": "2018-07-04 09:54:35.541272",
            "gid": 14151,
            "addr": "192.168.1.80:0/240942528",
            "metadata": {
              "arch": "x86_64",
              "ceph_release": "mimic",
              "ceph_version": "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)",
              "ceph_version_short": "13.2.0",
              "cpu": "Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz",
              "distro": "centos",
              "distro_description": "CentOS Linux 7 (Core)",
              "distro_version": "7",
              "hostname": "ceph-rbd-mirror0",
              "id": "ceph-rbd-mirror0",
              "instance_id": "14151",
              "kernel_description": "#1 SMP Wed May 9 18:05:47 UTC 2018",
              "kernel_version": "3.10.0-862.2.3.el7.x86_64",
              "mem_swap_kb": "1572860",
              "mem_total_kb": "1015548",
              "os": "Linux"
            }
          }
        }
      }
    }
  }
```

This patch modifies the function `test_rbd_mirror_is_up()` in
`test_rbd_mirror.py` so it works with `mimic` and keeps backward compatibility
with `luminous`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 09d795b5b737a05164772f5e3ba469577d605344)

7 years agotests: fix `_get_osd_id_from_host()` in TestOSDs()
Guillaume Abrioux [Mon, 9 Jul 2018 09:51:24 +0000 (11:51 +0200)]
tests: fix `_get_osd_id_from_host()` in TestOSDs()

We must initialize `children` variable in `_get_osd_id_from_host()`,
otherwise, if for any reason the deployment has failed and result with
an osd host with no OSD registered, we won't enter in the condition,
therefore, `children` is never set and the function tries to return
something undefined.

Typical error:
```
E       UnboundLocalError: local variable 'children' referenced before assignment
```

Fixes: #2860
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9a65ec231d7f76655caa964e1d9228ba7a910dea)

7 years agotests: refact test_all_*_osds_are_up_and_in
Guillaume Abrioux [Fri, 22 Jun 2018 23:43:49 +0000 (01:43 +0200)]
tests: refact test_all_*_osds_are_up_and_in

these tests are skipped on bluestore osds scenarios.
they were going to fail anyway since they are run on mon nodes and
`devices` is defined in inventory for each osd node. It means
`num_devices * num_osd_hosts` returns `0`.
The result is that the test expects to have 0 OSDs up.

The idea here is to move these tests so they are run on OSD nodes.
Each OSD node checks their respective OSD to be UP, if an OSD has 2
devices defined in `devices` variable, it means we are checking for 2
OSD to be up on that node, if each node has all its OSD up, we can say
all OSD are up.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fe79a5d24086fc61ae84a59bd61a053cddb62941)

7 years agocommon: switch from docker module to docker_container
Guillaume Abrioux [Mon, 9 Jul 2018 13:50:52 +0000 (15:50 +0200)]
common: switch from docker module to docker_container

As of ansible 2.4, `docker` module has been removed (was deprecated
since ansible 2.1).
We must switch to `docker_container` instead.

See: https://docs.ansible.com/ansible/latest/modules/docker_module.html#docker-module

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d0746e08586556b7d44ba4b20cd553a27860f30b)

7 years agomon: ensure socker is purged when mon is stopped
Guillaume Abrioux [Tue, 10 Jul 2018 09:56:17 +0000 (11:56 +0200)]
mon: ensure socker is purged when mon is stopped

On containerized deployment, if a mon is stopped, the socket is not
purged and can cause failure when a cluster is redeployed after the
purge playbook has been run.

Typical error:

```
fatal: [osd0]: FAILED! => {}

MSG:

'dict object' has no attribute 'osd_pool_default_pg_num'
```

the fact is not set because of this previous failure earlier:

```
ok: [mon0] => {
    "changed": false,
    "cmd": "docker exec ceph-mon-mon0 ceph --cluster test daemon mon.mon0 config get osd_pool_default_pg_num",
    "delta": "0:00:00.217382",
    "end": "2018-07-09 22:25:53.155969",
    "failed_when_result": false,
    "rc": 22,
    "start": "2018-07-09 22:25:52.938587"
}

STDERR:

admin_socket: exception getting command descriptions: [Errno 111] Connection refused

MSG:

non-zero return code
```

This failure happens when the ceph-mon service is stopped, indeed, since
the socket isn't purged, it's a leftover which is confusing the process.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9f54b3b4a7c1a7015def3a7987e3e9e426385251)

7 years agoceph-common: fix typo in task 2842/head v3.0.39
Sébastien Han [Thu, 5 Jul 2018 07:44:31 +0000 (09:44 +0200)]
ceph-common: fix typo in task

Somehow one option moved up and broke the yaml. Putting it to the right
place.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1598185
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoceph-common: fix rhcs condition v3.0.38
Sébastien Han [Wed, 4 Jul 2018 14:39:33 +0000 (16:39 +0200)]
ceph-common: fix rhcs condition

We forgot to add mgr_group_name when checking for the mon repo, thus the
conditional on the next task was failing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1598185
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fcf11ecc3567398f92b9f91e1a0749edb921131f)

7 years agoceph-mon: Generate initial keyring
Ha Phan [Thu, 21 Jun 2018 08:08:39 +0000 (16:08 +0800)]
ceph-mon: Generate initial keyring

Minor fix so that initial keyring can be generated using python3.

Signed-off-by: Ha Phan <thanhha.work@gmail.com>
(cherry picked from commit a7b7735b6fd23985d24a492f1bf4c5be7f1961b2)

7 years agotests: skip iscsi-gw nodes on jewel
Guillaume Abrioux [Tue, 3 Jul 2018 07:35:38 +0000 (09:35 +0200)]
tests: skip iscsi-gw nodes on jewel

On stable-3.0 we can't test iscsi-gw nodes on jewel because it's not
implemented yet.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoEnable monitor repo for mgr nodes and Tools repo for iscsi/nfs/clients v3.0.37
Vasu Kulkarni [Tue, 26 Jun 2018 21:41:14 +0000 (14:41 -0700)]
Enable monitor repo for mgr nodes and Tools repo for iscsi/nfs/clients

Signed-off-by: Vasu Kulkarni <vasu@redhat.com>
(cherry picked from commit 1d454b611f9ec5403a474fcb45a6333ca6d36715)

7 years agotests: no need to remove partitions in lvm_setup.yml
Andrew Schoen [Mon, 12 Mar 2018 19:06:39 +0000 (14:06 -0500)]
tests: no need to remove partitions in lvm_setup.yml

Now that we are using ceph_volume_zap the partitions are
kept around and should be able to be reused.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 98e237d234872a66b064ef9e3284a8763dda9186)

7 years agotest: when creating the /dev/sdc2 partition specify label as gpt
Andrew Schoen [Mon, 30 Oct 2017 20:31:04 +0000 (15:31 -0500)]
test: when creating the /dev/sdc2 partition specify label as gpt

ansible==2.4 requires that label be set to gpt, or it will be defaulted
to msdos.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 37a48209ccf702a817b445e8b6164e38b60ea708)

7 years agoSkip GPT header creation for lvm osd scenario
Vishal Kanaujia [Wed, 16 May 2018 09:58:31 +0000 (15:28 +0530)]
Skip GPT header creation for lvm osd scenario

The LVM lvcreate fails if the disk already has a GPT header.
We create GPT header regardless of OSD scenario. The fix is to
skip header creation for lvm scenario.

fixes: https://github.com/ceph/ceph-ansible/issues/2592

Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
(cherry picked from commit ef5f52b1f36188c3cab40337640a816dec2542fa)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agopurge-docker: added conditionals needed to successfully re-run purge
Randy J. Martinez [Wed, 28 Mar 2018 23:46:54 +0000 (18:46 -0500)]
purge-docker: added conditionals needed to successfully re-run purge

Added 'ignore_errors: true' to multiple lines which run docker commands; even in cases where docker is no longer installed. Because of this, certain tasks in the purge-docker-cluster.yml will cause the playbook to fail if re-run and stop the purge. This leaves behind a dirty environment, and a playbook which can no longer be run.
Fix Regex line 275: Sometimes 'list-units' will output 4 spaces between loaded+active. The update will account for both scenarios.
purge fetch_directory: in other roles fetch_directory is hard linked ex.: "{{ fetch_directory }}"/"{{ somedir }}". That being said, fetch_directory will never have a trailing slash in the all.yml so this task was never being run(causing failures when trying to re-deploy).

Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
(cherry picked from commit d1f2d64b15479524d9e347303c8d5e4bfe7c15d8)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agosystemd: remove changed_when: false
Sébastien Han [Thu, 28 Jun 2018 07:53:03 +0000 (09:53 +0200)]
systemd: remove changed_when: false

When using a module there is no need to apply this Ansible option. The
module will handle the idempotency on its own. So the module decides
wether or not the task has changed during the execution.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit f6239972716b013bafcb9313c9c83723615aa7d6)

# Conflicts:
# roles/ceph-iscsi-gw/tasks/container/containerized.yml

7 years agoceph-osd: trigger osd container restart on script change
Sébastien Han [Thu, 28 Jun 2018 07:54:24 +0000 (09:54 +0200)]
ceph-osd: trigger osd container restart on script change

The script ceph-osd-run.sh holds the config options to start the
container, if one of these options are modified we must restart the
container. This was not the case before becauase the 'notify' flag
wasn't present.

Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1596061
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit abdb53e16a7f46ceebbf4de65ed1add04da0d543)

7 years agotests: reduce the amount of time we wait
Guillaume Abrioux [Tue, 26 Jun 2018 11:42:27 +0000 (13:42 +0200)]
tests: reduce the amount of time we wait

This sleep 120 looks a bit long, let's reduce this to 30sec and see if
things go faster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 081600842ff3758910109d9f636b54cb12a85ed9)

7 years agomon: honour mon_docker_net_host option
Sébastien Han [Wed, 27 Jun 2018 09:23:00 +0000 (11:23 +0200)]
mon: honour mon_docker_net_host option

--net=host was hardcoded in the startup line so even though
mon_docker_net_host was set to False the net option would always be
activated.
mon_docker_net_host is set to True by default so this commit does not
change the behaviour.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 322e2de7d27c498a9d89b31b96729200bed56e19)

7 years agotests: add more nodes in ooo testing scenario 2822/head
Guillaume Abrioux [Wed, 13 Jun 2018 14:46:40 +0000 (16:46 +0200)]
tests: add more nodes in ooo testing scenario

adding more node in this scenario could help to have a better coverage
so we can catch more potential bugs.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 481c14455aa3bcd05cc1c190458148a4d516e991)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agomon/osd: bump container memory limit
Sébastien Han [Fri, 15 Jun 2018 19:39:34 +0000 (15:39 -0400)]
mon/osd: bump container memory limit

As discussed with the cores, the current limits are too low and should
be bumped to higher value.
So now by default monitors get 3GB and OSDs get 5GB.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591876
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit a9ed3579ae680949b4f53aee94003ca50d1ae721)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotests: fix *_has_correct_value tests
Guillaume Abrioux [Tue, 19 Jun 2018 16:08:10 +0000 (18:08 +0200)]
tests: fix *_has_correct_value tests

It might happen that the list of ips/hosts in following line (ceph.conf)
- `mon initial memebers = <hosts>`
- `mon host = <ips>`

are not ordered the same way depending on deployment.

This patch makes the tests looking for each ip or hostname in respective
lines.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f68936ca7e7f556c8d8cee8b2c4565a3c94f72f9)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agotests: keep same ceph release during handlers/idempotency test
Guillaume Abrioux [Fri, 15 Jun 2018 08:44:25 +0000 (10:44 +0200)]
tests: keep same ceph release during handlers/idempotency test

since `latest` points to `mimic`, we need to force the test to keep the
same ceph release when testing anything else than `mimic`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 21894655a7eaf0db6c99a46955e0e8ebc59a83af)

7 years agotests: increase memory to 1024Mb for centos7_cluster scenario
Guillaume Abrioux [Mon, 11 Jun 2018 08:49:39 +0000 (10:49 +0200)]
tests: increase memory to 1024Mb for centos7_cluster scenario

we see more and more failure like `fatal: [mon0]: UNREACHABLE! => {}` in
`centos7_cluster` scenario, Since we have 30Gb RAM on hypervisors, we
can give monitors a bit more RAM. By the way, nodes on containerized cluster
testing scenario have already 1024Mb memory allocated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bbb869133563c3b0ddd0388727b894306b6b8b26)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agotests: improve mds tests
Guillaume Abrioux [Wed, 6 Jun 2018 19:56:38 +0000 (21:56 +0200)]
tests: improve mds tests

the expected number of mds daemon consist of number of daemons that are
'up' + number of daemons 'up:standby'.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c94ada69e80d7a1ddfbd2de2b13086d57a6fdfcd)

7 years agomon: copy openstack keys over to all mon
Guillaume Abrioux [Thu, 7 Jun 2018 07:09:38 +0000 (09:09 +0200)]
mon: copy openstack keys over to all mon

When configuring openstack, the created keyrings aren't copied over to
all monitors nodes.

This should have been backported from
433ecc7cbcc1ac91cab509dabe5c647d58c18c7f but this would implie too much
changes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agorolling_update: fix facts gathering delegation
Guillaume Abrioux [Tue, 5 Jun 2018 14:30:12 +0000 (16:30 +0200)]
rolling_update: fix facts gathering delegation

this is kind of follow up on what has been made in #2560.
See #2560 and #2553 for details.

Closes: #2708
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 232a16d77ff1048a2d3c4aa743c44e864fa2b80b)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoplaybook: follow up on #2553
Guillaume Abrioux [Thu, 24 May 2018 13:07:56 +0000 (15:07 +0200)]
playbook: follow up on #2553

Since we fixed the `gather and delegate facts` task, this exception is
not needed anymore. It's a leftover that should be removed to save some
time when deploying a cluster with a large client number.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 828848017cefd981e14ca9e4690dd7d1320f0eef)

7 years agorgws: renames create_pools variable with rgw_create_pools.
jtudelag [Thu, 31 May 2018 15:01:44 +0000 (17:01 +0200)]
rgws: renames create_pools variable with rgw_create_pools.

Renamed to be consistent with the role (rgw) and have a meaningful name.

Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
(cherry picked from commit 600e1e2c2680e8102f4ef17855d4bcd89d6ef733)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoAdds RGWs pool creation to containerized installation.
jtudelag [Sun, 4 Mar 2018 22:06:48 +0000 (23:06 +0100)]
Adds RGWs pool creation to containerized installation.

ceph command has to be executed from one of the monitor containers
if not admin copy present in RGWs. Task has to be delegated then.

Adds test to check proper RGW pool creation for Docker container scenarios.

Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
(cherry picked from commit 8704144e3157aa253fb7563fe701d9d434bf2f3e)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotests: skip disabling fastest mirror detection on atomic host
Guillaume Abrioux [Tue, 5 Jun 2018 07:31:42 +0000 (09:31 +0200)]
tests: skip disabling fastest mirror detection on atomic host

There is no need to execute this task on atomic hosts.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f0cd4b065144843762b9deca667e05a1903b2121)

7 years agoceph-defaults: Enable local epel repository
Erwan Velu [Fri, 1 Jun 2018 16:53:10 +0000 (18:53 +0200)]
ceph-defaults: Enable local epel repository

During the tests, the remote epel repository is generating a lots of
errors leading to broken jobs (issue #2666)

This patch is about using a local repository instead of a random one.
To achieve that, we make a preliminary install of epel-release, remove
the metalink and enforce a baseurl to our local http mirror.

That should speed up the build process but also avoid the random errors
we face.

This patch is part of a patch series that tries to remove all possible yum failures.

Signed-off-by: Erwan Velu <erwan@redhat.com>
(cherry picked from commit 493f615eae3510021687e8cfc821364cc26a71ac)

7 years agoMakefile: followup on #2585 v3.0.36
Guillaume Abrioux [Thu, 31 May 2018 09:25:49 +0000 (11:25 +0200)]
Makefile: followup on #2585

Fix a typo in `tag` target, double quote are missing here.

Without them, the `make tag` command fails like this:

```
if [[ "v3.0.35" ==  ]]; then \
            echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \
            exit 1; \
        fi
/bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator
/bin/sh: -c: line 0: syntax error near `;'
/bin/sh: -c: line 0: `if [[ "v3.0.35" ==  ]]; then     echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35";     exit 1; fi'
make: *** [tag] Error 2
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b67f42feb95594fb403908d61383dc25d6cd342)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoMakefile: add "make tag" command
Ken Dreyer [Thu, 10 May 2018 23:08:05 +0000 (17:08 -0600)]
Makefile: add "make tag" command

Add a new "make tag" command. This automates some common operations:

1) Automatically determine the next Git tag version number to create.
   For example:
   "3.2.0beta1 -> "3.2.0beta2"
   "3.2.0rc1 -> "3.2.0rc2"
   "3.2.0" -> "3.2.1"

2) Create the Git tag, and print instructions for the user to push it to
   GitHub.

3) Sanity check that HEAD is a stable-* branch or master (bail on
   everything else).

4) Sanity check that HEAD is not already tagged.

Note, we will still need to tag manually once each time we change the
format, for example when moving from tagging "betas" to tagging "rcs",
or "rcs" to "stable point releases".

Signed-off-by: Ken Dreyer <kdreyer@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fcea56849578bd47e65b130ab6884e0b96f9d89d)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agorgw: container add option to configure multi-site zone
Sébastien Han [Mon, 16 Apr 2018 13:57:23 +0000 (15:57 +0200)]
rgw: container add option to configure multi-site zone

You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1c084efb3cb7e48d96c9cbd6bd05ca4f93526853)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotests: resize root partition when atomic host
Guillaume Abrioux [Wed, 30 May 2018 07:17:09 +0000 (09:17 +0200)]
tests: resize root partition when atomic host

For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.

The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.

Typical error seen:

```
STDERR:

failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```

Indeed, on the machine we can see:
```
Every 2.0s: df -h                                                                                                                                                                                                                                       Tue May 29 17:21:13 2018
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  3.0G  3.0G   14M 100% /
```

The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.

```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
  Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
  Logical volume atomicos/root successfully resized.
```

```
-bash-4.2# xfs_growfs /
meta-data=/dev/mapper/atomicos-root isize=512    agcount=4, agsize=192000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=768000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 768000 to 2543616
```

```
-bash-4.2# df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  9.7G  1.4G  8.4G  14% /
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 34f70428521ab30414ce8806c7e2967a7387ff00)

7 years agotests: avoid yum failures
Guillaume Abrioux [Mon, 28 May 2018 10:02:49 +0000 (12:02 +0200)]
tests: avoid yum failures

In the CI we can see at many times failures like following:

`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`

It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.

This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Erwan Velu <evelu@redhat.com>
(cherry picked from commit 98cb6ed8f602d9c54b63c5381a17dbca75df6bc2)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoAdd privilege escalation to iscsi purge tasks v3.0.35
Paul Cuzner [Fri, 25 May 2018 00:13:20 +0000 (12:13 +1200)]
Add privilege escalation to iscsi purge tasks

Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit 2890b57cfc2e1ef9897a791ce60f4a5545011907)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoceph-radosgw: disable NSS PKI db when SSL is disabled
Luigi Toscano [Tue, 22 May 2018 09:46:33 +0000 (11:46 +0200)]
ceph-radosgw: disable NSS PKI db when SSL is disabled

The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.

It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.

Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a122b7ce8ec6a5c27a70bc19589d177.
Now profiles drives the setting of rgw keystone *.

Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
(cherry picked from commit 43e96c1f98312734e2f12a1ea5ef29981e9072bd)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoFix restarting OSDs twice during a rolling update.
Subhachandra Chandra [Fri, 16 Mar 2018 17:10:14 +0000 (10:10 -0700)]
Fix restarting OSDs twice during a rolling update.

During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.

(cherry picked from commit c7e269fcf5620a49909b880f57f5cbb988c27b07)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agodefaults: restart_osd_daemon unit spaces
Sébastien Han [Fri, 18 May 2018 12:43:57 +0000 (14:43 +0200)]
defaults: restart_osd_daemon unit spaces

Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail

It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2f43e9dab5f077276162069f449978ea97c2e9c0)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agopurge_cluster: fix dmcrypt purge
Guillaume Abrioux [Fri, 18 May 2018 15:56:03 +0000 (17:56 +0200)]
purge_cluster: fix dmcrypt purge

dmcrypt devices aren't closed properly, therefore, it may fail when
trying to redeploy after a purge.

Typical errors:

```
ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command
'/sbin/blkid' returned non-zero exit status 2
```

```
ceph-disk: Error: unable to read dm-crypt key:
/var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf:
/etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key
```

Closing properly dmcrypt devices allows to redeploy without error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9801bde4d4ce501208fc297d5cb0ab2e0aa28702)

7 years agopurge_cluster: wipe all partitions
Guillaume Abrioux [Wed, 16 May 2018 15:34:38 +0000 (17:34 +0200)]
purge_cluster: wipe all partitions

In order to ensure there is no leftover after having purged a cluster,
we must wipe all partitions properly.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a9247c4de78dec8a63f17400deb8b06ce91e7267)

7 years agopurge_cluster: fix bug when building device list
Guillaume Abrioux [Wed, 16 May 2018 14:04:25 +0000 (16:04 +0200)]
purge_cluster: fix bug when building device list

there is some leftover on devices when purging osds because of a invalid
device list construction.

typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
    "changed": true,
    "cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
    "delta": "0:00:00.015188",
    "end": "2018-05-16 12:41:40.408597",
    "item": "/dev/sda sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:40.393409"
}

STDOUT:

sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600

STDERR:

Error: Could not stat device /dev/sda sda1 - No such file or directory.
```

the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output

eg:

```
changed: [osd3] => (item=/dev/sda1) => {
    "changed": true,
    "cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
    "delta": "0:00:00.013634",
    "end": "2018-05-16 12:41:09.068166",
    "item": "/dev/sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:09.054532"
}

STDOUT:

/dev/sda sda1
```

For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9cad113e2f22132d08208cd58462f11056c41305)

7 years agoswitch: fix ceph_uid fact for osd
Guillaume Abrioux [Wed, 25 Apr 2018 12:20:35 +0000 (14:20 +0200)]
switch: fix ceph_uid fact for osd

In addition to b324c17 this commit fix the ceph uid for osd role in the
switch from non containerized to containerized playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit adeecc51f8adf7834b936b7cf6a1be7e6bb82d27)

7 years agoswitch: fix ceph_uid fact
Sébastien Han [Thu, 19 Apr 2018 08:28:56 +0000 (10:28 +0200)]
switch: fix ceph_uid fact

Latest is now centos not ubuntu anymore so the condition was wrong.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 767abb5de02c0ecdf81a18f6ca63f2e978d3d7a4)

7 years agoswitch: disable ceph-disk units
Sébastien Han [Wed, 16 May 2018 15:37:10 +0000 (17:37 +0200)]
switch: disable ceph-disk units

During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 49a47124859e6577fb99e6dd680c5244ccd6f38f)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotake-over: fix bug when trying to override variable
Guillaume Abrioux [Thu, 17 May 2018 15:29:20 +0000 (17:29 +0200)]
take-over: fix bug when trying to override variable

A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.

Typical error:

```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```

Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 415dc0a29b10b28cbd047fe28eb4dd38419ea5dc)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agorolling_update: move osd flag section
Sébastien Han [Wed, 16 May 2018 14:02:41 +0000 (16:02 +0200)]
rolling_update: move osd flag section

During a minor update from a jewel to a higher jewel version (10.2.9 to
10.2.10 for example) osd flags don't get applied because they were done
in the mgr section which is skipped in jewel since this daemons does not
exist.
Moving the set flag section after all the mons have been updated solves
that problem.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1548071
Co-authored-by: Tomas Petr <tpetr@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit d80a871a078a175d0775e91df00baf625dc39725)

7 years agoiscsi: add python-rtslib repository
Sébastien Han [Mon, 14 May 2018 07:21:48 +0000 (09:21 +0200)]
iscsi: add python-rtslib repository

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 8c7c11b774f54078b32b652481145699dbbd79ff)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoiscsi-gw: fix issue when trying to mask target
Guillaume Abrioux [Mon, 14 May 2018 15:39:25 +0000 (17:39 +0200)]
iscsi-gw: fix issue when trying to mask target

trying to mask target when `/etc/systemd/system/target.service` doesn't
exist seems to be a bug.
There is no need to mask a unit file which doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a145caf947aec64467150a007b7aafe57abe2891)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoFIX: run restart scripts in `noexec` /tmp v3.0.34
Arano-kai [Mon, 6 Nov 2017 14:02:47 +0000 (16:02 +0200)]
FIX: run restart scripts in `noexec` /tmp

- One can not run scripts directly in place, that mounted with `noexec`
option. But one can run scripts as arguments for `bash/sh`.

Signed-off-by: Arano-kai <captcha.is.evil@gmail.com>
(cherry picked from commit 5cde3175aede783feb89cbbc4ebb5c2f05649b99)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoosd: clean legacy syntax in ceph-osd-run.sh.j2
Guillaume Abrioux [Wed, 9 May 2018 01:10:30 +0000 (03:10 +0200)]
osd: clean legacy syntax in ceph-osd-run.sh.j2

Quick clean on a legacy syntax due to e0a264c7e

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7b387b506a21fd71eedd7aabab9f114353b63abc)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoadds missing state needed to upgrade nfs-ganesha v3.0.33
Gregory Meno [Wed, 9 May 2018 18:17:26 +0000 (11:17 -0700)]
adds missing state needed to upgrade nfs-ganesha

in tasks for os_family Red Hat we were missing this

fixes: bz1575859
Signed-off-by: Gregory Meno <gmeno@redhat.com>
(cherry picked from commit 26f6a650425517216fb57c08e1a8bda39ddcf2b5)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agocommon: make the delegate_facts feature optional
Guillaume Abrioux [Tue, 31 Oct 2017 13:39:29 +0000 (14:39 +0100)]
common: make the delegate_facts feature optional

Since we encountered issue with this on ansible2.2, this commit provide
the ability to enable or disable it regarding which ansible we are
running.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4596fbaac1322a4c670026bc018e3b5b061b072b)

7 years agoplaybook: improve facts gathering
Guillaume Abrioux [Thu, 3 May 2018 16:41:16 +0000 (18:41 +0200)]
playbook: improve facts gathering

there is no need to gather facts with O(N^2) way.
Only one node should gather facts from other node.

Fixes: #2553
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 75733daf23d56008b246d8c05c5069303edd4197)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoMake sure the restart_mds_daemon script is created with the correct MDS name
Simone Caronni [Thu, 5 Apr 2018 14:14:23 +0000 (16:14 +0200)]
Make sure the restart_mds_daemon script is created with the correct MDS name

(cherry picked from commit b12bf62c36955d1e502552f8fddb03f44d7d6fc7)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agocommon: enable Tools repo for rhcs clients
Sébastien Han [Tue, 8 May 2018 14:11:14 +0000 (07:11 -0700)]
common: enable Tools repo for rhcs clients

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574458
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 07ca91b5cb7e213545687b8a62c421ebf8dd741d)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agocommon: copy iso files if rolling_update
Sébastien Han [Thu, 3 May 2018 14:54:53 +0000 (16:54 +0200)]
common: copy iso files if rolling_update

If we are in a middle of an update we want to get the new package
version being installed so the task that copies the repo files should
not be skipped.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572032
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4a186237e6fdc98f779c2e25985da4325b3b16cd)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoRevert "add .vscode/ to gitignore" v3.0.32
Sébastien Han [Fri, 27 Apr 2018 11:21:16 +0000 (13:21 +0200)]
Revert "add .vscode/ to gitignore"

This reverts commit ce67b05292e224d640738bf506ce873680ff9b97.

7 years agoadd .vscode/ to gitignore
Sébastien Han [Wed, 4 Apr 2018 14:23:54 +0000 (16:23 +0200)]
add .vscode/ to gitignore

I personally dev on vscode and I have some preferences to save when it
comes to running the python unit tests. So escaping this directory is
actually useful.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3c4319ca4b5355d69b2925e916420f86d29ee524)
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoshrink-osd: ability to shrink NVMe drives
Sébastien Han [Fri, 20 Apr 2018 09:13:51 +0000 (11:13 +0200)]
shrink-osd: ability to shrink NVMe drives

Now if the service name contains nvme we know we need to remove the last
2 character instead of 1.

If nvme then osd_to_kill_disks is nvme0n1, we need nvme0
If ssd or hdd then osd_to_kill_disks is sda1, we need sda

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1561456
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 66c1ea8cd561fce6cfe5cdd1ecaa13411c824e3a)
Signed-off-by: Sébastien Han <seb@redhat.com>