]> git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log
ceph-ansible.git
6 years agoceph-config: allow the batch --report to fail when getting the OSD num
Andrew Schoen [Tue, 2 Oct 2018 18:56:09 +0000 (13:56 -0500)]
ceph-config: allow the batch --report to fail when getting the OSD num

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-volume: if --report fails to load json, fail with better info
Andrew Schoen [Tue, 2 Oct 2018 18:50:01 +0000 (13:50 -0500)]
ceph-volume: if --report fails to load json, fail with better info

This handles the case gracefully where --report does not return any JSON
because a validator might have failed.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agotests: remove journal_size from lvm-batch testing scenario
Andrew Schoen [Mon, 1 Oct 2018 20:06:50 +0000 (15:06 -0500)]
tests: remove journal_size from lvm-batch testing scenario

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-volume: make the batch action idempotent
Andrew Schoen [Mon, 1 Oct 2018 17:51:47 +0000 (12:51 -0500)]
ceph-volume: make the batch action idempotent

The command is run with --report first to see if any OSDs will be
created or not. If they will be, then the command is run. If not, then
changed is set to False and the module exits.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-config: use 'lvm list' to find num_osds for an existing cluster
Andrew Schoen [Tue, 25 Sep 2018 20:25:40 +0000 (15:25 -0500)]
ceph-config: use 'lvm list' to find num_osds for an existing cluster

This makes finding num_osds idempotent for clusters that were deployed
using 'lvm batch'.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-volume: adds `lvm list` support to the ceph_volume module
Andrew Schoen [Tue, 25 Sep 2018 20:05:08 +0000 (15:05 -0500)]
ceph-volume: adds `lvm list` support to the ceph_volume module

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-config: use the ceph_volume module to get num_osds for lvm batch
Andrew Schoen [Thu, 20 Sep 2018 18:32:00 +0000 (13:32 -0500)]
ceph-config: use the ceph_volume module to get num_osds for lvm batch

This gives us an accurate number of how many osds will be created.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph_volume: adds the report parameter
Andrew Schoen [Thu, 20 Sep 2018 18:17:29 +0000 (13:17 -0500)]
ceph_volume: adds the report parameter

Will pass the --report command to ceph-volume lvm batch.

Results will be returned in json format.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-osd: use journal_size and block_db_size for lvm batch
Andrew Schoen [Thu, 20 Sep 2018 17:26:24 +0000 (12:26 -0500)]
ceph-osd: use journal_size and block_db_size for lvm batch

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-defaults: add the block_db_size option
Andrew Schoen [Thu, 20 Sep 2018 17:24:07 +0000 (12:24 -0500)]
ceph-defaults: add the block_db_size option

This is used in the lvm osd scenario for the 'lvm batch' subcommand
of ceph-volume.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agoceph-volume: add the journal_size and block_db_size options
Andrew Schoen [Thu, 20 Sep 2018 17:18:53 +0000 (12:18 -0500)]
ceph-volume: add the journal_size and block_db_size options

These can be used for the the --journal-size and --block-db-size options
of `lvm batch`.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
6 years agosite: use default value for 'cluster' variable
Sébastien Han [Mon, 8 Oct 2018 13:45:58 +0000 (09:45 -0400)]
site: use default value for 'cluster' variable

If someone's cluster name is 'ceph' then the playbook will fail (with no
errors because of ignore_errors) saying it can not find the variable. So
let's declare the default. If the cluster name is different then it'll
be in group_vars and thus there won't be any failre.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1636962
Signed-off-by: Sébastien Han <seb@redhat.com>
6 years agorhcs: add helpers for the containerized deployment
Sébastien Han [Fri, 5 Oct 2018 12:05:11 +0000 (14:05 +0200)]
rhcs: add helpers for the containerized deployment

We give more assistance to consultants deplying by setting the registry
and the image name.

Signed-off-by: Sébastien Han <seb@redhat.com>
6 years agocommon: remove check_firewall code
Guillaume Abrioux [Fri, 5 Oct 2018 12:33:04 +0000 (14:33 +0200)]
common: remove check_firewall code

Check firewall isn't working as expected and might break deployments.
This part of the code will be reworked soon.

Let's focus on configure_firewall code for now.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1541840
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
6 years agofollow up on b5d2ea2
Guillaume Abrioux [Thu, 4 Oct 2018 08:02:24 +0000 (10:02 +0200)]
follow up on b5d2ea2

Add some missed statements

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
6 years agorolling_update: add ceph-handler role
Guillaume Abrioux [Fri, 5 Oct 2018 11:15:54 +0000 (13:15 +0200)]
rolling_update: add ceph-handler role

since the introduction of ceph-handler, it has to be added in
rolling_update playbook as well

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
6 years agodon't use "static" field while including tasks
Rishabh Dave [Fri, 10 Aug 2018 12:16:30 +0000 (08:16 -0400)]
don't use "static" field while including tasks

Instead used "import_tasks" and "include_tasks" to tell whether tasks
must be included statically or dynamically.

Fixes: https://github.com/ceph/ceph-ansible/issues/2998
Signed-off-by: Rishabh Dave <ridave@redhat.com>
6 years agoswitch: copy initial mon keyring
Sébastien Han [Wed, 3 Oct 2018 11:39:35 +0000 (13:39 +0200)]
switch: copy initial mon keyring

We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.

Signed-off-by: Sébastien Han <seb@redhat.com>
6 years agoswitch: add missing call to ceph-handler role
Guillaume Abrioux [Tue, 2 Oct 2018 17:22:20 +0000 (19:22 +0200)]
switch: add missing call to ceph-handler role

Add missing call the ceph-handler role, otherwise we can't have
reference to variable registered from ceph-handler from other roles.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
6 years agoswitch: support migration when cluster is scrubbing
Guillaume Abrioux [Tue, 2 Oct 2018 15:31:49 +0000 (17:31 +0200)]
switch: support migration when cluster is scrubbing

Similar to c13a3c3 we must allow scrubbing when running this playbook.

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.

This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.

Closes: #3182
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
6 years agoconfig: look up for monitor_address_block in hostvars
Guillaume Abrioux [Tue, 2 Oct 2018 13:55:47 +0000 (15:55 +0200)]
config: look up for monitor_address_block in hostvars

`monitor_address_block` should be read from hostvars[host] instead of
current node being played.

eg:

Let's assume we have:

```
[mons]
ceph-mon0 monitor_address=192.168.1.10
ceph-mon1 monitor_interface=eth1
ceph-mon2 monitor_address_block=192.168.1.0/24
```

the ceph.conf generation task will end up with:

```
fatal: [ceph-mon0]: FAILED! => {}

MSG:

'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```

the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:

```
    {%- else -%}
      {% set interface = 'ansible_' + (monitor_interface | replace('-', '_')) %}
      {% if ip_version == 'ipv4' -%}
        {{ hostvars[host][interface][ip_version]['address'] }}
      {%- elif ip_version == 'ipv6' -%}
        [{{ hostvars[host][interface][ip_version][0]['address'] }}]
      {%- endif %}
    {%- endif %}
```

`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1635303
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
6 years agoAdd support for different NTP daemons v3.2.0beta3
Benjamin Cherian [Wed, 5 Sep 2018 16:59:50 +0000 (09:59 -0700)]
Add support for different NTP daemons

Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.

Fixes issue #3086 on Github

Signed-off-by: Benjamin Cherian <benjamin_cherian@amat.com>
6 years agoigw: valid client CHAP settings.
Mike Christie [Fri, 28 Sep 2018 21:23:10 +0000 (16:23 -0500)]
igw: valid client CHAP settings.

The linux kernel target layer, LIO, does not support the iscsi target to
mix ACLs that have chap enabled and disabled under the same tpg. This
patch adds a check and fails if this type of setup is detected.

This fixes Red Hat BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1615088

Signed-off-by: Mike Christie <mchristi@redhat.com>
6 years agodoc: redo lvm scenario documentation, improved wording and config descriptions
Alfredo Deza [Fri, 21 Sep 2018 13:28:33 +0000 (09:28 -0400)]
doc: redo lvm scenario documentation, improved wording and config descriptions

Signed-off-by: Alfredo Deza <adeza@redhat.com>
7 years agoadd ceph-handler role
Sébastien Han [Fri, 27 Jul 2018 14:56:09 +0000 (16:56 +0200)]
add ceph-handler role

The role contains all the handlers for Ceph services. We decided to
leave ceph-defaults role with variables and a few facts only. This is
useful when organizing the site.yml files and also adding the known
variables to infrastructure-playbooks.

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agopurge-cluster: zap devices used with the lvm scenario
Andrew Schoen [Fri, 21 Sep 2018 19:46:30 +0000 (14:46 -0500)]
purge-cluster: zap devices used with the lvm scenario

Fixes: https://github.com/ceph/ceph-ansible/issues/3156
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agopurge-cluster: recursively remove ceph-related files, symlinks and directories under...
wumingqiao [Fri, 28 Sep 2018 08:58:56 +0000 (16:58 +0800)]
purge-cluster: recursively remove ceph-related files, symlinks and directories under /etc/systemd/system.

fix: https://github.com/ceph/ceph-ansible/issues/3166

Signed-off-by: wumingqiao <wumingqiao@beyondcent.com>
7 years agotest: use osd_objecstore default value
Sébastien Han [Thu, 27 Sep 2018 15:29:38 +0000 (17:29 +0200)]
test: use osd_objecstore default value

Do not force filestore on our test but whatever is the default of
osd_objecstore.

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agodefaults: do not disable THP on bluestore
Sébastien Han [Thu, 27 Sep 2018 08:21:17 +0000 (10:21 +0200)]
defaults: do not disable THP on bluestore

As per #1013 it appears that BS will soon use THP to lower TLB misses,
also disabling THP hasn't demonstrated any gains so far.

Closes: https://github.com/ceph/ceph-ansible/issues/1013
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agodefault: use bluestore as default object store
Sébastien Han [Thu, 27 Sep 2018 07:57:26 +0000 (09:57 +0200)]
default: use bluestore as default object store

All tooling in Ceph is defaulting to use the bluestore objectstore for provisioning OSDs, there is no good reason for ceph-ansible to continue to default to filestore.

Closes: https://github.com/ceph/ceph-ansible/issues/3149
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633508
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agodon't use "include" to include tasks
Rishabh Dave [Tue, 21 Aug 2018 14:23:35 +0000 (19:53 +0530)]
don't use "include" to include tasks

Use "import_tasks" or "include_tasks" instead.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
7 years agopurge: actually remove of /var/lib/ceph/*
Guillaume Abrioux [Thu, 27 Sep 2018 09:33:51 +0000 (11:33 +0200)]
purge: actually remove of /var/lib/ceph/*

38dc20e74b89c1833d45f677f405fe758fd10c04 introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.

`/var/lib/ceph/*` files are not purged it means there is a leftover.

When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.

Typical error (from container output):

```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16  /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23  /entrypoint.sh:
SUCCESS
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agorolling_update: ensure pgs_by_state has at least 1 entry
Guillaume Abrioux [Tue, 25 Sep 2018 12:21:44 +0000 (14:21 +0200)]
rolling_update: ensure pgs_by_state has at least 1 entry

Previous commit c13a3c3 has removed a condition.

This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoupgrade: consider all 'active+clean' states as valid pgs
Guillaume Abrioux [Mon, 24 Sep 2018 12:21:24 +0000 (14:21 +0200)]
upgrade: consider all 'active+clean' states as valid pgs

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.

This commit allows an upgrade even while PGs are scrubbing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agodocs: supported validation by the ceph-validate role
Andrew Schoen [Fri, 21 Sep 2018 14:54:43 +0000 (09:54 -0500)]
docs: supported validation by the ceph-validate role

List the osd_scenarios and install options that are validated by the
ceph-validate role in the documentation.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agotests: add an RGW node on osd0 for ooo-collocation
Guillaume Abrioux [Fri, 21 Sep 2018 15:16:00 +0000 (17:16 +0200)]
tests: add an RGW node on osd0 for ooo-collocation

get more coverage by adding an RGW daemon collocated on osd0.
We've missed a bug in the past which could have been caught earlier in
the CI.
Let's add this additional daemon in order to have a better coverage.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoFix version check in ceph.conf template
Giulio Fidente [Mon, 24 Sep 2018 08:17:02 +0000 (10:17 +0200)]
Fix version check in ceph.conf template

We need to look for ceph_release when comparing with release names,
not ceph_version.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1631789
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
7 years agorestart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex
Matthew Vernon [Wed, 19 Sep 2018 13:25:15 +0000 (14:25 +0100)]
restart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex

`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.

Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
7 years agorestart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
Matthew Vernon [Wed, 19 Sep 2018 12:26:26 +0000 (13:26 +0100)]
restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK

After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
7 years agorestart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs
Matthew Vernon [Fri, 21 Sep 2018 16:55:01 +0000 (17:55 +0100)]
restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs

Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.

Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
7 years agovagrantfile: fix references to OpenStack settings
Norbert Illés [Thu, 20 Sep 2018 19:02:14 +0000 (21:02 +0200)]
vagrantfile: fix references to OpenStack settings

In case of an OpenStack "box", the Vagrantfile intend to check the
existence of os_networks and os_floating_ip_pool settings in
vagrant_variables.yml and pass them to the provider if they are set.
Due to two typos in the Vagrantfile this is not working as it checks the
wrong variable names.
This commit fixes the typos so these settings can be used.

Signed-off-by: Norbert Illés <illesnorbi@gmail.com>
7 years agoceph-config: calculate num_osds for the lvm batch scenario
Andrew Schoen [Tue, 18 Sep 2018 20:12:59 +0000 (15:12 -0500)]
ceph-config: calculate num_osds for the lvm batch scenario

For now our best guess is to count the number of devices and multiply
by osds_per_device. Ideally we'd like to run ceph-volume lvm batch
--report and get the number of OSDs that way, but currently we need
a ceph.conf in place already before we can do that. There is a tracker
ticket that would allow os to get around the need for a ceph.conf:
http://tracker.ceph.com/issues/36088

Fixes: https://github.com/ceph/ceph-ansible/issues/3135
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agoconfig: set default _rgw_hostname value to respective host v3.2.0beta2
Guillaume Abrioux [Tue, 18 Sep 2018 16:10:57 +0000 (18:10 +0200)]
config: set default _rgw_hostname value to respective host

the default value for _rgw_hostname was took from the current node being
played while it should be took from the respective node in the loop.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoceph-config: default devices and lvm_volumes when setting num_osds
Andrew Schoen [Tue, 18 Sep 2018 14:36:08 +0000 (09:36 -0500)]
ceph-config: default devices and lvm_volumes when setting num_osds

This avoids errors when the osd scenario choosen does not require
setting devices or lvm_volumes. The default values for these are not
set because they exist in the ceph-osd role, not ceph-defaults.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agoosd: add osd memory target option
Neha Ojha [Mon, 10 Sep 2018 17:23:20 +0000 (17:23 +0000)]
osd: add osd memory target option

BlueStore's cache is sized conservatively by default, so that it does
not overwhelm under-provisioned servers. The default is 1G for HDD, and
3G for SSD.

To replace the page cache, as much memory as possible should be given to
BlueStore. This is required for good performance. Since ceph-ansible
knows how much memory a host has, it can set

`bluestore cache size = max(total host memory / num OSDs on this host * safety
factor, 1G)`

Due to fragmentation and other memory use not included in bluestore's
cache, a safety factor of 0.5 for dedicated nodes and 0.2 for
hyperconverged nodes is recommended.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1595003
Signed-off-by: Neha Ojha <nojha@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoshrink-osd: follow up on 36fb3cde
Guillaume Abrioux [Fri, 14 Sep 2018 11:45:48 +0000 (13:45 +0200)]
shrink-osd: follow up on 36fb3cde

- Adds loop in bash to satisfy the 1:n relation between `osd_hosts` and the
different device lists.
- Fixes some container name which were using the host hostname instead
of the actual container one.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agosite-docker: fix undefined variable error
Guillaume Abrioux [Fri, 14 Sep 2018 05:51:03 +0000 (07:51 +0200)]
site-docker: fix undefined variable error

`mon_group_name` isn't defined here, we must hardcode it.

Typical error:

```
The task includes an option with an undefined variable. The error was: 'mon_group_name' is undefined
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoshrink-osd: purge dedicated devices
Sébastien Han [Thu, 19 Jul 2018 13:45:55 +0000 (15:45 +0200)]
shrink-osd: purge dedicated devices

Once the OSD is destroyed we also have to purge the associated devices,
this means purging journal, db , wal partitions too.

This now works for container and non-container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoigw: enable and start rbd-target-api
Mike Christie [Wed, 8 Aug 2018 16:14:09 +0000 (11:14 -0500)]
igw: enable and start rbd-target-api

The commit:

commit 1164cdc002cccb9dc1c6f10fb6b4370eafda3c4b
Author: Guillaume Abrioux <gabrioux@redhat.com>
Date:   Thu Aug 2 11:58:47 2018 +0200

    iscsigw: install ceph-iscsi-cli package

installs the cli package but does not start and enable the
rbd-target-api daemon needed for gwcli to communicate with the igw
nodes. This patch just enables and starts it for the non-container
setup. The container setup is already doing this.

This fixes bz https://bugzilla.redhat.com/show_bug.cgi?id=1613963

Signed-off-by: Mike Christie <mchristi@redhat.com>
7 years agotests: fix monitor_address for shrink_osd scenario
Guillaume Abrioux [Thu, 13 Sep 2018 13:22:07 +0000 (15:22 +0200)]
tests: fix monitor_address for shrink_osd scenario

b89cc1746 introduced a typo. This commit fixes it

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoshrink-osd: fix purge osd on containerized deployment
Guillaume Abrioux [Thu, 13 Sep 2018 09:18:56 +0000 (11:18 +0200)]
shrink-osd: fix purge osd on containerized deployment

ce1dd8d introduced the purge osd on containers but it was incorrect.

`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agodoc: update lvm doc
Guillaume Abrioux [Thu, 13 Sep 2018 15:18:07 +0000 (17:18 +0200)]
doc: update lvm doc

As of e3820a2 the creation of logical volumes is now supported by ceph-ansible.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agonfs: ignore error on semanage command for ganesha_t
Guillaume Abrioux [Wed, 12 Sep 2018 13:02:06 +0000 (15:02 +0200)]
nfs: ignore error on semanage command for ganesha_t

As of rhel 7.6, it has been decided it doesn't make sense to confine
`ganesha_t` anymore. It means this domain won't exist anymore.

Let's add a `failed_when: false` in order to make the deployment not
failing when trying to run this command.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1626070
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agotests: pin sphinx version to 1.7.9
Guillaume Abrioux [Thu, 13 Sep 2018 10:48:25 +0000 (12:48 +0200)]
tests: pin sphinx version to 1.7.9

using sphinx 1.8.0 breaks our doc test CI job.

Typical error:

```
Exception occurred:
  File
  "/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py",  line 26, in <module>
      from sphinx.ext import doctest
      SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```

See: https://github.com/sphinx-doc/sphinx/issues/5417

Pinning to 1.7.9 to fix our CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoceph_volume: adds the osds_per_device parameter
Andrew Schoen [Thu, 6 Sep 2018 19:00:56 +0000 (14:00 -0500)]
ceph_volume: adds the osds_per_device parameter

If this is set to anything other than the default value of 1 then the
--osds-per-device flag will be used by the batch command to define how
many osds will be created per device.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agomon: fix `ExecStartPre` option in systemd unit file
Guillaume Abrioux [Thu, 6 Sep 2018 12:00:13 +0000 (14:00 +0200)]
mon: fix `ExecStartPre` option in systemd unit file

This command line is not supported.
According to official documentation:

```
Note that shell command lines are not directly supported.
If shell command lines are to be used,
they need to be passed explicitly to a shell implementation of some kind.
```

We must run this using /bin/sh instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agodefaults: add a default value to rgw_hostname
Guillaume Abrioux [Wed, 5 Sep 2018 11:20:47 +0000 (13:20 +0200)]
defaults: add a default value to rgw_hostname

let's add ansible_hostname as a default value for rgw_hostname if no
hostname in servicemap matches ansible_fqdn.

Fixes: #3063
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agotests: do not upgrade ceph release for switch_to_containers scenario
Guillaume Abrioux [Fri, 7 Sep 2018 17:38:41 +0000 (19:38 +0200)]
tests: do not upgrade ceph release for switch_to_containers scenario

Using `UPDATE_*` environment variables here will make an upgrade of the
ceph release when running switch_to_containers scenario which is not
correct.

Eg:
If ceph luminous was first deployed, then we should switch to ceph
luminous containers, not to mimic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agoRevert "client: add quotes to the dict values"
Guillaume Abrioux [Fri, 7 Sep 2018 14:54:42 +0000 (16:54 +0200)]
Revert "client: add quotes to the dict values"

This commit is adding quotes that make keyring unusuable

eg:

```
client.john
        key: AQAN0RdbAAAAABAAH5D3WgMN9Rxw3M8jkpMIfg==
        caps: [mds] ''
        caps: [mgr] 'allow *'
        caps: [mon] 'allow rw'
        caps: [osd] 'allow rw'
```

Trying to import such a keyring and use it will result:

```
Error EACCES: access denied
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1623417
This reverts commit 424815501a0c6072234a8e1311a0fefeb5bcc222.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
7 years agorun rados cmd in container if containerized deployment
Tom Barron [Sat, 1 Sep 2018 14:32:51 +0000 (10:32 -0400)]
run rados cmd in container if containerized deployment

When ceph-nfs is deployed containerized and ceph-common is not
installed on the host the start_nfs task fails because the rados
command is missing on the host.

Run rados commands from a ceph container instead so that
they will succeed.

Signed-off-by: Tom Barron <tpb@dyncloud.net>
7 years agoroles: ceph-rgw: Enable the ceph-radosgw target
Markos Chandras [Wed, 29 Aug 2018 10:56:16 +0000 (11:56 +0100)]
roles: ceph-rgw: Enable the ceph-radosgw target

If the ceph-radosgw target is not enabled, then enabling the
ceph-radosgw@ service has no effect since nothing will pull
it on the next reboot. As such, we need to ensure that the
target is enabled.

Signed-off-by: Markos Chandras <mchandras@suse.de>
7 years agopurge: only purge /var/lib/ceph content
Sébastien Han [Mon, 27 Aug 2018 18:02:59 +0000 (11:02 -0700)]
purge: only purge /var/lib/ceph content

Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotests: use new 'num_osds' variable in tests
Alfredo Deza [Fri, 31 Aug 2018 19:32:08 +0000 (12:32 -0700)]
tests: use new 'num_osds' variable in tests

Signed-off-by: Alfredo Deza <adeza@redhat.com>
7 years agotests: allow defining arbitrary number of OSDs
Alfredo Deza [Fri, 31 Aug 2018 19:29:17 +0000 (12:29 -0700)]
tests: allow defining arbitrary number of OSDs

Some tests might want to set this since number of devices will not
necessarily map to number of OSDs

Signed-off-by: Alfredo Deza <adeza@redhat.com>
7 years agoDont run client dummy container on non-x86_64 hosts
Andy McCrae [Thu, 30 Aug 2018 07:53:36 +0000 (08:53 +0100)]
Dont run client dummy container on non-x86_64 hosts

The dummy client container currently wont work on non-x86_64 hosts.
This PR creates a filtered client group that contains only hosts
that are x86_64 - which can then be the group to run the
dummy container against.

This is for the specific case of a containerized_deployment where
there is a mixture of non-x86_64 hosts and x86_64 hosts. As such
the filtered group will contain all hosts when running with
containerized_deployment: false.

Currently ppc64le is not supported for Ceph server components.

Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
7 years agoinfrastructure-playbooks: add comments for lv_vars.yml
Ali Maredia [Wed, 29 Aug 2018 19:44:55 +0000 (15:44 -0400)]
infrastructure-playbooks: add comments for lv_vars.yml

Add comments telling user that devices used in
playbooks must not have GPT/FS/RAID signatures

Signed-off-by: Ali Maredia <amaredia@redhat.com>
7 years agoinfrastructure playbooks: remove lv-create error msg
Ali Maredia [Wed, 29 Aug 2018 16:06:52 +0000 (12:06 -0400)]
infrastructure playbooks: remove lv-create error msg

remove error message when PV creation fails

Signed-off-by: Ali Maredia <amaredia@redhat.com>
7 years agodoc: remove old statement
Sébastien Han [Mon, 27 Aug 2018 21:04:35 +0000 (14:04 -0700)]
doc: remove old statement

We have been supporting multiple devices for journalin containerized
deployments for a while now and forgot about this.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622393
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoremove warning for unsupported variables
Sébastien Han [Mon, 27 Aug 2018 20:58:20 +0000 (13:58 -0700)]
remove warning for unsupported variables

As promised, these will go unsupported for 3.1 so let's actually remove
them :).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622729
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoinfrastructure-playbooks: failure msg for pvcreate
Ali Maredia [Tue, 28 Aug 2018 19:36:44 +0000 (15:36 -0400)]
infrastructure-playbooks: failure msg for pvcreate

Add a message for when PV creation fails.

This message alerts users that FS/GPT/RAID
signatures could still on the device and the
reason for the failures.

`wipefs -a $device` needs to be run to fix this issue.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
7 years agosites: fix conditonnal
Sébastien Han [Mon, 27 Aug 2018 17:20:32 +0000 (10:20 -0700)]
sites: fix conditonnal

Same problem again... ceph_release_num[ceph_release] is only set in
ceph-docker-common/common roles so putting the condition on that role
will never work. Removing the condition.

The downside of this is we will be installing packages and then skip the
role on the node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622210
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agosite-docker.yml: remove useless condition
Sébastien Han [Thu, 23 Aug 2018 09:28:03 +0000 (11:28 +0200)]
site-docker.yml: remove useless condition

If we play site-docker.yml, we are already in a
containerized_deployment. So the condition is not needed.

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoci: stop using different images on the same run
Sébastien Han [Thu, 23 Aug 2018 09:21:54 +0000 (11:21 +0200)]
ci: stop using different images on the same run

There is no point of using hosts running on atomic AND centos hosts. So
let's run containerized scenarios on Atomic only.

This solves this error here:

```
fatal: [client2]: FAILED! => {
    "failed": true
}

MSG:

The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'

The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: set_fact ceph_current_status (convert to json)
  ^ here
```

From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a

What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agodefaults: fix rgw_hostname
Sébastien Han [Tue, 21 Aug 2018 18:50:31 +0000 (20:50 +0200)]
defaults: fix rgw_hostname

A couple if things were wrong in the initial commit:

* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it

* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.

* set the fact rgw_hostname on rgw nodes only

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agovagrant: move variable samples to contrib
Sébastien Han [Tue, 21 Aug 2018 14:25:45 +0000 (16:25 +0200)]
vagrant: move variable samples to contrib

Let's clean up the root of the repo a bit

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agorm ceph-aio-no-vagrant.sh
Sébastien Han [Tue, 21 Aug 2018 14:25:08 +0000 (16:25 +0200)]
rm ceph-aio-no-vagrant.sh

Script is out dated.

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoremove monitor_keys_example file
Sébastien Han [Tue, 21 Aug 2018 14:21:14 +0000 (16:21 +0200)]
remove monitor_keys_example file

This file is not needed, if you want to generate a key you can run:

python -c "import os ; import struct ; import time; import base64 ; key
= os.urandom(16) ; header =
struct.pack('<hiih',1,int(time.time()),0,len(key)) ;
print(base64.b64encode(header + key).decode())"

Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoSync config_template with base plugin
Andy McCrae [Wed, 25 Jul 2018 09:05:30 +0000 (10:05 +0100)]
Sync config_template with base plugin

The config_template plugin exists in the ceph-common role so that
config_template will still work with ansible galaxy.

This PR syncs the config_template module from the base of the repo in
plugins/actions to the ceph-common role.

Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
7 years agorolling_upgrade: set sortbitwise properly
Sébastien Han [Tue, 21 Aug 2018 09:15:44 +0000 (11:15 +0200)]
rolling_upgrade: set sortbitwise properly

Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoiscsi group name preserve backward compatibility
Sébastien Han [Mon, 20 Aug 2018 13:53:03 +0000 (15:53 +0200)]
iscsi group name preserve backward compatibility

Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoosd: fix ceph_release
Sébastien Han [Mon, 20 Aug 2018 14:03:59 +0000 (16:03 +0200)]
osd: fix ceph_release

We need ceph_release in the condition, not ceph_stable_release

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1619255
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agotake-over-existing-cluster: do not call var_files
Sébastien Han [Mon, 20 Aug 2018 12:41:06 +0000 (14:41 +0200)]
take-over-existing-cluster: do not call var_files

We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
7 years agoroles: ceph-defaults: Check if 'rgw' attribute exists for rgw_hostname
Markos Chandras [Wed, 15 Aug 2018 06:55:49 +0000 (09:55 +0300)]
roles: ceph-defaults: Check if 'rgw' attribute exists for rgw_hostname

If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:

msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'

We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.

Signed-off-by: Markos Chandras <mchandras@suse.de>
7 years agoroles: ceph-defaults: Delegate cluster information task to monitor node
Markos Chandras [Tue, 14 Aug 2018 06:52:04 +0000 (09:52 +0300)]
roles: ceph-defaults: Delegate cluster information task to monitor node

Since commit f422efb1d6b56ce56a7d39a21736a471e4ed357 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible

fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"

This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.

Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role

Signed-off-by: Markos Chandras <mchandras@suse.de>
7 years agomgr: improve/fix disabled modules check
Dardo D Kleiner [Wed, 15 Aug 2018 12:50:19 +0000 (08:50 -0400)]
mgr: improve/fix disabled modules check

Follow up on 36942af6983d60666f3f8a1a06b352a440a6c0da

"disabled_modules" is always a list, it's the items in the list that
can be dicts in mimic.  Many ways to fix this, here's one.

Signed-off-by: Dardo D Kleiner <dardokleiner@gmail.com>
7 years agolv-create: use copy instead of the template module
Andrew Schoen [Thu, 9 Aug 2018 13:09:41 +0000 (08:09 -0500)]
lv-create: use copy instead of the template module

The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agotests: cat the contents of lv-create.log in infra_lv_create
Andrew Schoen [Thu, 9 Aug 2018 12:26:58 +0000 (07:26 -0500)]
tests: cat the contents of lv-create.log in infra_lv_create

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agolv-create: add an example logfile_path config option in lv_vars.yml
Andrew Schoen [Thu, 9 Aug 2018 12:26:22 +0000 (07:26 -0500)]
lv-create: add an example logfile_path config option in lv_vars.yml

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agotests: adds a testing scenario for lv-create and lv-teardown
Andrew Schoen [Wed, 8 Aug 2018 22:12:30 +0000 (17:12 -0500)]
tests: adds a testing scenario for lv-create and lv-teardown

Using an explicitly named testing environment name allows us to have a
specific [testenv] block for this test. This greatly simplifies how it will
work as it doesn't really anything from the ceph cluster tests.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agolv-teardown: fail silently if lv_vars.yml is not found
Andrew Schoen [Wed, 8 Aug 2018 22:04:29 +0000 (17:04 -0500)]
lv-teardown: fail silently if lv_vars.yml is not found

This allows user to opt out of using lv_vars.yml and load configuration
from other sources.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agolv-teardown: set become: true at the playbook level
Andrew Schoen [Wed, 8 Aug 2018 22:04:07 +0000 (17:04 -0500)]
lv-teardown: set become: true at the playbook level

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agolv-create: fail silenty if lv_vars.yml is not found
Andrew Schoen [Wed, 8 Aug 2018 21:49:34 +0000 (16:49 -0500)]
lv-create: fail silenty if lv_vars.yml is not found

If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agolv-create: set become: true at the playbook level
Andrew Schoen [Wed, 8 Aug 2018 21:48:42 +0000 (16:48 -0500)]
lv-create: set become: true at the playbook level

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agolv-create: use the template module to write log file
Andrew Schoen [Wed, 8 Aug 2018 21:43:55 +0000 (16:43 -0500)]
lv-create: use the template module to write log file

The copy module will not expand the template and render the variables
included, so we must use template.

Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
7 years agoinfrastructure-playbooks/vars/lv_vars.yaml: minor fixes
Neha Ojha [Tue, 7 Aug 2018 20:08:38 +0000 (20:08 +0000)]
infrastructure-playbooks/vars/lv_vars.yaml: minor fixes

Signed-off-by: Neha Ojha <nojha@redhat.com>
7 years agoinfrastructure-playbooks/lv-create.yml: use tempfile to create logfile
Neha Ojha [Tue, 7 Aug 2018 16:54:29 +0000 (16:54 +0000)]
infrastructure-playbooks/lv-create.yml: use tempfile to create logfile

Signed-off-by: Neha Ojha <nojha@redhat.com>
7 years agoinfrastructure-playbooks/lv-create.yml: add lvm_volumes to suggested paste
Neha Ojha [Mon, 6 Aug 2018 18:14:37 +0000 (18:14 +0000)]
infrastructure-playbooks/lv-create.yml: add lvm_volumes to suggested paste

Signed-off-by: Neha Ojha <nojha@redhat.com>
7 years agoinfrastructure-playbooks/lv-create.yml: copy without using a template file
Neha Ojha [Mon, 6 Aug 2018 17:40:58 +0000 (17:40 +0000)]
infrastructure-playbooks/lv-create.yml: copy without using a template file

Signed-off-by: Neha Ojha <nojha@redhat.com>
7 years agoinfrastructure-playbooks/lv-create.yml: don't use action to copy
Neha Ojha [Fri, 3 Aug 2018 20:32:58 +0000 (20:32 +0000)]
infrastructure-playbooks/lv-create.yml: don't use action to copy

Signed-off-by: Neha Ojha <nojha@redhat.com>
7 years agoinfrastructure-playbooks: standardize variable usage with a space after brackets
Neha Ojha [Fri, 3 Aug 2018 20:08:31 +0000 (20:08 +0000)]
infrastructure-playbooks: standardize variable usage with a space after brackets

Signed-off-by: Neha Ojha <nojha@redhat.com>