git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

]> git.apps.os.sepia.ceph.com Git - ceph-ansible.git/log

projects / ceph-ansible.git / log

commit | commitdiff | tree

Greg Charot [Tue, 6 Feb 2018 18:26:54 +0000 (19:26 +0100)]

no reason the ceph-ansible ansible default provided crush_rule_hdd rule should be set as rack root + default ruleset

commit | commitdiff | tree

Greg Charot [Tue, 6 Feb 2018 18:20:17 +0000 (19:20 +0100)]

We don't want to automatically move the rbd pool to the new default crush rule. This operation shall be performed by the cluster operator.

commit | commitdiff | tree

Sébastien Han [Wed, 28 Feb 2018 16:08:07 +0000 (17:08 +0100)]

add support for installation checkpoint

This was taken from the openshift ansible repository here:
https://github.com/leseb/openshift-ansible/tree/master/roles/installer_checkpoint

Rationale:

A complete OpenShift cluster installation is comprised of many different
components which can take 30 minutes to several hours to complete. If
the installation should fail, it could be confusing to understand at
which component the failure occurred. Additionally, it may be desired to
re-run only the component which failed instead of starting over from the
beginning. Components which came after the failed component would also
need to be run individually.

Ceph has a similar situation so we can benefit from that
callback_plugin.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Andy McCrae [Mon, 5 Mar 2018 19:06:09 +0000 (19:06 +0000)]

Remove vars that are no longer used

As part of fcba2c801a122b7ce8ec6a5c27a70bc19589d177 these vars were
removed and no longer do anything:

radosgw_dns_name
radosgw_resolve_cname

This patch removes them from the group_vars files and defaults/main.yml

commit | commitdiff | tree

jtudelag [Sun, 4 Mar 2018 21:13:22 +0000 (22:13 +0100)]

Makes use of docker_exec_cmd in ceph-mon role.

Keeps consistency inside the role and among roles.
Makes the code more readable.

commit | commitdiff | tree

Sébastien Han [Thu, 1 Mar 2018 16:33:33 +0000 (17:33 +0100)]

common: run updatedb task on debian systems only

The command doesn't exist on Red Hat systems so it's better to skip it
instead of ignoring the error.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 1 Mar 2018 15:50:06 +0000 (16:50 +0100)]

rgw: add cluster name option to the handler

If the cluster name is different than 'ceph', the command will fail so
we need to pass the cluster name.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 1 Mar 2018 15:47:37 +0000 (16:47 +0100)]

ci: add copy_admin_key test to container scenario

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 1 Mar 2018 15:47:22 +0000 (16:47 +0100)]

rgw: ability to copy ceph admin key on containerized

If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 1 Mar 2018 15:46:01 +0000 (16:46 +0100)]

rgw: run the handler on a mon host

In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Mon, 26 Feb 2018 13:35:36 +0000 (14:35 +0100)]

tests: make CI jobs using 'ansible.cfg'

The jobs launches by the CI are not using 'ansible.cfg'.
There are some parameters that should avoid SSH failure that we are used
to see in the CI so far.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Fri, 16 Feb 2018 08:04:23 +0000 (09:04 +0100)]

client: use `ceph_uid` fact to set uid/gid on admin key

That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Grant Slater [Sun, 25 Feb 2018 01:44:07 +0000 (01:44 +0000)]

mds: fix ansible_service_mgr typo

This commit fixes a typo introduced by 4671b9e74e657988137f6723ef12e38c66d9cd40

commit | commitdiff | tree

Andy McCrae [Wed, 21 Feb 2018 08:41:27 +0000 (08:41 +0000)]

Revert "[TEST] Test setting up correct systemd file for nfs-ganesha"

The nfs-ganesha package has been fixed as part of this commit:
https://github.com/nfs-ganesha/nfs-ganesha-debian/commit/963b6681dfac459c27c947cb8decc788bc9e5422

Once the package is rebuilt this should be good to merge.

This reverts commit e88af3c4cb314f1f640447ebdce343f0aca85fb4.

commit | commitdiff | tree

Giulio Fidente [Thu, 22 Feb 2018 18:57:47 +0000 (19:57 +0100)]

Make rule_name optional when defining items in openstack_pools

Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.

commit | commitdiff | tree

Sébastien Han [Thu, 22 Feb 2018 14:05:28 +0000 (15:05 +0100)]

remove kernel.pid_max

This is now managed by Ceph packages.

See: https://github.com/ceph/ceph/pull/18544/files

http://tracker.ceph.com/issues/21929

Closes: https://github.com/ceph/ceph-ansible/issues/2410
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Fri, 23 Feb 2018 10:23:13 +0000 (11:23 +0100)]

tests: change ceph_docker_image_tag for 2nd run

The ceph-ansible upstream CI runs severals tests, including a
'idempotency/handlers' test. It means the playbook is run a first time
and then a second time with an other container image version to ensure the
handlers run properly and the containers are well restarted.
This can cause issues.
For instance, in that specific case which drove me to submit this commit,
I've hit the case where `latest` image ships ceph 12.2.3 while the `stable-3.0`
(which is the image used for the second run) ships ceph 12.2.2.

The goal of this test is not to verify we can upgrade from a specific
version to another but to ensure handlers are working even if it's a valid
failure here.
It should be caught by a test dedicated to that usecase.

We just need to have a container image which has a different id for
the upstream CI, we need the same content in container imagebut a different
image id in the registry since the test relies on image id to decide whether
the container should be restarted.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Fri, 16 Feb 2018 12:53:52 +0000 (13:53 +0100)]

ci: add tripleo scenario testing

This should help to see earlier any failure in a tripleo deployment scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Andy McCrae [Mon, 19 Feb 2018 18:13:21 +0000 (18:13 +0000)]

Adjust /etc/updatedb.conf to not parse /var/lib/ceph

Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.

To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.

commit | commitdiff | tree

Andy McCrae [Mon, 19 Feb 2018 17:23:32 +0000 (17:23 +0000)]

[TEST] Test setting up correct systemd file for nfs-ganesha

Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.

The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).

commit | commitdiff | tree

Guillaume Abrioux [Fri, 16 Feb 2018 12:45:26 +0000 (13:45 +0100)]

update: look for short and fqdn in ceph_health_raw

According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Paul Bourke [Fri, 16 Feb 2018 16:21:24 +0000 (16:21 +0000)]

Remove redundant task to check if atomic

This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common

Signed-off-by: Paul Bourke <paul.bourke@oracle.com>

commit | commitdiff | tree

Andy McCrae [Wed, 20 Dec 2017 03:49:16 +0000 (13:49 +1000)]

Restart services if handler called

This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.

For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.

To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.

Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.

Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.

commit | commitdiff | tree

Sébastien Han [Wed, 14 Feb 2018 00:44:18 +0000 (01:44 +0100)]

container: osd remove run_once

When used along with delegate, run_once does not belong well. Thus,
using | last always brings the desired result.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 8 Feb 2018 16:35:05 +0000 (17:35 +0100)]

docker-common: fix container restart on new image

We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 13 Feb 2018 08:37:14 +0000 (09:37 +0100)]

default: remove duplicate code

This is already defined in ceph-defaults.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Fri, 9 Feb 2018 17:15:25 +0000 (18:15 +0100)]

test: add test for containers resources changes

We change the ceph_mon_docker_memory_limit on the second run, this
should trigger a restart of services.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Fri, 9 Feb 2018 17:11:07 +0000 (18:11 +0100)]

test: add test for restart on new container image

Since we have a task to test the handlers we can test a new container to
validate the service restart on a new container image.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Mon, 12 Feb 2018 20:52:27 +0000 (14:52 -0600)]

rolling update: fix undefined jewel_minor_update failure

Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.

The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 9 Feb 2018 20:02:07 +0000 (14:02 -0600)]

infra: do not include host_vars/* in take-over-existing-cluster.yml

These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Caleb Boylan [Thu, 28 Dec 2017 16:52:02 +0000 (08:52 -0800)]

osd: Add support for multipath disks

Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs

Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>

commit | commitdiff | tree

Andy McCrae [Fri, 9 Feb 2018 14:12:35 +0000 (14:12 +0000)]

Set application for OpenStack pools

Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.

We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.

commit | commitdiff | tree

Sébastien Han [Thu, 8 Feb 2018 16:44:19 +0000 (17:44 +0100)]

site: ability to only generate a ceph.conf on the machines

Now by running the playbook like this:

ansible-playbook site.yml --tags='ceph_update_config'

You can only generate a ceph configuration file on the nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1543434
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 8 Feb 2018 13:51:15 +0000 (14:51 +0100)]

default: define 'osd_scenario' variable

osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Thu, 8 Feb 2018 12:27:45 +0000 (13:27 +0100)]

osd: fix osd restart when dmcrypt

This commit fixes a bug that occurs especially for dmcrypt scenarios.

There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.

If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :

```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```

It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.

Adding `--net=host` to the 'disk_list' container fixes this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Grant [Wed, 7 Feb 2018 12:35:11 +0000 (12:35 +0000)]

Update Documentation example link to 3.0

Update the Documentation link from 2.2 -> 3.0

Helpful for newbies.

commit | commitdiff | tree

Giulio Fidente [Fri, 2 Feb 2018 08:45:07 +0000 (09:45 +0100)]

Check for docker sockets named after both _hostname or _fqdn

While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.

commit | commitdiff | tree

Greg Charot [Fri, 2 Feb 2018 14:12:18 +0000 (15:12 +0100)]

mon: Fixed crush_rule_config for containerised deployment.

Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml

Signed-off-by: Greg Charot <gcharot@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Fri, 2 Feb 2018 10:55:18 +0000 (11:55 +0100)]

purge-docker: fix ceph-osd-zap name container

the `zap ceph osd disks` task should iter on `resolved_parent_device`
instead of `combined_devices_list` which contain only the base device
name (vs. full path name in `combined_devices_list`).

this fixes the issue where docker complain about container name because
of illegal characters such as `/` :
```
"/usr/bin/docker-current: Error response from daemon: Invalid container
name (ceph-osd-zap-magna074-/dev/sdb1), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed.","See '/usr/bin/docker-current run --help'."
""
```

having the the basename of the device path is enough for the container
name.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 31 Jan 2018 08:31:11 +0000 (09:31 +0100)]

common: do not use `shell` module when it is not needed

There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 31 Jan 2018 08:23:28 +0000 (09:23 +0100)]

syntax: change local_action syntax

Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 30 Jan 2018 13:41:52 +0000 (14:41 +0100)]

osd: resync group_vars file

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 30 Jan 2018 13:39:58 +0000 (14:39 +0100)]

config: remove any spaces in public_network or cluster_network

With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.

but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Tue, 30 Jan 2018 16:27:53 +0000 (17:27 +0100)]

purge: fix resolve parent device task

This is a typo caused by leftover.
It was previously written like this :
`shell: echo /dev/$(lsblk -no pkname "{{ item }}") }}")`
and has been rewritten to :
`shell: $(lsblk --nodeps -no pkname "{{ item }}") }}")`
because we are appending later the '/dev/' in the next task.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Mon, 29 Jan 2018 13:28:23 +0000 (14:28 +0100)]

Do not search osd ids if ceph-volume

Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.

This happens always, regardless if the setup and deployment is correct.

Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.

How reproducible: 100%

Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster

Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
    "attempts": 10,
    "changed": false,
    "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
    "delta": "0:00:00.002717",
    "end": "2018-01-21 18:10:31.237933",
    "failed": true,
    "failed_when_result": false,
    "rc": 0,
    "start": "2018-01-21 18:10:31.235216"
}

STDOUT:

0
1
2

Expected results:
There aren't any (or just a few) timeouts while the OSDs are found

Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.

Basically this line:

    until: osd_id.stdout_lines|length == devices|unique|length

Means in this 2 OSD case it is trying to ensure the following incorrect condition:

    until: 2 == 0

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103

commit | commitdiff | tree

Andy McCrae [Sat, 27 Jan 2018 19:40:09 +0000 (19:40 +0000)]

Add default for radosgw_keystone_ssl

This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.

Fixes: #2111

commit | commitdiff | tree

Guillaume Abrioux [Wed, 24 Jan 2018 13:06:47 +0000 (14:06 +0100)]

Revert "monitor_interface: document need to use monitor_address when using IPv6"

This reverts commit 10b91661ceef7992354032030c7c2673a90d40f4.

This reverts also the same comment added in
1359869497a44df0c3b4157f41453b84326b58e7

commit | commitdiff | tree

Eduard Egorov [Thu, 9 Nov 2017 11:49:00 +0000 (11:49 +0000)]

config: add host-specific ceph_conf_overrides evaluation and generation.

This allows us to use host-specific variables in ceph_conf_overrides variable. For example, this fixes usage of such variables (e.g. 'nss db path' having {{ ansible_hostname }} inside) in ceph_conf_overrides for rados gateway configuration (see profiles/rgw-keystone-v3) - issue #2157.

Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>

commit | commitdiff | tree

Guillaume Abrioux [Thu, 25 Jan 2018 15:57:45 +0000 (16:57 +0100)]

upgrade: skip luminous tasks for jewel minor update

These tasks are needed only when upgrading to luminous.
They are not needed in Jewel minor upgrade and by the way, they fail because
`ceph versions` command doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 24 Jan 2018 17:49:41 +0000 (18:49 +0100)]

defaults: avoid getting stuck (ceph --connect-timeout)

Sometime the playbook gets stuck because even with `--connect-timeout=`
option, the connexion to the existing ceph cluster never timeout.

As a workaround, using `timeout` command provided by coreutils will
actually timeout if we can't connect to the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537003
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Mon, 22 Jan 2018 16:53:40 +0000 (10:53 -0600)]

docs for creating encrypted OSDs with the lvm scenario

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 19 Jan 2018 15:44:59 +0000 (09:44 -0600)]

ceph-osd: adds dmcrypt to the lvm scenario

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 19 Jan 2018 15:43:48 +0000 (09:43 -0600)]

ceph-volume: adds a dmcrypt param to the ceph_volume module

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Tue, 23 Jan 2018 13:38:35 +0000 (14:38 +0100)]

ansible: set ssh retry option to 5

We noticed that sometime, ceph-ansible can fail with error :

`Failed to connect to the host via ssh:`

It can occurs after the task `restart firewalld` has been played.

Setting `retries` to 5 should prevent from unexcepted ssh failure.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Mon, 22 Jan 2018 13:28:15 +0000 (14:28 +0100)]

osds: change default value for `dedicated_devices`

This is to keep backward compatibility with stable-2.2 and satisfy the
check "verify dedicated devices have been provided" in
`check_mandatory_vars.yml`. This check is looking for
`dedicated_devices` so we need to default it's value to
`raw_journal_devices` when `raw_multi_journal` is set to `True`.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1536098
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Thu, 18 Jan 2018 13:57:45 +0000 (07:57 -0600)]

tests: remove crush_device_class from lvm tests

The --crush-device-class flag for ceph-volume is not available in luminous so lets
remove this testing option for now until it's more widely available.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 18 Jan 2018 09:06:34 +0000 (10:06 +0100)]

rgw: disable legacy unit

Some systems that were deployed with old tools can leave units named
"ceph-radosgw@radosgw.gateway.service". As a consequence, they will
prevent the new unit to start.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1509584
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Wed, 17 Jan 2018 14:18:11 +0000 (15:18 +0100)]

rolling update: add mgr exception for jewel minor updates

When update from a minor Jewel version to another, the playbook will
fail on the task "fail if no mgr host is present in the inventory".
This now can be worked around by running Ansible with_items

-e jewel_minor_update=true

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1535382
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 17 Jan 2018 08:08:16 +0000 (09:08 +0100)]

purge-container: use lsblk to resolv parent device

Using `lsblk` to resolv the parent device is better than just removing the last
char when passing it to the zap container.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 17 Jan 2018 08:06:43 +0000 (09:06 +0100)]

purge-container: remove awk usage in favor of blkid

Avoid using `awk` to get the different devices from the partlabel.
Using `blkid` is more readable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 12 Jan 2018 14:46:30 +0000 (08:46 -0600)]

docs for the crush_device_class option of lvm_volumes

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Thu, 11 Jan 2018 17:00:23 +0000 (11:00 -0600)]

tests: adds crush_device_class to lvm tests

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Thu, 11 Jan 2018 16:59:01 +0000 (10:59 -0600)]

ceph-osd: adds the crush_device_class param to the lvm scenario

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Thu, 11 Jan 2018 16:56:39 +0000 (10:56 -0600)]

ceph_volume: adds the crush_device_class param

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Ken Dreyer [Thu, 11 Jan 2018 17:06:21 +0000 (10:06 -0700)]

Makefile: handle "beta" Git tags

With this change, "make srpm" will generate an RPM with "beta" in the
Release value.

For example, "v3.2.0beta1" will create
"ceph-ansible-3.2.0-0.beta1.1.el7.src.rpm"

commit | commitdiff | tree

Eduard Egorov [Thu, 16 Nov 2017 14:26:27 +0000 (14:26 +0000)]

crush: create rack type buckets and build crush tree according to {{ osd_crush_location }}.

Currently, we can define crush location for each host but only crush roots and crush rules are created. This commit automates other routines for a complete solution:
  1) Creates rack type crush buckets defined in {{ ceph_crush_rack }} of each osd host. If it's not defined by user then a rack named 'default_rack_{{ ceph_crush_root  }}' would be added and used in next steps.
  2) Move rack type crush buckets defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
  3) Move hosts defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.

Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>

commit | commitdiff | tree

Sébastien Han [Tue, 19 Dec 2017 17:54:19 +0000 (18:54 +0100)]

osd: skip devices marked as '/dev/dead'

On a non-collocated scenario, if a drive is faulty we can't really
remove it from the list of 'devices' without messing up or having to
re-arrange the order of the 'dedicated_devices'. We want to keep this
device list ordered. This will prevent the activation failing on a
device that we know is failing but we can't remove it yet to not mess up
the dedicated_devices mapping with devices.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 21 Dec 2017 18:57:01 +0000 (19:57 +0100)]

ci: test on ansible 2.4.2

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Mon, 8 Jan 2018 14:00:32 +0000 (15:00 +0100)]

container: trigger handlers on systemd file change

When a systemd unit file is changed we should trigger handlers to
restart the services.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Mon, 8 Jan 2018 09:00:25 +0000 (10:00 +0100)]

handlers: avoid duplicate handler

Having handlers in both ceph-defaults and ceph-docker-common roles can make the
playbook restarting two times services. Handlers can be triggered first
time because of a change in ceph.conf and a second time because a new
image has been pulled.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Fri, 15 Dec 2017 18:43:23 +0000 (19:43 +0100)]

container: restart container when there is a new image

This wasn't any good choice to implement this.
We had several options and none of them were ideal since handlers can
not be triggered cross-roles.
We could have achieved that by doing:

* option 1 was to add a dependancy in the meta of the ceph-docker-common
role. We had that long ago and we decided to stop so everything is
managed via site.yml

* option 2 was to import files from another role. This is messy and we
don't that anywhere in the current code base. We will continue to do so.

There is option 3 where we pull the image from the ceph-config role.
This is not suitable as well since the docker command won't be available
unless you run Atomic distro. This would also mean that you're trying to
pull twice. First time in ceph-config, second time in ceph-docker-common

The only option I came up with was to duplicate a bit of the ceph-config
handlers code.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 10 Jan 2018 09:18:27 +0000 (10:18 +0100)]

containers: fix bug when looking for existing cluster

When containerized deployment, `docker_exec_cmd` is not set before the
task which try to retrieve the current fsid is played, it means it
considers there is no existing fsid and try to generate a new one.

Typical error:

```
ok: [mon0 -> mon0] => {
    "changed": false,
    "cmd": [
        "ceph",
        "--connect-timeout",
        "3",
        "--cluster",
        "test",
        "fsid"
    ],
    "delta": "0:00:00.179909",
    "end": "2018-01-09 10:36:58.759846",
    "failed": false,
    "failed_when_result": false,
    "rc": 1,
    "start": "2018-01-09 10:36:58.579937"
}

STDERR:

Error initializing cluster client: Error('error calling conf_read_file: errno EINVAL',)
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 9 Jan 2018 13:34:09 +0000 (14:34 +0100)]

container: change the way we force no logs inside the container

Previously we were using ceph_conf_overrides however this doesn't play
nice for softwares like TripleO that uses ceph_conf_overrides inside its
own code. For now, and since this is the only occurence of this, we can
ensure no logs through the ceph conf template.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1532619
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 10 Jan 2018 08:08:01 +0000 (09:08 +0100)]

defaults: rename check_socket files for containers

When containerized deployment, we are not looking for a socket but for a
running container.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 9 Jan 2018 12:54:50 +0000 (13:54 +0100)]

mon: use crush rules for non-container too

There is no reasons why we can't use crush rules when deploying
containers. So moving the inlcude in the main.yml so it can be called.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Mon, 8 Jan 2018 15:41:42 +0000 (16:41 +0100)]

containers: bump memory limit

A default value of 4GB for MDS is more appropriate and 3GB for OSD also.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1531607
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 5 Jan 2018 19:47:10 +0000 (13:47 -0600)]

test: set UPDATE_CEPH_DOCKER_IMAGE_TAG for jewel tests

We want to be explict here and update to luminous and not
the 'latest' tag.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 5 Jan 2018 18:42:16 +0000 (12:42 -0600)]

switch-to-containers: do not fail when stopping the nfs-ganesha service

If we're working with a jewel cluster then this service will not exist.

This is mainly a problem with CI testing because our tests are setup to
work with both jewel and luminous, meaning that eventhough we want to
test jewel we still have a nfs-ganesha host in the test causing these
tasks to run.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 5 Jan 2018 18:37:36 +0000 (12:37 -0600)]

switch-to-containers: do not fail when stopping the ceph-mgr daemon

If we are working with a jewel cluster ceph mgr does not exist
and this makes the playbook fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Andrew Schoen [Fri, 5 Jan 2018 16:06:53 +0000 (10:06 -0600)]

rolling_update: do not fail the playbook if nfs-ganesha is not present

The rolling update playbook was attempting to stop the
nfs-ganesha service on nodes where jewel is still installed.
The nfs-ganesha service did not exist in jewel so the task fails.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

commit | commitdiff | tree

Aviolat Romain [Fri, 29 Dec 2017 23:17:31 +0000 (00:17 +0100)]

doc: corrected a typo

commit | commitdiff | tree

Sébastien Han [Wed, 20 Dec 2017 14:29:02 +0000 (15:29 +0100)]

mon: always run ceph-create-keys

ceph-create-keys is idempotent so it's not an issue to run it each time
we play ansible. This also fix issues where the 'creates' arg skips the
task and no keys get generated on newer version, e.g during an upgrade.

Closes: https://github.com/ceph/ceph-ansible/issues/2228
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 21 Dec 2017 09:19:22 +0000 (10:19 +0100)]

rgw: disable legacy rgw service unit

When upgrading from OSP11 to OSP12 container, ceph-ansible attempts to
disable the RGW service provided by the overcloud image. The task
attempts to stop/disable ceph-rgw@{{ ansible-hostname }} and
ceph-radosgw@{{ ansible-hostname }}.service. The actual service name is
ceph-radosgw@radosgw.$name

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525209
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Tue, 19 Dec 2017 09:55:02 +0000 (10:55 +0100)]

osd: fix check gpt

the gpt label creation doesn't work even with parted module.
This commit fixes the gpt label creation by using parted command
instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 13 Dec 2017 14:23:47 +0000 (15:23 +0100)]

purge-cluster: clean some code

Avoid using regexp to match device

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Wed, 13 Dec 2017 14:24:33 +0000 (15:24 +0100)]

purge-cluster: wipe disk using dd

`bluestore_purge_osd_non_container` scenario is failing because it
keeps old osd_uuid information on devices and cause the `ceph-disk activate`
to fail when trying to redeploy a new cluster after a purge.

typical error seen :

```
2017-12-13 14:29:48.021288 7f6620651d00 -1
bluestore(/var/lib/ceph/tmp/mnt.2_3gh6/block) _check_or_set_bdev_label
bdev /var/lib/ceph/tmp/mnt.2_3gh6/block fsid
770080e2-20db-450f-bc17-81b55f167982 does not match our fsid
f33efff0-2f07-4203-ad8d-8a0844d6bda0
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Wed, 20 Dec 2017 12:39:33 +0000 (13:39 +0100)]

fix jewel scenarios on container

When deploying Jewel from master we still need to enable this code since
the container image has such check. This check still exists because
ceph-disk is not able to create a GPT label on a drive that does not
have one.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 19 Dec 2017 14:10:05 +0000 (15:10 +0100)]

site-docker: ability to disable fact sharing

When deploying with Ansible at large scale, the delegate_facts method
consumes a lot of memory on the host that is running Ansible. This can
cause various issues like memory exhaustion on that machine.
You can now run Ansible with "-e delegate_facts_host=False" to disable
the fact sharing.

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Mon, 18 Dec 2017 15:43:37 +0000 (16:43 +0100)]

osd: best effort if no device is found during activation

We have a scenario when we switch from non-container to containers. This
means we don't know anything about the ceph partitions associated to an
OSD. Normally in a containerized context we have files containing the
preparation sequence. From these files we can get the capabilities of
each OSD. As a last resort we use a ceph-disk call inside a dummy bash
container to discover the ceph journal on the current osd.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525612
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Fri, 15 Dec 2017 16:39:32 +0000 (17:39 +0100)]

rolling_update: do not require root to answer question

There is no need to ask for root on the local action. This will prompt
for a password the current user is not part of sudoers. That's
unnecessary anyways.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1516947
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Sébastien Han [Tue, 19 Dec 2017 10:17:04 +0000 (11:17 +0100)]

nfs: fix package install for debian/suss systems

This resolves the following error:
E: There were unauthenticated packages and -y was used without
--allow-unauthenticated

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Christian Berendt [Tue, 12 Dec 2017 10:06:15 +0000 (11:06 +0100)]

Rename fact docker_version to ceph_docker_version

The name docker_version is very generic and is also used by other
roles. As a result, there may be name conflicts. To avoid this a
ceph_ prefix should be used for this fact. Since it is an internal
fact renaming is not a problem.

commit | commitdiff | tree

Markos Chandras [Thu, 14 Dec 2017 18:13:09 +0000 (18:13 +0000)]

roles: ceph-mgr: Install the ceph-mgr package on SUSE

The ceph-mgr package name is identical to RedHat so add the SUSE family
to the existing task.

commit | commitdiff | tree

Sébastien Han [Thu, 14 Dec 2017 16:23:02 +0000 (17:23 +0100)]

contrib: do not skip ci on backport

Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Tue, 12 Dec 2017 10:28:36 +0000 (11:28 +0100)]

client: don't make `osd_pool_default_pg_num` mandatory

making `osd_pool_default_pg_num` mandatory is a bit agressive and is
unrelated when you just want to create users keyrings.

Closes: #2241
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Tue, 12 Dec 2017 10:25:26 +0000 (11:25 +0100)]

client: don't try to generate keys

the entrypoint to generate users keyring is `ceph-authtool`, therefore,
it can expand the `$(ceph-authtool --gen-print-key)` inside the
container. Users must generate a keyring themselves.
This commit also adds a check to ensure keyring are properly filled when
`user_config: true`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Guillaume Abrioux [Tue, 12 Dec 2017 13:55:02 +0000 (14:55 +0100)]

docker: add missing condition for selinux tasks

on `client` and `mds` roles, it tries to set selinux even on non rhel
based distributions.`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

commit | commitdiff | tree

Sébastien Han [Thu, 14 Dec 2017 10:31:28 +0000 (11:31 +0100)]

default: look for the right return code on socket stat in-use

As reported in https://github.com/ceph/ceph-ansible/issues/2254, the
check with fuser is not ideal. If fuser is not available the return code
is 127. Here we want to make sure that we looking for the correct return
code, so 1.

Closes: https://github.com/ceph/ceph-ansible/issues/2254
Signed-off-by: Sébastien Han <seb@redhat.com>

commit | commitdiff | tree

John Fulton [Mon, 11 Dec 2017 21:17:22 +0000 (16:17 -0500)]

Add flags for OSD 'docker run --cpuset-{cpus,mems}'

Add the variables ceph_osd_docker_cpuset_cpus and
ceph_osd_docker_cpuset_mems, so that a user may specify
the CPUs and memory nodes of NUMA systems on which OSD
containers are run.

Provides a example in osds.yaml.sample to guide user
based on sample `lscpu` output since cpuset-mems refers
to the memory by NUMA node only while cpuset-cpus can
refer to individual vCPUs within a NUMA node.

commit | commitdiff | tree

Eduard Egorov [Mon, 20 Nov 2017 14:11:38 +0000 (14:11 +0000)]

firewall: add mds, nfs, restapi and iscsi ports, remove 'configure_firewall' variable used for conditional execution. Include the task only on rpm-based systems.

Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom