This seems to break the update scenario CI testing in stable-3.2.
Typical error:
```
The conditional check 'mds_max_mds > 1' failed. The error was: Unexpected templating type error occurred on ({% if mds_max_mds > 1 %} True {% else %} False {% endif %}): '>' not supported between instances of 'str' and 'int'
```
Vagrantfile: support more than 9 nodes per daemon type
because of the current ip address assignation, it's not possible to
deploy more than 9 nodes per daemon type.
This commit refact a bit and allows us to get around this limitation.
Dimitri Savineau [Fri, 14 Jun 2019 21:31:39 +0000 (17:31 -0400)]
tests: Update ansible ssh_args variable
Because we're using vagrant, a ssh config file will be created for
each nodes with options like user, host, port, identity, etc...
But via tox we're override ANSIBLE_SSH_ARGS to use this file. This
remove the default value set in ansible.cfg.
Also adding PreferredAuthentications=publickey because CentOS/RHEL
servers are configured with GSSAPIAuthenticationis enabled for ssh
server forcing the client to make a PTR DNS query.
tests: fix path to inventory host file in tox-update.ini
the path had `/{env:CONTAINER_DIR:}` which is already added in
`changedir=` section. That led to a wrong path so the initial deployment
couldn't complete.
the wrong image version was used to run shrink_osd playbook.
in stable-3.1 we should use a luminous image, not nautilus which doesn't
have ceph-disk binary anymore.
osd: backward compatibility with old disk_list.sh location
Since all files in container image have moved to `/opt/ceph-container`
this check must look for new AND the old path so it's backward
compatible. Otherwise it could end up by templating an inconsistent
`ceph-osd-run.sh`.
Typical error:
```
fatal: [iscsi-gw0]: FAILED! =>
msg: 'an error occurred while trying to read the file ''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'': [Errno 13] Permission denied: b''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'''
```
`become: True` is not needed on the following task:
`copy crt file(s) to gateway nodes`.
Since it's already set in the main playbook (site.yml/site-container.yml)
The thing is that the files get generated in the 'fetch_directory' with
root user because there is a 'delegate_to' + we run the playbook with
`become: True` (from main playbook).
The idea here is to create files under ansible user so we can open them
later to copy them on the remote machine.
Rishabh Dave [Wed, 12 Dec 2018 11:15:00 +0000 (16:45 +0530)]
ceph-common: merge ntp_debian.yml and ntp_rpm.yml
Merge ntp_debian.yml and ntp_rpm.yml into one (the new file is called
setup_ntp.yml) since they are almost identical.
Since this is as a "as it is" backport for the original commit, it also
adds the feature of supporting multiple NTP daemons (namely, chronyd &
timesyncd). This is to maintain consistency across all branches
since the backport for stable-3.2 was auto-merged by mergify despite
of conflicts.
Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.
Sébastien Han [Thu, 29 Nov 2018 13:26:41 +0000 (14:26 +0100)]
rolling_update: do not fail on missing keys
We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Wed, 21 Nov 2018 15:18:58 +0000 (16:18 +0100)]
rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f9263b9ac3b5649f1e3cf3cbaf12d10)
Sébastien Han [Wed, 21 Nov 2018 15:17:04 +0000 (16:17 +0100)]
ceph_key: add a get_key function
When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.
Sébastien Han [Thu, 9 Aug 2018 09:32:53 +0000 (11:32 +0200)]
rolling_update: fix upgrade when using fqdn
CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 44d0da0dd497bfab040c1e64fb406e4c13150028)
Andy McCrae [Thu, 12 Jul 2018 11:24:15 +0000 (12:24 +0100)]
Sync config_template with upstream for Ansible 2.6
The original_basename option in the copy module changed to be
_original_basename in Ansible 2.6+, this PR resyncs the config_template
module to allow this to work with both Ansible 2.6+ and before.
Additionally, this PR removes the _v1_config_template.py file, since
ceph-ansible no longer supports versions of Ansible before version 2,
and so we shouldn't continue to carry that code.
Sébastien Han [Wed, 3 Oct 2018 11:39:35 +0000 (13:39 +0200)]
switch: copy initial mon keyring
We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.
switch: support migration when cluster is scrubbing
Similar to c13a3c3 we must allow scrubbing when running this playbook.
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.
This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.
`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.
This should have been backported but it's part of a too important
refact in master that can't be backported.
defaults: fix osd handlers that are never triggered
`run_once: true` + `inventory_hostname == groups.get(osd_group_name) |
last` is a bad combination since if the only node being run isn't the
last, the task will be definitly skipped.
'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```
the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:
`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.
Matthew Vernon [Wed, 19 Sep 2018 12:26:26 +0000 (13:26 +0100)]
restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.
On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".
Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.
(could this be backported to at least stable-3.0 please?)
This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.
upgrade: consider all 'active+clean' states as valid pgs
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.
This commit allows an upgrade even while PGs are scrubbing.
Matthew Vernon [Fri, 21 Sep 2018 16:55:01 +0000 (17:55 +0100)]
restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs
Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.
shrink-osd: fix purge osd on containerized deployment
ce1dd8d introduced the purge osd on containers but it was incorrect.
`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.
```
Exception occurred:
File
"/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py", line 26, in <module>
from sphinx.ext import doctest
SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```
Tom Barron [Sat, 1 Sep 2018 14:32:51 +0000 (10:32 -0400)]
run rados cmd in container if containerized deployment
When ceph-nfs is deployed containerized and ceph-common is not
installed on the host the start_nfs task fails because the rados
command is missing on the host.
Run rados commands from a ceph container instead so that
they will succeed.
Markos Chandras [Wed, 29 Aug 2018 10:56:16 +0000 (11:56 +0100)]
roles: ceph-rgw: Enable the ceph-radosgw target
If the ceph-radosgw target is not enabled, then enabling the
ceph-radosgw@ service has no effect since nothing will pull
it on the next reboot. As such, we need to ensure that the
target is enabled.
Andy McCrae [Thu, 30 Aug 2018 07:53:36 +0000 (08:53 +0100)]
Dont run client dummy container on non-x86_64 hosts
The dummy client container currently wont work on non-x86_64 hosts.
This PR creates a filtered client group that contains only hosts
that are x86_64 - which can then be the group to run the
dummy container against.
This is for the specific case of a containerized_deployment where
there is a mixture of non-x86_64 hosts and x86_64 hosts. As such
the filtered group will contain all hosts when running with
containerized_deployment: false.
Currently ppc64le is not supported for Ceph server components.
Sébastien Han [Mon, 27 Aug 2018 17:20:32 +0000 (10:20 -0700)]
sites: fix conditonnal
Same problem again... ceph_release_num[ceph_release] is only set in
ceph-docker-common/common roles so putting the condition on that role
will never work. Removing the condition.
The downside of this is we will be installing packages and then skip the
role on the node.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622210 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ae5ebeeb00214d9ea27929b4670c6de4ad27d829)
The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'
The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: set_fact ceph_current_status (convert to json)
^ here
```
From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a
What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.
Sébastien Han [Tue, 21 Aug 2018 18:50:31 +0000 (20:50 +0200)]
defaults: fix rgw_hostname
A couple if things were wrong in the initial commit:
* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it
* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.
* set the fact rgw_hostname on rgw nodes only
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 6d7fa99ff74b3ec25d1a6010b1ddb25e00c123be)
Sébastien Han [Tue, 21 Aug 2018 09:15:44 +0000 (11:15 +0200)]
rolling_upgrade: set sortbitwise properly
Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943 Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2e6e885bb75156c74735a65c05b4757b031041bb)
Sébastien Han [Mon, 20 Aug 2018 13:53:03 +0000 (15:53 +0200)]
iscsi group name preserve backward compatibility
Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.
This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.
Sébastien Han [Mon, 20 Aug 2018 12:41:06 +0000 (14:41 +0200)]
take-over-existing-cluster: do not call var_files
We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:
Markos Chandras [Tue, 14 Aug 2018 06:52:04 +0000 (09:52 +0300)]
roles: ceph-defaults: Delegate cluster information task to monitor node
Since commit f422efb1d6b56ce56a7d39a21736a471e4ed357 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible
fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"
This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.
Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role
Markos Chandras [Wed, 15 Aug 2018 06:55:49 +0000 (09:55 +0300)]
roles: ceph-defaults: Check if 'rgw' attribute exists for rgw_hostname
If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:
msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'
We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.
Andrew Schoen [Thu, 9 Aug 2018 13:09:41 +0000 (08:09 -0500)]
lv-create: use copy instead of the template module
The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 04df3f0802c0bc903172314d05a38e869f0eee6a) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Thu, 9 Aug 2018 12:26:58 +0000 (07:26 -0500)]
tests: cat the contents of lv-create.log in infra_lv_create
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit f5a4c8986982f277f6fd5bcd5b28c6099f655d79) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Thu, 9 Aug 2018 12:26:22 +0000 (07:26 -0500)]
lv-create: add an example logfile_path config option in lv_vars.yml
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 131796f2750f1209a019ae75a500e6f1a1ab37f8) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Wed, 8 Aug 2018 22:12:30 +0000 (17:12 -0500)]
tests: adds a testing scenario for lv-create and lv-teardown
Using an explicitly named testing environment name allows us to have a
specific [testenv] block for this test. This greatly simplifies how it will
work as it doesn't really anything from the ceph cluster tests.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 810cc47892e53701485c540ff51c00c860ea0a00) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Wed, 8 Aug 2018 22:04:29 +0000 (17:04 -0500)]
lv-teardown: fail silently if lv_vars.yml is not found
This allows user to opt out of using lv_vars.yml and load configuration
from other sources.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit b0bfc173510ec7d5da715c0048e633a8fe3d2a4d) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Wed, 8 Aug 2018 22:04:07 +0000 (17:04 -0500)]
lv-teardown: set become: true at the playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 8424858b40bafebe3569b33279e4d8d824e2276b) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Wed, 8 Aug 2018 21:49:34 +0000 (16:49 -0500)]
lv-create: fail silenty if lv_vars.yml is not found
If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit e43eec57bb44bf5f7a10da8548ca22a8772c2d92) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Wed, 8 Aug 2018 21:48:42 +0000 (16:48 -0500)]
lv-create: set become: true at the playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit fde47be13cc753218153b3dbc0db5a4daa752b21) Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Wed, 8 Aug 2018 21:43:55 +0000 (16:43 -0500)]
lv-create: use the template module to write log file
The copy module will not expand the template and render the variables
included, so we must use template.
Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 35301b35af4e71713edb944eb54654b587710527) Signed-off-by: Sébastien Han <seb@redhat.com>