Andrew Schoen [Wed, 8 Aug 2018 22:12:30 +0000 (17:12 -0500)]
tests: adds a testing scenario for lv-create and lv-teardown
Using an explicitly named testing environment name allows us to have a
specific [testenv] block for this test. This greatly simplifies how it will
work as it doesn't really anything from the ceph cluster tests.
Andrew Schoen [Wed, 8 Aug 2018 21:43:55 +0000 (16:43 -0500)]
lv-create: use the template module to write log file
The copy module will not expand the template and render the variables
included, so we must use template.
Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.
Sébastien Han [Mon, 13 Aug 2018 13:59:25 +0000 (15:59 +0200)]
rolling_update: register container osd units
Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626 Signed-off-by: Sébastien Han <seb@redhat.com>
This commit was giving a new failure later during the rolling_update
process. Basically, this was modifying the list of devices and started
impacting the ceph-osd itself. The modification to accomodate the
osd_auto_discovery parameter should happen outside of the ceph-osd.
Also we are trying to not play ceph-osd role during the rolling_update
process so we can speed up the upgrade.
fqdn configuration possibility caused a lot of trouble, it's adding a
lot of complexity because of multiple cases and the relation between
ceph-ansible and ceph-container. Moreover, there is no benefit for such
a feature.
the ceph.conf.j2 always assumes the hostname used to register the
radosgw in the servicemap is equivalent to `{{ ansible_hostname }}`
which returns the shortname form.
We need to detect which form of the hostname was used in case of already
deployed cluster and update the ceph.conf accordingly.
Sébastien Han [Fri, 10 Aug 2018 09:08:14 +0000 (11:08 +0200)]
mon: fix calamari initialisation
If calamari is already installed and ceph has been upgraded to a higher
version the initialisation will fail later. So if we detect the
calamari-server is too old compare to ceph_rhcs_version we try to update
it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1601755 Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Thu, 9 Aug 2018 15:40:16 +0000 (10:40 -0500)]
lvm: fix condition when selecting which scenario to run
devices and lvm_volumes will always be defined, so we need to instead
check it's length before deciding to run the scenario.
This fixes the failure here:
https://2.jenkins.ceph.com/job/ceph-ansible-prs-luminous-bluestore_lvm_osds/86/consoleFull#1667273050b5dd38fa-a56e-4233-a5ca-584604e56e3a
Sébastien Han [Thu, 9 Aug 2018 13:18:34 +0000 (15:18 +0200)]
osd: generate device list for osd_auto_discovery on rolling_update
rolling_update relies on the list of devices when performing the restart
of the OSDs. The task that is builind the devices list out of the
ansible_devices dict only runs when there are no partitions on the
drives. However during an upgrade the OSD are already configured, they
have been prepared and have partitions so this task won't run and thus
the devices list will be empty, skipping the restart during
rolling_update. We now run the same task under different requirements
when rolling_update is true and build a list when:
* osd_auto_discovery is true
* rolling_update is true
* ansible_devices exists
* no dm/lv are part of the discovery
* the device is not removable
* the device has more than 1 sector
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626 Signed-off-by: Sébastien Han <seb@redhat.com>
Andrew Schoen [Mon, 6 Aug 2018 20:14:53 +0000 (15:14 -0500)]
ceph-osd: adds crush_device_class config option
This is used with the lvm osd scenario. When using devices you need the
option to set the crush device class for all of the OSDs that are
created from those devices.
Andrew Schoen [Fri, 3 Aug 2018 16:15:58 +0000 (11:15 -0500)]
ceph-volume: implement the 'lvm batch' subcommand
This adds the action 'batch' to the ceph-volume module so that we can
run the new 'ceph-volume lvm batch' subcommand. A functional test is
also included.
If devices is defind and osd_scenario is lvm then the 'ceph-volume lvm
batch' command will be used to create the OSDs.
In environments where we wish to have manual/greater control over
how the bootstrap keyrings are used, we need to able to externally
define what the mgr keyring secret will be and have ceph-ansible
use it, instead of it being autogenerated
Sébastien Han [Tue, 7 Aug 2018 11:38:36 +0000 (13:38 +0200)]
test: follow up on osd_crush_location for containers
This was fixed by
https://github.com/ceph/ceph-ansible/commit/578aa5c2d54a680912e4e015b6fb3dbbc94d4fd0
on non-container, we need to apply the same fix for containers.
* add some missing dots and ``
* add/remove line breaks
* consistent use of shell prompt in consoles outpus
* fix block indents Bearbeiten
* use code blocks
Signed-off-by: Christian Berendt <berendt@b1-systems.de>
Fix in regular expression matching OSD ID on non-contenerized
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.
Closes: #2964 Signed-off-by: Artur Fijalkowski <artur.fijalkowski@ing.com>
defaults: backward compatibility with fqdn deployments
This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration
Sébastien Han [Mon, 30 Jul 2018 16:29:00 +0000 (18:29 +0200)]
config: enforce socket name
This was introduced by
https://github.com/ceph/ceph/commit/59ee2e8d3b14511e8d07ef8325ac8ca96e051784
and made our socket checks impossible to run. The PID could be found,
but the cctid cannot.
This happens during upgrade to mimic and on cluster running on mimic.
So let's force the admin socket the way it was so we can properly check
for existing instances also the line $cluster-$name.$pid.$cctid.asok
is only needed when running multiple instances of the same daemon,
thing ceph-ansible cannot do at the time of writing
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610220 Signed-off-by: Sébastien Han <seb@redhat.com>
Mike Christie [Thu, 26 Jul 2018 18:52:44 +0000 (13:52 -0500)]
igw: do not fail purge on rbd removal errors
Instead of failing the entire purge operation when the rbd command fails
just log an error. This will allow the higher level target and config
cleanup to complete, and the user only has to manually delete the rbd
images.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Mike Christie [Wed, 25 Jul 2018 18:13:17 +0000 (13:13 -0500)]
igw: fix image removal during purge
We were not passing in the ceph conf info into the rbd image removal
command, so if the clustername was not the default igw purge would fail
due to the rbd rm command failing.
This just fixes the bug by passing in the ceph conf info which has the
clustername to use.
This fixes Red Hat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1601949
Signed-off-by: Mike Christie <mchristi@redhat.com>
Sébastien Han [Fri, 27 Jul 2018 14:52:19 +0000 (16:52 +0200)]
osd: do not remove expose_partition container
The container runs with --rm which means it will be deleted by Docker
when exiting. Also 'docker rm -f' is not idempotent and returns 1 if the
container does not exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1609007 Signed-off-by: Sébastien Han <seb@redhat.com>
jewel used to create a default `rbd` pool in the default crush root
`default`, we need to have at least 1 osd to satisfy the PGs for this
created pool, otherwise the cluster will be in HEALTH_ERR state because
of `pgs stuck unclean`/`pgs stuck inactive`
Check `systempython2.stat` instead of `systempython2.stat.exists`.
Without this change, in the case that python2 is not installed, the `stat`
task fails without defining `systempython2.stat`. It leads that the next
installation tasks fail because of undefined `systempython2.stat`.
TASK [install python2 for debian based systems] ************************
Wednesday 25 July 2018 14:51:00 +0900 (0:00:01.742) 0:00:01.926 *
fatal: [ceph-mon2]: FAILED! => {
"msg": "The conditional check 'systempython2.stat.exists is undefined or
systempython2.stat.exists == false' failed. The error was: error while
evaluating conditional (systempython2.stat.exists is undefined or
systempython2.stat.exists == false): 'dict object' has no attribute 'stat'
\n\n The error appears to have been in
'/Users/arata/git/ceph-ansible/site.yml.sample': line 36, column 7, but
may\n be elsewhere in the file depending on the exact syntax problem.\n\n
The offending line appears to be:\n\n\n
- name: install python2 for debian based systems\n
^ here\n
"}
...ignoring
```
Sébastien Han [Mon, 23 Jul 2018 12:56:20 +0000 (14:56 +0200)]
rolling_update: set osd sortbitwise
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943 Signed-off-by: Sébastien Han <seb@redhat.com>
tests: add support of 'ooo-collocation' scenario when testing against ceph dev
The group_vars/all file is not available on 'ooo-collocation' scenario,
it's making the `dev_setup.yml` failing because this path is hardcoded.
The idea here is to check if the pattern 'ooo-collocation' is present in
`change_dir` variable so we can set this path properly according to the
scenario being run.
tests: support update scenarios in test_rbd_mirror_is_up()
`test_rbd_mirror_is_up()` is failing on update scenarios because it
assumes the `ceph_stable_release` is still set to the value of the
original ceph release, it means it won't enter in the right part of the
condition and fails.
since no latest-bis-jewel exists, it's using latest-bis which points to
ceph mimic. In our testing, using it for idempotency/handlers tests
means upgrading from jewel to mimic which is not what we want do.
client: do not rely on copy_admin_key to import keys
Relying on `copy_admin_key` to import created keys on client nodes makes
us obliged to copy admin key on those nodes which is not something we might
want.
We should use the fact `condition_copy_admin_key` which will be set to
`True` when the delegated node is a mon which means we can import keys
without taking care of admin keyring.
Andy McCrae [Thu, 12 Jul 2018 11:24:15 +0000 (12:24 +0100)]
Sync config_template with upstream for Ansible 2.6
The original_basename option in the copy module changed to be
_original_basename in Ansible 2.6+, this PR resyncs the config_template
module to allow this to work with both Ansible 2.6+ and before.
Additionally, this PR removes the _v1_config_template.py file, since
ceph-ansible no longer supports versions of Ansible before version 2,
and so we shouldn't continue to carry that code.
Closes: #2843 Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
We must check in the generated fact `_disabled_ceph_mgr_modules` to
enable disabled mgr module.
Otherwise, this task will be skipped because it's not comparing the
right list.
On containerized deployment, if a mon is stopped, the socket is not
purged and can cause failure when a cluster is redeployed after the
purge playbook has been run.
Typical error:
```
fatal: [osd0]: FAILED! => {}
MSG:
'dict object' has no attribute 'osd_pool_default_pg_num'
```
the fact is not set because of this previous failure earlier:
tests: fix `_get_osd_id_from_host()` in TestOSDs()
We must initialize `children` variable in `_get_osd_id_from_host()`,
otherwise, if for any reason the deployment has failed and result with
an osd host with no OSD registered, we won't enter in the condition,
therefore, `children` is never set and the function tries to return
something undefined.
Typical error:
```
E UnboundLocalError: local variable 'children' referenced before assignment
```
Remove zone from zonegroup and update period before deleting the zone to avoid inconsistent period information across other zones.
When you delete a zone without removing from zonegroup, the period update would
fail since that command needs to load the zone and zonegroup to be able to
update the master. Period update would fail with an error like this:
radosgw-admin period update --commit
-1 Cannot find zone id= (name=), switching to local zonegroup configuration
-1 Cannot find zone id= (name=)
This part has changed from mimic and became:
```
"servicemap": {
"epoch": 2,
"modified": "2018-07-04 09:54:36.164786",
"services": {
"rbd-mirror": {
"daemons": {
"summary": "",
"14151": {
"start_epoch": 2,
"start_stamp": "2018-07-04 09:54:35.541272",
"gid": 14151,
"addr": "192.168.1.80:0/240942528",
"metadata": {
"arch": "x86_64",
"ceph_release": "mimic",
"ceph_version": "ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)",
"ceph_version_short": "13.2.0",
"cpu": "Intel(R) Xeon(R) CPU X5650 @ 2.67GHz",
"distro": "centos",
"distro_description": "CentOS Linux 7 (Core)",
"distro_version": "7",
"hostname": "ceph-rbd-mirror0",
"id": "ceph-rbd-mirror0",
"instance_id": "14151",
"kernel_description": "#1 SMP Wed May 9 18:05:47 UTC 2018",
"kernel_version": "3.10.0-862.2.3.el7.x86_64",
"mem_swap_kb": "1572860",
"mem_total_kb": "1015548",
"os": "Linux"
}
}
}
}
}
}
```
This patch modifies the function `test_rbd_mirror_is_up()` in
`test_rbd_mirror.py` so it works with `mimic` and keeps backward compatibility
with `luminous`
Sébastien Han [Thu, 5 Jul 2018 12:10:33 +0000 (14:10 +0200)]
ceph-config: do not log cluster log on container
The container image recently merged both cluster and mon log into a
single stream. Following this, we now see this warning coming from the
container image:
2018-06-19 13:44:01.542990 7ff75b024700 1 mon.vm02@1(peon).log
v57928205 unable to write to '/var/log/ceph/ceph.log' for channel
'cluster': (2) No such file or directory
So we now tell the mon to not log cluster log on the filesystem.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591771 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Fri, 29 Jun 2018 10:10:16 +0000 (12:10 +0200)]
ceph-client: do not kill the dummy container
The container runs for 300 sec, then dies and removes itself thanks to
the '--rm' option, so there is no point of removing it. Also this is
causing failure under some circonstances.
Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1568157 Signed-off-by: Sébastien Han <seb@redhat.com>
Sébastien Han [Thu, 28 Jun 2018 07:54:24 +0000 (09:54 +0200)]
ceph-osd: trigger osd container restart on script change
The script ceph-osd-run.sh holds the config options to start the
container, if one of these options are modified we must restart the
container. This was not the case before becauase the 'notify' flag
wasn't present.
Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1596061 Signed-off-by: Sébastien Han <seb@redhat.com>