Sébastien Han [Tue, 2 Oct 2018 21:54:57 +0000 (23:54 +0200)]
osd: remove ceph-disk support
We don't support the preparation of OSD with ceph-disk. ceph-volume is
only supported. However, the start operation of OSD is still supported.
So let's say you change a config option, the handlers will be able to
restart all the OSDs via their respective systemd unit files.
Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e2a5aa062eae90b154e98c3c5f6d6a427c28bf97)
We don't need to have multiple ceph-override.json copies. We
currently already have symlink to all_daemons/ceph-override.json so
we can do it for all scenarios.
We don't need to use the cephfs variable for the application pool
name because it's always cephfs.
If the cephfs variable is set to something else than the default
value it will break the appplication pool task.
Dimitri Savineau [Tue, 26 Feb 2019 14:16:37 +0000 (09:16 -0500)]
rgw: change default frontend on nautilus
As discussed in ceph/ceph#26599, beast is now the default frontend
for rados gateway with nautilus release.
Add rgw_thread_pool_size variable with 512 as default value and keep
backward compatibility with num_threads option when using civetweb.
Update radosgw_civetweb_num_threads to reflect rgw_thread_pool_size
change.
Matthew Vernon [Wed, 27 Mar 2019 13:34:47 +0000 (13:34 +0000)]
UCA: Uncomment UCA variables in defaults, fix consequent breakage
The Ubuntu Cloud Archive-related (UCA) defaults in
roles/ceph-defaults/defaults/main.yml were commented out, which means
if you set `ceph_repository` to "uca", you get undefined variable
errors, e.g.
```
The task includes an option with an undefined variable. The error was: 'ceph_stable_repo_uca' is undefined
The error appears to have been in '/nfs/users/nfs_m/mv3/software/ceph-ansible/roles/ceph-common/tasks/installs/debian_uca_repository.yml': line 6, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: add ubuntu cloud archive repository
^ here
```
Unfortunately, uncommenting these results in some other breakage,
because further roles were written that use the fact of
`ceph_stable_release_uca` being defined as a proxy for "we're using
UCA", so try and install packages from the bionic-updates/queens
release, for example, which doesn't work. So there are a few `apt` tasks
that need modifying to not use `ceph_stable_release_uca` unless
`ceph_origin` is `repository` and `ceph_repository` is `uca`.
Dimitri Savineau [Fri, 15 Mar 2019 14:18:48 +0000 (10:18 -0400)]
rolling_update: Remove ceph aliases
ceph aliases have been introduced in stable-3.2 during the ceph
deployment. On master this has been removed but we don't handle
this removal in the upgrade from stable-3.2 to master via the
rolling_update playbook.
Also remove the task from purge-docker-cluster missing from d9e7835
When using monitor_address_block or radosgw_address_block variables
to configure the mon/rgw address we're getting the first ip address
from the ansible facts present in that cidr.
When there's VIP on that network the first filter could return the
wrong value.
This seems to affect only IPv6 setup because the VIP addresses are
added to the ansible facts at the beginning of the list. This is the
opposite (at the end) when using IPv4.
This causes the mon/rgw processes to bind on the VIP address.
François Lafont [Sat, 6 Apr 2019 09:44:03 +0000 (11:44 +0200)]
ceph-rgw: Fix bad paths which depend on the clustername
The path of the RGW environment file (in the /var/lib/ceph/radosgw/
directory) depends on the Ceph clustername. It was not taken into
account in the Ansible role `ceph-rgw`.
Check ceph_health_raw.stdout value as string during mon bootstrap
According to rdo testing https://review.rdoproject.org/r/#/c/18721
a check on the output of the ceph_health value is added to
allow the playbook to make several attempts (according to the
retry/delay variables) when waiting the cluster quorum or
when the container bootstrap is not ended.
It avoids the failure of the command execution when it doesn't
receive a valid json object to decode (because cluster is too
slow to boostrap compared to ceph-ansible task execution).
In containerized deployment the default radosgw quota is too low
for production environment.
This is causing performance degradation compared to bare-metal.
Since https://github.com/ceph/ceph/commit/77912c0 ceph-volume uses
stdout encoding based on LC_CTYPE and PYTHONIOENCODING environment
variables.
Thoses variables aren't set when using ansible.
Currently this commit breaks non containerized deployment on Ubuntu.
TASK [use ceph-volume to create bluestore osds] ********************
cmd:
- ceph-volume
- --cluster
- ceph
- lvm
- create
- --bluestore
- --data
- /dev/sdb
rc: 1
stderr: |-
Traceback (most recent call last):
(...)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 132: ordinal not in range(128)
Note that the task is failing on ansible side due to the stdout
decoding but the osd creation is successful.
Dimitri Savineau [Wed, 27 Mar 2019 18:11:20 +0000 (14:11 -0400)]
container: Add python3-docker on Ubuntu bionic
When installing python-minimal on Ubuntu bionic, this will add the
/usr/bin/python symlink to the default python interpreter.
On bionic, this isn't python2 but python3.
$ /usr/bin/python --version
Python 3.6.7
The python docker library is only installed for python2 which causes
issues when running the purge-docker-cluster playbook. This playbook
uses the ansible docker modules and requires to have python bindings
installed on the remote host.
Without the bindings we can see python error reported by the docker
module.
msg: Failed to import docker or docker-py - No module named 'docker'.
Try `pip install docker` or `pip install docker-py` (Python 2.6)
Dimitri Savineau [Fri, 22 Mar 2019 19:03:15 +0000 (15:03 -0400)]
Add uca to ceph_repository choices validation
Ubuntu cloud archive is configurable via ceph_repository variable but
the uca choice isn't accepted.
This commit fixes this issue and also validates the associated uca
repository variables.
sometimes those tasks might fail because of a timeout.
I've been facing this several times in the CI, adding this retry might
help and won't hurt in any case.
update: add containerized deployment upgrade support (L->N)
Add a couple of fixes to allow containerized deployments upgrade support
to upgrade from luminous/mimic to nautilus.
- pass CEPH_CONTAINER_IMAGE and CEPH_CONTAINER_BINARY environment
variable to the ceph_key module,
- fix the docker exec command in 'waiting for the containerized monitor
to join the quorum' task according to the `delegate_to` parameter,
- override `docker_exec_cmd` in `ceph-facts` with `mon_host` when
rolling_update is `True`,
- do not run unnecessarily `create_mds_filesystems.yml` when performing an
upgrade.
once the cluster is upgraded to nautilus, we can complete the process by
disallowing pre-nautilus OSDs and enabling all new nautilus-only functionality
update: ensure mgrs are upgraded after ALL monitors
As of 1c760904b0bd1b6b0f49d6ac19d87d79f185c18f, ceph-ansible implicitly
bootstrap managers on monitors.
mgrs must be upgraded only after all monitors, therefore, this commit
refact the way mgrs are upgraded to be sure we don't upgrade a mgr
during the monitors upgrade.
This commit also ensure we handle the case were we split managers on
dedicated nodes.
update: ensure /var/lib/ceph/bootstrap-rbd-mirror is present
This directory is created by ceph-config node by node.
In the upgrade context we need it to be created on ALL monitors as soon
as the first iteration because of the task right after which creates and sends
the keyrings on all monitors.
This prevents the packaging from restarting services before we do need
to restart them in the rolling update sequence.
We want to handle services restart at rolling_update playbook.
update: remove an old parameter in ceph_key module call
the `containerized` parameter in ceph_key module doesn't exist anymore.
This was making the module failing but was hidden because of the
`ignore_errors: True`.
ceph_key: `lookup_ceph_initial_entities` shouldn't fail on update
As of nautilus, the initial keyrings list has changed, it means when
upgrading from Luminous or Mimic, it is expected there's a mismatch
between what is found on the cluster and the expected initial keyring
list hardcoded in ceph_key module. We shouldn't fail when upgrading to
nautilus.
handlers: do not trigger handlers on rolling_update
rolling_update playbook already takes care of stopping/starting services
during the sequence. There's no need to trigger potential unwanted
services restart.
Dimitri Savineau [Wed, 20 Mar 2019 19:30:46 +0000 (15:30 -0400)]
ceph-osd: Ensure lvm2 is installed
When using osd_scenario lvm, we never check if the lvm2 package is
present on the host.
When using containerized deployment and docker on CentOS/RedHat this
package will be automatically installed as a dependency but not for
Ubuntu distribution.
OSD deployed via ceph-volume require the lvmetad.socket to be active
and running.
osd: backward compatibility with old disk_list.sh location
Since all files in container image have moved to `/opt/ceph-container`
this check must look for new AND the old path so it's backward
compatible. Otherwise it could end up by templating an inconsistent
`ceph-osd-run.sh`.
Dimitri Savineau [Thu, 14 Mar 2019 20:22:01 +0000 (16:22 -0400)]
ceph-validate: fail if there's no ipaddr available in monitor_address_block subnet
When using monitor_address_block to determine the ip address of the
monitor node, we need an ip address available in that cidr to be
present in the ansible facts (ansible_all_ipv[46]_addresses).
Currently we don't check if there's an ip address available during
the ceph-validate role.
As a result, the ceph-config role fails due to an empty list during
ceph.conf template creation but the error isn't explicit.
TASK [ceph-config : generate ceph.conf configuration file] *****
fatal: [0]: FAILED! => {"msg": "No first item, sequence was empty."}
With this patch we will fail before the ceph deployment with an
explicit failure message.
Dimitri Savineau [Fri, 15 Mar 2019 15:30:15 +0000 (11:30 -0400)]
ceph-common: Install yum plugin priorities
When using community repository we need to set the priority on the
ceph repositories because we could have some conflict with EPEL
packages.
In order to set the priority on the ceph repositories, we need to
install the yum-plugin-priorities package.
Rishabh Dave [Mon, 11 Mar 2019 10:49:32 +0000 (16:19 +0530)]
don't append path components while calling os.path.join()
This creates a confusion whether directory/file names are being
formed by appendng strings or path components are being appended.
Since latter should never be done manually, get rid of the statements
creating confusion.
Rishabh Dave [Mon, 11 Mar 2019 10:20:08 +0000 (15:50 +0530)]
use os.path.join() correctly
os.path.join adds the separator (i.e. '/') between the provided path
components only if needed. Providing a single path component doesn't
lead to any checks.
Currently the default crush rule value is added to the ceph config
on the mon nodes as an extra configuration applied after the template
generation via the ansible ini module.
This implies two behaviors:
1/ On each ceph-ansible run, the ceph.conf will be regenerated via
ceph-config+template and then ceph-mon+ini_file. This leads to a
non necessary daemons restart.
2/ When other ceph daemons are collocated on the monitor nodes
(like mgr or rgw), the default crush rule value will be erased by
the ceph.conf template (mon -> mgr -> rgw).
This patch adds the osd_pool_default_crush_rule config to the ceph
template and only for the monitor nodes (like crush_rules.yml).
The default crush rule id is read (if exist) from the current ceph
configuration.
The default configuration is -1 (ceph default).
Dimitri Savineau [Mon, 11 Mar 2019 14:44:47 +0000 (10:44 -0400)]
ceph-osd: Install numactl package when needed
With 3e32dce we can run OSD containers with numactl support.
When using numactl command in a containerized deployment we need to
be sure that the corresponding package is installed on the host.
The package installation is only executed when the
ceph_osd_numactl_opts variable isn't empty.
I suspect `./generate_group_vars_sample.sh` wasn't used in b8d580b3f48c69ba9882df773c4d144b73d01c95 because it introduced a typo in
`group_vars/all.yml.sample` and `group_vars/clients.yml.sample`.
We don't need to set After=docker.service when the container_binary
variable isn't set to docker.
It doesn't break anything currently but it could be confusing when
using podman.
Instead of using subscription-manager with command module we can use
the rhsm_repository ansible module.
This module already uses repos list feature to determine if a
repository is enabled or not. That way this module is idempotent so
we don't need changed_when: false anymore.
Because the client name is part of the client key path we can reuse
the user variable to build this path.
Also remove a duplicate user variable declaration.
Because we're still using Linux distributions with python 2.7 (like
CentOS/RHEL 7) it could be useful to run travis tests against python
2.7 even if the support will be ended in 2020.
The ceph stable community repository only enables the basearch
packages url.
Adding the noarch url because starting with nautilus release, some
packages are added there and useful for mgr or grafana.
After b8d580b and e9e5d5a we could have either item.min_size or
osd_pool_default_min_size using string instead of int causing the
condition to be true when it's false.
As a result, the task could try to set the pool min_size value to
0 which leads to:
Error EINVAL: pool min_size must be between 1 and 1
otherwise a bunch of jobs will fail like following:
```
[WARNING]: Unable to parse /home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-ubuntu-container-stable-3.2-bluestore_lvm_osds/tests/functional/bs-lvm-osds/container/hosts-ubuntu as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
```
As of testinfra 2.0.0, the binary name is `py.test`.
But let's pin the version to 1.19.0.
Indeed, migrating to 2.0.0 requires our current testing to be reworked a bit.
Since we don't have the bandwidth ATM for this, it's better to simply
keep testing with testinfra 1.19.0.
Note that I've replaced all `testinfra` occurences by `py.test` anyway.
removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.
Typical error:
```
fatal: [mon0]: FAILED! => changed=false
attempts: 3
msg: No package matching 'ceph' is available
```
The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
apt:
update_cache: yes
cache_valid_time: 3600
register: result
until: result is succeeded
```
since the task installing ceph packages has a `update_cache: no` it
fails:
```
- name: install ceph for debian
apt:
name: "{{ debian_ceph_pkgs | unique }}"
update_cache: no
state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
register: result
until: result is succeeded
```
/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.
using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.
on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
cmd: command -v ceph-volume
failed_when_result: false
msg: '[Errno 2] No such file or directory: ''command'': ''command'''
rc: 2
```