Matthew Vernon [Wed, 10 Mar 2021 16:36:52 +0000 (16:36 +0000)]
ceph-osd: add prepare_osd tag to lvm-batch scenario
Sometimes it's useful to be able to skip the OSD creation step when
running ceph-ansible (cf #1777). The lvm scenario has a prepare_osd
tag on the relevant play. This commit adds the same tag to the
lvm-batch scenario.
Alex Schultz [Wed, 3 Mar 2021 14:43:50 +0000 (07:43 -0700)]
Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant. In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.
Related: ansible#73654 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406 Signed-off-by: Alex Schultz <aschultz@redhat.com>
Matthew Vernon [Mon, 22 Feb 2021 14:26:10 +0000 (14:26 +0000)]
Fix typo and broken link for documenting RGW frontends
http://docs.ceph.com/docs/nautilus/radosgw/frontends/ 404s so replace
it with a working "latest" docs link, and correct the spelling of
"additional" while I'm at it.
Florian Haas [Fri, 12 Feb 2021 08:29:00 +0000 (09:29 +0100)]
requirements.txt: Move the six dependency into the general requirements
config_template.py depends on six, which isn't listed in the default
requirements.txt. This previously frequently wasn't a problem, because
six used to be a standard package being installed into a venv, and
lots of other projects depended on it.
It also does get installed for unit and integration tests via
tests/requirements.txt, so any broken dependency on six wouldn't be
detected by tox runs.
However, as other projects and distributions have phased out Python
2.7 support the dependency on six becomes less common. Thus, as long
as ceph-ansible does require it for config_template.py, add it to the
base requirements.
When asking `ceph-volume` to report only in `lvm batch` context, there's
a bug described in bz1896803 [1] when `--yes` is passed (which by the
way isn't necessary with `--report`).
This commit ensure `--yes` isn't passed to `ceph-volume` when `--report`
is used.
switch2container: do not serialize the ceph-crash migration
There's no need to slow down the playbook execution time by migrating
all the `ceph-crash` instances in a serial way. Let's remove the
`serial: 1` so the migration is achieved in a parallel way.
we aren't deploying enough OSD daemon, so it fails like following:
```
stderr: 'Error ERANGE: pool id 10 pg_num 256 size 2 would mean 1536 total pgs, which exceeds max 1500 (mon_max_pg_per_osd 250 * num_in_osds 6)'
```
Let's increase the value of `mon_max_pg_per_osd` in order to get around
this issue in the CI.
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.
ceph-common: enable rhcs tools repo for monitoring
The monitoring node running grafana needs the rhcs tools repostory
enabled in non containerized deployment to be able to install the
ceph-grafana-dashboards rpm package.
Since eefe11d the grafana-server group has been renamed to monitoring
but the dashboard playbook wasn't updated.
This was still working due to the backward compatibility added in the
ceph-facts role.
The CentOS 8 vagrant box has finally been updated [1] with a recent
version (the latest one 2011 which means CentOS 8.3).
We don't need to download the vagrant libvirt box with a direct url
anymore from the CentOS infrastructure.
Due to recent changes in shaman, there's a chance it returns the wrong
repository from architecture point of view.
We can query shaman and ask for the correct architecture to get around
this.
Dimitri Savineau [Thu, 21 Jan 2021 17:12:17 +0000 (12:12 -0500)]
cephadm-adopt: use ceph_osd_flag module
There's no reason to not use the ceph_osd_flag module to set/unset osd
flags.
Also if there's no OSD nodes in the inventory then we don't need to
execute the set/unset play.
Dimitri Savineau [Fri, 22 Jan 2021 17:45:32 +0000 (12:45 -0500)]
library: retrieve realm id for zone/zonegroup
When the zonegroup or the zone doesn't have a realm associated then
it's not possible to modify that ressource.
This patch allows to retrieve the current realm id and compare it to
the realm id from the realm in parameter.
Dimitri Savineau [Thu, 21 Jan 2021 22:42:33 +0000 (17:42 -0500)]
cephadm-adopt: use radosgw modules for idempotency
When rerunning the cephadm-adopt.yml playbook the radosgw realm,
zonegroup and zone tasks will fail because the task isn't
idempotent.
Using the radosgw ansible modules solves that problem.
Dimitri Savineau [Wed, 20 Jan 2021 22:39:44 +0000 (17:39 -0500)]
cephadm-adopt: make the playbook idempotent
If the cephadm-adopt.yml fails during the first execution and some
daemons have already been adopted by cephadm then we can't rerun
the playbook because the old container won't exist anymore.
Error: no container with name or ID ceph-mon-xxx found: no such container
If the daemons are adopted then the old systemd unit doesn't exist anymore
so any call to that unit with systemd will fail.
Dimitri Savineau [Wed, 13 Jan 2021 15:17:56 +0000 (10:17 -0500)]
ceph-mon: add ExecStartPre docker stop to systemd
We already do that in the other systemd templates (mgr, mds, etc..)
and would present to add workaround in other orchestration tool.
This change is for containerized deployment only.
Add the possibility to deploy rgw multisite configuration with a mix of
secondary and primary zones on a same rgw node.
Before that, on a same node, all instances were either primary
zones *OR* secondary.
Basically it's now possible to define `rgw_zonemaster`,
`rgw_zonesecondary` and `rgw_zonegroupmaster` at the intsance
level instead of the whole node level.
Also, this commit adds an option `deploy_secondary_zones` (default True)
which can be set to `False` in order to explicitly ask the playbook to
not deploy secondary zones in case where the corresponding endpoint are
not deployed yet.
If for some reason `get_zonegroup()` returns a failure, we must handle
and make the module exit properly instead of failing with the following
python trace:
```
Traceback (most recent call last):
File "./AnsiballZ_radosgw_zone.py", line 247, in <module>
_ansiballz_main()
File "./AnsiballZ_radosgw_zone.py", line 234, in _ansiballz_main
exitcode = debug(sys.argv[1], zipped_mod, ANSIBALLZ_PARAMS)
File "./AnsiballZ_radosgw_zone.py", line 202, in debug
runpy.run_module(mod_name='ansible.modules.radosgw_zone', init_globals=None, run_name='__main__', alter_sys=True)
File "/usr/lib64/python3.6/runpy.py", line 205, in run_module
return _run_module_code(code, init_globals, run_name, mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/vagrant/.ansible/tmp/ansible-tmp-1610728441.41-685133-218973990589597/debug_dir/ansible/modules/radosgw_zone.py", line 467, in <module>
main()
File "/home/vagrant/.ansible/tmp/ansible-tmp-1610728441.41-685133-218973990589597/debug_dir/ansible/modules/radosgw_zone.py", line 463, in main
run_module()
File "/home/vagrant/.ansible/tmp/ansible-tmp-1610728441.41-685133-218973990589597/debug_dir/ansible/modules/radosgw_zone.py", line 425, in run_module
zonegroup = json.loads(_out)
File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Dimitri Savineau [Fri, 22 Jan 2021 15:01:10 +0000 (10:01 -0500)]
ceph-defaults: change default ceph container tag
The "latest" ceph container tag references the latest stable release
(octopus at the moment). "latest" is an alias on "latest-octopus".
On the devel branch we should use "latest-master" tag instead.
Dimitri Savineau [Thu, 14 Jan 2021 02:11:39 +0000 (21:11 -0500)]
module_utils: don't add newline to the data
When executing a command via the run_command method and passing some
data with stdin then the default behavior is to add append a newline.
This breaks the value of password used by our modules.
fs2bs: skip migration when a mix of fs and bs is detected
Since the default of `osd_objectstore` has changed as of 3.2, some
deployments might have a mix of filestore and bluestore OSDs on a same
node. In some specific cases, there's a possibility that a filestore OSD
shares a journal/db device with a bluestore OSD. We shouldn't try to
redeploy in this context because ceph-volume will complain. (either
because in lvm batch you can't pass partition or about gpt header).
The safest option is to skip the migration on the node when such a mix
is detected or force all osds including those already using bluestore
(option `force_filestore_to_bluestore=True` has to be passed as an extra var).
If all OSDs are using filestore, then they will be migrated to
bluestore.
This commit checks the length of `virtual_ips` doesn't exceed the length
of `groups[rgwloadbalancer_group_name]`.
It also ensure this variable is defined when
`groups[rgwloadbalancer_group_name]` contains at least one node.
While 2ca33641 fixed a bug in the way the `keepalived.conf.j2` template matched
hostnames to set the VRRP `MASTER`/`BACKUP` states, it also introduced a
regression in the case where `virtual_ips` is a list of more than one IP
address.
The previous behavior would result in each host in the `rgwloadbalancers` group
to be `MASTER` for one of the `virtual_ips`, but the new behavior caused the
first host to be `MASTER` for all the IP address in `virtual_ips`.
The current check makes no sense because it checks any of other monitor
than the one being played (either a previous one already converted or a
next that isn't yet converted) is present on the quorum.
rgw: support switching from single-site to multisite
When collocating rgw with either a mon, mgr or osd, switching from
single site to a multisite rgw setup failed because of the handlers
triggered between the ansible play of the collocated daemon and the play
of the rgw. Since the multisite changes are not yet applied the handlers
fail.
The idea here is to ensure we run the multisite configuration from the
ceph-handler role before the restart happens, this way it won't complain
because of non existing multisite configuration.
(Note: this is also valid when simply changing a multisite configuration)
Dimitri Savineau [Fri, 18 Dec 2020 15:25:54 +0000 (10:25 -0500)]
library: remove containerized parameter from cv
The ceph-volume module relies on environment variables to determine if
the command should be executed within a container or not.
The containerized parameter isn't used anymore and we can remove it.
Fabien Brachere [Wed, 16 Dec 2020 06:33:36 +0000 (07:33 +0100)]
library: add missing `target_size_ratio` parameter support in ceph_pool module
When creating a new pool, target_size_ratio was ignored by ansible module ceph_pool.py.
target_size_ratio is now used when pg_autoscale_mode is on.
Tests added to library tests.
This adds too the use in the role ceph-rgw.
Dimitri Savineau [Tue, 15 Dec 2020 18:52:43 +0000 (13:52 -0500)]
ceph-config: fix ceph-volume lvm batch report
Since the major ceph-volume lvm batch refactoring, the report value
is different.
Before the refact, the report was a dict with the OSDs list to be created
under the "osds" key.
After the refact, the report is a list of dict.
This avoids interactive mode for `vagrant box remove`.
This can happen for some reason when there's leftover from previous
deployment (VMs not destroyed as expected)
Currently we create an object from the primary sites but we try to read
that object still from the master which doesn't make sense, we should
try to read it from a secondary site.
This breaks the backward compatibility with previous osd_memory_target
calculation and we could have a value lower than the minimum value allowed
(896M) which causes some ceph commands to fail (like ceph assimilate-conf).
Dimitri Savineau [Fri, 11 Dec 2020 18:07:04 +0000 (13:07 -0500)]
monitoring: use config_template module for config
The alertmanager, grafana and prometheus configuration file are
generated with the template module which doesn't allow for using
config overrides.
Instead we could use the config_template plugin action and add a
new variable for overrides (one for each component).
With this patch, one should be able to add configuration to
prometheus with the following:
The conversion fact task was only executed when the grafana_server_group_name
variable was explicitly set in the user configuration. If an user was using
the default value then the conversion wasn't executed.
This also adds back the default grafana_server_group_name value in case user
was using the default value and to avoid undefined variable error.
Instead of hardcoding the "monitoring" group name then we can reuse the
monitoring_group_name variable.
There's no need to override the monitoring_group_name variable, it's either
using the default value or the one defined by the user.
Finally removing the delegate_to statement on the add_host task since it's
always executed on the ansible controller.
ceph-mon: No become during gen mon initial keyring
Since the backing generate_secret() just hands out urandom output,
running as privileged doesn't seem to be required. It's not
desireable to provide sudo in some Ansible runner environments.
Signed-off-by: Jukka Nousiainen <jukka.nousiainen@csc.fi>
Dimitri Savineau [Wed, 18 Nov 2020 22:20:45 +0000 (17:20 -0500)]
consume ceph_volume module when possible
We should always use the ceph_volume ansible module when possible.
This patch replace the ceph-volume inventory and lvm {list,zap} commands
called via the command/shell modules by the corresponding call with the
ceph_volume module.
This adds ceph_crush_rule ansible module for replacing the command
module usage with the ceph osd crush rule commands.
This module can manage both erasure and replicated crush rules.
container: remove `--ignore` from `podman rm` command
As of podman 2.0.5, `--ignore` param conflicts with `--storage`.
```
Nov 30 13:53:10 magna089 podman[164443]: Error: --storage conflicts with --volumes, --all, --latest, --ignore and --cidfile
```