container: run engine/common roles on first client
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.
ceph-facts: only get fsid when monitor are present
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).
This changes the grafana container image regitry from docker.io to
quay.io to avoid rate limit.
This also adds the missing container image values for docker2podman
and podman scenarios.
Add --cluster option on ceph require-osd-release command
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.
Closes: https://bugzilla.redhat.com/1876447 Signed-off-by: Francesco Pantano <fpantano@redhat.com>
Dimitri Savineau [Thu, 27 Aug 2020 13:57:35 +0000 (09:57 -0400)]
tests: reenable ceph-iscsi testing
This re-adds the ceph-iscsi testing for both non containerized and
containerized deployment since the rados connection error on ceph
dev has been fixed [1].
There's no need to use `ignore_errors: true` on these tasks.
Using a loop on the task stopping mon daemons allows us to avoid
duplicating this task, the `ignore_errors` isn't needed here because it
won't fail the playbook if one of the ID doesn't exist (shortname vs. fqdn)
Using the right condition on the task starting the mgr daemon allows us
to avoid using an `ignore_errors: true` as well.
This commit moves the erasure pool creation testing from `all_daemons`
to `lvm_osds` so we can decrease the number of osd nodes we spawn so the
OVH Jenkins slaves aren't less overwhelmed when a `all_daemons` based
scenario is being tested.
Temporarily disable iscsigw testing for containerized deployments
because it's broken upstream on ceph@master.
non-containerized deployments use stable build for iscsigw to get around
this issue.
Dimitri Savineau [Mon, 17 Aug 2020 17:55:47 +0000 (13:55 -0400)]
ceph-rgw: allow specifying crush rule on pool
We already support specifiying a custom crush rule during pool creation
in ceph-osd role but not in ceph-rgw role.
This patch adds the missing code to implement this feature.
Note this is only available for replicated pool not erasure. The rule
must also exist prior the pool creation.
Dimitri Savineau [Mon, 17 Aug 2020 18:56:17 +0000 (14:56 -0400)]
container: don't install the engine on all clients
We only need the container engine to be installed on the first clients
node in order to execute the pools/keys operation. We already do the
same worflow with the ceph-container-common role which pull the ceph
container image.
raul [Mon, 3 Aug 2020 10:58:50 +0000 (12:58 +0200)]
rgw: support 1+ rgw instance in `radosgw_frontend_port`
Change the radosgw_frontend_port to take in account more than 1 RGW instance,
in it's original form `radosgw_frontend_port: radosgw_frontend_port | int`,
it configured the 8080 port to all instances, with the following modification
`radosgw_frontend_port: radosgw_frontend_port | int + item|int` we increase in
1 the port count.
Co-authored-by: Daniel Parkes <dparkes@redhat.com> Signed-off-by: raul <rmahique@redhat.com>
nfs: do not copy rgw keyring when `nfs_obj_gw` is true
This keyring shouldn't be copied when `nfs_obj_gw` is `True` if the
cluster doesn't contain a rgw node, which can be the case given we are
using `nfs_obj_gw` instead of `nfs_file_gw` (cephfs vs. object), the
deployment will fail trying to copy a key that doesn't exist.
When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.
This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.
PytestUnknownMarkWarning: Unknown pytest.mark.ceph_crash - is this a typo?
You can register custom marks to avoid this warning - for details,
see https://docs.pytest.org/en/latest/mark.html
Kevin Coakley [Mon, 3 Aug 2020 17:03:34 +0000 (10:03 -0700)]
Remove ceph-radosgw.target when switching to containerize daemons
The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.
When using TLS on the ceph dashboard or grafana services, we can provide
the TLS certificate and key.
Those files should be present on the ansible controller and they will be
copyied to the right node(s).
In some situation, the TLS certificate and key could be already present
on the target node and not on the ansible controller.
For this scenario, we just need to copy the files locally (on each remote
host).
This patch adds the dashboard_tls_external variable (with default to
false) to allow users to achieve this scenario when configuring this
variable to true.
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.
Due to [1], ceph-volume has now a dependency on pyyaml but it's not
installed by default via the package dependency.
This patch only add the required package on non containerized
deployment and as temporary workaround for the CI.
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.
During the daemon upgrade we're
- stopping the service when it's not containerized
- running the daemon role
- start the service when it's not containerized
- restart the service when it's containerized
This implementation has multiple issue.
1/ We don't use the same service workflow when using containers
or baremetal.
2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.
3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.
This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.
Finally, this adds the missing service stop task for ceph crash upgrade
workflow.
The iscsigws restart scripts for tcmu-runner and rbd-target-{api,gw}
services only call the systemctl restart command.
We don't really need to copy a shell script to do it when we can use
the ansible service module instead.
In case of failure, the systemd ExecStop isn't executed so the container
isn't removed. After a reboot of a failed node, the container doesn't
start because the old container is still present in created state.
We should always try to remove the container in ExecStartPre for this
situation.
A normal reboot doesn't trigger this issue and this also doesn't affect
nodes running containers via docker.
This behaviour was introduced by d43769d.
Set the cephadm cmd as a fact instead of rewriting the same command
over and over.
This also fix an issue when using docker as container engine because
the --docker cephadm parameter should be use before the subcommand
not after.
The ceph-dashboard role is executed on the mgr nodes so the TLS cert/key
files are copied to those nodes.
But we are running importing the cert/key files into the ceph
configuration on the monitor.
This is a partial revert of b38019e because we don't want to execute
the whole play on the monitor otherwise if we have some empty group
like rgws or mdss then the orchestrator commands will still be
executed.
Instead we should keep the real target group name at play level and
delegate the orchestator commands to the monitor. The whole play
will be skipped is the group is empty.
Print a message at the end of the playbook to inform users that they
don't have to user ceph-ansible playbooks anymore as everything else
need to be done via cephadm (day 2 operation).
When reporting the orchestrator service/daemon list at the end of the
playbook, we can use the --refresh option otherwise we could have
an outdated output.
Like rolling_update or switch2container playbooks, we need to set/unset
some osd flags before and after the OSD daemons adoption.
This also adds a task for waiting for clean pgs at then of an OSd node.
cephadm uses default value for dashboard container images which need to
be customized by ansible for upstream or downstream purpose.
This feature wasn't present when cephadm-adopt.yml has been designed.
Also set the container_image_base variable for upgrade purpose.