the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.
Dimitri Savineau [Mon, 13 Jan 2020 15:24:52 +0000 (10:24 -0500)]
ceph-facts: move grafana fact to dedicated file
We don't need to executed the grafana fact everytime but only during
the dashboard deployment.
Especially for ceph-grafana, ceph-prometheus and ceph-dashboard roles.
Dimitri Savineau [Fri, 10 Jan 2020 20:30:58 +0000 (15:30 -0500)]
tests/setup: update mount options on EL 8
The nobarrier mount flag doesn't exist anymoer on XFS in the EL 8
kernel. That's why the task wasn't working on those systems.
We can still use the other options instead of skipping the task.
Dimitri Savineau [Fri, 10 Jan 2020 14:31:26 +0000 (09:31 -0500)]
purge-iscsi-gateways: don't run all ceph-facts
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.
config: exclude ceph-disk prepared osds in lvm batch report
We must exclude the devices already used and prepared by ceph-disk when
doing the lvm batch report. Otherwise it fails because ceph-volume
complains about GPT header.
rolling_update: run registry auth before upgrading
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.
Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.
We must pick up a mon which actually exists in ceph-facts in order to
detect if a cluster is running. Otherwise, it will state no cluster is
already running which will end up deploying a new monitor isolated in a
new quorum.
using `handler_*_status` instead of `hostvars[item]['handler_*_status']`
causes handlers to be triggered in anycase even though
`handler_*_status` was set to `False` on a specific node.
for instance. However, doing so would create pools of size
`osd_pool_default_size` regardless of the `size` value. This was due to
the fact that the Ansible task used
Only the ipv4 addresses from the nodes running the dashboard mgr module
were added to the trusted_ip_list configuration file on the iscsigws
nodes.
This also add the iscsi gateways with ipv6 configuration to the ceph
dashboard.
Some docker commands were hardcoded in tests playbooks and some
conditions were not taking care of the containerized_deployment
variable but only the atomic fact.
Since RHEL 8.1 we need to add the ganesha_t type to the permissive
SELinux list.
Otherwise the nfs-ganesha service won't start.
This was done on RHEL 7 previously and part of the nfs-ganesha-selinux
package on RHEL 8.
Before this patch, the lvm2 package installation was done during the
ceph-osd role.
However we were running ceph-volume command in the ceph-config role
before ceph-osd. If lvm2 wasn't installed then the ceph-volume command
fails:
error checking path "/run/lock/lvm": stat /run/lock/lvm: no such file or
directory
This wasn't visible before because lvm2 was automatically installed as
docker dependency but it's not the same for podman on CentOS 8.
This commit adds CentOS 8 support:
- update vagrant image in tox configurations.
- add CentOS 8 repository for el8 dependencies.
- CentOS 8 container engine is podman (same than RHEL 8).
- don't use the epel mirror on sepia because it's epel7 only.
Currently the keepalived template only works when system hostnames exactly match the Ansible inventory name. If these are different, all generated templates become BACKUP without a MASTER assigned. Using the inventory_hostname in the template file resolves this issue.
Signed-off-by: Stanley Lam stanleylam_604@hotmail.com
Dimitri Savineau [Mon, 16 Dec 2019 16:03:21 +0000 (11:03 -0500)]
ceph-infra: replace hardcoded grafana group name
The grafana-server group name was hardcoded for the grafana/prometheus
firewalld tasks condition.
We should we the associated variable : grafana_server_group_name
Dimitri Savineau [Mon, 16 Dec 2019 16:00:35 +0000 (11:00 -0500)]
ceph-infra: move dashboard into a dedicated file
Instead of using multiple dashboard_enabled condition in the
configure_firewall file we could just have the condition once
and include the dedicated tasks list.
Dimitri Savineau [Mon, 16 Dec 2019 15:48:26 +0000 (10:48 -0500)]
ceph-infra: open dashboard port on monitor
When there's no mgr group defined in the ansible inventory then the
mgrs are deployed implicitly on the mons nodes.
If the dashboard is enabled then we need to open the dashboard port on
the node that is running the ceph mgr process (mgr or mon).
The current code only allow to open that port on the mgr nodes when they
are present explicitly in the inventory but not implicitly.
To avoid confusion, let's change the default value from `0.0.0.0` to
`x.x.x.x`.
Users might think setting `0.0.0.0` will make the daemon binding on all
interfaces.
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.
Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.
The zap action from ceph_volume module always implies `--destroy`.
This commit adds the destroy option support so we can ask ceph-volume to
not use `--destroy` when zapping a device.
Dimitri Savineau [Tue, 10 Dec 2019 21:35:34 +0000 (16:35 -0500)]
ceph-facts: set use_new_ceph_iscsi on iscsi nodes
We don't need to set the use_new_ceph_iscsi fact on other nodes than
those present in the iscsigws group.
Also remove the duplicate iscsi_gw_group_name condition already present
on the include_task.
Finally validate the ansible distribution as the first task.
ansible.cfg: do not enforce PreferredAuthentications
There's no need to enforce PreferredAuthentications by default.
Users can still choose to override the ansible.cfg with any additional
parameter like this one to fit their infrastructure.
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].
It doesn't make sense to start validating configuration if the ansible
version isn't the good one.
This commit moves the check_system task as the first task in the
ceph-validate role.
The ansible version test tasks are moved at the top of this file.
Also moving the iscsi kernel tests from check_system to check_iscsi
file.
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).
This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.
Dimitri Savineau [Wed, 27 Nov 2019 16:27:09 +0000 (11:27 -0500)]
switch_to_containers: fix umount ceph partitions
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.
We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0
Also we should not fail on the ceph directory listing.
Dimitri Savineau [Wed, 27 Nov 2019 14:29:06 +0000 (09:29 -0500)]
ceph-osd: wait for all osds once
cf8c6a3 moves the 'wait for all osds' task from openstack_config to the
main tasks list.
But the openstack_config code was executed only on the last OSD node.
We don't need to do this check on all OSD node so we need to add set
run_once to true on that task.
When using `osd_auto_discovery`, `devices` is built multiple times due
to multiple runs of `ceph-facts` role. It end up with duplicate
instances of a same device in the list.
Using `unique` filter when building the list fixes this issue.
Dimitri Savineau [Tue, 26 Nov 2019 16:09:11 +0000 (11:09 -0500)]
ceph-osd: wait for all osd before crush rules
When creating crush rules with device class parameter we need to be sure
that all OSDs are up and running because the device class list is
is populated with this information.
This is now enable for all scenario not openstack_config only.
At the moment, we bindmount the dbus socket from the host, this requires
to run the container with --privileged.
Since we now run a dedicated dbus daemon inside the same container, we
can stop running privileged nfs-ganesha containers
Related ceph-container PR : ceph/ceph-container#1517