Using ceph_dev_branch and ceph_dev_sha1 for configuring ceph-iscsi
repositories from shaman doesn't make sense because the ceph devel
branches and sha1 aren't compatible with ceph-iscsi devel.
Instead we could rely on the master branch and the latest sha1.
Currently it's not possible to using a custom ceph branch/sha1 value
with iscsi setup otherwise the repository setup will fail.
Dimitri Savineau [Mon, 10 Feb 2020 16:06:48 +0000 (11:06 -0500)]
ceph-nfs: fix ceph_nfs_ceph_user variable
The ceph_nfs_ceph_user variable is a string for the ceph-nfs role but a
list in ceph-client role. 6a6785b introduced a confusion between both variable type in the ceph-nfs
role for external ceph with ganesha.
Dimitri Savineau [Mon, 10 Feb 2020 18:43:31 +0000 (13:43 -0500)]
ceph-{mon,osd}: move default crush variables
Since ed36a11 we move the crush rules creation code from the ceph-mon to
the ceph-osd role.
To keep the backward compatibility we kept the possibility to set the
crush variables on the mons side but we didn't move the default values.
As a result, when using crush_rule_config set to true and wanted to use
the default values for crush_rules then the crush rule ansible task
creation will fail.
"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute
'crush_rules'"
This patch move the default crush variables from ceph-mon to ceph-osd
role but also use those default values when nothing is defined on the
mons side.
Dimitri Savineau [Wed, 12 Feb 2020 15:38:25 +0000 (10:38 -0500)]
ceph-grafana: fix grafana_{crt,key} condition
The grafana_{crt,key} aren't boolean variables but strings. The default
value is an empty string so we should do the conditional on the string
length instead of the bool filter
When using multiple grafana hosts then we push set the grafana and
prometheus URL and push the dashboard layout to a single node.
grafana_server_addrs is the list of all grafana nodes and used during
the ceph-dashboard role (on mgr/mon nodes).
grafana_server_addr is the current grafana node used during the
ceph-grafana and ceph-prometheus role (on grafana-server nodes).
We don't have the grafana_server_addr fact duplication code between
external vs collocated nodes.
switch_to_containers: increase health check values
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.
Mike Christie [Tue, 28 Jan 2020 22:31:55 +0000 (16:31 -0600)]
iscsi: Fix crashes during rolling update
During a rolling update we will run the ceph iscsigw tasks that start
the daemons then run the configure_iscsi.yml tasks which can create
iscsi objects like targets, disks, clients, etc. The problem is that
once the daemons are started they will accept confifguration requests,
or may want to update the system themself. Those operations can then
conflict with the configure_iscsi.yml tasks that setup objects and we
can end up in crashes due to the kernel being in a unsupported state.
This could also happen during creation, but is less likely due to no
objects being setup yet, so there are no watchers or users accessing the
gws yet. The fix in this patch works for both update and initial setup.
Dimitri Savineau [Wed, 29 Jan 2020 03:31:04 +0000 (22:31 -0500)]
ceph-container-engine: lvm2 on OSD nodes only
Since de8f2a9 the lvm2 package installation has been moved from ceph-osd
role to ceph-container-engine role.
But the scope wasn't limited to the OSD nodes only.
This commit fixes this behaviour.
Dimitri Savineau [Tue, 28 Jan 2020 15:27:34 +0000 (10:27 -0500)]
ceph-defaults: remove rgw from ceph_conf_overrides
The [rgw] section in the ceph.conf file or via the ceph_conf_overrides
variable doesn't exist and has no effect.
To apply overrides to all radosgw instances we should use either the
[global] or [client] sections.
Overrides per radosgw instance should still use the
[client.rgw.{instance-name}] section.
Dimitri Savineau [Wed, 29 Jan 2020 02:34:24 +0000 (21:34 -0500)]
ceph-facts: fix _container_exec_cmd fact value
When using different name between the inventory_hostname and the
ansible_hostname then the _container_exec_cmd fact will get a wrong
value based on the inventory_hostname instead of the ansible_hostname.
This happens when the ceph cluster is already running (update/upgrade).
Later the container exec commands will fail because the container name
is wrong.
We should always set the _container_exec_cmd based on the
ansible_hostname fact.
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).
fix calls to `container_exec_cmd` in ceph-osd role
We must call `container_exec_cmd` from the right monitor node otherwise
the value of the fact might mistmatch between the delegated node and the
node being played.
Dimitri Savineau [Thu, 23 Jan 2020 21:58:14 +0000 (16:58 -0500)]
filestore-to-bluestore: skip bluestore osd nodes
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.
Some ganesha packages do not create ganesha log directories
while it's expected to be created while changing it's permissions.
Additionally it's no much sense in doing that as a separate task,
so directory is created as correct permissions are set with creation of
the rest required directories.
Dimitri Savineau [Wed, 22 Jan 2020 19:45:38 +0000 (14:45 -0500)]
site-container: don't skip ceph-container-common
On HCI environment the OSD and Client nodes are collocated. Because we
aren't running the ceph-container-common role on the client nodes except
the first one (for keyring purpose) then the ceph-role execution fails
due to undefined variables.
because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.
Dimitri Savineau [Tue, 21 Jan 2020 21:37:10 +0000 (16:37 -0500)]
filestore-to-bluestore: fix osd_auto_discovery
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.
common: add a default value for ceph_directories_mode
Since this variable makes it possible to customize the mode for ceph
directories, let's make it a bit more explicit by adding a default value
in ceph-defaults.
Dimitri Savineau [Mon, 20 Jan 2020 21:40:58 +0000 (16:40 -0500)]
filestore-to-bluestore: --destroy with raw devices
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.
Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
/dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes
Dimitri Savineau [Mon, 20 Jan 2020 16:24:08 +0000 (11:24 -0500)]
ceph-osd: set container objectstore env variables
Because we need to manage legacy ceph-disk based OSD with ceph-volume
then we need a way to know the osd_objectstore in the container.
This was done like this previously with ceph-disk so we should also
do it with ceph-volume.
Note that this won't have any impact for ceph-volume lvm based OSD.
Rename docker_env_args fact to container_env_args and move the container
condition on the include_tasks call.
Remove OSD_DMCRYPT env variable from the ceph-osd template because it's
now included in the container_env_args variable.
Benoît Knecht [Mon, 20 Jan 2020 10:36:27 +0000 (11:36 +0100)]
ceph-rgw: Fix customize pool size "when" condition
In 3c31b19ab39f297635c84edb9e8a5de6c2da7707, I fixed the `customize pool
size` task by replacing `item.size` with `item.value.size`. However, I
missed the same issue in the `when` condition.
handler: fix call to container_exec_cmd in handler_osds
When unsetting the noup flag, we must call container_exec_cmd from the
delegated node (first mon member)
Also, adding a `run_once: true` because this task needs to be run only 1
time.
Since commit [1] running_mon introduced, it can be not defined
which results in fatal error [2]. This patch defines default value which
was used before patch [1]
Dimitri Savineau [Thu, 16 Jan 2020 14:38:08 +0000 (09:38 -0500)]
ceph-facts: move facts to defaults value
There's no need to define a variable via a fact if we can do it via a
default value. Using a fact could be interesseting to override the
default value on some condition.
- ceph_uid could be set to 167 by default because it's only different on
non containerized deployment on Debian/Ubuntu.
- rbd_client_directory_{owner,group,mode} could be set to ceph,ceph,0770
by default install of null as we are doing in the facts.
```
{% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%}
```
is wrong because it compares the first digit (1) whereas it should
compare the second one.
It means we always use `--cpu-quota` although documentation recommend
using `--cpus` when docker version is 1.13.1 or higher.
From the doc:
> --cpu-quota=<value> Impose a CPU CFS quota on the container. The number of
> microseconds per --cpu-period that the container is limited to before
> throttled. As such acting as the effective ceiling.
> If you use Docker 1.13 or higher, use --cpus instead.
Iterating over all monitors in order to delegate a `
{{ container_binary }}` fails when collocating mgrs with mons, because
ceph-facts reset `container_exec_cmd` to point to the first member of
the monitor group.
The idea is to force `container_exec_cmd` to be reset in ceph-mgr.
This commit also removes the `container_exec_cmd_mgr` fact.
Dimitri Savineau [Wed, 15 Jan 2020 15:35:28 +0000 (10:35 -0500)]
vagrant: temp workaround for CentOS 8 cloud image
The CentOS cloud infrastructure storing the vagrant CentOS 8 image
changed the directory path and remove the old 8.0 image so the vagrant
box add centos/8 fails returning a 404 http error.
As a workaround we can pull the image from CentOS instead of letting
vagrant doing the resolution.
the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.
Dimitri Savineau [Mon, 13 Jan 2020 15:24:52 +0000 (10:24 -0500)]
ceph-facts: move grafana fact to dedicated file
We don't need to executed the grafana fact everytime but only during
the dashboard deployment.
Especially for ceph-grafana, ceph-prometheus and ceph-dashboard roles.
Dimitri Savineau [Fri, 10 Jan 2020 20:30:58 +0000 (15:30 -0500)]
tests/setup: update mount options on EL 8
The nobarrier mount flag doesn't exist anymoer on XFS in the EL 8
kernel. That's why the task wasn't working on those systems.
We can still use the other options instead of skipping the task.
Dimitri Savineau [Fri, 10 Jan 2020 14:31:26 +0000 (09:31 -0500)]
purge-iscsi-gateways: don't run all ceph-facts
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.
config: exclude ceph-disk prepared osds in lvm batch report
We must exclude the devices already used and prepared by ceph-disk when
doing the lvm batch report. Otherwise it fails because ceph-volume
complains about GPT header.
rolling_update: run registry auth before upgrading
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.
Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.
We must pick up a mon which actually exists in ceph-facts in order to
detect if a cluster is running. Otherwise, it will state no cluster is
already running which will end up deploying a new monitor isolated in a
new quorum.
using `handler_*_status` instead of `hostvars[item]['handler_*_status']`
causes handlers to be triggered in anycase even though
`handler_*_status` was set to `False` on a specific node.
for instance. However, doing so would create pools of size
`osd_pool_default_size` regardless of the `size` value. This was due to
the fact that the Ansible task used
Only the ipv4 addresses from the nodes running the dashboard mgr module
were added to the trusted_ip_list configuration file on the iscsigws
nodes.
This also add the iscsi gateways with ipv6 configuration to the ceph
dashboard.
Some docker commands were hardcoded in tests playbooks and some
conditions were not taking care of the containerized_deployment
variable but only the atomic fact.
Since RHEL 8.1 we need to add the ganesha_t type to the permissive
SELinux list.
Otherwise the nfs-ganesha service won't start.
This was done on RHEL 7 previously and part of the nfs-ganesha-selinux
package on RHEL 8.
Before this patch, the lvm2 package installation was done during the
ceph-osd role.
However we were running ceph-volume command in the ceph-config role
before ceph-osd. If lvm2 wasn't installed then the ceph-volume command
fails:
error checking path "/run/lock/lvm": stat /run/lock/lvm: no such file or
directory
This wasn't visible before because lvm2 was automatically installed as
docker dependency but it's not the same for podman on CentOS 8.
This commit adds CentOS 8 support:
- update vagrant image in tox configurations.
- add CentOS 8 repository for el8 dependencies.
- CentOS 8 container engine is podman (same than RHEL 8).
- don't use the epel mirror on sepia because it's epel7 only.
Currently the keepalived template only works when system hostnames exactly match the Ansible inventory name. If these are different, all generated templates become BACKUP without a MASTER assigned. Using the inventory_hostname in the template file resolves this issue.
Signed-off-by: Stanley Lam stanleylam_604@hotmail.com
Dimitri Savineau [Mon, 16 Dec 2019 16:03:21 +0000 (11:03 -0500)]
ceph-infra: replace hardcoded grafana group name
The grafana-server group name was hardcoded for the grafana/prometheus
firewalld tasks condition.
We should we the associated variable : grafana_server_group_name
Dimitri Savineau [Mon, 16 Dec 2019 16:00:35 +0000 (11:00 -0500)]
ceph-infra: move dashboard into a dedicated file
Instead of using multiple dashboard_enabled condition in the
configure_firewall file we could just have the condition once
and include the dedicated tasks list.
Dimitri Savineau [Mon, 16 Dec 2019 15:48:26 +0000 (10:48 -0500)]
ceph-infra: open dashboard port on monitor
When there's no mgr group defined in the ansible inventory then the
mgrs are deployed implicitly on the mons nodes.
If the dashboard is enabled then we need to open the dashboard port on
the node that is running the ceph mgr process (mgr or mon).
The current code only allow to open that port on the mgr nodes when they
are present explicitly in the inventory but not implicitly.