Health checks
=============
+The cephadm module provides additional healthchecks to supplement the default healthchecks
+provided by the Cluster. These additional healthchecks fall into two categories;
+
+- **cephadm operations**: Healthchecks in this category are always executed when the cephadm module is active.
+- **cluster configuration**: These healthchecks are *optional*, and focus on the configuration of the hosts in
+ the cluster
+
+CEPHADM Operations
+------------------
CEPHADM_PAUSED
---------------
+^^^^^^^^^^^^^^
Cephadm background work has been paused with ``ceph orch pause``. Cephadm
continues to perform passive monitoring activities (like checking
.. _cephadm-stray-host:
CEPHADM_STRAY_HOST
-------------------
+^^^^^^^^^^^^^^^^^^
One or more hosts have running Ceph daemons but are not registered as
hosts managed by *cephadm*. This means that those services cannot
domain names.
CEPHADM_STRAY_DAEMON
---------------------
+^^^^^^^^^^^^^^^^^^^^
One or more Ceph daemons are running but not are not managed by
*cephadm*. This may be because they were deployed using a different
ceph config set mgr mgr/cephadm/warn_on_stray_daemons false
CEPHADM_HOST_CHECK_FAILED
--------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^
One or more hosts have failed the basic cephadm host check, which verifies
that (1) the host is reachable and cephadm can be executed there, and (2)
ceph config set mgr mgr/cephadm/warn_on_failed_host_check false
+Cluster Configuration Checks
+----------------------------
+Cephadm periodically scans each of the hosts in the cluster, to understand the state
+of the OS, disks, NICs etc. These facts can then be analysed for consistency across the hosts
+in the cluster to identify any configuration anomalies.
+
+The configuration checks are an **optional** feature, enabled by the following command
+::
+
+ ceph config set mgr mgr/cephadm/config_checks_enabled true
+
+The configuration checks are triggered after each host scan (1m). The cephadm log entries will
+show the current state and outcome of the configuration checks as follows;
+
+Disabled state (config_checks_enabled false)
+::
+
+ ALL cephadm checks are disabled, use 'ceph config set mgr mgr/cephadm/config_checks_enabled true' to enable
+
+Enabled state (config_checks_enabled true)
+::
+
+ CEPHADM 8/8 checks enabled and executed (0 bypassed, 0 disabled). No issues detected
+
+The configuration checks themselves are managed through several cephadm sub-commands.
+
+To determine whether the configuration checks are enabled, you can use the following command
+::
+
+ ceph cephadm config-check status
+
+This command will return the status of the configuration checker as either "Enabled" or "Disabled".
+
+
+Listing all the configuration checks and their current state
+::
+
+ ceph cephadm config-check ls
+
+ e.g.
+ NAME HEALTHCHECK STATUS DESCRIPTION
+ kernel_security CEPHADM_CHECK_KERNEL_LSM enabled checks SELINUX/Apparmor profiles are consistent across cluster hosts
+ os_subscription CEPHADM_CHECK_SUBSCRIPTION enabled checks subscription states are consistent for all cluster hosts
+ public_network CEPHADM_CHECK_PUBLIC_MEMBERSHIP enabled check that all hosts have a NIC on the Ceph public_netork
+ osd_mtu_size CEPHADM_CHECK_MTU enabled check that OSD hosts share a common MTU setting
+ osd_linkspeed CEPHADM_CHECK_LINKSPEED enabled check that OSD hosts share a common linkspeed
+ network_missing CEPHADM_CHECK_NETWORK_MISSING enabled checks that the cluster/public networks defined exist on the Ceph hosts
+ ceph_release CEPHADM_CHECK_CEPH_RELEASE enabled check for Ceph version consistency - ceph daemons should be on the same release (unless upgrade is active)
+ kernel_version CEPHADM_CHECK_KERNEL_VERSION enabled checks that the MAJ.MIN of the kernel on Ceph hosts is consistent
+
+The name of each configuration check, can then be used to enable or disable a specific check.
+::
+
+ ceph cephadm config-check disable <name>
+
+ eg.
+ ceph cephadm config-check disable kernel_security
+
+CEPHADM_CHECK_KERNEL_LSM
+^^^^^^^^^^^^^^^^^^^^^^^^
+Each host within the cluster is expected to operate within the same Linux Security Module (LSM) state. For example,
+if the majority of the hosts are running with SELINUX in enforcing mode, any host not running in this mode
+would be flagged as an anomaly and a healtcheck (WARNING) state raised.
+
+CEPHADM_CHECK_SUBSCRIPTION
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+This check relates to the status of vendor subscription. This check is only performed for hosts using RHEL, but helps
+to confirm that all your hosts are covered by an active subscription so patches and updates
+are available.
+
+CEPHADM_CHECK_PUBLIC_MEMBERSHIP
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+All members of the cluster should have NICs configured on at least one of the public network subnets. Hosts
+that are not on the public network will rely on routing which may affect performance
+
+CEPHADM_CHECK_MTU
+^^^^^^^^^^^^^^^^^
+The MTU of the NICs on OSDs can be a key factor in consistent performance. This check examines hosts
+that are running OSD services to ensure that the MTU is configured consistently within the cluster. This is
+determined by establishing the MTU setting that the majority of hosts are using, with any anomalies being
+resulting in a Ceph healthcheck.
+
+CEPHADM_CHECK_LINKSPEED
+^^^^^^^^^^^^^^^^^^^^^^^
+Similar to the MTU check, linkspeed consistency is also a factor in consistent cluster performance.
+This check determines the linkspeed shared by the majority of "OSD hosts", resulting in a healthcheck for
+any hosts that are set at a lower linkspeed rate.
+
+CEPHADM_CHECK_NETWORK_MISSING
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The public_network and cluster_network settings support subnet definitions for IPv4 and IPv6. If these
+settings are not found on any host in the cluster a healthcheck is raised.
+
+CEPHADM_CHECK_CEPH_RELEASE
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Under normal operations, the ceph cluster should be running daemons under the same ceph release (i.e. all
+pacific). This check looks at the active release for each daemon, and reports any anomalies as a
+healthcheck. *This check is bypassed if an upgrade process is active within the cluster.*
+
+CEPHADM_CHECK_KERNEL_VERSION
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The OS kernel version (maj.min) is checked for consistency across the hosts. Once again, the
+majority of the hosts is used as the basis of identifying anomalies.
+
/etc/ceph/ceph.conf
===================
-Cephadm distributes a minimized ``ceph.conf`` that only contains
+Cephadm distributes a minimized ``ceph.conf`` that only contains
a minimal set of information to connect to the Ceph cluster.
To update the configuration settings, instead of manually editing
ceph config set ...
-See :ref:`ceph-conf-database` for details.
+See :ref:`ceph-conf-database` for details.
By default, cephadm does not deploy that minimized ``ceph.conf`` across the
cluster. To enable the management of ``/etc/ceph/ceph.conf`` files on all
ceph config set mgr mgr/cephadm/manage_etc_ceph_ceph_conf true
-To set up an initial configuration before bootstrapping
+To set up an initial configuration before bootstrapping
the cluster, create an initial ``ceph.conf`` file. For example::
cat <<EOF > /etc/ceph/ceph.conf