From: Sage Weil <sage@redhat.com>
Date: Tue, 31 Jul 2018 14:22:42 +0000 (-0500)
Subject: doc/rados/operations/health-checks: document DEVICE_HEALTH* messages
X-Git-Tag: v14.0.1~723^2~1
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=7ab8675fdfd3c523bf6cd878bf2dd5395bf56e50;p=ceph.git

doc/rados/operations/health-checks: document DEVICE_HEALTH* messages

Signed-off-by: Sage Weil <sage@redhat.com>
---

diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst
index c43e1e57a1ff0..acd4ce8913880 100644
--- a/doc/rados/operations/health-checks.rst
+++ b/doc/rados/operations/health-checks.rst
@@ -251,6 +251,72 @@ You can either raise the pool quota with::
 or delete some existing data to reduce utilization.
 
 
+Device health
+-------------
+
+DEVICE_HEALTH
+_____________
+
+One or more devices is expected to fail soon, where the warning
+threshold is controlled by the ``mgr/devicehealth/warn_threshold``
+config option.
+
+This warning only applies to OSDs that are currently marked "in", so
+the expected response to this failure is to mark the device "out" so
+that data is migrated off of the device, and then to remove the
+hardware from the system.  Note that the marking out is normally done
+automatically if ``mgr/devicehealth/self_heal`` is enabled based on
+the ``mgr/devicehealth/mark_out_threshold``.
+
+Device health can be checked with::
+
+  ceph device info <device-id>
+
+Device life expectancy is set by a prediction model run by
+the mgr or an by external tool via the command::
+
+  ceph device set-life-expectancy <device-id> <from> <to>
+
+You can change the stored life expectancy manually, but that usually
+doesn't accomplish anything as whatever tool originally set it will
+probably set it again, and changing the stored value does not affect
+the actual health of the hardware device.
+
+DEVICE_HEALTH_IN_USE
+____________________
+
+One or more devices is expected to fail soon and has been marked "out"
+of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it
+is still participating in one more PGs.  This may be because it was
+only recently marked "out" and data is still migrating, or because data
+cannot be migrated off for some reason (e.g., the cluster is nearly
+full, or the CRUSH hierarchy is such that there isn't another suitable
+OSD to migrate the data too).
+
+This message can be silenced by disabling the self heal behavior
+(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the
+``mgr/devicehealth/mark_out_threshold``, or by addressing what is
+preventing data from being migrated off of the ailing device.
+
+DEVICE_HEALTH_TOOMANY
+_____________________
+
+Too many devices is expected to fail soon and the
+``mgr/devicehealth/self_heal`` behavior is enabled, such that marking
+out all of the ailing devices would exceed the clusters
+``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being
+automatically marked "out".
+
+This generally indicates that too many devices in your cluster are
+expected to fail soon and you should take action to add newer
+(healthier) devices before too many devices fail and data is lost.
+
+The health message can also be silenced by adjusting parameters like
+``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``,
+but be warned that this will increase the likelihood of unrecoverable
+data loss in the cluster.
+
+
 Data health (pools & placement groups)
 --------------------------------------