doc/rados/operations/health-checks: document DEVICE_HEALTH* messages

author Sage Weil <sage@redhat.com>

Tue, 31 Jul 2018 14:22:42 +0000 (09:22 -0500)

committer Sage Weil <sage@redhat.com>

Tue, 31 Jul 2018 19:08:53 +0000 (14:08 -0500)
author Sage Weil <sage@redhat.com>
Tue, 31 Jul 2018 14:22:42 +0000 (09:22 -0500)
committer Sage Weil <sage@redhat.com>
Tue, 31 Jul 2018 19:08:53 +0000 (14:08 -0500)
diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst

index c43e1e57a1ff056cacc2318d2e7c3e82537cc79c..acd4ce8913880b9eb9a2566094503edd9302e0eb 100644 (file)
--- a/doc/rados/operations/health-checks.rst
+++ b/doc/rados/operations/health-checks.rst
@@ -251,6 +251,72 @@ You can either raise the pool quota with::
  or delete some existing data to reduce utilization.
  
  
+Device health
+-------------
+
+DEVICE_HEALTH
+_____________
+
+One or more devices is expected to fail soon, where the warning
+threshold is controlled by the ``mgr/devicehealth/warn_threshold``
+config option.
+
+This warning only applies to OSDs that are currently marked "in", so
+the expected response to this failure is to mark the device "out" so
+that data is migrated off of the device, and then to remove the
+hardware from the system.  Note that the marking out is normally done
+automatically if ``mgr/devicehealth/self_heal`` is enabled based on
+the ``mgr/devicehealth/mark_out_threshold``.
+
+Device health can be checked with::
+
+  ceph device info <device-id>
+
+Device life expectancy is set by a prediction model run by
+the mgr or an by external tool via the command::
+
+  ceph device set-life-expectancy <device-id> <from> <to>
+
+You can change the stored life expectancy manually, but that usually
+doesn't accomplish anything as whatever tool originally set it will
+probably set it again, and changing the stored value does not affect
+the actual health of the hardware device.
+
+DEVICE_HEALTH_IN_USE
+____________________
+
+One or more devices is expected to fail soon and has been marked "out"
+of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it
+is still participating in one more PGs.  This may be because it was
+only recently marked "out" and data is still migrating, or because data
+cannot be migrated off for some reason (e.g., the cluster is nearly
+full, or the CRUSH hierarchy is such that there isn't another suitable
+OSD to migrate the data too).
+
+This message can be silenced by disabling the self heal behavior
+(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the
+``mgr/devicehealth/mark_out_threshold``, or by addressing what is
+preventing data from being migrated off of the ailing device.
+
+DEVICE_HEALTH_TOOMANY
+_____________________
+
+Too many devices is expected to fail soon and the
+``mgr/devicehealth/self_heal`` behavior is enabled, such that marking
+out all of the ailing devices would exceed the clusters
+``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being
+automatically marked "out".
+
+This generally indicates that too many devices in your cluster are
+expected to fail soon and you should take action to add newer
+(healthier) devices before too many devices fail and data is lost.
+
+The health message can also be silenced by adjusting parameters like
+``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``,
+but be warned that this will increase the likelihood of unrecoverable
+data loss in the cluster.
+
+
  Data health (pools & placement groups)
  --------------------------------------
author	Sage Weil <sage@redhat.com>
	Tue, 31 Jul 2018 14:22:42 +0000 (09:22 -0500)
committer	Sage Weil <sage@redhat.com>
	Tue, 31 Jul 2018 19:08:53 +0000 (14:08 -0500)