From: Sage Weil Date: Tue, 31 Jul 2018 14:22:42 +0000 (-0500) Subject: doc/rados/operations/health-checks: document DEVICE_HEALTH* messages X-Git-Tag: v14.0.1~723^2~1 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=7ab8675fdfd3c523bf6cd878bf2dd5395bf56e50;p=ceph.git doc/rados/operations/health-checks: document DEVICE_HEALTH* messages Signed-off-by: Sage Weil --- diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst index c43e1e57a1ff0..acd4ce8913880 100644 --- a/doc/rados/operations/health-checks.rst +++ b/doc/rados/operations/health-checks.rst @@ -251,6 +251,72 @@ You can either raise the pool quota with:: or delete some existing data to reduce utilization. +Device health +------------- + +DEVICE_HEALTH +_____________ + +One or more devices is expected to fail soon, where the warning +threshold is controlled by the ``mgr/devicehealth/warn_threshold`` +config option. + +This warning only applies to OSDs that are currently marked "in", so +the expected response to this failure is to mark the device "out" so +that data is migrated off of the device, and then to remove the +hardware from the system. Note that the marking out is normally done +automatically if ``mgr/devicehealth/self_heal`` is enabled based on +the ``mgr/devicehealth/mark_out_threshold``. + +Device health can be checked with:: + + ceph device info + +Device life expectancy is set by a prediction model run by +the mgr or an by external tool via the command:: + + ceph device set-life-expectancy + +You can change the stored life expectancy manually, but that usually +doesn't accomplish anything as whatever tool originally set it will +probably set it again, and changing the stored value does not affect +the actual health of the hardware device. + +DEVICE_HEALTH_IN_USE +____________________ + +One or more devices is expected to fail soon and has been marked "out" +of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it +is still participating in one more PGs. This may be because it was +only recently marked "out" and data is still migrating, or because data +cannot be migrated off for some reason (e.g., the cluster is nearly +full, or the CRUSH hierarchy is such that there isn't another suitable +OSD to migrate the data too). + +This message can be silenced by disabling the self heal behavior +(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the +``mgr/devicehealth/mark_out_threshold``, or by addressing what is +preventing data from being migrated off of the ailing device. + +DEVICE_HEALTH_TOOMANY +_____________________ + +Too many devices is expected to fail soon and the +``mgr/devicehealth/self_heal`` behavior is enabled, such that marking +out all of the ailing devices would exceed the clusters +``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being +automatically marked "out". + +This generally indicates that too many devices in your cluster are +expected to fail soon and you should take action to add newer +(healthier) devices before too many devices fail and data is lost. + +The health message can also be silenced by adjusting parameters like +``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``, +but be warned that this will increase the likelihood of unrecoverable +data loss in the cluster. + + Data health (pools & placement groups) --------------------------------------