doc/mgr/devicehealth: document devicehealth module

author Sage Weil <sage@redhat.com>

Tue, 31 Jul 2018 14:38:39 +0000 (09:38 -0500)

committer Sage Weil <sage@redhat.com>

Tue, 31 Jul 2018 19:08:53 +0000 (14:08 -0500)
author Sage Weil <sage@redhat.com>
Tue, 31 Jul 2018 14:38:39 +0000 (09:38 -0500)
committer Sage Weil <sage@redhat.com>
Tue, 31 Jul 2018 19:08:53 +0000 (14:08 -0500)
diff --git a/doc/mgr/devicehealth.rst b/doc/mgr/devicehealth.rst

new file mode 100644 (file)

index 0000000..5e0d001
--- /dev/null
+++ b/doc/mgr/devicehealth.rst
@@ -0,0 +1,52 @@
+Devicehealth plugin
+===================
+
+The *devicehealth* plugin includes code to manage physical devices
+that back Ceph daemons (e.g., OSDs).  This includes scraping health
+metrics (e.g., SMART) and responding to health metrics by migrating
+data away from failing devices.
+
+Enabling
+--------
+
+The *devicehealth* module is enabled with::
+
+  ceph mgr module enable devicehealth
+
+(It is enabled by default.)
+
+Scraping
+--------
+
+Health metrics can be scraped from all devices with::
+
+  ceph device scrape-health-metrics
+
+A single device can be scraped with::
+
+  ceph device scrape-health-metrics <device-id>
+
+Or a single daemon's devices can be scraped with::
+
+  ceph device scrape-daemon-health-metrics <who>
+
+
+Health monitoring
+-----------------
+
+By default, the devicehealth module wakes up periodically and checks
+the health of all devices in the system.  This will raise health
+alerts if devices are expected to fail soon.  This can be disabled by
+turning off the ``mgr/devicehealth/enable_monitoring`` option.
+
+The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
+device failure must be before we generate a health warning.
+
+If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
+default), then for devices that are expected to fail soon the module
+will automatically migrate data away from them by marking the devices
+"out".
+
+The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
+expected device failure must be before we automatically mark an osd
+"out".
diff --git a/doc/mgr/index.rst b/doc/mgr/index.rst

index ea8c9d48ca1bf5db229100bc80ff9401cdc62862..e640292f40f4dee9f36776cdb9a25f86b73059ee 100644 (file)
--- a/doc/mgr/index.rst
+++ b/doc/mgr/index.rst
@@ -39,3 +39,4 @@ sensible.
      Telemetry plugin <telemetry>
      Iostat plugin <iostat>
      Crash plugin <crash>
+    Devicehealth plugin <devicehealth>
author	Sage Weil <sage@redhat.com>
	Tue, 31 Jul 2018 14:38:39 +0000 (09:38 -0500)
committer	Sage Weil <sage@redhat.com>
	Tue, 31 Jul 2018 19:08:53 +0000 (14:08 -0500)
doc/mgr/devicehealth.rst	[new file with mode: 0644]	patch \| blob
doc/mgr/index.rst		patch \| blob \| history