doc: update docs for device management

author Sage Weil <sage@redhat.com>

Fri, 14 Sep 2018 19:06:41 +0000 (14:06 -0500)

committer Sage Weil <sage@redhat.com>

Fri, 14 Sep 2018 19:06:41 +0000 (14:06 -0500)
author Sage Weil <sage@redhat.com>
Fri, 14 Sep 2018 19:06:41 +0000 (14:06 -0500)
committer Sage Weil <sage@redhat.com>
Fri, 14 Sep 2018 19:06:41 +0000 (14:06 -0500)
diff --git a/doc/mgr/devicehealth.rst b/doc/mgr/devicehealth.rst

deleted file mode 100644 (file)

index 3dc4a08..0000000
--- a/doc/mgr/devicehealth.rst
+++ /dev/null
@@ -1,70 +0,0 @@
-Devicehealth plugin
-===================
-
-The *devicehealth* plugin includes code to manage physical devices
-that back Ceph daemons (e.g., OSDs).  This includes scraping health
-metrics (e.g., SMART) and responding to health metrics by migrating
-data away from failing devices.
-
-Enabling
---------
-
-The *devicehealth* module is enabled with::
-
-  ceph mgr module enable devicehealth
-
-(The module is enabled by default.)
-
-To turn on automatic device health monitoring, including regular (daily)
-scraping of device health metrics like SMART::
-
-  ceph device monitoring on
-
-To disable monitoring,::
-
-  ceph device monitoring off
-
-Scraping
---------
-
-Health metrics can be scraped from all devices with::
-
-  ceph device scrape-health-metrics
-
-A single device can be scraped with::
-
-  ceph device scrape-health-metrics <device-id>
-
-Or a single daemon's devices can be scraped with::
-
-  ceph device scrape-daemon-health-metrics <who>
-
-The stored health metrics for a device can be retrieved (optionally
-for a specific timestamp) with::
-
-  ceph device get-health-metrics <devid> [sample-timestamp]
-
-Health monitoring
------------------
-
-By default, the devicehealth module wakes up periodically and checks
-the health of all devices in the system.  This will raise health
-alerts if devices are expected to fail soon.  This can be disabled by
-turning off the ``mgr/devicehealth/enable_monitoring`` option.
-
-The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
-device failure must be before we generate a health warning.
-
-If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
-default), then for devices that are expected to fail soon the module
-will automatically migrate data away from them by marking the devices
-"out".
-
-The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
-expected device failure must be before we automatically mark an osd
-"out".
-
-The stored life expectancy of all devices can be checked, and any
-appropriate health alerts generated, with::
-
-  ceph device check-health
diff --git a/doc/mgr/index.rst b/doc/mgr/index.rst

index 8b3487dd8070abd306a112c449dc6f68d164bba7..0fd60466e83396a71dda1d845c5d1cd44506e411 100644 (file)
--- a/doc/mgr/index.rst
+++ b/doc/mgr/index.rst
@@ -39,7 +39,6 @@ sensible.
      Telemetry plugin <telemetry>
      Iostat plugin <iostat>
      Crash plugin <crash>
-    Devicehealth plugin <devicehealth>
      Orchestrator CLI plugin <orchestrator_cli>
      Rook plugin <rook>
      Insights plugin <insights>
diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst

new file mode 100644 (file)

index 0000000..2815abb
--- /dev/null
+++ b/doc/rados/operations/devices.rst
@@ -0,0 +1,98 @@
+Device Management
+=================
+
+Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
+which daemons, and collects health metrics about those devices in order to
+provide tools to predict and/or automatically respond to hardware failure.
+
+Device tracking
+---------------
+
+You can query which storage devices are in use with::
+
+  ceph device ls
+
+You can also list devices by daemon or by host::
+
+  ceph device ls-by-daemon <daemon>
+  ceph device ls-by-host <host>
+
+For any individual device, you can query information about its
+location and how it is being consumed with::
+
+  ceph device info <devid>
+
+
+Enabling monitoring
+-------------------
+
+Ceph can also monitor health metrics associated with your device.  For
+example, SATA hard disks implement a standard called SMART that
+provides a wide range of internal metrics about the device's usage and
+health, like the number of hours powered on, number of power cycles,
+or unrecoverable read errors.  Other device types like SAS and NVMe
+implement a similar set of metrics (via slightly different standards).
+All of these can be collected by Ceph via the ``smartctl`` tool.
+
+You can enable or disable health monitoring with::
+
+  ceph device monitoring on
+
+or::
+
+  ceph device monitoring off
+
+
+Scraping
+--------
+
+If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::
+
+  ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
+
+The default is to scrape once every 24 hours.
+
+You can manually trigger a scrape of all devices with::
+
+  ceph device scrape-health-metrics
+
+A single device can be scraped with::
+
+  ceph device scrape-health-metrics <device-id>
+
+Or a single daemon's devices can be scraped with::
+
+  ceph device scrape-daemon-health-metrics <who>
+
+The stored health metrics for a device can be retrieved (optionally
+for a specific timestamp) with::
+
+  ceph device get-health-metrics <devid> [sample-timestamp]
+
+Failure prediction
+------------------
+
+TBD
+
+Health alerts
+-------------
+
+The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
+device failure must be before we generate a health warning.
+
+The stored life expectancy of all devices can be checked, and any
+appropriate health alerts generated, with::
+
+  ceph device check-health
+
+Automatic Mitigation
+--------------------
+
+If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
+default), then for devices that are expected to fail soon the module
+will automatically migrate data away from them by marking the devices
+"out".
+
+The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
+expected device failure must be before we automatically mark an osd
+"out".
diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst

index 6012f757d11e1de8476320666e98b4080d1247d9..8e93886b8085070580783abd5dd4e32ed10a8715 100644 (file)
--- a/doc/rados/operations/index.rst
+++ b/doc/rados/operations/index.rst
@@ -60,6 +60,7 @@ with new hardware.
  
         add-or-rm-osds
         add-or-rm-mons
+       devices
         bluestore-migration
         Command Reference <control>
author	Sage Weil <sage@redhat.com>
	Fri, 14 Sep 2018 19:06:41 +0000 (14:06 -0500)
committer	Sage Weil <sage@redhat.com>
	Fri, 14 Sep 2018 19:06:41 +0000 (14:06 -0500)
doc/mgr/devicehealth.rst	[deleted file]	patch \| blob \| history
doc/mgr/index.rst		patch \| blob \| history
doc/rados/operations/devices.rst	[new file with mode: 0644]	patch \| blob
doc/rados/operations/index.rst		patch \| blob \| history