From 3e791d1f9eec94730e470f2a72e73c1ba4c04558 Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@redhat.com>
Date: Fri, 14 Sep 2018 14:06:41 -0500
Subject: [PATCH] doc: update docs for device management

Move this out of mgr section and into rados operations.

Leave a placeholder for failure prediction, which is still a work in
progress.

Signed-off-by: Sage Weil <sage@redhat.com>
---
 doc/mgr/devicehealth.rst         | 70 -----------------------
 doc/mgr/index.rst                |  1 -
 doc/rados/operations/devices.rst | 98 ++++++++++++++++++++++++++++++++
 doc/rados/operations/index.rst   |  1 +
 4 files changed, 99 insertions(+), 71 deletions(-)
 delete mode 100644 doc/mgr/devicehealth.rst
 create mode 100644 doc/rados/operations/devices.rst
diff --git a/doc/mgr/devicehealth.rst b/doc/mgr/devicehealth.rst
deleted file mode 100644
index 3dc4a08bab6f6..0000000000000
--- a/doc/mgr/devicehealth.rst
+++ /dev/null
@@ -1,70 +0,0 @@
-Devicehealth plugin
-===================
-
-The *devicehealth* plugin includes code to manage physical devices
-that back Ceph daemons (e.g., OSDs).  This includes scraping health
-metrics (e.g., SMART) and responding to health metrics by migrating
-data away from failing devices.
-
-Enabling
---------
-
-The *devicehealth* module is enabled with::
-
-  ceph mgr module enable devicehealth
-
-(The module is enabled by default.)
-
-To turn on automatic device health monitoring, including regular (daily)
-scraping of device health metrics like SMART::
-
-  ceph device monitoring on
-
-To disable monitoring,::
-
-  ceph device monitoring off
-
-Scraping
---------
-
-Health metrics can be scraped from all devices with::
-
-  ceph device scrape-health-metrics
-
-A single device can be scraped with::
-
-  ceph device scrape-health-metrics <device-id>
-
-Or a single daemon's devices can be scraped with::
-
-  ceph device scrape-daemon-health-metrics <who>
-
-The stored health metrics for a device can be retrieved (optionally
-for a specific timestamp) with::
-
-  ceph device get-health-metrics <devid> [sample-timestamp]
-
-Health monitoring
------------------
-
-By default, the devicehealth module wakes up periodically and checks
-the health of all devices in the system.  This will raise health
-alerts if devices are expected to fail soon.  This can be disabled by
-turning off the ``mgr/devicehealth/enable_monitoring`` option.
-
-The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
-device failure must be before we generate a health warning.
-
-If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
-default), then for devices that are expected to fail soon the module
-will automatically migrate data away from them by marking the devices
-"out".
-
-The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
-expected device failure must be before we automatically mark an osd
-"out".
-
-The stored life expectancy of all devices can be checked, and any
-appropriate health alerts generated, with::
-
-  ceph device check-health
diff --git a/doc/mgr/index.rst b/doc/mgr/index.rst
index 8b3487dd8070a..0fd60466e8339 100644
--- a/doc/mgr/index.rst
+++ b/doc/mgr/index.rst
@@ -39,7 +39,6 @@ sensible.
     Telemetry plugin <telemetry>
     Iostat plugin <iostat>
     Crash plugin <crash>
-    Devicehealth plugin <devicehealth>
     Orchestrator CLI plugin <orchestrator_cli>
     Rook plugin <rook>
     Insights plugin <insights>
diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst
new file mode 100644
index 0000000000000..2815abbd96f54
--- /dev/null
+++ b/doc/rados/operations/devices.rst
@@ -0,0 +1,98 @@
+Device Management
+=================
+
+Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
+which daemons, and collects health metrics about those devices in order to
+provide tools to predict and/or automatically respond to hardware failure.
+
+Device tracking
+---------------
+
+You can query which storage devices are in use with::
+
+  ceph device ls
+
+You can also list devices by daemon or by host::
+
+  ceph device ls-by-daemon <daemon>
+  ceph device ls-by-host <host>
+
+For any individual device, you can query information about its
+location and how it is being consumed with::
+
+  ceph device info <devid>
+
+
+Enabling monitoring
+-------------------
+
+Ceph can also monitor health metrics associated with your device.  For
+example, SATA hard disks implement a standard called SMART that
+provides a wide range of internal metrics about the device's usage and
+health, like the number of hours powered on, number of power cycles,
+or unrecoverable read errors.  Other device types like SAS and NVMe
+implement a similar set of metrics (via slightly different standards).
+All of these can be collected by Ceph via the ``smartctl`` tool.
+
+You can enable or disable health monitoring with::
+
+  ceph device monitoring on
+
+or::
+
+  ceph device monitoring off
+
+
+Scraping
+--------
+
+If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with::
+
+  ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
+
+The default is to scrape once every 24 hours.
+
+You can manually trigger a scrape of all devices with::
+
+  ceph device scrape-health-metrics
+
+A single device can be scraped with::
+
+  ceph device scrape-health-metrics <device-id>
+
+Or a single daemon's devices can be scraped with::
+
+  ceph device scrape-daemon-health-metrics <who>
+
+The stored health metrics for a device can be retrieved (optionally
+for a specific timestamp) with::
+
+  ceph device get-health-metrics <devid> [sample-timestamp]
+
+Failure prediction
+------------------
+
+TBD
+
+Health alerts
+-------------
+
+The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
+device failure must be before we generate a health warning.
+
+The stored life expectancy of all devices can be checked, and any
+appropriate health alerts generated, with::
+
+  ceph device check-health
+
+Automatic Mitigation
+--------------------
+
+If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
+default), then for devices that are expected to fail soon the module
+will automatically migrate data away from them by marking the devices
+"out".
+
+The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
+expected device failure must be before we automatically mark an osd
+"out".
diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst
index 6012f757d11e1..8e93886b80850 100644
--- a/doc/rados/operations/index.rst
+++ b/doc/rados/operations/index.rst
@@ -60,6 +60,7 @@ with new hardware.
 
 	add-or-rm-osds
 	add-or-rm-mons
+	devices
 	bluestore-migration
 	Command Reference <control>
 
-- 
2.39.5