From 3e791d1f9eec94730e470f2a72e73c1ba4c04558 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Fri, 14 Sep 2018 14:06:41 -0500 Subject: [PATCH] doc: update docs for device management Move this out of mgr section and into rados operations. Leave a placeholder for failure prediction, which is still a work in progress. Signed-off-by: Sage Weil --- doc/mgr/devicehealth.rst | 70 ----------------------- doc/mgr/index.rst | 1 - doc/rados/operations/devices.rst | 98 ++++++++++++++++++++++++++++++++ doc/rados/operations/index.rst | 1 + 4 files changed, 99 insertions(+), 71 deletions(-) delete mode 100644 doc/mgr/devicehealth.rst create mode 100644 doc/rados/operations/devices.rst diff --git a/doc/mgr/devicehealth.rst b/doc/mgr/devicehealth.rst deleted file mode 100644 index 3dc4a08bab6f6..0000000000000 --- a/doc/mgr/devicehealth.rst +++ /dev/null @@ -1,70 +0,0 @@ -Devicehealth plugin -=================== - -The *devicehealth* plugin includes code to manage physical devices -that back Ceph daemons (e.g., OSDs). This includes scraping health -metrics (e.g., SMART) and responding to health metrics by migrating -data away from failing devices. - -Enabling --------- - -The *devicehealth* module is enabled with:: - - ceph mgr module enable devicehealth - -(The module is enabled by default.) - -To turn on automatic device health monitoring, including regular (daily) -scraping of device health metrics like SMART:: - - ceph device monitoring on - -To disable monitoring,:: - - ceph device monitoring off - -Scraping --------- - -Health metrics can be scraped from all devices with:: - - ceph device scrape-health-metrics - -A single device can be scraped with:: - - ceph device scrape-health-metrics - -Or a single daemon's devices can be scraped with:: - - ceph device scrape-daemon-health-metrics - -The stored health metrics for a device can be retrieved (optionally -for a specific timestamp) with:: - - ceph device get-health-metrics [sample-timestamp] - -Health monitoring ------------------ - -By default, the devicehealth module wakes up periodically and checks -the health of all devices in the system. This will raise health -alerts if devices are expected to fail soon. This can be disabled by -turning off the ``mgr/devicehealth/enable_monitoring`` option. - -The ``mgr/devicehealth/warn_threshold`` controls how soon an expected -device failure must be before we generate a health warning. - -If the ``mgr/devicehealth/self_heal`` option is enabled (it is by -default), then for devices that are expected to fail soon the module -will automatically migrate data away from them by marking the devices -"out". - -The ``mgr/devicehealth/mark_out_threshold`` controls how soon an -expected device failure must be before we automatically mark an osd -"out". - -The stored life expectancy of all devices can be checked, and any -appropriate health alerts generated, with:: - - ceph device check-health diff --git a/doc/mgr/index.rst b/doc/mgr/index.rst index 8b3487dd8070a..0fd60466e8339 100644 --- a/doc/mgr/index.rst +++ b/doc/mgr/index.rst @@ -39,7 +39,6 @@ sensible. Telemetry plugin Iostat plugin Crash plugin - Devicehealth plugin Orchestrator CLI plugin Rook plugin Insights plugin diff --git a/doc/rados/operations/devices.rst b/doc/rados/operations/devices.rst new file mode 100644 index 0000000000000..2815abbd96f54 --- /dev/null +++ b/doc/rados/operations/devices.rst @@ -0,0 +1,98 @@ +Device Management +================= + +Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by +which daemons, and collects health metrics about those devices in order to +provide tools to predict and/or automatically respond to hardware failure. + +Device tracking +--------------- + +You can query which storage devices are in use with:: + + ceph device ls + +You can also list devices by daemon or by host:: + + ceph device ls-by-daemon + ceph device ls-by-host + +For any individual device, you can query information about its +location and how it is being consumed with:: + + ceph device info + + +Enabling monitoring +------------------- + +Ceph can also monitor health metrics associated with your device. For +example, SATA hard disks implement a standard called SMART that +provides a wide range of internal metrics about the device's usage and +health, like the number of hours powered on, number of power cycles, +or unrecoverable read errors. Other device types like SAS and NVMe +implement a similar set of metrics (via slightly different standards). +All of these can be collected by Ceph via the ``smartctl`` tool. + +You can enable or disable health monitoring with:: + + ceph device monitoring on + +or:: + + ceph device monitoring off + + +Scraping +-------- + +If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with:: + + ceph config set mgr mgr/devicehealth/scrape_frequency + +The default is to scrape once every 24 hours. + +You can manually trigger a scrape of all devices with:: + + ceph device scrape-health-metrics + +A single device can be scraped with:: + + ceph device scrape-health-metrics + +Or a single daemon's devices can be scraped with:: + + ceph device scrape-daemon-health-metrics + +The stored health metrics for a device can be retrieved (optionally +for a specific timestamp) with:: + + ceph device get-health-metrics [sample-timestamp] + +Failure prediction +------------------ + +TBD + +Health alerts +------------- + +The ``mgr/devicehealth/warn_threshold`` controls how soon an expected +device failure must be before we generate a health warning. + +The stored life expectancy of all devices can be checked, and any +appropriate health alerts generated, with:: + + ceph device check-health + +Automatic Mitigation +-------------------- + +If the ``mgr/devicehealth/self_heal`` option is enabled (it is by +default), then for devices that are expected to fail soon the module +will automatically migrate data away from them by marking the devices +"out". + +The ``mgr/devicehealth/mark_out_threshold`` controls how soon an +expected device failure must be before we automatically mark an osd +"out". diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index 6012f757d11e1..8e93886b80850 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -60,6 +60,7 @@ with new hardware. add-or-rm-osds add-or-rm-mons + devices bluestore-migration Command Reference -- 2.39.5