From dbb1dd33e6cc7b0443aee15602f89da1bb1e0d02 Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@redhat.com>
Date: Tue, 1 Aug 2017 09:25:27 -0400
Subject: [PATCH] doc/rados/operations/health-checks: add PG health check
 commentary

Include a link to pg-repair.rst, although there is no
content there yet.

Signed-off-by: Sage Weil <sage@redhat.com>
---
 doc/rados/operations/health-checks.rst | 214 +++++++++++++++++++++++++
 doc/rados/operations/pg-repair.rst     |   4 +
 2 files changed, 218 insertions(+)
 create mode 100644 doc/rados/operations/pg-repair.rst
diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst
index e5156fbe3af..b612995081e 100644
--- a/doc/rados/operations/health-checks.rst
+++ b/doc/rados/operations/health-checks.rst
@@ -218,79 +218,293 @@ You can either raise the pool quota with::
 
 or delete some existing data to reduce utilization.
 
+
 Data health (pools & placement groups)
 ------------------------------
 
 PG_AVAILABILITY
 _______________
 
+Data availability is reduced, meaning that the cluster is unable to
+service potential read or write requests for some data in the cluster.
+Specifically, one or more PGs is in a state that does not allow IO
+requests to be serviced.  Problematic PG states include *peering*,
+*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear
+quickly).
+
+Detailed information about which PGs are affected is available from::
+
+  ceph health detail
+
+In most cases the root cause is that one or more OSDs is currently
+down; see the dicussion for ``OSD_DOWN`` above.
+
+The state of specific problematic PGs can be queried with::
+
+  ceph tell <pgid> query
 
 PG_DEGRADED
 ___________
 
+Data redundancy is reduced for some data, meaning the cluster does not
+have the desired number of replicas for all data (for replicated
+pools) or erasure code fragments (for erasure coded pools).
+Specifically, one or more PGs:
+
+* has the *degraded* or *undersized* flag set, meaning there are not
+  enough instances of that placement group in the cluster;
+* has not had the *clean* flag set for some time.
+
+Detailed information about which PGs are affected is available from::
+
+  ceph health detail
+
+In most cases the root cause is that one or more OSDs is currently
+down; see the dicussion for ``OSD_DOWN`` above.
+
+The state of specific problematic PGs can be queried with::
+
+  ceph tell <pgid> query
+
 
 PG_DEGRADED_FULL
 ________________
 
+Data redundancy may be reduced or at risk for some data due to a lack
+of free space in the cluster.  Specifically, one or more PGs has the
+*backfill_toofull* or *recovery_toofull* flag set, meaning that the
+cluster is unable to migrate or recover data because one or more OSDs
+is above the *backfillfull* threshold.
+
+See the discussion for *OSD_BACKFILLFULL* or *OSD_FULL* above for
+steps to resolve this condition.
 
 PG_DAMAGED
 __________
 
+Data scrubbing has discovered some problems with data consistency in
+the cluster.  Specifically, one or more PGs has the *inconsistent* or
+*snaptrim_error* flag is set, indicating an earlier scrub operation
+found a problem, or that the *repair* flag is set, meaning a repair
+for such an inconsistency is currently in progress.
+
+See :doc:`pg-repair` for more information.
+
 OSD_SCRUB_ERRORS
 ________________
 
+Recent OSD scrubs have uncovered inconsistencies. This error is generally
+paired with *PG_DAMANGED* (see above).
+
+See :doc:`pg-repair` for more information.
 
 CACHE_POOL_NEAR_FULL
 ____________________
 
+A cache tier pool is nearly full.  Full in this context is determined
+by the ``target_max_bytes`` and ``target_max_objects`` properties on
+the cache pool.  Once the pool reaches the target threshold, write
+requests to the pool may block while data is flushed and evicted
+from the cache, a state that normally leads to very high latencies and
+poor performance.
+
+The cache pool target size can be adjusted with::
+
+  ceph osd pool set <cache-pool-name> target_max_bytes <bytes>
+  ceph osd pool set <cache-pool-name> target_max_objects <objects>
+
+Normal cache flush and evict activity may also be throttled due to reduced
+availability or performance of the base tier, or overall cluster load.
 
 TOO_FEW_PGS
 ___________
 
+The number of PGs in use in the cluster is below the configurable
+threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD.  This can lead
+to suboptimizal distribution and balance of data across the OSDs in
+the cluster, and similar reduce overall performance.
+
+This may be an expected condition if data pools have not yet been
+created.
+
+The PG count for existing pools can be increased or new pools can be
+created.  Please refer to
+:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for
+more information.
 
 TOO_MANY_PGS
 ____________
 
+The number of PGs in use in the cluster is above the configurable
+threshold of ``mon_pg_warn_max_per_osd`` PGs per OSD.  This can lead
+to higher memory utilization for OSD daemons, slower peering after
+cluster state changes (like OSD restarts, additions, or removals), and
+higher load on the Manager and Monitor daemons.
+
+The ``pg_num`` value for existing pools cannot currently be reduced.
+However, the ``pgp_num`` value can, which effectively collocates some
+PGs on the same sets of OSDs, mitigating some of the negative impacts
+described above.  The ``pgp_num`` value can be adjusted with::
+
+  ceph osd pool set <pool> pgp_num <value>
+
+Please refer to
+:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for
+more information.
 
 SMALLER_PGP_NUM
 _______________
 
+One or more pools has a ``pgp_num`` value less than ``pg_num``.  This
+is normally an indication that the PG count was increased without
+also increasing the placement behavior.
+
+This is sometimes done deliberately to separate out the `split` step
+when the PG count is adjusted from the data migration that is needed
+when ``pgp_num`` is changed.
+
+This is normally resolved by setting ``pgp_num`` to match ``pg_num``,
+triggering the data migration, with::
+
+  ceph osd pool set <pool> pgp_num <pg-num-value>
+
 
 MANY_OBJECTS_PER_PG
 ___________________
 
+One or more pools has an average number of objects per PG that is
+significantly higher than the overall cluster average.  The specific
+threshold is controlled by the ``mon_pg_warn_max_object_skew``
+configuration value.
+
+This is usually an indication that the pool(s) containing most of the
+data in the cluster have too few PGs, and/or that other pools that do
+not contain as much data have too many PGs.  See the discussion of
+*TOO_MANY_PGS* above.
+
+The threshold can be raised to silence the health warning by adjusting
+the ``mon_pg_warn_max_object_skew`` config option on the monitors.
 
 POOL_FULL
 _________
 
+One or more pools has reached (or is very close to reaching) its
+quota.  The threshold to trigger this error condition is controlled by
+the ``mon_pool_quota_crit_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) with::
+
+  ceph osd pool set-quota <pool> max_bytes <bytes>
+  ceph osd pool set-quota <pool> max_objects <objects>
+
+Setting the quota value to 0 will disable the quota.  
 
 POOL_NEAR_FULL
 ______________
 
+One or more pools is approaching is quota.  The threshold to trigger
+this warning condition is controlled by the
+``mon_pool_quota_warn_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) with::
+
+  ceph osd pool set-quota <pool> max_bytes <bytes>
+  ceph osd pool set-quota <pool> max_objects <objects>
+
+Setting the quota value to 0 will disable the quota.
 
 OBJECT_MISPLACED
 ________________
 
+One or more objects in the cluster is not stored on the node the
+cluster would like it to be stored on.  This is an indication that
+data migration due to some recent cluster change has not yet completed.
+
+Misplaced data is not a dangerous condition in and of itself; data
+consistency is never at risk, and old copies of objects are never
+removed until the desired number of new copies (in the desired
+locations) are present.
 
 OBJECT_UNFOUND
 ______________
 
+One or more objects in the cluster cannot be found.  Specifically, the
+OSDs know that a new or updated copy of an object should exist, but a
+copy of that version of the object has not been found on OSDs that are
+currently online.
+
+Read or write requests to unfound objects will block.
+
+Ideally, a down OSD can be brought back online that has the more
+recent copy of the unfound object.  Candidate OSDs can be identified from the
+peering state for the PG(s) responsible for the unfound object::
+
+  ceph tell <pgid> query
+
+If the latest copy of the object is not available, the cluster can be
+told to roll back to a previous version of the object.  See
+:doc:`troubleshooting-pg#Unfound-objects` for more information.
 
 REQUEST_SLOW
 ____________
 
+One or more OSD requests is taking a long time to process.  This can
+be an indication of extreme load, a slow storage device, or a software
+bug.
+
+The request queue on the OSD(s) in question can be queried with the
+following command, executed from the OSD host::
+
+  ceph daemon osd.<id> ops
+
+A summary of the slowest recent requests can be seen with::
+
+  ceph daemon osd.<id> dump_historic_ops
+
+The location of an OSD can be found with::
+
+  ceph osd find osd.<id>
 
 REQUEST_STUCK
 _____________
 
+One or more OSD requests has been blocked for an extremely long time.
+This is an indication that either the cluster has been unhealthy for
+an extended period of time (e.g., not enough running OSDs) or there is
+some internal problem with the OSD.  See the dicussion of
+*REQUEST_SLOW* above.
 
 PG_NOT_SCRUBBED
 _______________
 
+One or more PGs has not been scrubbed recently.  PGs are normally
+scrubbed every ``mon_scrub_interval`` seconds, and this warning
+triggers when ``mon_warn_not_scrubbed`` such intervals have elapsed
+without a scrub.
+
+PGs will not scrub if they are not flagged as *clean*, which may
+happen if they are misplaced or degraded (see *PG_AVAILABILITY* and
+*PG_DEGRADED* above).
+
+You can manually initiate a scrub of a clean PG with::
+
+  ceph pg scrub <pgid>
 
 PG_NOT_DEEP_SCRUBBED
 ____________________
 
+One or more PGs has not been deep scrubbed recently.  PGs are normally
+scrubbed every ``osd_deep_mon_scrub_interval`` seconds, and this warning
+triggers when ``mon_warn_not_deep_scrubbed`` such intervals have elapsed
+without a scrub.
+
+PGs will not (deep) scrub if they are not flagged as *clean*, which may
+happen if they are misplaced or degraded (see *PG_AVAILABILITY* and
+*PG_DEGRADED* above).
+
+You can manually initiate a scrub of a clean PG with::
+
+  ceph pg deep-scrub <pgid>
 
 CephFS
 ------
diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst
new file mode 100644
index 00000000000..0d6692a35e9
--- /dev/null
+++ b/doc/rados/operations/pg-repair.rst
@@ -0,0 +1,4 @@
+Repairing PG inconsistencies
+============================
+
+
-- 
2.39.5