From: Venky Shankar <vshankar@redhat.com>
Date: Wed, 13 Oct 2021 05:32:15 +0000 (+0530)
Subject: doc / cephfs: health message codes should be permalinks
X-Git-Tag: v17.1.0~697^2
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=3d97d6d98f0a72272362c154b5c74d158c572eaa;p=ceph-ci.git

doc / cephfs: health message codes should be permalinks

... so that such links can be included in alert warnings.

Additionally, document some other health warnings. Credit to @pcuzner
to point out that not all health warnings have been documented.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
---

diff --git a/doc/cephfs/health-messages.rst b/doc/cephfs/health-messages.rst
index 28ceb704a9d..790fc3fdedb 100644
--- a/doc/cephfs/health-messages.rst
+++ b/doc/cephfs/health-messages.rst
@@ -72,7 +72,8 @@ slow requests.
 This page lists the health checks raised by MDS daemons. For the checks from
 other daemons, please see :ref:`health-checks`.
 
-* ``MDS_TRIM``
+``MDS_TRIM``
+------------
 
   Message
     "Behind on trimming..."
@@ -85,7 +86,9 @@ other daemons, please see :ref:`health-checks`.
     too slowly, or a software bug is preventing trimming, then this health
     message may appear.  The threshold for this message to appear is controlled by
     the config option ``mds_log_warn_factor``, the default is 2.0.
-* ``MDS_HEALTH_CLIENT_LATE_RELEASE``, ``MDS_HEALTH_CLIENT_LATE_RELEASE_MANY``
+
+``MDS_HEALTH_CLIENT_LATE_RELEASE``, ``MDS_HEALTH_CLIENT_LATE_RELEASE_MANY``
+---------------------------------------------------------------------------
 
   Message
     "Client *name* failing to respond to capability release"
@@ -96,7 +99,9 @@ other daemons, please see :ref:`health-checks`.
     is unresponsive or buggy, it might fail to do so promptly or fail to do
     so at all.  This message appears if a client has taken longer than
     ``session_timeout`` (default 60s) to comply.
-* ``MDS_CLIENT_RECALL``, ``MDS_HEALTH_CLIENT_RECALL_MANY``
+
+``MDS_CLIENT_RECALL``, ``MDS_HEALTH_CLIENT_RECALL_MANY``
+--------------------------------------------------------
 
   Message
     "Client *name* failing to respond to cache pressure"
@@ -111,7 +116,9 @@ other daemons, please see :ref:`health-checks`.
     ``mds_recall_warning_threshold`` capabilities (decaying with a half-life of
     ``mds_recall_max_decay_rate``) within the last
     ``mds_recall_warning_decay_rate`` second.
-* ``MDS_CLIENT_OLDEST_TID``, ``MDS_CLIENT_OLDEST_TID_MANY``
+
+``MDS_CLIENT_OLDEST_TID``, ``MDS_CLIENT_OLDEST_TID_MANY``
+---------------------------------------------------------
 
   Message
     "Client *name* failing to advance its oldest client/flush tid"
@@ -124,7 +131,9 @@ other daemons, please see :ref:`health-checks`.
     appears if a client appears to have more than ``max_completed_requests``
     (default 100000) requests that are complete on the MDS side but haven't
     yet been accounted for in the client's *oldest tid* value.
-* ``MDS_DAMAGE``
+
+``MDS_DAMAGE``
+--------------
 
   Message
     "Metadata damage detected"
@@ -135,7 +144,9 @@ other daemons, please see :ref:`health-checks`.
     client accesses to the damaged subtree will return IO errors.  Use
     the ``damage ls`` admin socket command to get more detail on the damage.
     This message appears as soon as any damage is encountered.
-* ``MDS_HEALTH_READ_ONLY``
+
+``MDS_HEALTH_READ_ONLY``
+------------------------
 
   Message
     "MDS in read-only mode"
@@ -145,7 +156,9 @@ other daemons, please see :ref:`health-checks`.
     MDS will go into readonly mode if it encounters a write error while
     writing to the metadata pool, or if forced to by an administrator using
     the *force_readonly* admin socket command.
-* ``MDS_SLOW_REQUEST``
+
+``MDS_SLOW_REQUEST``
+--------------------
 
   Message
     "*N* slow requests are blocked"
@@ -157,7 +170,9 @@ other daemons, please see :ref:`health-checks`.
     Use the ``ops`` admin socket command to list outstanding metadata operations.
     This message appears if any client requests have taken longer than
     ``mds_op_complaint_time`` (default 30s).
-* ``MDS_CACHE_OVERSIZED``
+
+``MDS_CACHE_OVERSIZED``
+-----------------------
 
   Message
     "Too many inodes in cache"
@@ -168,3 +183,58 @@ other daemons, please see :ref:`health-checks`.
     the actual cache size (in memory) is at least 50% greater than
     ``mds_cache_memory_limit`` (default 1GB). Modify ``mds_health_cache_threshold``
     to set the warning ratio.
+
+``FS_WITH_FAILED_MDS``
+----------------------
+
+  Message
+    "Some MDS ranks do not have standby replacements"
+
+  Description
+    Normally, a failed MDS rank will be replaced by a standby MDS. This situation
+    is transient and is not considered critical. However, if there are no standby
+    MDSs available to replace an active MDS rank, this health warning is generated.
+
+``MDS_INSUFFICIENT_STANDBY``
+----------------------------
+
+  Message
+    "Insufficient number of available standby(-replay) MDS daemons than configured"
+
+  Description
+    The minimum number of standby(-replay) MDS daemons can be configured by setting
+    ``standby_count_wanted`` configuration variable. This health warning is generated
+    when the configured value mismatches the number of standby(-replay) MDS daemons
+    available.
+
+``FS_DEGRADED``
+----------------------------
+
+  Message
+    "Some MDS ranks have been marked failed or damaged"
+
+  Description
+    When one or more MDS rank ends up in failed or damaged state due to
+    an unrecoverable error. The file system may be partially or fully
+    unavailable when one (or more) ranks are offline.
+
+``MDS_UP_LESS_THAN_MAX``
+----------------------------
+
+  Message
+    "Number of active ranks are less than configured number of maximum MDSs"
+
+  Description
+    The maximum number of MDS ranks can be configured by setting ``max_mds``
+    configuration variable. This health warning is generated when the number
+    of MDS ranks falls below this configured value.
+
+``MDS_ALL_DOWN``
+----------------------------
+
+  Message
+    "None of the MDS ranks are available (file system offline)"
+
+  Description
+    All MDS ranks are unavailable resulting in the file system to be completely
+    offline.