From: Venky Shankar Date: Wed, 13 Oct 2021 05:32:15 +0000 (+0530) Subject: doc / cephfs: health message codes should be permalinks X-Git-Tag: v17.1.0~697^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=3d97d6d98f0a72272362c154b5c74d158c572eaa;p=ceph-ci.git doc / cephfs: health message codes should be permalinks ... so that such links can be included in alert warnings. Additionally, document some other health warnings. Credit to @pcuzner to point out that not all health warnings have been documented. Signed-off-by: Venky Shankar --- diff --git a/doc/cephfs/health-messages.rst b/doc/cephfs/health-messages.rst index 28ceb704a9d..790fc3fdedb 100644 --- a/doc/cephfs/health-messages.rst +++ b/doc/cephfs/health-messages.rst @@ -72,7 +72,8 @@ slow requests. This page lists the health checks raised by MDS daemons. For the checks from other daemons, please see :ref:`health-checks`. -* ``MDS_TRIM`` +``MDS_TRIM`` +------------ Message "Behind on trimming..." @@ -85,7 +86,9 @@ other daemons, please see :ref:`health-checks`. too slowly, or a software bug is preventing trimming, then this health message may appear. The threshold for this message to appear is controlled by the config option ``mds_log_warn_factor``, the default is 2.0. -* ``MDS_HEALTH_CLIENT_LATE_RELEASE``, ``MDS_HEALTH_CLIENT_LATE_RELEASE_MANY`` + +``MDS_HEALTH_CLIENT_LATE_RELEASE``, ``MDS_HEALTH_CLIENT_LATE_RELEASE_MANY`` +--------------------------------------------------------------------------- Message "Client *name* failing to respond to capability release" @@ -96,7 +99,9 @@ other daemons, please see :ref:`health-checks`. is unresponsive or buggy, it might fail to do so promptly or fail to do so at all. This message appears if a client has taken longer than ``session_timeout`` (default 60s) to comply. -* ``MDS_CLIENT_RECALL``, ``MDS_HEALTH_CLIENT_RECALL_MANY`` + +``MDS_CLIENT_RECALL``, ``MDS_HEALTH_CLIENT_RECALL_MANY`` +-------------------------------------------------------- Message "Client *name* failing to respond to cache pressure" @@ -111,7 +116,9 @@ other daemons, please see :ref:`health-checks`. ``mds_recall_warning_threshold`` capabilities (decaying with a half-life of ``mds_recall_max_decay_rate``) within the last ``mds_recall_warning_decay_rate`` second. -* ``MDS_CLIENT_OLDEST_TID``, ``MDS_CLIENT_OLDEST_TID_MANY`` + +``MDS_CLIENT_OLDEST_TID``, ``MDS_CLIENT_OLDEST_TID_MANY`` +--------------------------------------------------------- Message "Client *name* failing to advance its oldest client/flush tid" @@ -124,7 +131,9 @@ other daemons, please see :ref:`health-checks`. appears if a client appears to have more than ``max_completed_requests`` (default 100000) requests that are complete on the MDS side but haven't yet been accounted for in the client's *oldest tid* value. -* ``MDS_DAMAGE`` + +``MDS_DAMAGE`` +-------------- Message "Metadata damage detected" @@ -135,7 +144,9 @@ other daemons, please see :ref:`health-checks`. client accesses to the damaged subtree will return IO errors. Use the ``damage ls`` admin socket command to get more detail on the damage. This message appears as soon as any damage is encountered. -* ``MDS_HEALTH_READ_ONLY`` + +``MDS_HEALTH_READ_ONLY`` +------------------------ Message "MDS in read-only mode" @@ -145,7 +156,9 @@ other daemons, please see :ref:`health-checks`. MDS will go into readonly mode if it encounters a write error while writing to the metadata pool, or if forced to by an administrator using the *force_readonly* admin socket command. -* ``MDS_SLOW_REQUEST`` + +``MDS_SLOW_REQUEST`` +-------------------- Message "*N* slow requests are blocked" @@ -157,7 +170,9 @@ other daemons, please see :ref:`health-checks`. Use the ``ops`` admin socket command to list outstanding metadata operations. This message appears if any client requests have taken longer than ``mds_op_complaint_time`` (default 30s). -* ``MDS_CACHE_OVERSIZED`` + +``MDS_CACHE_OVERSIZED`` +----------------------- Message "Too many inodes in cache" @@ -168,3 +183,58 @@ other daemons, please see :ref:`health-checks`. the actual cache size (in memory) is at least 50% greater than ``mds_cache_memory_limit`` (default 1GB). Modify ``mds_health_cache_threshold`` to set the warning ratio. + +``FS_WITH_FAILED_MDS`` +---------------------- + + Message + "Some MDS ranks do not have standby replacements" + + Description + Normally, a failed MDS rank will be replaced by a standby MDS. This situation + is transient and is not considered critical. However, if there are no standby + MDSs available to replace an active MDS rank, this health warning is generated. + +``MDS_INSUFFICIENT_STANDBY`` +---------------------------- + + Message + "Insufficient number of available standby(-replay) MDS daemons than configured" + + Description + The minimum number of standby(-replay) MDS daemons can be configured by setting + ``standby_count_wanted`` configuration variable. This health warning is generated + when the configured value mismatches the number of standby(-replay) MDS daemons + available. + +``FS_DEGRADED`` +---------------------------- + + Message + "Some MDS ranks have been marked failed or damaged" + + Description + When one or more MDS rank ends up in failed or damaged state due to + an unrecoverable error. The file system may be partially or fully + unavailable when one (or more) ranks are offline. + +``MDS_UP_LESS_THAN_MAX`` +---------------------------- + + Message + "Number of active ranks are less than configured number of maximum MDSs" + + Description + The maximum number of MDS ranks can be configured by setting ``max_mds`` + configuration variable. This health warning is generated when the number + of MDS ranks falls below this configured value. + +``MDS_ALL_DOWN`` +---------------------------- + + Message + "None of the MDS ranks are available (file system offline)" + + Description + All MDS ranks are unavailable resulting in the file system to be completely + offline.