From: Zac Dover <zac.dover@proton.me>
Date: Sun, 2 Apr 2023 22:03:29 +0000 (+1000)
Subject: doc/rados: edit ops/monitoring.rst (2 of 3)
X-Git-Tag: v17.2.7~498^2
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=134c63991691bb4ef90a6d2d7db276981b9b6102;p=ceph.git

doc/rados: edit ops/monitoring.rst (2 of 3)

Line-edit the second third of doc/rados/operations/monitoring.rst.

Follows https://github.com/ceph/ceph/pull/50670.

https://tracker.ceph.com/issues/58485

Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com>
Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 41684ebd33b5c9fe707f5a33b27c55ed29cd5ede)
---

diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst
index e7922193c8444..5ba5d397443f1 100644
--- a/doc/rados/operations/monitoring.rst
+++ b/doc/rados/operations/monitoring.rst
@@ -178,25 +178,26 @@ to a healthy state:
     2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
     2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
 
-
 Network Performance Checks
 --------------------------
 
-Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability.  We
-also use the response times to monitor network performance.
-While it is possible that a busy OSD could delay a ping response, we can assume
-that if a network switch fails multiple delays will be detected between distinct pairs of OSDs.
+Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
+availability and network performance. If a single delayed response is detected,
+this might indicate nothing more than a busy OSD. But if multiple delays
+between distinct pairs of OSDs are detected, this might indicate a failed
+network switch, a NIC failure, or a layer 1 failure.
 
-By default we will warn about ping times which exceed 1 second (1000 milliseconds).
+By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
+health check (a ``HEALTH_WARN``. For example:
 
 ::
 
     HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)
 
-The health detail will add the combination of OSDs are seeing the delays and by how much.  There is a limit of 10
-detail line items.
-
-::
+In the output of the ``ceph health detail`` command, you can see which OSDs are
+experiencing delays and how long the delays are. The output of ``ceph health
+detail`` is limited to ten lines. Here is an example of the output you can
+expect from the ``ceph health detail`` command::
 
     [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
         Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
@@ -204,11 +205,15 @@ detail line items.
         Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
         Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec
 
-To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used.  Typically, this would be
-sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD.  The current threshold which defaults to 1 second
-(1000 milliseconds) can be overridden as an argument in milliseconds.
+To see more detail and to collect a complete dump of network performance
+information, use the ``dump_osd_network`` command. This command is usually sent
+to a Ceph Manager Daemon, but it can be used to collect information about a
+specific OSD's interactions by sending it to that OSD. The default threshold
+for a slow heartbeat is 1 second (1000 milliseconds), but this can be
+overridden by providing a number of milliseconds as an argument.
 
-The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr.
+To show all network performance data with a specified threshold of 0, send the
+following command to the mgr:
 
 .. prompt:: bash $
 
@@ -292,26 +297,26 @@ The following command will show all gathered network performance data by specify
 
 
 
-Muting health checks
+Muting Health Checks
 --------------------
 
-Health checks can be muted so that they do not affect the overall
-reported status of the cluster.  Alerts are specified using the health
-check code (see :ref:`health-checks`):
+Health checks can be muted so that they have no effect on the overall
+reported status of the cluster. For example, if the cluster has raised a
+single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``.
+To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and 
+run the following command:
 
 .. prompt:: bash $
 
    ceph health mute <code>
 
-For example, if there is a health warning, muting it will make the
-cluster report an overall status of ``HEALTH_OK``.  For example, to
-mute an ``OSD_DOWN`` alert,:
+For example, to mute an ``OSD_DOWN`` health check, run the following command:
 
 .. prompt:: bash $
 
    ceph health mute OSD_DOWN
 
-Mutes are reported as part of the short and long form of the ``ceph health`` command.
+Mutes are reported as part of the short and long form of the ``ceph health`` command's output.
 For example, in the above scenario, the cluster would report:
 
 .. prompt:: bash $
@@ -332,7 +337,7 @@ For example, in the above scenario, the cluster would report:
    (MUTED) OSD_DOWN 1 osds down
        osd.1 is down
 
-A mute can be explicitly removed with:
+A mute can be removed by running the following command:
 
 .. prompt:: bash $
 
@@ -344,53 +349,49 @@ For example:
 
    ceph health unmute OSD_DOWN
 
-A health check mute may optionally have a TTL (time to live)
-associated with it, such that the mute will automatically expire
-after the specified period of time has elapsed.  The TTL is specified as an optional
-duration argument, e.g.:
+A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive)
+associated with it: this means that the mute will automatically expire
+after a specified period of time. The TTL is specified as an optional
+duration argument, as seen in the following examples:
 
 .. prompt:: bash $
 
    ceph health mute OSD_DOWN 4h    # mute for 4 hours
-   ceph health mute MON_DOWN 15m   # mute for 15  minutes
+   ceph health mute MON_DOWN 15m   # mute for 15 minutes
 
-Normally, if a muted health alert is resolved (e.g., in the example
-above, the OSD comes back up), the mute goes away.  If the alert comes
+Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check 
+in the example above has come back up), the mute goes away. If the health check comes
 back later, it will be reported in the usual way.
 
-It is possible to make a mute "sticky" such that the mute will remain even if the
-alert clears.  For example:
+It is possible to make a health mute "sticky": this means that the mute will remain even if the
+health check clears. For example, to make a health mute "sticky", you might run the following command:
 
 .. prompt:: bash $
 
    ceph health mute OSD_DOWN 1h --sticky   # ignore any/all down OSDs for next hour
 
-Most health mutes also disappear if the extent of an alert gets worse.  For example,
-if there is one OSD down, and the alert is muted, the mute will disappear if one
-or more additional OSDs go down.  This is true for any health alert that involves
-a count indicating how much or how many of something is triggering the warning or
-error.
+Most health mutes disappear if the unhealthy condition that triggered the health check gets worse.
+For example, suppose that there is one OSD down and the health check is muted. In that case, if
+one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value.
 
-
-Detecting configuration issues
+Detecting Configuration Issues
 ==============================
 
-Although Ceph continuously monitors itself, some configuration issues can be
-detected only with an external tool called `ceph-medic
-<https://docs.ceph.com/_/downloads/ceph-medic/en/latest/pdf/>`_.
+Although Ceph continuously monitors itself, some configuration issues can be 
+detected only with an external tool called ``ceph-medic``.
 
 Checking a Cluster's Usage Stats
 ================================
 
-To check a cluster's data usage and data distribution among pools, you can
-use the ``df`` option. It is similar to Linux ``df``. Execute 
-the following:
+To check a cluster's data usage and data distribution among pools, use the
+``df`` command. This option is similar to Linux's ``df`` command. Run the
+following command:
 
 .. prompt:: bash $
 
    ceph df
 
-The output of ``ceph df`` looks like this::
+The output of ``ceph df`` resembles the following::
 
    CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
    ssd    202 GiB  200 GiB  2.0 GiB   2.0 GiB       1.00
@@ -403,10 +404,6 @@ The output of ``ceph df`` looks like this::
    cephfs.a.data           3   32      0 B      0 B      0 B         0      0 B     0 B      0 B       0     99 GiB            N/A          N/A      0         0 B          0 B
    test                    4   32   22 MiB   22 MiB   50 KiB       248   19 MiB  19 MiB   50 KiB       0    297 GiB            N/A          N/A    248         0 B          0 B
 
-
-
-
-
 - **CLASS:** for example, "ssd" or "hdd"
 - **SIZE:** The amount of storage capacity managed by the cluster.
 - **AVAIL:** The amount of free space available in the cluster.