From: Sage Weil <sage@redhat.com>
Date: Wed, 31 Jul 2019 10:04:37 +0000 (-0500)
Subject: doc/reados/operations/health-checks: document PG_SLOW_SNAP_TRIMMING
X-Git-Tag: v15.1.0~1877^2~6
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=7e9ba0a1c12090466a4fa1ca25e77b43e4d944b4;p=ceph-ci.git

doc/reados/operations/health-checks: document PG_SLOW_SNAP_TRIMMING

The mitigation steps are weak, but it's not clear concrete guidance to
provide.

Signed-off-by: Sage Weil <sage@redhat.com>
---

diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst
index 0668aa41845..4238abb9378 100644
--- a/doc/rados/operations/health-checks.rst
+++ b/doc/rados/operations/health-checks.rst
@@ -965,6 +965,28 @@ You can manually initiate a scrub of a clean PG with::
   ceph pg deep-scrub <pgid>
 
 
+PG_SLOW_SNAP_TRIMMING
+_____________________
+
+The snapshot trim queue for one or more PGs has exceeded the
+configured warning threshold.  This indicates that either an extremely
+large number of snapshots were recently deleted, or that the OSDs are
+unable to trim snapshots quickly enough to keep up with the rate of
+new snapshot deletions.
+
+The warning threshold is controlled by the
+``mon_osd_snap_trim_queue_warn_on`` option (default: 32768).
+
+This warning may trigger if OSDs are under excessive load and unable
+to keep up with their background work, or if the OSDs' internal
+metadata database is heavily fragmented and unable to perform.  It may
+also indicate some other performance issue with the OSDs.
+
+The exact size of the snapshot trim queue is reported by the
+``snaptrimq_len`` field of ``ceph pg ls -f json-detail``.
+
+
+
 Miscellaneous
 -------------