From: Sage Weil <sage@newdream.net>
Date: Tue, 6 Mar 2012 23:31:29 +0000 (-0800)
Subject: doc: describe 'stuck' states we check for
X-Git-Tag: v0.44~45^2~7^2~9
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=2bec51a21ecad7aecf69a8c1dea25389031dc82d;p=ceph.git

doc: describe 'stuck' states we check for

Signed-off-by: Sage Weil <sage@newdream.net>
---

diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst
index 739b38a93be27..ddf32392fd20c 100644
--- a/doc/ops/manage/failures/osd.rst
+++ b/doc/ops/manage/failures/osd.rst
@@ -56,6 +56,23 @@ daemons will allow the cluster to recover that PG (and, presumably,
 many others).
 
 
+Stuck PGs
+=========
+
+It is normal for PGs to enter states like "degraded" or "peering"
+following a failure.  Normally these states indicate the normal
+progression through the failure recovery process.  However, is a PG
+stays in one of these states for a long time this may be an indication
+of a larger problem.  For this reason, the monitor will warn when PGs
+get "stuck" in a non-optimal state.  Specifically, we check for:
+
+* ``inactive`` - the PG is has not ``active`` for too long (i.e., hasn't
+  been able to service read/write requests)
+* ``unclean`` - the PG has not been ``clean`` for too long (i.e.,
+  hasn't been able to completely recover from a previous failure
+* ``stale`` - the PG status hasn't been updated by a ``ceph-osd``,
+  indicating that all nodes storing this PG may be down
+
 
 PG down (peering failure)
 =========================