doc: describe 'stuck' states we check for

author Sage Weil <sage@newdream.net>

Tue, 6 Mar 2012 23:31:29 +0000 (15:31 -0800)

committer Sage Weil <sage@newdream.net>

Wed, 7 Mar 2012 01:05:29 +0000 (17:05 -0800)
author Sage Weil <sage@newdream.net>
Tue, 6 Mar 2012 23:31:29 +0000 (15:31 -0800)
committer Sage Weil <sage@newdream.net>
Wed, 7 Mar 2012 01:05:29 +0000 (17:05 -0800)
diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst

index 739b38a93be272857338ca61e58355bd736ca3e7..ddf32392fd20c424a1afac00c3961eb067bcd510 100644 (file)
--- a/doc/ops/manage/failures/osd.rst
+++ b/doc/ops/manage/failures/osd.rst
@@ -56,6 +56,23 @@ daemons will allow the cluster to recover that PG (and, presumably,
  many others).
  
  
+Stuck PGs
+=========
+
+It is normal for PGs to enter states like "degraded" or "peering"
+following a failure.  Normally these states indicate the normal
+progression through the failure recovery process.  However, is a PG
+stays in one of these states for a long time this may be an indication
+of a larger problem.  For this reason, the monitor will warn when PGs
+get "stuck" in a non-optimal state.  Specifically, we check for:
+
+* ``inactive`` - the PG is has not ``active`` for too long (i.e., hasn't
+  been able to service read/write requests)
+* ``unclean`` - the PG has not been ``clean`` for too long (i.e.,
+  hasn't been able to completely recover from a previous failure
+* ``stale`` - the PG status hasn't been updated by a ``ceph-osd``,
+  indicating that all nodes storing this PG may be down
+
  
  PG down (peering failure)
  =========================
author	Sage Weil <sage@newdream.net>
	Tue, 6 Mar 2012 23:31:29 +0000 (15:31 -0800)
committer	Sage Weil <sage@newdream.net>
	Wed, 7 Mar 2012 01:05:29 +0000 (17:05 -0800)