From bd97923ceb25f0f6fbb80c3102fd7bb1468d6ba0 Mon Sep 17 00:00:00 2001
From: Sage Weil <sage.weil@dreamhost.com>
Date: Tue, 6 Mar 2012 20:35:33 -0800
Subject: [PATCH] doc: fix misc typos, bad phrasing

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
---
 doc/dev/index.rst                   | 12 +++++++++
 doc/ops/manage/failures/index.rst   |  3 +--
 doc/ops/manage/failures/mon.rst     |  6 ++---
 doc/ops/manage/failures/osd.rst     | 40 ++++++++++++++++++++++-------
 doc/ops/manage/failures/radosgw.rst |  8 +++---
 5 files changed, 51 insertions(+), 18 deletions(-)

diff --git a/doc/dev/index.rst b/doc/dev/index.rst
index 69fda91850c26..628b958ae3fbd 100644
--- a/doc/dev/index.rst
+++ b/doc/dev/index.rst
@@ -16,6 +16,18 @@ You can start a development mode Ceph cluster, after compiling the source, with:
 .. todo:: vstart is woefully undocumented and full of sharp sticks to poke yourself with.
 
 
+.. _mailing-list:
+
+Mailing list
+============
+
+The official development email list is ``ceph-devel@vger.kernel.org``.  Subscribe by sending
+a message to ``majordomo@vger.kernel.org`` with the line::
+
+ subscribe ceph-devel
+
+in the body of the message.
+
 .. toctree::
    :glob:
 
diff --git a/doc/ops/manage/failures/index.rst b/doc/ops/manage/failures/index.rst
index 6f14a344635c2..2a8d7c3d38dbb 100644
--- a/doc/ops/manage/failures/index.rst
+++ b/doc/ops/manage/failures/index.rst
@@ -10,8 +10,7 @@ be checked with the ``ceph health`` command.  If all is well, you get::
  $ ceph health
  HEALTH_OK
 
-and a success error code.  If there are problems, you will see
-something like::
+If there are problems, you will see something like::
 
  $ ceph health
  HEALTH_WARN short summary of problem(s)
diff --git a/doc/ops/manage/failures/mon.rst b/doc/ops/manage/failures/mon.rst
index 3ff650ffce445..2618c9157d02e 100644
--- a/doc/ops/manage/failures/mon.rst
+++ b/doc/ops/manage/failures/mon.rst
@@ -4,13 +4,13 @@
 
 Any single ceph-mon failure should not take down the entire monitor
 cluster as long as a majority of the nodes are available.  If that
-is the case--the remainin nodes are able to form a quorum--the ``ceph
+is the case--the remaining nodes are able to form a quorum--the ``ceph
 health`` command will report any problems::
 
  $ ceph health
  HEALTH_WARN 1 mons down, quorum 0,2
 
-and
+and::
 
  $ ceph health detail
  HEALTH_WARN 1 mons down, quorum 0,2
@@ -18,7 +18,7 @@ and
 
 Generally speaking, simply restarting the affected node will repair things.
 
-If there are not enough monitors for form a quorum, the ``ceph``
+If there are not enough monitors to form a quorum, the ``ceph``
 command will block trying to reach the cluster.  In this situation,
 you need to get enough ``ceph-mon`` daemons running to form a quorum
 before doing anything else with the cluster.
diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst
index 8dfabc68ba0c1..c25ef263294c8 100644
--- a/doc/ops/manage/failures/osd.rst
+++ b/doc/ops/manage/failures/osd.rst
@@ -6,7 +6,8 @@ Single ceph-osd failure
 =======================
 
 When a ceph-osd process dies, the monitor will learn about the failure
-from its peers and report it via the ``ceph health`` command::
+from surviving ceph-osd daemons and report it via the ``ceph health``
+command::
 
  $ ceph health
  HEALTH_WARN 1/3 in osds are down
@@ -23,7 +24,14 @@ Under normal circumstances, simply restarting the ceph-osd daemon will
 allow it to rejoin the cluster and recover.  If there is a disk
 failure or other fault preventing ceph-osd from functioning or
 restarting, an error message should be present in its log file in
-``/var/log/ceph``.
+``/var/log/ceph``.  
+
+If the daemon stopped because of a heartbeat failure, the underlying
+kernel file system may unresponsive; check ``dmesg`` output for disk
+or other kernel errors.
+
+If the problem is a software error (failed assertion or other
+unexpected error), it should be reported to the :ref:`mailing-list`.
 
 
 Full cluster
@@ -31,14 +39,14 @@ Full cluster
 
 If the cluster fills up, the monitor will prevent new data from being
 written.  The system puts ceph-osds in two categories: ``nearfull``
-and ``full`, which configurable threshholds for each (80% and 90% by
+and ``full`, with configurable threshholds for each (80% and 90% by
 default).  In both cases, full ceph-osds will be reported by ``ceph health``::
 
  $ ceph health
  HEALTH_WARN 1 nearfull osds
  osd.2 is near full at 85%
 
-or
+or::
 
  $ ceph health
  HEALTH_ERR 1 nearfull osds, 1 full osds
@@ -85,18 +93,30 @@ Stuck PGs
 
 It is normal for PGs to enter states like "degraded" or "peering"
 following a failure.  Normally these states indicate the normal
-progression through the failure recovery process.  However, is a PG
+progression through the failure recovery process.  However, if a PG
 stays in one of these states for a long time this may be an indication
 of a larger problem.  For this reason, the monitor will warn when PGs
 get "stuck" in a non-optimal state.  Specifically, we check for:
 
-* ``inactive`` - the PG is has not ``active`` for too long (i.e., hasn't
+* ``inactive`` - the PG has not been ``active`` for too long (i.e., hasn't
   been able to service read/write requests)
 * ``unclean`` - the PG has not been ``clean`` for too long (i.e.,
   hasn't been able to completely recover from a previous failure
-* ``stale`` - the PG status hasn't been updated by a ``ceph-osd``,
+* ``stale`` - the PG status has not been updated by a ``ceph-osd``,
   indicating that all nodes storing this PG may be down
 
+You can explicitly list stuck PGs with one of::
+
+ $ ceph pg dump_stuck stale
+ $ ceph pg dump_stuck inactive
+ $ ceph pg dump_stuck unclean
+
+For stuck stale PGs, it is normally a matter of getting the right ceph-osd
+daemons running again.  For stuck inactive PGs, it is usually a peering problem
+(see :ref:`failures-osd-peering`).  For stuck unclean PGs, there is usually something
+preventing recovery from completing, like unfound objects (see :ref:`failures-osd-unfound`);
+
+.. _failures-osd-peering:
 
 PG down (peering failure)
 =========================
@@ -141,9 +161,9 @@ The ``recovery_state`` section tells us that peering is blocked due to
 down ceph-osd daemons, specifically osd.1.  In this case, we can start that ceph-osd
 and things will recover.
 
-Alternatively, is there is a catastrophic failure of osd.1 (e.g., disk
+Alternatively, if there is a catastrophic failure of osd.1 (e.g., disk
 failure), we can tell the cluster that it is "lost" and to cope as
-best it case.  Note that this is dangerous in that the cluster cannot
+best it can.  Note that this is dangerous in that the cluster cannot
 guarantee that the other copies of the data are consistent and up to
 date.  To instruct Ceph to continue anyway::
 
@@ -152,6 +172,8 @@ date.  To instruct Ceph to continue anyway::
 and recovery will proceed.
 
 
+.. _failures-osd-unfound:
+
 Unfound objects
 ===============
 
diff --git a/doc/ops/manage/failures/radosgw.rst b/doc/ops/manage/failures/radosgw.rst
index 6951fe8757115..11f00ab0e346e 100644
--- a/doc/ops/manage/failures/radosgw.rst
+++ b/doc/ops/manage/failures/radosgw.rst
@@ -9,7 +9,7 @@ HTTP request errors
 Examining the access and error logs for the web server itself is
 probably the first step in identifying what is going on.  If there is
 a 500 error, that usually indicates a problem communicating with the
-radosgw daemon.  Ensure the daemon is running, it's socket path is
+radosgw daemon.  Ensure the daemon is running, its socket path is
 configured, and that the web server is looking for it in the proper
 location.
 
@@ -30,7 +30,7 @@ Blocked radosgw requests
 
 If some (or all) radosgw requests appear to be blocked, you can get
 some insight into the internal state of the ``radosgw`` daemon via
-it's admin socket.  By default, there will be a socket configured to
+its admin socket.  By default, there will be a socket configured to
 reside in ``/var/run/ceph``, and the daemon can be queried with::
 
  $ ceph --admin-daemon /var/run/ceph/client.rgw help
@@ -46,6 +46,6 @@ Of particular interest::
  ...
 
 will dump information about current in-progress requests with the
-RADOS cluster, which allow one to identify if a request is blocked
-by a non-responsive backend cluster.
+RADOS cluster.  This allows one to identify if any requests are blocked
+by a non-responsive ceph-osd.
 
-- 
2.39.5