doc: fix misc typos, bad phrasing

author Sage Weil <sage.weil@dreamhost.com>

Wed, 7 Mar 2012 04:35:33 +0000 (20:35 -0800)

committer Sage Weil <sage.weil@dreamhost.com>

Wed, 7 Mar 2012 05:03:58 +0000 (21:03 -0800)
author Sage Weil <sage.weil@dreamhost.com>
Wed, 7 Mar 2012 04:35:33 +0000 (20:35 -0800)
committer Sage Weil <sage.weil@dreamhost.com>
Wed, 7 Mar 2012 05:03:58 +0000 (21:03 -0800)
diff --git a/doc/dev/index.rst b/doc/dev/index.rst

index 69fda91850c263b7e99be222246afe5d91e2de29..628b958ae3fbd7a0018e4f25aff01156651c56c5 100644 (file)
--- a/doc/dev/index.rst
+++ b/doc/dev/index.rst
@@ -16,6 +16,18 @@ You can start a development mode Ceph cluster, after compiling the source, with:
  .. todo:: vstart is woefully undocumented and full of sharp sticks to poke yourself with.
  
  
+.. _mailing-list:
+
+Mailing list
+============
+
+The official development email list is ``ceph-devel@vger.kernel.org``.  Subscribe by sending
+a message to ``majordomo@vger.kernel.org`` with the line::
+
+ subscribe ceph-devel
+
+in the body of the message.
+
  .. toctree::
     :glob:
  
diff --git a/doc/ops/manage/failures/index.rst b/doc/ops/manage/failures/index.rst

index 6f14a344635c25652200f12eb2ebeeed62b003f6..2a8d7c3d38dbbce3284e6b5fe30e2014840dab3e 100644 (file)
--- a/doc/ops/manage/failures/index.rst
+++ b/doc/ops/manage/failures/index.rst
@@ -10,8 +10,7 @@ be checked with the ``ceph health`` command.  If all is well, you get::
   $ ceph health
   HEALTH_OK
  
-and a success error code.  If there are problems, you will see
-something like::
+If there are problems, you will see something like::
  
   $ ceph health
   HEALTH_WARN short summary of problem(s)
diff --git a/doc/ops/manage/failures/mon.rst b/doc/ops/manage/failures/mon.rst

index 3ff650ffce44580b95719e7154071d8db0fa4da8..2618c9157d02ebfb3248b1291c066c6dd2d98246 100644 (file)
--- a/doc/ops/manage/failures/mon.rst
+++ b/doc/ops/manage/failures/mon.rst
@@ -4,13 +4,13 @@
  
  Any single ceph-mon failure should not take down the entire monitor
  cluster as long as a majority of the nodes are available.  If that
-is the case--the remainin nodes are able to form a quorum--the ``ceph
+is the case--the remaining nodes are able to form a quorum--the ``ceph
  health`` command will report any problems::
  
   $ ceph health
   HEALTH_WARN 1 mons down, quorum 0,2
  
-and
+and::
  
   $ ceph health detail
   HEALTH_WARN 1 mons down, quorum 0,2
@@ -18,7 +18,7 @@ and
  
  Generally speaking, simply restarting the affected node will repair things.
  
-If there are not enough monitors for form a quorum, the ``ceph``
+If there are not enough monitors to form a quorum, the ``ceph``
  command will block trying to reach the cluster.  In this situation,
  you need to get enough ``ceph-mon`` daemons running to form a quorum
  before doing anything else with the cluster.
diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst

index 8dfabc68ba0c1be08dc6f52012d8c18bdb3e26d3..c25ef263294c8a814cedb422ef9981a4a895b895 100644 (file)
--- a/doc/ops/manage/failures/osd.rst
+++ b/doc/ops/manage/failures/osd.rst
@@ -6,7 +6,8 @@ Single ceph-osd failure
  =======================
  
  When a ceph-osd process dies, the monitor will learn about the failure
-from its peers and report it via the ``ceph health`` command::
+from surviving ceph-osd daemons and report it via the ``ceph health``
+command::
  
   $ ceph health
   HEALTH_WARN 1/3 in osds are down
@@ -23,7 +24,14 @@ Under normal circumstances, simply restarting the ceph-osd daemon will
  allow it to rejoin the cluster and recover.  If there is a disk
  failure or other fault preventing ceph-osd from functioning or
  restarting, an error message should be present in its log file in
-``/var/log/ceph``.
+``/var/log/ceph``.  
+
+If the daemon stopped because of a heartbeat failure, the underlying
+kernel file system may unresponsive; check ``dmesg`` output for disk
+or other kernel errors.
+
+If the problem is a software error (failed assertion or other
+unexpected error), it should be reported to the :ref:`mailing-list`.
  
  
  Full cluster
@@ -31,14 +39,14 @@ Full cluster
  
  If the cluster fills up, the monitor will prevent new data from being
  written.  The system puts ceph-osds in two categories: ``nearfull``
-and ``full`, which configurable threshholds for each (80% and 90% by
+and ``full`, with configurable threshholds for each (80% and 90% by
  default).  In both cases, full ceph-osds will be reported by ``ceph health``::
  
   $ ceph health
   HEALTH_WARN 1 nearfull osds
   osd.2 is near full at 85%
  
-or
+or::
  
   $ ceph health
   HEALTH_ERR 1 nearfull osds, 1 full osds
@@ -85,18 +93,30 @@ Stuck PGs
  
  It is normal for PGs to enter states like "degraded" or "peering"
  following a failure.  Normally these states indicate the normal
-progression through the failure recovery process.  However, is a PG
+progression through the failure recovery process.  However, if a PG
  stays in one of these states for a long time this may be an indication
  of a larger problem.  For this reason, the monitor will warn when PGs
  get "stuck" in a non-optimal state.  Specifically, we check for:
  
-* ``inactive`` - the PG is has not ``active`` for too long (i.e., hasn't
+* ``inactive`` - the PG has not been ``active`` for too long (i.e., hasn't
    been able to service read/write requests)
  * ``unclean`` - the PG has not been ``clean`` for too long (i.e.,
    hasn't been able to completely recover from a previous failure
-* ``stale`` - the PG status hasn't been updated by a ``ceph-osd``,
+* ``stale`` - the PG status has not been updated by a ``ceph-osd``,
    indicating that all nodes storing this PG may be down
  
+You can explicitly list stuck PGs with one of::
+
+ $ ceph pg dump_stuck stale
+ $ ceph pg dump_stuck inactive
+ $ ceph pg dump_stuck unclean
+
+For stuck stale PGs, it is normally a matter of getting the right ceph-osd
+daemons running again.  For stuck inactive PGs, it is usually a peering problem
+(see :ref:`failures-osd-peering`).  For stuck unclean PGs, there is usually something
+preventing recovery from completing, like unfound objects (see :ref:`failures-osd-unfound`);
+
+.. _failures-osd-peering:
  
  PG down (peering failure)
  =========================
@@ -141,9 +161,9 @@ The ``recovery_state`` section tells us that peering is blocked due to
  down ceph-osd daemons, specifically osd.1.  In this case, we can start that ceph-osd
  and things will recover.
  
-Alternatively, is there is a catastrophic failure of osd.1 (e.g., disk
+Alternatively, if there is a catastrophic failure of osd.1 (e.g., disk
  failure), we can tell the cluster that it is "lost" and to cope as
-best it case.  Note that this is dangerous in that the cluster cannot
+best it can.  Note that this is dangerous in that the cluster cannot
  guarantee that the other copies of the data are consistent and up to
  date.  To instruct Ceph to continue anyway::
  
@@ -152,6 +172,8 @@ date.  To instruct Ceph to continue anyway::
  and recovery will proceed.
  
  
+.. _failures-osd-unfound:
+
  Unfound objects
  ===============
  
diff --git a/doc/ops/manage/failures/radosgw.rst b/doc/ops/manage/failures/radosgw.rst

index 6951fe8757115d51656fd5bee3144be9ccdeb480..11f00ab0e346ee968957816e11a4110b8424f041 100644 (file)
--- a/doc/ops/manage/failures/radosgw.rst
+++ b/doc/ops/manage/failures/radosgw.rst
@@ -9,7 +9,7 @@ HTTP request errors
  Examining the access and error logs for the web server itself is
  probably the first step in identifying what is going on.  If there is
  a 500 error, that usually indicates a problem communicating with the
-radosgw daemon.  Ensure the daemon is running, it's socket path is
+radosgw daemon.  Ensure the daemon is running, its socket path is
  configured, and that the web server is looking for it in the proper
  location.
  
@@ -30,7 +30,7 @@ Blocked radosgw requests
  
  If some (or all) radosgw requests appear to be blocked, you can get
  some insight into the internal state of the ``radosgw`` daemon via
-it's admin socket.  By default, there will be a socket configured to
+its admin socket.  By default, there will be a socket configured to
  reside in ``/var/run/ceph``, and the daemon can be queried with::
  
   $ ceph --admin-daemon /var/run/ceph/client.rgw help
@@ -46,6 +46,6 @@ Of particular interest::
   ...
  
  will dump information about current in-progress requests with the
-RADOS cluster, which allow one to identify if a request is blocked
-by a non-responsive backend cluster.
+RADOS cluster.  This allows one to identify if any requests are blocked
+by a non-responsive ceph-osd.
author	Sage Weil <sage.weil@dreamhost.com>
	Wed, 7 Mar 2012 04:35:33 +0000 (20:35 -0800)
committer	Sage Weil <sage.weil@dreamhost.com>
	Wed, 7 Mar 2012 05:03:58 +0000 (21:03 -0800)
doc/dev/index.rst		patch \| blob \| history
doc/ops/manage/failures/index.rst		patch \| blob \| history
doc/ops/manage/failures/mon.rst		patch \| blob \| history
doc/ops/manage/failures/osd.rst		patch \| blob \| history
doc/ops/manage/failures/radosgw.rst		patch \| blob \| history