From bd97923ceb25f0f6fbb80c3102fd7bb1468d6ba0 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Tue, 6 Mar 2012 20:35:33 -0800 Subject: [PATCH] doc: fix misc typos, bad phrasing Signed-off-by: Sage Weil --- doc/dev/index.rst | 12 +++++++++ doc/ops/manage/failures/index.rst | 3 +-- doc/ops/manage/failures/mon.rst | 6 ++--- doc/ops/manage/failures/osd.rst | 40 ++++++++++++++++++++++------- doc/ops/manage/failures/radosgw.rst | 8 +++--- 5 files changed, 51 insertions(+), 18 deletions(-) diff --git a/doc/dev/index.rst b/doc/dev/index.rst index 69fda91850c26..628b958ae3fbd 100644 --- a/doc/dev/index.rst +++ b/doc/dev/index.rst @@ -16,6 +16,18 @@ You can start a development mode Ceph cluster, after compiling the source, with: .. todo:: vstart is woefully undocumented and full of sharp sticks to poke yourself with. +.. _mailing-list: + +Mailing list +============ + +The official development email list is ``ceph-devel@vger.kernel.org``. Subscribe by sending +a message to ``majordomo@vger.kernel.org`` with the line:: + + subscribe ceph-devel + +in the body of the message. + .. toctree:: :glob: diff --git a/doc/ops/manage/failures/index.rst b/doc/ops/manage/failures/index.rst index 6f14a344635c2..2a8d7c3d38dbb 100644 --- a/doc/ops/manage/failures/index.rst +++ b/doc/ops/manage/failures/index.rst @@ -10,8 +10,7 @@ be checked with the ``ceph health`` command. If all is well, you get:: $ ceph health HEALTH_OK -and a success error code. If there are problems, you will see -something like:: +If there are problems, you will see something like:: $ ceph health HEALTH_WARN short summary of problem(s) diff --git a/doc/ops/manage/failures/mon.rst b/doc/ops/manage/failures/mon.rst index 3ff650ffce445..2618c9157d02e 100644 --- a/doc/ops/manage/failures/mon.rst +++ b/doc/ops/manage/failures/mon.rst @@ -4,13 +4,13 @@ Any single ceph-mon failure should not take down the entire monitor cluster as long as a majority of the nodes are available. If that -is the case--the remainin nodes are able to form a quorum--the ``ceph +is the case--the remaining nodes are able to form a quorum--the ``ceph health`` command will report any problems:: $ ceph health HEALTH_WARN 1 mons down, quorum 0,2 -and +and:: $ ceph health detail HEALTH_WARN 1 mons down, quorum 0,2 @@ -18,7 +18,7 @@ and Generally speaking, simply restarting the affected node will repair things. -If there are not enough monitors for form a quorum, the ``ceph`` +If there are not enough monitors to form a quorum, the ``ceph`` command will block trying to reach the cluster. In this situation, you need to get enough ``ceph-mon`` daemons running to form a quorum before doing anything else with the cluster. diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst index 8dfabc68ba0c1..c25ef263294c8 100644 --- a/doc/ops/manage/failures/osd.rst +++ b/doc/ops/manage/failures/osd.rst @@ -6,7 +6,8 @@ Single ceph-osd failure ======================= When a ceph-osd process dies, the monitor will learn about the failure -from its peers and report it via the ``ceph health`` command:: +from surviving ceph-osd daemons and report it via the ``ceph health`` +command:: $ ceph health HEALTH_WARN 1/3 in osds are down @@ -23,7 +24,14 @@ Under normal circumstances, simply restarting the ceph-osd daemon will allow it to rejoin the cluster and recover. If there is a disk failure or other fault preventing ceph-osd from functioning or restarting, an error message should be present in its log file in -``/var/log/ceph``. +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure, the underlying +kernel file system may unresponsive; check ``dmesg`` output for disk +or other kernel errors. + +If the problem is a software error (failed assertion or other +unexpected error), it should be reported to the :ref:`mailing-list`. Full cluster @@ -31,14 +39,14 @@ Full cluster If the cluster fills up, the monitor will prevent new data from being written. The system puts ceph-osds in two categories: ``nearfull`` -and ``full`, which configurable threshholds for each (80% and 90% by +and ``full`, with configurable threshholds for each (80% and 90% by default). In both cases, full ceph-osds will be reported by ``ceph health``:: $ ceph health HEALTH_WARN 1 nearfull osds osd.2 is near full at 85% -or +or:: $ ceph health HEALTH_ERR 1 nearfull osds, 1 full osds @@ -85,18 +93,30 @@ Stuck PGs It is normal for PGs to enter states like "degraded" or "peering" following a failure. Normally these states indicate the normal -progression through the failure recovery process. However, is a PG +progression through the failure recovery process. However, if a PG stays in one of these states for a long time this may be an indication of a larger problem. For this reason, the monitor will warn when PGs get "stuck" in a non-optimal state. Specifically, we check for: -* ``inactive`` - the PG is has not ``active`` for too long (i.e., hasn't +* ``inactive`` - the PG has not been ``active`` for too long (i.e., hasn't been able to service read/write requests) * ``unclean`` - the PG has not been ``clean`` for too long (i.e., hasn't been able to completely recover from a previous failure -* ``stale`` - the PG status hasn't been updated by a ``ceph-osd``, +* ``stale`` - the PG status has not been updated by a ``ceph-osd``, indicating that all nodes storing this PG may be down +You can explicitly list stuck PGs with one of:: + + $ ceph pg dump_stuck stale + $ ceph pg dump_stuck inactive + $ ceph pg dump_stuck unclean + +For stuck stale PGs, it is normally a matter of getting the right ceph-osd +daemons running again. For stuck inactive PGs, it is usually a peering problem +(see :ref:`failures-osd-peering`). For stuck unclean PGs, there is usually something +preventing recovery from completing, like unfound objects (see :ref:`failures-osd-unfound`); + +.. _failures-osd-peering: PG down (peering failure) ========================= @@ -141,9 +161,9 @@ The ``recovery_state`` section tells us that peering is blocked due to down ceph-osd daemons, specifically osd.1. In this case, we can start that ceph-osd and things will recover. -Alternatively, is there is a catastrophic failure of osd.1 (e.g., disk +Alternatively, if there is a catastrophic failure of osd.1 (e.g., disk failure), we can tell the cluster that it is "lost" and to cope as -best it case. Note that this is dangerous in that the cluster cannot +best it can. Note that this is dangerous in that the cluster cannot guarantee that the other copies of the data are consistent and up to date. To instruct Ceph to continue anyway:: @@ -152,6 +172,8 @@ date. To instruct Ceph to continue anyway:: and recovery will proceed. +.. _failures-osd-unfound: + Unfound objects =============== diff --git a/doc/ops/manage/failures/radosgw.rst b/doc/ops/manage/failures/radosgw.rst index 6951fe8757115..11f00ab0e346e 100644 --- a/doc/ops/manage/failures/radosgw.rst +++ b/doc/ops/manage/failures/radosgw.rst @@ -9,7 +9,7 @@ HTTP request errors Examining the access and error logs for the web server itself is probably the first step in identifying what is going on. If there is a 500 error, that usually indicates a problem communicating with the -radosgw daemon. Ensure the daemon is running, it's socket path is +radosgw daemon. Ensure the daemon is running, its socket path is configured, and that the web server is looking for it in the proper location. @@ -30,7 +30,7 @@ Blocked radosgw requests If some (or all) radosgw requests appear to be blocked, you can get some insight into the internal state of the ``radosgw`` daemon via -it's admin socket. By default, there will be a socket configured to +its admin socket. By default, there will be a socket configured to reside in ``/var/run/ceph``, and the daemon can be queried with:: $ ceph --admin-daemon /var/run/ceph/client.rgw help @@ -46,6 +46,6 @@ Of particular interest:: ... will dump information about current in-progress requests with the -RADOS cluster, which allow one to identify if a request is blocked -by a non-responsive backend cluster. +RADOS cluster. This allows one to identify if any requests are blocked +by a non-responsive ceph-osd. -- 2.39.5