From: Zac Dover Date: Fri, 11 Aug 2023 15:25:32 +0000 (+1000) Subject: doc/rados: update monitoring-osd-pg.rst X-Git-Tag: v19.0.0~701^2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=73503d4e2665bee9c49a055c11be1ffe2db253a1;p=ceph.git doc/rados: update monitoring-osd-pg.rst Ingest Anthony D'Atri's notes from https://github.com/ceph/ceph/pull/50856#discussion_r1289532902 which should have been included earlier. Signed-off-by: Zac Dover --- diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst index 21def48f4941..5689ac935993 100644 --- a/doc/rados/operations/monitoring-osd-pg.rst +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -10,10 +10,11 @@ directly to specific OSDs. For this reason, tracking system faults requires finding the `placement group`_ (PG) and the underlying OSDs at the root of the problem. -.. tip:: A fault in one part of the cluster might prevent you from accessing a - particular object, but that doesn't mean that you are prevented from accessing other objects. - When you run into a fault, don't panic. Just follow the steps for monitoring - your OSDs and placement groups, and then begin troubleshooting. +.. tip:: A fault in one part of the cluster might prevent you from accessing a + particular object, but that doesn't mean that you are prevented from + accessing other objects. When you run into a fault, don't panic. Just + follow the steps for monitoring your OSDs and placement groups, and then + begin troubleshooting. Ceph is self-repairing. However, when problems persist, monitoring OSDs and placement groups will help you identify the problem. @@ -22,22 +23,18 @@ placement groups will help you identify the problem. Monitoring OSDs =============== -An OSD's status is either in the cluster (``in``) or out of the cluster -(``out``); and, it is either up and running (``up``), or it is down and not -running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster -(you can read and write data) or it is ``out`` of the cluster. If it was -``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate -placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will -not assign placement groups to the OSD. If an OSD is ``down``, it should also be -``out``. - -.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster - will not be in a healthy state. - -The problem here seems to be with the word "should" above, and then with the -information in the note that there is a problem when an OSD is "down" and "in". -We need to replace "should" with something less modal and tentative. Does the -sentence mean to say "When an OSD is 'down', it is therefore a fortiori 'out'."? +An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is +either running and reachable (``up``), or it is not running and not +reachable (``down``). + +If an OSD is ``up``, it may be either ``in`` service (clients can read and +write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintin the configured redundancy. + +If an OSD is ``out`` of service, CRUSH will not assign placement groups to it. +If an OSD is ``down``, it will also be ``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and this + indicates that the cluster is not in a healthy state. .. ditaa::