doc/rados: improve troubleshooting-pg.rst

author Ville Ojamo <14869000+bluikko@users.noreply.github.com>

Thu, 15 Jan 2026 07:22:29 +0000 (14:22 +0700)

committer Ville Ojamo <14869000+bluikko@users.noreply.github.com>

Fri, 16 Jan 2026 08:27:49 +0000 (15:27 +0700)
author Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Thu, 15 Jan 2026 07:22:29 +0000 (14:22 +0700)
committer Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Fri, 16 Jan 2026 08:27:49 +0000 (15:27 +0700)
diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst

index 770c7b9a73c94002cda41bbfbac5a6a21e0f58c4..aece61714368146c4fe6b15a078f8b04344c4a9c 100644 (file)
--- a/doc/rados/operations/health-checks.rst
+++ b/doc/rados/operations/health-checks.rst
@@ -1150,7 +1150,7 @@ or ``snaptrim_error`` flag set, which indicates that an earlier data scrub
  operation found a problem, or (2) have the *repair* flag set, which means that
  a repair for such an inconsistency is currently in progress.
  
-For more information, see :doc:`../troubleshooting/troubleshooting-pg`.
+For more information, see :ref:`rados_operations_monitoring_osd_pg`.
  
  OSD_SCRUB_ERRORS
  ________________
@@ -1158,7 +1158,7 @@ ________________
  Recent OSD scrubs have discovered inconsistencies. This alert is generally
  paired with *PG_DAMAGED* (see above).
  
-For more information, see :doc:`../troubleshooting/troubleshooting-pg`.
+For more information, see :ref:`rados_operations_monitoring_osd_pg`.
  
  OSD_TOO_MANY_REPAIRS
  ____________________
diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst

index ba8805941890bda6f05a4798fcee5d3cdbbf0634..f5d86f3fb191d9ecd40d4e923b1a6ec50bf5b771 100644 (file)
--- a/doc/rados/operations/monitoring-osd-pg.rst
+++ b/doc/rados/operations/monitoring-osd-pg.rst
@@ -197,8 +197,7 @@ the following diagram, we assume a pool with three replicas of the PG:
                  |          Peering             |
  
  The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD
-Interaction`_. To troubleshoot peering issues, see `Peering
-Failure`_.
+Interaction`_. To troubleshoot peering issues, see :ref:`failures-osd-peering`.
  
  
  Monitoring PG States
@@ -487,7 +486,7 @@ To identify stuck PGs, run the following command:
      ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
  
  For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
-see `Troubleshooting PG Errors`_.
+see :ref:`failures-pg-stuck`.
  
  
  Finding an Object Location
@@ -554,8 +553,6 @@ performing the migration. For details, see the `Architecture`_ section.
  .. _mClock backfill: ../../configuration/mclock-config-ref#recovery-backfill-options
  .. _Architecture: ../../../architecture
  .. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running
-.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors
-.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering
  .. _CRUSH map: ../crush-map
  .. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/
  .. _Placement Group Subsystem: ../control#placement-group-subsystem
diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst

index f1355b02c735f751841223cb720bf22504c4cf95..bf60e83999bcf4dc2084995b6e33db348da83488 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-pg.rst
+++ b/doc/rados/troubleshooting/troubleshooting-pg.rst
@@ -8,10 +8,10 @@ Placement Groups Never Get Clean
  Placement Groups (PGs) that remain in the ``active`` status, the
  ``active+remapped`` status or the ``active+degraded`` status and never achieve
  an ``active+clean`` status might indicate a problem with the configuration of
-the Ceph cluster. 
+the Ceph cluster.
  
-In such a situation, review the settings in the `Pool, PG and CRUSH Config
-Reference`_ and make appropriate adjustments.
+In such a situation, review the settings in the :ref:`rados_config_pool_pg_crush_ref`
+and make appropriate adjustments.
  
  As a general rule, run your cluster with more than one OSD and a pool size
  of greater than two object replicas.
@@ -29,11 +29,11 @@ VMs are used as clients). You can experiment with Ceph in a one-node
  configuration, in spite of the limitations as described herein.
  
  To create a cluster on a single node, you must change the
-``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
+:confval:`osd_crush_chooseleaf_type` setting from the default of ``1`` (meaning
  ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
-file before you create your monitors and OSDs. This tells Ceph that an OSD is
+file before you create Monitors and OSDs. This tells Ceph that an OSD is
  permitted to place another OSD on the same host. If you are trying to set up a
-single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
+single-node cluster and :confval:`osd_crush_chooseleaf_type` is greater than ``0``,
  Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
  another node, chassis, rack, row, or datacenter depending on the setting.
  
@@ -48,16 +48,16 @@ directories for the data first.
  Fewer OSDs than Replicas
  ------------------------
  
-If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
-in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
-to greater than ``2``.
+If a number of OSDs are in an ``up`` and ``in`` state, but the placement groups are not
+in an ``active+clean`` state, you may have an :confval:`osd_pool_default_size` set
+to greater than the number of ``up`` and ``in`` state OSDs.
  
-There are a few ways to address this situation. If you want to operate your
-cluster in an ``active + degraded`` state with two replicas, you can set the
-``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
-``active + degraded`` state. You may also set the ``osd_pool_default_size``
+There are a few ways to address this situation. For example, if you want to operate your
+cluster with :confval:`osd_pool_default_size` set to ``3`` in an ``active+degraded`` state with two replicas, you can set the
+:confval:`osd_pool_default_min_size` to ``2`` so that you can write objects in an
+``active+degraded`` state. You may also set the :confval:`osd_pool_default_size`
  setting to ``2`` so that you have only two stored replicas (the original and
-one replica). In such a case, the cluster should achieve an ``active + clean``
+one replica). In such a case, the cluster should achieve an ``active+clean``
  state.
  
  .. note:: You can make the changes while the cluster is running. If you make
@@ -68,7 +68,7 @@ state.
  Pool Size = 1
  -------------
  
-If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
+If you have :confval:`osd_pool_default_size` set to ``1``, you will have only one copy
  of the object. OSDs rely on other OSDs to tell them which objects they should
  have. If one OSD has a copy of an object and there is no second copy, then
  there is no second OSD to tell the first OSD that it should have that copy. For
@@ -76,7 +76,7 @@ each placement group mapped to the first OSD (see ``ceph pg dump``), you can
  force the first OSD to notice the placement groups it needs by running a
  command of the following form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph osd force-create-pg <pgid>
  
@@ -84,40 +84,40 @@ command of the following form:
  CRUSH Map Errors
  ----------------
  
-If any placement groups in your cluster are unclean, then there might be errors
+If any placement groups in your cluster are ``unclean``, then there might be errors
  in your CRUSH map.
  
+.. _failures-pg-stuck:
  
  Stuck Placement Groups
  ======================
  
-It is normal for placement groups to enter "degraded" or "peering" states after
+It is normal for placement groups to enter ``degraded`` or ``peering`` states after
  a component failure. Normally, these states reflect the expected progression
  through the failure recovery process. However, a placement group that stays in
  one of these states for a long time might be an indication of a larger problem.
  For this reason, the Ceph Monitors will warn when placement groups get "stuck"
  in a non-optimal state. Specifically, we check for:
  
-* ``inactive`` - The placement group has not been ``active`` for too long (that
+* ``inactive`` The placement group has not been ``active`` for too long (that
    is, it hasn't been able to service read/write requests).
  
-* ``unclean`` - The placement group has not been ``clean`` for too long (that
+* ``unclean`` The placement group has not been ``clean`` for too long (that
    is, it hasn't been able to completely recover from a previous failure).
  
-* ``stale`` - The placement group status has not been updated by a
-  ``ceph-osd``.  This indicates that all nodes storing this placement group may
-  be ``down``.
+* ``stale`` The placement group status has not been updated by an OSD.
+  This indicates that all nodes storing this placement group may be ``down``.
  
  List stuck placement groups by running one of the following commands:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph pg dump_stuck stale
     ceph pg dump_stuck inactive
     ceph pg dump_stuck unclean
  
-- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
-  daemons are not running.
+- Stuck ``stale`` placement groups usually indicate that key OSDs are
+  not running.
  - Stuck ``inactive`` placement groups usually indicate a peering problem (see
    :ref:`failures-osd-peering`).
  - Stuck ``unclean`` placement groups usually indicate that something is
@@ -125,21 +125,20 @@ List stuck placement groups by running one of the following commands:
    :ref:`failures-osd-unfound`);
  
  
-
  .. _failures-osd-peering:
  
  Placement Group Down - Peering Failure
  ======================================
  
-In certain cases, the ``ceph-osd`` `peering` process can run into problems,
+In certain cases, the OSD `peering` process can run into problems,
  which can prevent a PG from becoming active and usable. In such a case, running
  the command ``ceph health detail`` will report something similar to the following:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph health detail
  
-::
+.. code-block:: none
  
      HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
      ...
@@ -150,7 +149,7 @@ the command ``ceph health detail`` will report something similar to the followin
  
  Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph pg 0.5 query
  
@@ -159,10 +158,10 @@ Query the cluster to determine exactly why the PG is marked ``down`` by running
   { "state": "down+peering",
     ...
     "recovery_state": [
-        { "name": "Started\/Primary\/Peering\/GetInfo",
+        { "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2012-03-06 14:40:16.169679",
            "requested_info_from": []},
-        { "name": "Started\/Primary\/Peering",
+        { "name": "Started/Primary/Peering",
            "enter_time": "2012-03-06 14:40:16.169659",
            "probing_osds": [
                  0,
@@ -180,8 +179,8 @@ Query the cluster to determine exactly why the PG is marked ``down`` by running
   }
  
  The ``recovery_state`` section tells us that peering is blocked due to down
-``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
-particular ``ceph-osd`` and recovery will proceed.
+OSDs, specifically ``osd.1``. In this case, we can start that
+particular OSD and recovery will proceed.
  
  Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
  there has been a disk failure), the cluster can be informed that the OSD is
@@ -194,7 +193,7 @@ there has been a disk failure), the cluster can be informed that the OSD is
  To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
  anyway, run a command of the following form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph osd lost 1
  
@@ -209,28 +208,28 @@ Unfound Objects
  Under certain combinations of failures, Ceph may complain about ``unfound``
  objects, as in this example:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph health detail
  
-::
+.. code-block:: none
  
     HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
     pg 2.4 is active+degraded, 78 unfound
  
  This means that the storage cluster knows that some objects (or newer copies of
  existing objects) exist, but it hasn't found copies of them.  Here is an
-example of how this might come about for a PG whose data is on two OSDS, which
+example of how this might come about for a PG whose data is on two OSDs, which
  we will call "1" and "2":
  
-* 1 goes down
-* 2 handles some writes, alone
-* 1 comes up
+* 1 goes down.
+* 2 handles some writes, alone.
+* 1 comes up.
  * 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
  * Before the new objects are copied, 2 goes down.
  
-At this point, 1 knows that these objects exist, but there is no live
-``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
+At this point, 1 knows that these objects exist, but there is no live OSD
+that has a copy of the objects. In this case, IO to those objects
  will block, and the cluster will hope that the failed node comes back soon.
  This is assumed to be preferable to returning an IO error to the user.
  
@@ -240,11 +239,11 @@ This is assumed to be preferable to returning an IO error to the user.
  
  Identify which objects are unfound by running a command of the following form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph pg 2.4 list_unfound [starting offset, in json]
  
-.. code-block:: javascript
+.. code-block:: json
  
    {
      "num_missing": 1,
@@ -296,18 +295,18 @@ OSDs that have the status of ``already probed`` are ignored.
  
  Use of ``query``:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph pg 2.4 query
  
-.. code-block:: javascript
+.. code-block:: json
  
     "recovery_state": [
-        { "name": "Started\/Primary\/Active",
+        { "name": "Started/Primary/Active",
            "enter_time": "2012-03-06 15:15:46.713212",
            "might_have_unfound": [
                  { "osd": 1,
-                  "status": "osd is down"}]},
+                  "status": "osd is down"}]}]
  
  In this case, the cluster knows that ``osd.1`` might have data, but it is
  ``down``. Here is the full range of possible states:
@@ -332,7 +331,7 @@ combinations of failures have occurred that allow the cluster to learn about
  writes that were performed before the writes themselves have been recovered. To
  mark the "unfound" objects as "lost", run a command of the following form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph pg 2.5 mark_unfound_lost revert|delete
  
@@ -346,6 +345,7 @@ either roll back to a previous version of the object or (if it was a new
  object) forget about the object entirely. Use ``revert`` with caution, as it
  may confuse applications that expect the object to exist.
  
+
  Homeless Placement Groups
  =========================
  
@@ -355,22 +355,22 @@ placement groups becomes unavailable and the monitor will receive no status
  updates for those placement groups. The monitor marks as ``stale`` any
  placement group whose primary OSD has failed. For example:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph health
  
-::
+.. code-block:: none
  
      HEALTH_WARN 24 pgs stale; 3/300 in osds are down
  
  Identify which placement groups are ``stale`` and which were the last OSDs to
  store the ``stale`` placement groups by running the following command:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph health detail
  
-::
+.. code-block:: none
  
     HEALTH_WARN 24 pgs stale; 3/300 in osds are down
     ...
@@ -380,7 +380,7 @@ store the ``stale`` placement groups by running the following command:
     osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
     osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
  
-This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
+This output indicates that placement group ``2.5`` (``pg 2.5``) was last managed by
  ``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
  that placement group.
  
@@ -395,7 +395,7 @@ OSDs in an operation involving dividing the number of placement groups in the
  cluster by the number of OSDs in the cluster, a small number of placement
  groups (the remainder, in this operation) are sometimes not distributed across
  the cluster. In situations like this, create a pool with a placement group
-count that is a multiple of the number of OSDs. See `Placement Groups`_ for
+count that is a multiple of the number of OSDs. See :ref:`placement groups` for
  details. See the :ref:`Pool, PG, and CRUSH Config Reference
  <rados_config_pool_pg_crush_ref>` for instructions on changing the default
  values used to determine how many placement groups are assigned to each pool.
@@ -408,23 +408,23 @@ If the cluster is up, but some OSDs are down and you cannot write data, make
  sure that you have the minimum number of OSDs running in the pool. If you don't
  have the minimum number of OSDs running in the pool, Ceph will not allow you to
  write data to it because there is no guarantee that Ceph can replicate your
-data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
+data. See :confval:`osd_pool_default_min_size` in the :ref:`Pool, PG, and CRUSH
  Config Reference <rados_config_pool_pg_crush_ref>` for details.
  
  
  PGs Inconsistent
  ================
  
-If the command ``ceph health detail`` returns an ``active + clean +
-inconsistent`` state, this might indicate an error during scrubbing. Identify
+If the command ``ceph health detail`` returns an ``active+clean+inconsistent``
+state, this might indicate an error during scrubbing. Identify
  the inconsistent placement group or placement groups by running the following
  command:
  
-.. prompt:: bash
+.. prompt:: bash #
  
      ceph health detail
  
-::
+.. code-block:: none
  
      HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
      pg 0.6 is active+clean+inconsistent, acting [0,1,2]
@@ -433,11 +433,11 @@ command:
  Alternatively, run this command if you prefer to inspect the output in a
  programmatic way:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     rados list-inconsistent-pg rbd
  
-::
+.. code-block:: none
  
      ["0.6"]
  
@@ -446,11 +446,11 @@ different inconsistencies in multiple perspectives found in more than one
  object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
  ``rados list-inconsistent-pg rbd`` will look something like this:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     rados list-inconsistent-obj 0.6 --format=json-pretty
  
-.. code-block:: javascript
+.. code-block:: json
  
      {
          "epoch": 14,
@@ -508,40 +508,40 @@ In this case, the output indicates the following:
    inconsistencies.
  * The inconsistencies fall into two categories:
  
-  #. ``errors``: these errors indicate inconsistencies between shards, without
+  #. ``errors``: These errors indicate inconsistencies between shards, without
       an indication of which shard(s) are bad. Check for the ``errors`` in the
       ``shards`` array, if available, to pinpoint the problem.
  
-     * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
+     * ``data_digest_mismatch``: The digest of the replica read from ``OSD.2``
         is different from the digests of the replica reads of ``OSD.0`` and
-       ``OSD.1``
-     * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
+       ``OSD.1``.
+     * ``size_mismatch``: The size of the replica read from ``OSD.2`` is ``0``,
         but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
  
-  #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
+  #. ``union_shard_errors``: The union of all shard-specific ``errors`` in the
       ``shards`` array. The ``errors`` are set for the shard with the problem.
       These errors include ``read_error`` and other similar errors. The
       ``errors`` ending in ``oi`` indicate a comparison with
       ``selected_object_info``. Examine the ``shards`` array to determine
       which shard has which error or errors.
  
-     * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
+     * ``data_digest_mismatch_info``: The digest stored in the ``object-info``
         is not ``0xffffffff``, which is calculated from the shard read from
-       ``OSD.2``
-     * ``size_mismatch_info``: the size stored in the ``object-info`` is
+       ``OSD.2``.
+     * ``size_mismatch_info``: The size stored in the ``object-info`` is
         different from the size read from ``OSD.2``. The latter is ``0``.
  
  .. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
     inconsistency is likely due to physical storage errors. In cases like this,
-   check the storage used by that OSD. 
-   
+   check the storage used by that OSD.
+
     Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
     repair.
  
  To repair the inconsistent placement group, run a command of the following
  form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph pg repair {placement-group-ID}
  
@@ -550,7 +550,7 @@ For example:
  .. prompt:: bash #
  
     ceph pg repair 1.4
-    
+
  .. warning:: This command overwrites the "bad" copies with "authoritative"
     copies. In most cases, Ceph is able to choose authoritative copies from all
     the available replicas by using some predefined criteria. This, however,
@@ -564,14 +564,16 @@ For example:
     command ``ceph osd dump | grep pool`` return a list of pool numbers.
  
  
-If you receive ``active + clean + inconsistent`` states periodically due to
+If you receive ``active+clean+inconsistent`` states periodically due to
  clock skew, consider configuring the `NTP
  <https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
  hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
  and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
  
+
  More Information on PG Repair
  -----------------------------
+
  Ceph stores and updates the checksums of objects stored in the cluster. When a
  scrub is performed on a PG, the lead OSD attempts to choose an authoritative
  copy from among its replicas. Only one of the possible cases is consistent.
@@ -583,7 +585,7 @@ any mismatch between the checksum of any replica of an object and the checksum
  of the authoritative copy means that there is an inconsistency. The discovery
  of these inconsistencies cause a PG's state to be set to ``inconsistent``.
  
-The ``pg repair`` command attempts to fix inconsistencies of various kinds. When 
+The ``pg repair`` command attempts to fix inconsistencies of various kinds. When
  ``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
  the inconsistent copy with the digest of the authoritative copy. When ``pg
  repair`` finds an inconsistent copy in a replicated pool, it marks the
@@ -591,8 +593,8 @@ inconsistent copy as missing. In the case of replicated pools, recovery is
  beyond the scope of ``pg repair``.
  
  In the case of erasure-coded and BlueStore pools, Ceph will automatically
-perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
-``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
+perform repairs if :confval:`osd_scrub_auto_repair` (default ``false``) is set to
+``true`` and if no more than :confval:`osd_scrub_auto_repair_num_errors` (default
  ``5``) errors are found.
  
  The ``pg repair`` command will not solve every problem. Ceph does not
@@ -615,36 +617,41 @@ might not be the uncorrupted replica. Because of this uncertainty, human
  intervention is necessary when an inconsistency is discovered. This
  intervention sometimes involves use of ``ceph-objectstore-tool``.
  
+
  PG Repair Walkthrough
  ---------------------
+
  https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
-contains a walkthrough of the repair of a PG. It is recommended reading if you
-want to repair a PG but have never done so.
+contains a walkthrough of the repair of a PG on the deprecated Filestore OSD back end. It is recommended reading if you
+want to repair a PG on a Filestore OSD but have never done so. The walkthrough does not
+apply to modern BlueStore OSDs.
  
-Erasure Coded PGs are not active+clean
-======================================
+
+Erasure Coded PGs are not ``active+clean``
+==========================================
  
  If CRUSH fails to find enough OSDs to map to a PG, it will show as a
  ``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
  
       [2,1,6,0,5,8,2147483647,7,4]
  
+
  Not enough OSDs
  ---------------
  
  If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
-OSDs, the cluster will show "Not enough OSDs". In this case, you either create
-another erasure coded pool that requires fewer OSDs, by running commands of the
+OSDs, the cluster will show ``Not enough OSDs``. In this case, either add new
+OSDs that the PG will then use automatically, or create
+another erasure coded pool that requires fewer OSDs by running commands of the
  following form:
  
-.. prompt:: bash
+.. prompt:: bash #
  
       ceph osd erasure-code-profile set myprofile k=5 m=3
       ceph osd pool create erasurepool erasure myprofile
  
-or add new OSDs, and the PG will automatically use them.
  
-CRUSH constraints cannot be satisfied
+CRUSH Constraints cannot be Satisfied
  -------------------------------------
  
  If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
@@ -653,16 +660,22 @@ the CRUSH rule requires that no two OSDs from the same host are used in the
  same PG, the mapping may fail because only two OSDs will be found. Check the
  constraint by displaying ("dumping") the rule, as shown here:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph osd crush rule ls
  
-::
+.. code-block:: json
  
      [
          "replicated_rule",
          "erasurepool"]
-    $ ceph osd crush rule dump erasurepool
+
+.. prompt:: bash #
+
+   ceph osd crush rule dump erasurepool
+
+.. code-block:: json
+
      { "rule_id": 1,
        "rule_name": "erasurepool",
        "type": 3,
@@ -679,39 +692,44 @@ constraint by displaying ("dumping") the rule, as shown here:
  Resolve this problem by creating a new pool in which PGs are allowed to have
  OSDs residing on the same host by running the following commands:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
     ceph osd pool create erasurepool erasure myprofile
  
-CRUSH gives up too soon
+
+CRUSH Gives up too Soon
  -----------------------
  
  If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
  with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
-PG), it is possible that CRUSH gives up before finding a mapping. This problem
-can be resolved by:
+PG), it is possible that CRUSH gives up before finding a mapping. To resolve
+this problem, either:
  
-* lowering the erasure coded pool requirements to use fewer OSDs per PG (this
+* Lower the erasure coded pool requirements to use fewer OSDs per PG (this
    requires the creation of another pool, because erasure code profiles cannot
    be modified dynamically).
  
-* adding more OSDs to the cluster (this does not require the erasure coded pool
-  to be modified, because it will become clean automatically)
+* Add more OSDs to the cluster (this does not require the erasure coded pool
+  to be modified, because it will become clean automatically).
  
-* using a handmade CRUSH rule that tries more times to find a good mapping.
+* Use a handmade CRUSH rule that tries more times to find a good mapping.
    This can be modified for an existing CRUSH rule by setting
-  ``set_choose_tries`` to a value greater than the default.
+  ``set_choose_tries`` to a value greater than the default. For more
+  information, see :ref:`rados-crush-map-edits`.
+
+* Use a multi-step retry (MSR) CRUSH rule (Squid or later releases). For more
+  information, see :ref:`rados-crush-msr-rules`.
  
  First, verify the problem by using  ``crushtool`` after extracting the crushmap
  from the cluster. This ensures that your experiments do not modify the Ceph
  cluster and that they operate only on local files:
  
-.. prompt:: bash
+.. prompt:: bash #
  
     ceph osd crush rule dump erasurepool
  
-::
+.. code-block:: json
  
      { "rule_id": 1,
        "rule_name": "erasurepool",
@@ -724,12 +742,24 @@ cluster and that they operate only on local files:
                "num": 0,
                "type": "host"},
              { "op": "emit"}]}
-    $ ceph osd getcrushmap > crush.map
+
+.. prompt:: bash #
+
+    ceph osd getcrushmap > crush.map
+
+.. code-block:: none
+
      got crush map from osdmap epoch 13
-    $ crushtool -i crush.map --test --show-bad-mappings \
+
+.. prompt:: bash #
+
+    crushtool -i crush.map --test --show-bad-mappings \
         --rule 1 \
         --num-rep 9 \
         --min-x 1 --max-x $((1024 * 1024))
+
+.. code-block:: none
+
      bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
      bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
      bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
@@ -747,25 +777,23 @@ considered bad, the CRUSH rule can be configured to search longer for a viable
  placement.
  
  
-Changing the value of set_choose_tries
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Changing the Value of ``set_choose_tries``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  #. Decompile the CRUSH map to edit the CRUSH rule by running the following
     command:
  
-   .. prompt:: bash
+   .. prompt:: bash #
  
        crushtool --decompile crush.map > crush.txt
  
     For illustrative purposes a simplified CRUSH map will be used in this
-   example, simulating a single host with four disks of sizes 3×1TiB and
-   1×200GiB.  The settings below are chosen specifically for this example and
+   example, simulating a single host with four disks of sizes 3×1 TiB and
+   1×200 GiB.  The settings below are chosen specifically for this example and
     will diverge from the :ref:`CRUSH Map Tunables <crush-map-tunables>`
     generally found in production clusters. As defaults may change, please refer
     to the correct version of the documentation for your release of Ceph.
  
-_
-
     ::
  
        tunable choose_local_tries 0
@@ -832,7 +860,7 @@ _
        step set_choose_tries 100
  
     If the line does exist already, as in this example, only modify the value.
-   Ensure that the rule in this ``crush.txt`` does resemble this after the
+   Ensure that the rule in your ``crush.txt`` does resemble this after the
     change::
  
        rule ec {
@@ -847,7 +875,7 @@ _
  
  #. Recompile and retest the CRUSH rule:
  
-   .. prompt:: bash
+   .. prompt:: bash #
  
        crushtool --compile crush.txt -o better-crush.map
  
@@ -856,7 +884,7 @@ _
     ``--show-choose-tries`` option of the ``crushtool`` command, as in the
     following example:
  
-   .. prompt:: bash
+   .. prompt:: bash #
  
        crushtool -i better-crush.map --test --show-bad-mappings \
         --show-choose-tries \
@@ -864,7 +892,7 @@ _
         --num-rep 3 \
         --min-x 1 --max-x 10
  
-   ::
+   .. code-block:: none
  
       0:         0
       1:         0
@@ -908,6 +936,3 @@ placements in practice, however if a lower value is desired then the lower
  value can be used at the chance of potentially hitting one of the rare cases in
  which placement fails, requiring manual intervention.
  
-.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
-.. _Placement Groups: ../../operations/placement-groups
-.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
author	Ville Ojamo <14869000+bluikko@users.noreply.github.com>
	Thu, 15 Jan 2026 07:22:29 +0000 (14:22 +0700)
committer	Ville Ojamo <14869000+bluikko@users.noreply.github.com>
	Fri, 16 Jan 2026 08:27:49 +0000 (15:27 +0700)
doc/rados/operations/health-checks.rst		patch \| blob \| history
doc/rados/operations/monitoring-osd-pg.rst		patch \| blob \| history
doc/rados/troubleshooting/troubleshooting-pg.rst		patch \| blob \| history