From: Ville Ojamo <14869000+bluikko@users.noreply.github.com> Date: Thu, 15 Jan 2026 07:22:29 +0000 (+0700) Subject: doc/rados: improve troubleshooting-pg.rst X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=586f64c867302b7f59a48ea44924d9e41bc8e2bf;p=ceph.git doc/rados: improve troubleshooting-pg.rst Note that a link to a walkthrough uses deprecated Filestore. Reported in doc bugs pad. Fix capitalization, use OSD instead of ceph-osd. Improve language in a list. Remove escaping from slashes in PG query output, tested on Quincy. Don't use spaces in states like active+remapped consistently. Add label for incoming links and change them to refs. Use privileged prompt for CLI commands, don't highlight in console output. Use double backticks consistently. Improve markup. Remove spaces at the end of lines. Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com> --- diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst index 770c7b9a73c94..aece617143681 100644 --- a/doc/rados/operations/health-checks.rst +++ b/doc/rados/operations/health-checks.rst @@ -1150,7 +1150,7 @@ or ``snaptrim_error`` flag set, which indicates that an earlier data scrub operation found a problem, or (2) have the *repair* flag set, which means that a repair for such an inconsistency is currently in progress. -For more information, see :doc:`../troubleshooting/troubleshooting-pg`. +For more information, see :ref:`rados_operations_monitoring_osd_pg`. OSD_SCRUB_ERRORS ________________ @@ -1158,7 +1158,7 @@ ________________ Recent OSD scrubs have discovered inconsistencies. This alert is generally paired with *PG_DAMAGED* (see above). -For more information, see :doc:`../troubleshooting/troubleshooting-pg`. +For more information, see :ref:`rados_operations_monitoring_osd_pg`. OSD_TOO_MANY_REPAIRS ____________________ diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst index ba8805941890b..f5d86f3fb191d 100644 --- a/doc/rados/operations/monitoring-osd-pg.rst +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -197,8 +197,7 @@ the following diagram, we assume a pool with three replicas of the PG: | Peering | The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD -Interaction`_. To troubleshoot peering issues, see `Peering -Failure`_. +Interaction`_. To troubleshoot peering issues, see :ref:`failures-osd-peering`. Monitoring PG States @@ -487,7 +486,7 @@ To identify stuck PGs, run the following command: ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs, -see `Troubleshooting PG Errors`_. +see :ref:`failures-pg-stuck`. Finding an Object Location @@ -554,8 +553,6 @@ performing the migration. For details, see the `Architecture`_ section. .. _mClock backfill: ../../configuration/mclock-config-ref#recovery-backfill-options .. _Architecture: ../../../architecture .. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running -.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors -.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering .. _CRUSH map: ../crush-map .. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ .. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst index f1355b02c735f..bf60e83999bcf 100644 --- a/doc/rados/troubleshooting/troubleshooting-pg.rst +++ b/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -8,10 +8,10 @@ Placement Groups Never Get Clean Placement Groups (PGs) that remain in the ``active`` status, the ``active+remapped`` status or the ``active+degraded`` status and never achieve an ``active+clean`` status might indicate a problem with the configuration of -the Ceph cluster. +the Ceph cluster. -In such a situation, review the settings in the `Pool, PG and CRUSH Config -Reference`_ and make appropriate adjustments. +In such a situation, review the settings in the :ref:`rados_config_pool_pg_crush_ref` +and make appropriate adjustments. As a general rule, run your cluster with more than one OSD and a pool size of greater than two object replicas. @@ -29,11 +29,11 @@ VMs are used as clients). You can experiment with Ceph in a one-node configuration, in spite of the limitations as described herein. To create a cluster on a single node, you must change the -``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning +:confval:`osd_crush_chooseleaf_type` setting from the default of ``1`` (meaning ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration -file before you create your monitors and OSDs. This tells Ceph that an OSD is +file before you create Monitors and OSDs. This tells Ceph that an OSD is permitted to place another OSD on the same host. If you are trying to set up a -single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``, +single-node cluster and :confval:`osd_crush_chooseleaf_type` is greater than ``0``, Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on another node, chassis, rack, row, or datacenter depending on the setting. @@ -48,16 +48,16 @@ directories for the data first. Fewer OSDs than Replicas ------------------------ -If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not -in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set -to greater than ``2``. +If a number of OSDs are in an ``up`` and ``in`` state, but the placement groups are not +in an ``active+clean`` state, you may have an :confval:`osd_pool_default_size` set +to greater than the number of ``up`` and ``in`` state OSDs. -There are a few ways to address this situation. If you want to operate your -cluster in an ``active + degraded`` state with two replicas, you can set the -``osd_pool_default_min_size`` to ``2`` so that you can write objects in an -``active + degraded`` state. You may also set the ``osd_pool_default_size`` +There are a few ways to address this situation. For example, if you want to operate your +cluster with :confval:`osd_pool_default_size` set to ``3`` in an ``active+degraded`` state with two replicas, you can set the +:confval:`osd_pool_default_min_size` to ``2`` so that you can write objects in an +``active+degraded`` state. You may also set the :confval:`osd_pool_default_size` setting to ``2`` so that you have only two stored replicas (the original and -one replica). In such a case, the cluster should achieve an ``active + clean`` +one replica). In such a case, the cluster should achieve an ``active+clean`` state. .. note:: You can make the changes while the cluster is running. If you make @@ -68,7 +68,7 @@ state. Pool Size = 1 ------------- -If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy +If you have :confval:`osd_pool_default_size` set to ``1``, you will have only one copy of the object. OSDs rely on other OSDs to tell them which objects they should have. If one OSD has a copy of an object and there is no second copy, then there is no second OSD to tell the first OSD that it should have that copy. For @@ -76,7 +76,7 @@ each placement group mapped to the first OSD (see ``ceph pg dump``), you can force the first OSD to notice the placement groups it needs by running a command of the following form: -.. prompt:: bash +.. prompt:: bash # ceph osd force-create-pg @@ -84,40 +84,40 @@ command of the following form: CRUSH Map Errors ---------------- -If any placement groups in your cluster are unclean, then there might be errors +If any placement groups in your cluster are ``unclean``, then there might be errors in your CRUSH map. +.. _failures-pg-stuck: Stuck Placement Groups ====================== -It is normal for placement groups to enter "degraded" or "peering" states after +It is normal for placement groups to enter ``degraded`` or ``peering`` states after a component failure. Normally, these states reflect the expected progression through the failure recovery process. However, a placement group that stays in one of these states for a long time might be an indication of a larger problem. For this reason, the Ceph Monitors will warn when placement groups get "stuck" in a non-optimal state. Specifically, we check for: -* ``inactive`` - The placement group has not been ``active`` for too long (that +* ``inactive`` The placement group has not been ``active`` for too long (that is, it hasn't been able to service read/write requests). -* ``unclean`` - The placement group has not been ``clean`` for too long (that +* ``unclean`` The placement group has not been ``clean`` for too long (that is, it hasn't been able to completely recover from a previous failure). -* ``stale`` - The placement group status has not been updated by a - ``ceph-osd``. This indicates that all nodes storing this placement group may - be ``down``. +* ``stale`` The placement group status has not been updated by an OSD. + This indicates that all nodes storing this placement group may be ``down``. List stuck placement groups by running one of the following commands: -.. prompt:: bash +.. prompt:: bash # ceph pg dump_stuck stale ceph pg dump_stuck inactive ceph pg dump_stuck unclean -- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd`` - daemons are not running. +- Stuck ``stale`` placement groups usually indicate that key OSDs are + not running. - Stuck ``inactive`` placement groups usually indicate a peering problem (see :ref:`failures-osd-peering`). - Stuck ``unclean`` placement groups usually indicate that something is @@ -125,21 +125,20 @@ List stuck placement groups by running one of the following commands: :ref:`failures-osd-unfound`); - .. _failures-osd-peering: Placement Group Down - Peering Failure ====================================== -In certain cases, the ``ceph-osd`` `peering` process can run into problems, +In certain cases, the OSD `peering` process can run into problems, which can prevent a PG from becoming active and usable. In such a case, running the command ``ceph health detail`` will report something similar to the following: -.. prompt:: bash +.. prompt:: bash # ceph health detail -:: +.. code-block:: none HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down ... @@ -150,7 +149,7 @@ the command ``ceph health detail`` will report something similar to the followin Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form: -.. prompt:: bash +.. prompt:: bash # ceph pg 0.5 query @@ -159,10 +158,10 @@ Query the cluster to determine exactly why the PG is marked ``down`` by running { "state": "down+peering", ... "recovery_state": [ - { "name": "Started\/Primary\/Peering\/GetInfo", + { "name": "Started/Primary/Peering/GetInfo", "enter_time": "2012-03-06 14:40:16.169679", "requested_info_from": []}, - { "name": "Started\/Primary\/Peering", + { "name": "Started/Primary/Peering", "enter_time": "2012-03-06 14:40:16.169659", "probing_osds": [ 0, @@ -180,8 +179,8 @@ Query the cluster to determine exactly why the PG is marked ``down`` by running } The ``recovery_state`` section tells us that peering is blocked due to down -``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that -particular ``ceph-osd`` and recovery will proceed. +OSDs, specifically ``osd.1``. In this case, we can start that +particular OSD and recovery will proceed. Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if there has been a disk failure), the cluster can be informed that the OSD is @@ -194,7 +193,7 @@ there has been a disk failure), the cluster can be informed that the OSD is To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery anyway, run a command of the following form: -.. prompt:: bash +.. prompt:: bash # ceph osd lost 1 @@ -209,28 +208,28 @@ Unfound Objects Under certain combinations of failures, Ceph may complain about ``unfound`` objects, as in this example: -.. prompt:: bash +.. prompt:: bash # ceph health detail -:: +.. code-block:: none HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) pg 2.4 is active+degraded, 78 unfound This means that the storage cluster knows that some objects (or newer copies of existing objects) exist, but it hasn't found copies of them. Here is an -example of how this might come about for a PG whose data is on two OSDS, which +example of how this might come about for a PG whose data is on two OSDs, which we will call "1" and "2": -* 1 goes down -* 2 handles some writes, alone -* 1 comes up +* 1 goes down. +* 2 handles some writes, alone. +* 1 comes up. * 1 and 2 re-peer, and the objects missing on 1 are queued for recovery. * Before the new objects are copied, 2 goes down. -At this point, 1 knows that these objects exist, but there is no live -``ceph-osd`` that has a copy of the objects. In this case, IO to those objects +At this point, 1 knows that these objects exist, but there is no live OSD +that has a copy of the objects. In this case, IO to those objects will block, and the cluster will hope that the failed node comes back soon. This is assumed to be preferable to returning an IO error to the user. @@ -240,11 +239,11 @@ This is assumed to be preferable to returning an IO error to the user. Identify which objects are unfound by running a command of the following form: -.. prompt:: bash +.. prompt:: bash # ceph pg 2.4 list_unfound [starting offset, in json] -.. code-block:: javascript +.. code-block:: json { "num_missing": 1, @@ -296,18 +295,18 @@ OSDs that have the status of ``already probed`` are ignored. Use of ``query``: -.. prompt:: bash +.. prompt:: bash # ceph pg 2.4 query -.. code-block:: javascript +.. code-block:: json "recovery_state": [ - { "name": "Started\/Primary\/Active", + { "name": "Started/Primary/Active", "enter_time": "2012-03-06 15:15:46.713212", "might_have_unfound": [ { "osd": 1, - "status": "osd is down"}]}, + "status": "osd is down"}]}] In this case, the cluster knows that ``osd.1`` might have data, but it is ``down``. Here is the full range of possible states: @@ -332,7 +331,7 @@ combinations of failures have occurred that allow the cluster to learn about writes that were performed before the writes themselves have been recovered. To mark the "unfound" objects as "lost", run a command of the following form: -.. prompt:: bash +.. prompt:: bash # ceph pg 2.5 mark_unfound_lost revert|delete @@ -346,6 +345,7 @@ either roll back to a previous version of the object or (if it was a new object) forget about the object entirely. Use ``revert`` with caution, as it may confuse applications that expect the object to exist. + Homeless Placement Groups ========================= @@ -355,22 +355,22 @@ placement groups becomes unavailable and the monitor will receive no status updates for those placement groups. The monitor marks as ``stale`` any placement group whose primary OSD has failed. For example: -.. prompt:: bash +.. prompt:: bash # ceph health -:: +.. code-block:: none HEALTH_WARN 24 pgs stale; 3/300 in osds are down Identify which placement groups are ``stale`` and which were the last OSDs to store the ``stale`` placement groups by running the following command: -.. prompt:: bash +.. prompt:: bash # ceph health detail -:: +.. code-block:: none HEALTH_WARN 24 pgs stale; 3/300 in osds are down ... @@ -380,7 +380,7 @@ store the ``stale`` placement groups by running the following command: osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 -This output indicates that placement group 2.5 (``pg 2.5``) was last managed by +This output indicates that placement group ``2.5`` (``pg 2.5``) was last managed by ``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover that placement group. @@ -395,7 +395,7 @@ OSDs in an operation involving dividing the number of placement groups in the cluster by the number of OSDs in the cluster, a small number of placement groups (the remainder, in this operation) are sometimes not distributed across the cluster. In situations like this, create a pool with a placement group -count that is a multiple of the number of OSDs. See `Placement Groups`_ for +count that is a multiple of the number of OSDs. See :ref:`placement groups` for details. See the :ref:`Pool, PG, and CRUSH Config Reference ` for instructions on changing the default values used to determine how many placement groups are assigned to each pool. @@ -408,23 +408,23 @@ If the cluster is up, but some OSDs are down and you cannot write data, make sure that you have the minimum number of OSDs running in the pool. If you don't have the minimum number of OSDs running in the pool, Ceph will not allow you to write data to it because there is no guarantee that Ceph can replicate your -data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH +data. See :confval:`osd_pool_default_min_size` in the :ref:`Pool, PG, and CRUSH Config Reference ` for details. PGs Inconsistent ================ -If the command ``ceph health detail`` returns an ``active + clean + -inconsistent`` state, this might indicate an error during scrubbing. Identify +If the command ``ceph health detail`` returns an ``active+clean+inconsistent`` +state, this might indicate an error during scrubbing. Identify the inconsistent placement group or placement groups by running the following command: -.. prompt:: bash +.. prompt:: bash # ceph health detail -:: +.. code-block:: none HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] @@ -433,11 +433,11 @@ command: Alternatively, run this command if you prefer to inspect the output in a programmatic way: -.. prompt:: bash +.. prompt:: bash # rados list-inconsistent-pg rbd -:: +.. code-block:: none ["0.6"] @@ -446,11 +446,11 @@ different inconsistencies in multiple perspectives found in more than one object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of ``rados list-inconsistent-pg rbd`` will look something like this: -.. prompt:: bash +.. prompt:: bash # rados list-inconsistent-obj 0.6 --format=json-pretty -.. code-block:: javascript +.. code-block:: json { "epoch": 14, @@ -508,40 +508,40 @@ In this case, the output indicates the following: inconsistencies. * The inconsistencies fall into two categories: - #. ``errors``: these errors indicate inconsistencies between shards, without + #. ``errors``: These errors indicate inconsistencies between shards, without an indication of which shard(s) are bad. Check for the ``errors`` in the ``shards`` array, if available, to pinpoint the problem. - * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2`` + * ``data_digest_mismatch``: The digest of the replica read from ``OSD.2`` is different from the digests of the replica reads of ``OSD.0`` and - ``OSD.1`` - * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``, + ``OSD.1``. + * ``size_mismatch``: The size of the replica read from ``OSD.2`` is ``0``, but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``. - #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the + #. ``union_shard_errors``: The union of all shard-specific ``errors`` in the ``shards`` array. The ``errors`` are set for the shard with the problem. These errors include ``read_error`` and other similar errors. The ``errors`` ending in ``oi`` indicate a comparison with ``selected_object_info``. Examine the ``shards`` array to determine which shard has which error or errors. - * ``data_digest_mismatch_info``: the digest stored in the ``object-info`` + * ``data_digest_mismatch_info``: The digest stored in the ``object-info`` is not ``0xffffffff``, which is calculated from the shard read from - ``OSD.2`` - * ``size_mismatch_info``: the size stored in the ``object-info`` is + ``OSD.2``. + * ``size_mismatch_info``: The size stored in the ``object-info`` is different from the size read from ``OSD.2``. The latter is ``0``. .. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the inconsistency is likely due to physical storage errors. In cases like this, - check the storage used by that OSD. - + check the storage used by that OSD. + Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive repair. To repair the inconsistent placement group, run a command of the following form: -.. prompt:: bash +.. prompt:: bash # ceph pg repair {placement-group-ID} @@ -550,7 +550,7 @@ For example: .. prompt:: bash # ceph pg repair 1.4 - + .. warning:: This command overwrites the "bad" copies with "authoritative" copies. In most cases, Ceph is able to choose authoritative copies from all the available replicas by using some predefined criteria. This, however, @@ -564,14 +564,16 @@ For example: command ``ceph osd dump | grep pool`` return a list of pool numbers. -If you receive ``active + clean + inconsistent`` states periodically due to +If you receive ``active+clean+inconsistent`` states periodically due to clock skew, consider configuring the `NTP `_ daemons on your monitor hosts to act as peers. See `The Network Time Protocol `_ and Ceph :ref:`Clock Settings ` for more information. + More Information on PG Repair ----------------------------- + Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a PG, the lead OSD attempts to choose an authoritative copy from among its replicas. Only one of the possible cases is consistent. @@ -583,7 +585,7 @@ any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency. The discovery of these inconsistencies cause a PG's state to be set to ``inconsistent``. -The ``pg repair`` command attempts to fix inconsistencies of various kinds. When +The ``pg repair`` command attempts to fix inconsistencies of various kinds. When ``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. When ``pg repair`` finds an inconsistent copy in a replicated pool, it marks the @@ -591,8 +593,8 @@ inconsistent copy as missing. In the case of replicated pools, recovery is beyond the scope of ``pg repair``. In the case of erasure-coded and BlueStore pools, Ceph will automatically -perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to -``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default +perform repairs if :confval:`osd_scrub_auto_repair` (default ``false``) is set to +``true`` and if no more than :confval:`osd_scrub_auto_repair_num_errors` (default ``5``) errors are found. The ``pg repair`` command will not solve every problem. Ceph does not @@ -615,36 +617,41 @@ might not be the uncorrupted replica. Because of this uncertainty, human intervention is necessary when an inconsistency is discovered. This intervention sometimes involves use of ``ceph-objectstore-tool``. + PG Repair Walkthrough --------------------- + https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page -contains a walkthrough of the repair of a PG. It is recommended reading if you -want to repair a PG but have never done so. +contains a walkthrough of the repair of a PG on the deprecated Filestore OSD back end. It is recommended reading if you +want to repair a PG on a Filestore OSD but have never done so. The walkthrough does not +apply to modern BlueStore OSDs. -Erasure Coded PGs are not active+clean -====================================== + +Erasure Coded PGs are not ``active+clean`` +========================================== If CRUSH fails to find enough OSDs to map to a PG, it will show as a ``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example:: [2,1,6,0,5,8,2147483647,7,4] + Not enough OSDs --------------- If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine -OSDs, the cluster will show "Not enough OSDs". In this case, you either create -another erasure coded pool that requires fewer OSDs, by running commands of the +OSDs, the cluster will show ``Not enough OSDs``. In this case, either add new +OSDs that the PG will then use automatically, or create +another erasure coded pool that requires fewer OSDs by running commands of the following form: -.. prompt:: bash +.. prompt:: bash # ceph osd erasure-code-profile set myprofile k=5 m=3 ceph osd pool create erasurepool erasure myprofile -or add new OSDs, and the PG will automatically use them. -CRUSH constraints cannot be satisfied +CRUSH Constraints cannot be Satisfied ------------------------------------- If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing @@ -653,16 +660,22 @@ the CRUSH rule requires that no two OSDs from the same host are used in the same PG, the mapping may fail because only two OSDs will be found. Check the constraint by displaying ("dumping") the rule, as shown here: -.. prompt:: bash +.. prompt:: bash # ceph osd crush rule ls -:: +.. code-block:: json [ "replicated_rule", "erasurepool"] - $ ceph osd crush rule dump erasurepool + +.. prompt:: bash # + + ceph osd crush rule dump erasurepool + +.. code-block:: json + { "rule_id": 1, "rule_name": "erasurepool", "type": 3, @@ -679,39 +692,44 @@ constraint by displaying ("dumping") the rule, as shown here: Resolve this problem by creating a new pool in which PGs are allowed to have OSDs residing on the same host by running the following commands: -.. prompt:: bash +.. prompt:: bash # ceph osd erasure-code-profile set myprofile crush-failure-domain=osd ceph osd pool create erasurepool erasure myprofile -CRUSH gives up too soon + +CRUSH Gives up too Soon ----------------------- If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster with a total of nine OSDs and an erasure coded pool that requires nine OSDs per -PG), it is possible that CRUSH gives up before finding a mapping. This problem -can be resolved by: +PG), it is possible that CRUSH gives up before finding a mapping. To resolve +this problem, either: -* lowering the erasure coded pool requirements to use fewer OSDs per PG (this +* Lower the erasure coded pool requirements to use fewer OSDs per PG (this requires the creation of another pool, because erasure code profiles cannot be modified dynamically). -* adding more OSDs to the cluster (this does not require the erasure coded pool - to be modified, because it will become clean automatically) +* Add more OSDs to the cluster (this does not require the erasure coded pool + to be modified, because it will become clean automatically). -* using a handmade CRUSH rule that tries more times to find a good mapping. +* Use a handmade CRUSH rule that tries more times to find a good mapping. This can be modified for an existing CRUSH rule by setting - ``set_choose_tries`` to a value greater than the default. + ``set_choose_tries`` to a value greater than the default. For more + information, see :ref:`rados-crush-map-edits`. + +* Use a multi-step retry (MSR) CRUSH rule (Squid or later releases). For more + information, see :ref:`rados-crush-msr-rules`. First, verify the problem by using ``crushtool`` after extracting the crushmap from the cluster. This ensures that your experiments do not modify the Ceph cluster and that they operate only on local files: -.. prompt:: bash +.. prompt:: bash # ceph osd crush rule dump erasurepool -:: +.. code-block:: json { "rule_id": 1, "rule_name": "erasurepool", @@ -724,12 +742,24 @@ cluster and that they operate only on local files: "num": 0, "type": "host"}, { "op": "emit"}]} - $ ceph osd getcrushmap > crush.map + +.. prompt:: bash # + + ceph osd getcrushmap > crush.map + +.. code-block:: none + got crush map from osdmap epoch 13 - $ crushtool -i crush.map --test --show-bad-mappings \ + +.. prompt:: bash # + + crushtool -i crush.map --test --show-bad-mappings \ --rule 1 \ --num-rep 9 \ --min-x 1 --max-x $((1024 * 1024)) + +.. code-block:: none + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] @@ -747,25 +777,23 @@ considered bad, the CRUSH rule can be configured to search longer for a viable placement. -Changing the value of set_choose_tries -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Changing the Value of ``set_choose_tries`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #. Decompile the CRUSH map to edit the CRUSH rule by running the following command: - .. prompt:: bash + .. prompt:: bash # crushtool --decompile crush.map > crush.txt For illustrative purposes a simplified CRUSH map will be used in this - example, simulating a single host with four disks of sizes 3×1TiB and - 1×200GiB. The settings below are chosen specifically for this example and + example, simulating a single host with four disks of sizes 3×1 TiB and + 1×200 GiB. The settings below are chosen specifically for this example and will diverge from the :ref:`CRUSH Map Tunables ` generally found in production clusters. As defaults may change, please refer to the correct version of the documentation for your release of Ceph. -_ - :: tunable choose_local_tries 0 @@ -832,7 +860,7 @@ _ step set_choose_tries 100 If the line does exist already, as in this example, only modify the value. - Ensure that the rule in this ``crush.txt`` does resemble this after the + Ensure that the rule in your ``crush.txt`` does resemble this after the change:: rule ec { @@ -847,7 +875,7 @@ _ #. Recompile and retest the CRUSH rule: - .. prompt:: bash + .. prompt:: bash # crushtool --compile crush.txt -o better-crush.map @@ -856,7 +884,7 @@ _ ``--show-choose-tries`` option of the ``crushtool`` command, as in the following example: - .. prompt:: bash + .. prompt:: bash # crushtool -i better-crush.map --test --show-bad-mappings \ --show-choose-tries \ @@ -864,7 +892,7 @@ _ --num-rep 3 \ --min-x 1 --max-x 10 - :: + .. code-block:: none 0: 0 1: 0 @@ -908,6 +936,3 @@ placements in practice, however if a lower value is desired then the lower value can be used at the chance of potentially hitting one of the rare cases in which placement fails, requiring manual intervention. -.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups -.. _Placement Groups: ../../operations/placement-groups -.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref