Placement Groups (PGs) that remain in the ``active`` status, the
``active+remapped`` status or the ``active+degraded`` status and never achieve
an ``active+clean`` status might indicate a problem with the configuration of
-the Ceph cluster.
+the Ceph cluster.
-In such a situation, review the settings in the `Pool, PG and CRUSH Config
-Reference`_ and make appropriate adjustments.
+In such a situation, review the settings in the :ref:`rados_config_pool_pg_crush_ref`
+and make appropriate adjustments.
As a general rule, run your cluster with more than one OSD and a pool size
of greater than two object replicas.
configuration, in spite of the limitations as described herein.
To create a cluster on a single node, you must change the
-``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
+:confval:`osd_crush_chooseleaf_type` setting from the default of ``1`` (meaning
``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
-file before you create your monitors and OSDs. This tells Ceph that an OSD is
+file before you create Monitors and OSDs. This tells Ceph that an OSD is
permitted to place another OSD on the same host. If you are trying to set up a
-single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
+single-node cluster and :confval:`osd_crush_chooseleaf_type` is greater than ``0``,
Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
another node, chassis, rack, row, or datacenter depending on the setting.
Fewer OSDs than Replicas
------------------------
-If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
-in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
-to greater than ``2``.
+If a number of OSDs are in an ``up`` and ``in`` state, but the placement groups are not
+in an ``active+clean`` state, you may have an :confval:`osd_pool_default_size` set
+to greater than the number of ``up`` and ``in`` state OSDs.
-There are a few ways to address this situation. If you want to operate your
-cluster in an ``active + degraded`` state with two replicas, you can set the
-``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
-``active + degraded`` state. You may also set the ``osd_pool_default_size``
+There are a few ways to address this situation. For example, if you want to operate your
+cluster with :confval:`osd_pool_default_size` set to ``3`` in an ``active+degraded`` state with two replicas, you can set the
+:confval:`osd_pool_default_min_size` to ``2`` so that you can write objects in an
+``active+degraded`` state. You may also set the :confval:`osd_pool_default_size`
setting to ``2`` so that you have only two stored replicas (the original and
-one replica). In such a case, the cluster should achieve an ``active + clean``
+one replica). In such a case, the cluster should achieve an ``active+clean``
state.
.. note:: You can make the changes while the cluster is running. If you make
Pool Size = 1
-------------
-If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
+If you have :confval:`osd_pool_default_size` set to ``1``, you will have only one copy
of the object. OSDs rely on other OSDs to tell them which objects they should
have. If one OSD has a copy of an object and there is no second copy, then
there is no second OSD to tell the first OSD that it should have that copy. For
force the first OSD to notice the placement groups it needs by running a
command of the following form:
-.. prompt:: bash
+.. prompt:: bash #
ceph osd force-create-pg <pgid>
CRUSH Map Errors
----------------
-If any placement groups in your cluster are unclean, then there might be errors
+If any placement groups in your cluster are ``unclean``, then there might be errors
in your CRUSH map.
+.. _failures-pg-stuck:
Stuck Placement Groups
======================
-It is normal for placement groups to enter "degraded" or "peering" states after
+It is normal for placement groups to enter ``degraded`` or ``peering`` states after
a component failure. Normally, these states reflect the expected progression
through the failure recovery process. However, a placement group that stays in
one of these states for a long time might be an indication of a larger problem.
For this reason, the Ceph Monitors will warn when placement groups get "stuck"
in a non-optimal state. Specifically, we check for:
-* ``inactive`` - The placement group has not been ``active`` for too long (that
+* ``inactive`` The placement group has not been ``active`` for too long (that
is, it hasn't been able to service read/write requests).
-* ``unclean`` - The placement group has not been ``clean`` for too long (that
+* ``unclean`` The placement group has not been ``clean`` for too long (that
is, it hasn't been able to completely recover from a previous failure).
-* ``stale`` - The placement group status has not been updated by a
- ``ceph-osd``. This indicates that all nodes storing this placement group may
- be ``down``.
+* ``stale`` The placement group status has not been updated by an OSD.
+ This indicates that all nodes storing this placement group may be ``down``.
List stuck placement groups by running one of the following commands:
-.. prompt:: bash
+.. prompt:: bash #
ceph pg dump_stuck stale
ceph pg dump_stuck inactive
ceph pg dump_stuck unclean
-- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
- daemons are not running.
+- Stuck ``stale`` placement groups usually indicate that key OSDs are
+ not running.
- Stuck ``inactive`` placement groups usually indicate a peering problem (see
:ref:`failures-osd-peering`).
- Stuck ``unclean`` placement groups usually indicate that something is
:ref:`failures-osd-unfound`);
-
.. _failures-osd-peering:
Placement Group Down - Peering Failure
======================================
-In certain cases, the ``ceph-osd`` `peering` process can run into problems,
+In certain cases, the OSD `peering` process can run into problems,
which can prevent a PG from becoming active and usable. In such a case, running
the command ``ceph health detail`` will report something similar to the following:
-.. prompt:: bash
+.. prompt:: bash #
ceph health detail
-::
+.. code-block:: none
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
-.. prompt:: bash
+.. prompt:: bash #
ceph pg 0.5 query
{ "state": "down+peering",
...
"recovery_state": [
- { "name": "Started\/Primary\/Peering\/GetInfo",
+ { "name": "Started/Primary/Peering/GetInfo",
"enter_time": "2012-03-06 14:40:16.169679",
"requested_info_from": []},
- { "name": "Started\/Primary\/Peering",
+ { "name": "Started/Primary/Peering",
"enter_time": "2012-03-06 14:40:16.169659",
"probing_osds": [
0,
}
The ``recovery_state`` section tells us that peering is blocked due to down
-``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
-particular ``ceph-osd`` and recovery will proceed.
+OSDs, specifically ``osd.1``. In this case, we can start that
+particular OSD and recovery will proceed.
Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
there has been a disk failure), the cluster can be informed that the OSD is
To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
anyway, run a command of the following form:
-.. prompt:: bash
+.. prompt:: bash #
ceph osd lost 1
Under certain combinations of failures, Ceph may complain about ``unfound``
objects, as in this example:
-.. prompt:: bash
+.. prompt:: bash #
ceph health detail
-::
+.. code-block:: none
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
pg 2.4 is active+degraded, 78 unfound
This means that the storage cluster knows that some objects (or newer copies of
existing objects) exist, but it hasn't found copies of them. Here is an
-example of how this might come about for a PG whose data is on two OSDS, which
+example of how this might come about for a PG whose data is on two OSDs, which
we will call "1" and "2":
-* 1 goes down
-* 2 handles some writes, alone
-* 1 comes up
+* 1 goes down.
+* 2 handles some writes, alone.
+* 1 comes up.
* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
* Before the new objects are copied, 2 goes down.
-At this point, 1 knows that these objects exist, but there is no live
-``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
+At this point, 1 knows that these objects exist, but there is no live OSD
+that has a copy of the objects. In this case, IO to those objects
will block, and the cluster will hope that the failed node comes back soon.
This is assumed to be preferable to returning an IO error to the user.
Identify which objects are unfound by running a command of the following form:
-.. prompt:: bash
+.. prompt:: bash #
ceph pg 2.4 list_unfound [starting offset, in json]
-.. code-block:: javascript
+.. code-block:: json
{
"num_missing": 1,
Use of ``query``:
-.. prompt:: bash
+.. prompt:: bash #
ceph pg 2.4 query
-.. code-block:: javascript
+.. code-block:: json
"recovery_state": [
- { "name": "Started\/Primary\/Active",
+ { "name": "Started/Primary/Active",
"enter_time": "2012-03-06 15:15:46.713212",
"might_have_unfound": [
{ "osd": 1,
- "status": "osd is down"}]},
+ "status": "osd is down"}]}]
In this case, the cluster knows that ``osd.1`` might have data, but it is
``down``. Here is the full range of possible states:
writes that were performed before the writes themselves have been recovered. To
mark the "unfound" objects as "lost", run a command of the following form:
-.. prompt:: bash
+.. prompt:: bash #
ceph pg 2.5 mark_unfound_lost revert|delete
object) forget about the object entirely. Use ``revert`` with caution, as it
may confuse applications that expect the object to exist.
+
Homeless Placement Groups
=========================
updates for those placement groups. The monitor marks as ``stale`` any
placement group whose primary OSD has failed. For example:
-.. prompt:: bash
+.. prompt:: bash #
ceph health
-::
+.. code-block:: none
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
Identify which placement groups are ``stale`` and which were the last OSDs to
store the ``stale`` placement groups by running the following command:
-.. prompt:: bash
+.. prompt:: bash #
ceph health detail
-::
+.. code-block:: none
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
...
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
-This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
+This output indicates that placement group ``2.5`` (``pg 2.5``) was last managed by
``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
that placement group.
cluster by the number of OSDs in the cluster, a small number of placement
groups (the remainder, in this operation) are sometimes not distributed across
the cluster. In situations like this, create a pool with a placement group
-count that is a multiple of the number of OSDs. See `Placement Groups`_ for
+count that is a multiple of the number of OSDs. See :ref:`placement groups` for
details. See the :ref:`Pool, PG, and CRUSH Config Reference
<rados_config_pool_pg_crush_ref>` for instructions on changing the default
values used to determine how many placement groups are assigned to each pool.
sure that you have the minimum number of OSDs running in the pool. If you don't
have the minimum number of OSDs running in the pool, Ceph will not allow you to
write data to it because there is no guarantee that Ceph can replicate your
-data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
+data. See :confval:`osd_pool_default_min_size` in the :ref:`Pool, PG, and CRUSH
Config Reference <rados_config_pool_pg_crush_ref>` for details.
PGs Inconsistent
================
-If the command ``ceph health detail`` returns an ``active + clean +
-inconsistent`` state, this might indicate an error during scrubbing. Identify
+If the command ``ceph health detail`` returns an ``active+clean+inconsistent``
+state, this might indicate an error during scrubbing. Identify
the inconsistent placement group or placement groups by running the following
command:
-.. prompt:: bash
+.. prompt:: bash #
ceph health detail
-::
+.. code-block:: none
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
Alternatively, run this command if you prefer to inspect the output in a
programmatic way:
-.. prompt:: bash
+.. prompt:: bash #
rados list-inconsistent-pg rbd
-::
+.. code-block:: none
["0.6"]
object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
``rados list-inconsistent-pg rbd`` will look something like this:
-.. prompt:: bash
+.. prompt:: bash #
rados list-inconsistent-obj 0.6 --format=json-pretty
-.. code-block:: javascript
+.. code-block:: json
{
"epoch": 14,
inconsistencies.
* The inconsistencies fall into two categories:
- #. ``errors``: these errors indicate inconsistencies between shards, without
+ #. ``errors``: These errors indicate inconsistencies between shards, without
an indication of which shard(s) are bad. Check for the ``errors`` in the
``shards`` array, if available, to pinpoint the problem.
- * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
+ * ``data_digest_mismatch``: The digest of the replica read from ``OSD.2``
is different from the digests of the replica reads of ``OSD.0`` and
- ``OSD.1``
- * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
+ ``OSD.1``.
+ * ``size_mismatch``: The size of the replica read from ``OSD.2`` is ``0``,
but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
- #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
+ #. ``union_shard_errors``: The union of all shard-specific ``errors`` in the
``shards`` array. The ``errors`` are set for the shard with the problem.
These errors include ``read_error`` and other similar errors. The
``errors`` ending in ``oi`` indicate a comparison with
``selected_object_info``. Examine the ``shards`` array to determine
which shard has which error or errors.
- * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
+ * ``data_digest_mismatch_info``: The digest stored in the ``object-info``
is not ``0xffffffff``, which is calculated from the shard read from
- ``OSD.2``
- * ``size_mismatch_info``: the size stored in the ``object-info`` is
+ ``OSD.2``.
+ * ``size_mismatch_info``: The size stored in the ``object-info`` is
different from the size read from ``OSD.2``. The latter is ``0``.
.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
inconsistency is likely due to physical storage errors. In cases like this,
- check the storage used by that OSD.
-
+ check the storage used by that OSD.
+
Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
repair.
To repair the inconsistent placement group, run a command of the following
form:
-.. prompt:: bash
+.. prompt:: bash #
ceph pg repair {placement-group-ID}
.. prompt:: bash #
ceph pg repair 1.4
-
+
.. warning:: This command overwrites the "bad" copies with "authoritative"
copies. In most cases, Ceph is able to choose authoritative copies from all
the available replicas by using some predefined criteria. This, however,
command ``ceph osd dump | grep pool`` return a list of pool numbers.
-If you receive ``active + clean + inconsistent`` states periodically due to
+If you receive ``active+clean+inconsistent`` states periodically due to
clock skew, consider configuring the `NTP
<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
+
More Information on PG Repair
-----------------------------
+
Ceph stores and updates the checksums of objects stored in the cluster. When a
scrub is performed on a PG, the lead OSD attempts to choose an authoritative
copy from among its replicas. Only one of the possible cases is consistent.
of the authoritative copy means that there is an inconsistency. The discovery
of these inconsistencies cause a PG's state to be set to ``inconsistent``.
-The ``pg repair`` command attempts to fix inconsistencies of various kinds. When
+The ``pg repair`` command attempts to fix inconsistencies of various kinds. When
``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
the inconsistent copy with the digest of the authoritative copy. When ``pg
repair`` finds an inconsistent copy in a replicated pool, it marks the
beyond the scope of ``pg repair``.
In the case of erasure-coded and BlueStore pools, Ceph will automatically
-perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
-``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
+perform repairs if :confval:`osd_scrub_auto_repair` (default ``false``) is set to
+``true`` and if no more than :confval:`osd_scrub_auto_repair_num_errors` (default
``5``) errors are found.
The ``pg repair`` command will not solve every problem. Ceph does not
intervention is necessary when an inconsistency is discovered. This
intervention sometimes involves use of ``ceph-objectstore-tool``.
+
PG Repair Walkthrough
---------------------
+
https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
-contains a walkthrough of the repair of a PG. It is recommended reading if you
-want to repair a PG but have never done so.
+contains a walkthrough of the repair of a PG on the deprecated Filestore OSD back end. It is recommended reading if you
+want to repair a PG on a Filestore OSD but have never done so. The walkthrough does not
+apply to modern BlueStore OSDs.
-Erasure Coded PGs are not active+clean
-======================================
+
+Erasure Coded PGs are not ``active+clean``
+==========================================
If CRUSH fails to find enough OSDs to map to a PG, it will show as a
``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
[2,1,6,0,5,8,2147483647,7,4]
+
Not enough OSDs
---------------
If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
-OSDs, the cluster will show "Not enough OSDs". In this case, you either create
-another erasure coded pool that requires fewer OSDs, by running commands of the
+OSDs, the cluster will show ``Not enough OSDs``. In this case, either add new
+OSDs that the PG will then use automatically, or create
+another erasure coded pool that requires fewer OSDs by running commands of the
following form:
-.. prompt:: bash
+.. prompt:: bash #
ceph osd erasure-code-profile set myprofile k=5 m=3
ceph osd pool create erasurepool erasure myprofile
-or add new OSDs, and the PG will automatically use them.
-CRUSH constraints cannot be satisfied
+CRUSH Constraints cannot be Satisfied
-------------------------------------
If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
same PG, the mapping may fail because only two OSDs will be found. Check the
constraint by displaying ("dumping") the rule, as shown here:
-.. prompt:: bash
+.. prompt:: bash #
ceph osd crush rule ls
-::
+.. code-block:: json
[
"replicated_rule",
"erasurepool"]
- $ ceph osd crush rule dump erasurepool
+
+.. prompt:: bash #
+
+ ceph osd crush rule dump erasurepool
+
+.. code-block:: json
+
{ "rule_id": 1,
"rule_name": "erasurepool",
"type": 3,
Resolve this problem by creating a new pool in which PGs are allowed to have
OSDs residing on the same host by running the following commands:
-.. prompt:: bash
+.. prompt:: bash #
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
ceph osd pool create erasurepool erasure myprofile
-CRUSH gives up too soon
+
+CRUSH Gives up too Soon
-----------------------
If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
-PG), it is possible that CRUSH gives up before finding a mapping. This problem
-can be resolved by:
+PG), it is possible that CRUSH gives up before finding a mapping. To resolve
+this problem, either:
-* lowering the erasure coded pool requirements to use fewer OSDs per PG (this
+* Lower the erasure coded pool requirements to use fewer OSDs per PG (this
requires the creation of another pool, because erasure code profiles cannot
be modified dynamically).
-* adding more OSDs to the cluster (this does not require the erasure coded pool
- to be modified, because it will become clean automatically)
+* Add more OSDs to the cluster (this does not require the erasure coded pool
+ to be modified, because it will become clean automatically).
-* using a handmade CRUSH rule that tries more times to find a good mapping.
+* Use a handmade CRUSH rule that tries more times to find a good mapping.
This can be modified for an existing CRUSH rule by setting
- ``set_choose_tries`` to a value greater than the default.
+ ``set_choose_tries`` to a value greater than the default. For more
+ information, see :ref:`rados-crush-map-edits`.
+
+* Use a multi-step retry (MSR) CRUSH rule (Squid or later releases). For more
+ information, see :ref:`rados-crush-msr-rules`.
First, verify the problem by using ``crushtool`` after extracting the crushmap
from the cluster. This ensures that your experiments do not modify the Ceph
cluster and that they operate only on local files:
-.. prompt:: bash
+.. prompt:: bash #
ceph osd crush rule dump erasurepool
-::
+.. code-block:: json
{ "rule_id": 1,
"rule_name": "erasurepool",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
- $ ceph osd getcrushmap > crush.map
+
+.. prompt:: bash #
+
+ ceph osd getcrushmap > crush.map
+
+.. code-block:: none
+
got crush map from osdmap epoch 13
- $ crushtool -i crush.map --test --show-bad-mappings \
+
+.. prompt:: bash #
+
+ crushtool -i crush.map --test --show-bad-mappings \
--rule 1 \
--num-rep 9 \
--min-x 1 --max-x $((1024 * 1024))
+
+.. code-block:: none
+
bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
placement.
-Changing the value of set_choose_tries
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Changing the Value of ``set_choose_tries``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. Decompile the CRUSH map to edit the CRUSH rule by running the following
command:
- .. prompt:: bash
+ .. prompt:: bash #
crushtool --decompile crush.map > crush.txt
For illustrative purposes a simplified CRUSH map will be used in this
- example, simulating a single host with four disks of sizes 3×1TiB and
- 1×200GiB. The settings below are chosen specifically for this example and
+ example, simulating a single host with four disks of sizes 3×1 TiB and
+ 1×200 GiB. The settings below are chosen specifically for this example and
will diverge from the :ref:`CRUSH Map Tunables <crush-map-tunables>`
generally found in production clusters. As defaults may change, please refer
to the correct version of the documentation for your release of Ceph.
-_
-
::
tunable choose_local_tries 0
step set_choose_tries 100
If the line does exist already, as in this example, only modify the value.
- Ensure that the rule in this ``crush.txt`` does resemble this after the
+ Ensure that the rule in your ``crush.txt`` does resemble this after the
change::
rule ec {
#. Recompile and retest the CRUSH rule:
- .. prompt:: bash
+ .. prompt:: bash #
crushtool --compile crush.txt -o better-crush.map
``--show-choose-tries`` option of the ``crushtool`` command, as in the
following example:
- .. prompt:: bash
+ .. prompt:: bash #
crushtool -i better-crush.map --test --show-bad-mappings \
--show-choose-tries \
--num-rep 3 \
--min-x 1 --max-x 10
- ::
+ .. code-block:: none
0: 0
1: 0
value can be used at the chance of potentially hitting one of the rare cases in
which placement fails, requiring manual intervention.
-.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
-.. _Placement Groups: ../../operations/placement-groups
-.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref