documentation: add troubleshooting erasure coded PGs section

author Loic Dachary <ldachary@redhat.com>

Wed, 17 Dec 2014 15:08:29 +0000 (16:08 +0100)

committer Loic Dachary <ldachary@redhat.com>

Thu, 15 Jan 2015 20:26:53 +0000 (21:26 +0100)
author Loic Dachary <ldachary@redhat.com>
Wed, 17 Dec 2014 15:08:29 +0000 (16:08 +0100)
committer Loic Dachary <ldachary@redhat.com>
Thu, 15 Jan 2015 20:26:53 +0000 (21:26 +0100)
diff --git a/doc/rados/troubleshooting/troubleshooting-pg.rst b/doc/rados/troubleshooting/troubleshooting-pg.rst

index 3c4b4224061f4574d0e8a9220393dcc6ec4e964c..e7af5a54b6d85aac8adba4528665a4fddefa734d 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-pg.rst
+++ b/doc/rados/troubleshooting/troubleshooting-pg.rst
@@ -359,6 +359,201 @@ monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph
  `Clock Settings`_ for additional details.
  
  
+Erasure Coded PGs are not active+clean
+======================================
+
+When CRUSH fails to find enough OSDs to map to a PG, it will show as a
+``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance::
+
+     [2,1,6,0,5,8,2147483647,7,4]
+
+Not enough OSDs
+---------------
+
+If the Ceph cluster only has 8 OSDs and the erasure coded pool needs
+9, that is what it will show. You can either create another erasure
+coded pool that requires less OSDs::
+
+     ceph osd erasure-code-profile set myprofile k=5 m=3
+     ceph osd pool create erasurepool 16 16 erasure myprofile
+
+or add a new OSDs and the PG will automatically use them.
+
+CRUSH constraints cannot be satisfied
+-------------------------------------
+
+If the cluster has enough OSDs, it is possible that the CRUSH ruleset
+imposes constraints that cannot be satisfied. If there are 10 OSDs on
+two hosts and the CRUSH rulesets require that no two OSDs from the
+same host are used in the same PG, the mapping may fail because only
+two OSD will be found. You can check the constraint by displaying the
+ruleset::
+
+    $ ceph osd crush rule ls
+    [
+        "replicated_ruleset",
+        "erasurepool"]
+    $ ceph osd crush rule dump erasurepool
+    { "rule_id": 1,
+      "rule_name": "erasurepool",
+      "ruleset": 1,
+      "type": 3,
+      "min_size": 3,
+      "max_size": 20,
+      "steps": [
+            { "op": "take",
+              "item": -1,
+              "item_name": "default"},
+            { "op": "chooseleaf_indep",
+              "num": 0,
+              "type": "host"},
+            { "op": "emit"}]}
+
+
+You can resolve the problem by creating a new pool in which PGs are allowed
+to have OSDs residing on the same host with::
+
+     ceph osd erasure-code-profile set myprofile ruleset-failure-domain=osd
+     ceph osd pool create erasurepool 16 16 erasure myprofile
+
+CRUSH gives up too soon
+-----------------------
+
+If the Ceph cluster has just enough OSDs to map the PG (for instance a
+cluster with a total of 9 OSDs and an erasure coded pool that requires
+9 OSDs per PG), it is possible that CRUSH gives up before finding a
+mapping. It can be resolved by:
+
+* lowering the erasure coded pool requirements to use less OSDs per PG
+  (that requires the creation of another pool as erasure code profiles
+  cannot be dynamically modified).
+
+* adding more OSDs to the cluster (that does not require the erasure
+  coded pool to be modified, it will become clean automatically)
+
+* use a hand made CRUSH ruleset that tries more times to find a good
+  mapping.  (that requires the creation of another pool as erasure
+  code profiles cannot be dynamically modified). It can be done by
+  setting ``set_choose_tries`` to a value greater than the default.
+
+You should first verify the problem with ``crushtool`` after
+extracting the crushmap from the cluster so your experiments do not
+modify the Ceph cluster and only work on a local files::
+
+    $ ceph osd crush rule dump erasurepool
+    { "rule_name": "erasurepool",
+      "ruleset": 1,
+      "type": 3,
+      "min_size": 3,
+      "max_size": 20,
+      "steps": [
+            { "op": "take",
+              "item": -1,
+              "item_name": "default"},
+            { "op": "chooseleaf_indep",
+              "num": 0,
+              "type": "host"},
+            { "op": "emit"}]}
+    $ ceph osd getcrushmap > crush.map
+    got crush map from osdmap epoch 13
+    $ crushtool -i crush.map --test --show-bad-mappings \
+       --rule 1 \
+       --num-rep 9 \
+       --min-x 1 --max-x $((1024 * 1024))
+    bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
+    bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
+    bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
+
+Where ``--num-rep`` is the number of OSDs the erasure code crush
+ruleset needs, ``--rule`` is the value of the ``ruleset`` field
+displayed by ``ceph osd crush rule dump``.  The test will try mapping
+one million values (i.e. the range defined by ``[--min-x,--max-x]``)
+and must display at least one bad mapping. If it outputs nothing it
+means all mappings are successfull and you can stop right there: the
+problem is elsewhere.
+
+The crush ruleset can be edited by decompiling the crush map::
+
+    $ crushtool --decompile crush.map > crush.txt
+
+and adding the following line to the ruleset::
+
+    step set_choose_tries 100
+
+The relevant part of of the ``crush.txt`` file should look something
+like::
+
+     rule erasurepool {
+             ruleset 1
+             type erasure
+             min_size 3
+             max_size 20
+             step set_chooseleaf_tries 5
+             step set_choose_tries 100
+             step take default
+             step chooseleaf indep 0 type host
+             step emit
+     }
+
+It can then be compiled and tested again::
+
+    $ crushtool --compile crush.txt -o better-crush.map
+
+When all mappings succeed, an histogram of the number of tries that
+were necessary to find all of them can be displayed with the
+``--show-choose-tries`` option of ``crushtool``::
+
+    $ crushtool -i better-crush.map --test --show-bad-mappings \
+       --show-choose-tries \
+       --rule 1 \
+       --num-rep 9 \
+       --min-x 1 --max-x $((1024 * 1024))
+    ...
+    11:        42
+    12:        44
+    13:        54
+    14:        45
+    15:        35
+    16:        34
+    17:        30
+    18:        25
+    19:        19
+    20:        22
+    21:        20
+    22:        17
+    23:        13
+    24:        16
+    25:        13
+    26:        11
+    27:        11
+    28:        13
+    29:        11
+    30:        10
+    31:         6
+    32:         5
+    33:        10
+    34:         3
+    35:         7
+    36:         5
+    37:         2
+    38:         5
+    39:         5
+    40:         2
+    41:         5
+    42:         4
+    43:         1
+    44:         2
+    45:         2
+    46:         3
+    47:         1
+    48:         0
+    ...
+    102:         0
+    103:         1
+    104:         0
+    ...
+
+It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped).
  
  .. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
  .. _here: ../../configuration/pool-pg-config-ref
author	Loic Dachary <ldachary@redhat.com>
	Wed, 17 Dec 2014 15:08:29 +0000 (16:08 +0100)
committer	Loic Dachary <ldachary@redhat.com>
	Thu, 15 Jan 2015 20:26:53 +0000 (21:26 +0100)