From 0a15144c584d4445bd037340f053aa64b9c13f32 Mon Sep 17 00:00:00 2001 From: Patrick Donnelly Date: Wed, 5 Jul 2023 21:25:59 -0400 Subject: [PATCH] doc: add information on expediting MDS recovery Fixes: https://tracker.ceph.com/issues/61865 Signed-off-by: Patrick Donnelly --- doc/cephfs/troubleshooting.rst | 127 +++++++++++++++++++++++++++++++++ 1 file changed, 127 insertions(+) diff --git a/doc/cephfs/troubleshooting.rst b/doc/cephfs/troubleshooting.rst index f4cd5f6c93655..b58d2469f6e62 100644 --- a/doc/cephfs/troubleshooting.rst +++ b/doc/cephfs/troubleshooting.rst @@ -21,6 +21,133 @@ We can get hints about what's going on by dumping the MDS cache :: If high logging levels are set on the MDS, that will almost certainly hold the information we need to diagnose and solve the issue. +Stuck during recovery +===================== + +Stuck in up:replay +------------------ + +If your MDS is stuck in ``up:replay`` then it is likely that the journal is +very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is +behind on trimming its journal? If the journal has grown very large, it can +take hours to read the journal. There is no working around this but there +are things you can do to speed things along: + +Reduce MDS debugging to 0. Even at the default settings, the MDS logs some +messages to memory for dumping if a fatal error is encountered. You can avoid +this: + +.. code:: bash + + ceph config set mds debug_mds 0 + ceph config set mds debug_ms 0 + ceph config set mds debug_monc 0 + +Note if the MDS fails then there will be virtually no information to determine +why. If you can calculate when ``up:replay`` will complete, you should restore +these configs just prior to entering the next state: + +.. code:: bash + + ceph config rm mds debug_mds + ceph config rm mds debug_ms + ceph config rm mds debug_monc + +Once you've got replay moving along faster, you can calculate when the MDS will +complete. This is done by examining the journal replay status: + +.. code:: bash + + $ ceph tell mds.:0 status | jq .replay_status + { + "journal_read_pos": 4195244, + "journal_write_pos": 4195244, + "journal_expire_pos": 4194304, + "num_events": 2, + "num_segments": 2 + } + +Replay completes when the ``journal_read_pos`` reaches the +``journal_write_pos``. The write position will not change during replay. Track +the progression of the read position to compute the expected time to complete. + + +Avoiding recovery roadblocks +---------------------------- + +When trying to urgently restore your file system during an outage, here are some +things to do: + +* **Deny all reconnect to clients.** This effectively blocklists all existing + CephFS sessions so all mounts will hang or become unavailable. + +.. code:: bash + + ceph config set mds mds_deny_all_reconnect true + + Remember to undo this after the MDS becomes active. + +.. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting. + +* **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears + "stuck" doing some operation. Sometimes recovery of an MDS may involve an + operation that may take longer than expected (from the programmer's + perspective). This is more likely when recovery is already taking a longer than + normal amount of time to complete (indicated by your reading this document). + Avoid unnecessary replacement loops by extending the heartbeat graceperiod: + +.. code:: bash + + ceph config set mds mds_heartbeat_reset_grace 3600 + + This has the effect of having the MDS continue to send beacons to the monitors + even when its internal "heartbeat" mechanism has not been reset (beat) in one + hour. Note the previous mechanism for achieving this was via the + `mds_beacon_grace` monitor setting. + +* **Disable open file table prefetch.** Normally, the MDS will prefetch + directory contents during recovery to heat up its cache. During long + recovery, the cache is probably already hot **and large**. So this behavior + can be undesirable. Disable using: + +.. code:: bash + + ceph config set mds mds_oft_prefetch_dirfrags false + +* **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may + cause new load on the file system when it's just getting back on its feet. + There will likely be some general maintenance to do before workloads should be + resumed. For example, expediting journal trim may be advisable if the recovery + took a long time because replay was reading a overly large journal. + + You can do this manually or use the new file system tunable: + +.. code:: bash + + ceph fs set refuse_client_session true + + That prevents any clients from establishing new sessions with the MDS. + + + +Expediting MDS journal trim +=========================== + +If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a +long time!), you will want to have the MDS trim its journal more frequently. +You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings. + +The main tunable available to do this is to modify the MDS tick interval. The +"tick" interval drives several upkeep activities in the MDS. It is strongly +recommended no significant file system load be present when modifying this tick +interval. This setting only affects an MDS in ``up:active``. The MDS does not +trim its journal during recovery. + +.. code:: bash + + ceph config set mds mds_tick_interval 2 + + RADOS Health ============ -- 2.39.5