From: Zac Dover Date: Fri, 15 Aug 2025 02:12:25 +0000 (+1000) Subject: Merge pull request #64787 from vshankar/wip-improve-cephfs-dr-doc X-Git-Tag: v21.0.0~50^2~220 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=52c80fb413a4801f40741406bb6e8594369654c5;p=ceph.git Merge pull request #64787 from vshankar/wip-improve-cephfs-dr-doc doc/cephfs: update cephfs disaster recovery documentation Reviewed-by: Zac Dover --- 52c80fb413a4801f40741406bb6e8594369654c5 diff --cc doc/cephfs/troubleshooting.rst index 305737c6f2f,ed142492294..79e6682d1ab --- a/doc/cephfs/troubleshooting.rst +++ b/doc/cephfs/troubleshooting.rst @@@ -5,34 -5,24 +5,36 @@@ Slow/stuck operations ===================== -If you are experiencing apparent hung operations, the first task is to identify -where the problem is occurring: in the client, the MDS, or the network connecting -them. Start by looking to see if either side has stuck operations -(:ref:`slow_requests`, below), and narrow it down from there. +Sometimes CephFS operations hang. The first step in troubleshooting them is to +locate the problem causing the operations to hang. Problems present in three +places: -We can get hints about what's going on by dumping the MDS cache :: +#. in the client +#. in the MDS +#. in the network that connects the client to the MDS - ceph daemon mds. dump cache /tmp/dump.txt +First, use the procedure in :ref:`slow_requests` to determine if the client has +stuck operations or the MDS has stuck operations. -.. note:: The file `dump.txt` is on the machine executing the MDS and for systemd - controlled MDS services, this is in a tmpfs in the MDS container. - Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path. +Dump the MDS cache. The contents of the MDS cache will be used to diagnose the +nature of the problem. Run the following command to dump the MDS cache: -If high logging levels are set on the MDS, that will almost certainly hold the -information we need to diagnose and solve the issue. +.. prompt:: bash # + + ceph daemon mds. dump cache /tmp/dump.txt + +.. note:: MDS services that are not controlled by systemd dump the file + ``dump.txt`` to the machine that runs the MDS. MDS services that are + controlled by systemd dump the file ``dump.txt`` to a tmpfs in the MDS + container. Use `nsenter(1)` to locate ``dump.txt`` or specify another + system-wide path. + +If high logging levels have been set on the MDS, ``dump.txt`` can be expected +to hold the information needed to diagnose and solve the issue causing the +CephFS operations to hang. + .. _cephfs_dr_stuck_during_recovery: + Stuck during recovery ===================== @@@ -83,10 -70,14 +85,15 @@@ replay. Examine the journal replay stat } Replay completes when the ``journal_read_pos`` reaches the -``journal_write_pos``. The write position will not change during replay. Track +``journal_write_pos``. The write position does not change during replay. Track the progression of the read position to compute the expected time to complete. + The MDS emits an `MDS_ESTIMATED_REPLAY_TIME` warning when the act of replaying + the journal takes more than 30 seconds. The warning message includes an + estimated time to the completion of journal replay:: + + mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s +.. _cephfs_troubleshooting_avoiding_recovery_roadblocks: Avoiding recovery roadblocks ----------------------------