Merge pull request #64787 from vshankar/wip-improve-cephfs-dr-doc

author Zac Dover <zac.dover@proton.me>

Fri, 15 Aug 2025 02:12:25 +0000 (12:12 +1000)

committer GitHub <noreply@github.com>

Fri, 15 Aug 2025 02:12:25 +0000 (12:12 +1000)
author Zac Dover <zac.dover@proton.me>
Fri, 15 Aug 2025 02:12:25 +0000 (12:12 +1000)
committer GitHub <noreply@github.com>
Fri, 15 Aug 2025 02:12:25 +0000 (12:12 +1000)
diff --cc doc/cephfs/troubleshooting.rst

index 305737c6f2fbdb9b58b42aab453346931ec121fb,ed142492294deaa92b5ca128d4554454b96247bd..79e6682d1abed232b1169f21b7cace71bc9e910d
--- 1/doc/cephfs/troubleshooting.rst
--- 2/doc/cephfs/troubleshooting.rst
+++ b/doc/cephfs/troubleshooting.rst
@@@ -5,34 -5,24 +5,36 @@@
   Slow/stuck operations
   =====================
   
- -If you are experiencing apparent hung operations, the first task is to identify
- -where the problem is occurring: in the client, the MDS, or the network connecting
- -them. Start by looking to see if either side has stuck operations
- -(:ref:`slow_requests`, below), and narrow it down from there.
+ +Sometimes CephFS operations hang. The first step in troubleshooting them is to
+ +locate the problem causing the operations to hang. Problems present in three
+ +places:
   
- -We can get hints about what's going on by dumping the MDS cache ::
+ +#. in the client
+ +#. in the MDS
+ +#. in the network that connects the client to the MDS
   
- -  ceph daemon mds.<name> dump cache /tmp/dump.txt
+ +First, use the procedure in :ref:`slow_requests` to determine if the client has
+ +stuck operations or the MDS has stuck operations.
   
- -.. note:: The file `dump.txt` is on the machine executing the MDS and for systemd
- -        controlled MDS services, this is in a tmpfs in the MDS container.
- -        Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path.
+ +Dump the MDS cache. The contents of the MDS cache will be used to diagnose the
+ +nature of the problem. Run the following command to dump the MDS cache:
   
- -If high logging levels are set on the MDS, that will almost certainly hold the
- -information we need to diagnose and solve the issue.
+ +.. prompt:: bash #
+ +
+ +   ceph daemon mds.<name> dump cache /tmp/dump.txt
+ +
+ +.. note:: MDS services that are not controlled by systemd dump the file 
+ +   ``dump.txt`` to the machine that runs the MDS. MDS services that are
+ +   controlled by systemd dump the file ``dump.txt`` to a tmpfs in the MDS
+ +   container. Use `nsenter(1)` to locate ``dump.txt`` or specify another
+ +   system-wide path.
+ +
+ +If high logging levels have been set on the MDS, ``dump.txt`` can be expected
+ +to hold the information needed to diagnose and solve the issue causing the
+ +CephFS operations to hang.
   
+ .. _cephfs_dr_stuck_during_recovery:
+ 
   Stuck during recovery
   =====================
   
@@@ -83,10 -70,14 +85,15 @@@ replay. Examine the journal replay stat
      }
   
   Replay completes when the ``journal_read_pos`` reaches the
- -``journal_write_pos``. The write position will not change during replay. Track
+ +``journal_write_pos``. The write position does not change during replay. Track
   the progression of the read position to compute the expected time to complete.
+ The MDS emits an `MDS_ESTIMATED_REPLAY_TIME` warning when the act of replaying
+ the journal takes more than 30 seconds. The warning message includes an
+ estimated time to the completion of journal replay::
+ 
+   mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s
   
+ +.. _cephfs_troubleshooting_avoiding_recovery_roadblocks:
   
   Avoiding recovery roadblocks
   ----------------------------
author	Zac Dover <zac.dover@proton.me>
	Fri, 15 Aug 2025 02:12:25 +0000 (12:12 +1000)
committer	GitHub <noreply@github.com>
	Fri, 15 Aug 2025 02:12:25 +0000 (12:12 +1000)