From: Zac Dover <zac.dover@proton.me>
Date: Fri, 15 Aug 2025 02:12:25 +0000 (+1000)
Subject: Merge pull request #64787 from vshankar/wip-improve-cephfs-dr-doc
X-Git-Tag: v21.0.0~50^2~220
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=52c80fb413a4801f40741406bb6e8594369654c5;p=ceph.git

Merge pull request #64787 from vshankar/wip-improve-cephfs-dr-doc

doc/cephfs: update cephfs disaster recovery documentation

Reviewed-by: Zac Dover <zac.dover@proton.me>
---

52c80fb413a4801f40741406bb6e8594369654c5
diff --cc doc/cephfs/troubleshooting.rst
index 305737c6f2f,ed142492294..79e6682d1ab
--- a/doc/cephfs/troubleshooting.rst
+++ b/doc/cephfs/troubleshooting.rst
@@@ -5,34 -5,24 +5,36 @@@
  Slow/stuck operations
  =====================
  
 -If you are experiencing apparent hung operations, the first task is to identify
 -where the problem is occurring: in the client, the MDS, or the network connecting
 -them. Start by looking to see if either side has stuck operations
 -(:ref:`slow_requests`, below), and narrow it down from there.
 +Sometimes CephFS operations hang. The first step in troubleshooting them is to
 +locate the problem causing the operations to hang. Problems present in three
 +places:
  
 -We can get hints about what's going on by dumping the MDS cache ::
 +#. in the client
 +#. in the MDS
 +#. in the network that connects the client to the MDS
  
 -  ceph daemon mds.<name> dump cache /tmp/dump.txt
 +First, use the procedure in :ref:`slow_requests` to determine if the client has
 +stuck operations or the MDS has stuck operations.
  
 -.. note:: The file `dump.txt` is on the machine executing the MDS and for systemd
 -	  controlled MDS services, this is in a tmpfs in the MDS container.
 -	  Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path.
 +Dump the MDS cache. The contents of the MDS cache will be used to diagnose the
 +nature of the problem. Run the following command to dump the MDS cache:
  
 -If high logging levels are set on the MDS, that will almost certainly hold the
 -information we need to diagnose and solve the issue.
 +.. prompt:: bash #
 +
 +   ceph daemon mds.<name> dump cache /tmp/dump.txt
 +
 +.. note:: MDS services that are not controlled by systemd dump the file 
 +   ``dump.txt`` to the machine that runs the MDS. MDS services that are
 +   controlled by systemd dump the file ``dump.txt`` to a tmpfs in the MDS
 +   container. Use `nsenter(1)` to locate ``dump.txt`` or specify another
 +   system-wide path.
 +
 +If high logging levels have been set on the MDS, ``dump.txt`` can be expected
 +to hold the information needed to diagnose and solve the issue causing the
 +CephFS operations to hang.
  
+ .. _cephfs_dr_stuck_during_recovery:
+ 
  Stuck during recovery
  =====================
  
@@@ -83,10 -70,14 +85,15 @@@ replay. Examine the journal replay stat
     }
  
  Replay completes when the ``journal_read_pos`` reaches the
 -``journal_write_pos``. The write position will not change during replay. Track
 +``journal_write_pos``. The write position does not change during replay. Track
  the progression of the read position to compute the expected time to complete.
+ The MDS emits an `MDS_ESTIMATED_REPLAY_TIME` warning when the act of replaying
+ the journal takes more than 30 seconds. The warning message includes an
+ estimated time to the completion of journal replay::
+ 
+   mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s
  
 +.. _cephfs_troubleshooting_avoiding_recovery_roadblocks:
  
  Avoiding recovery roadblocks
  ----------------------------