From a9d6b82c04566aa1e6507e0c1fb9fbda3f8ccef1 Mon Sep 17 00:00:00 2001 From: Zac Dover Date: Tue, 2 Sep 2025 10:31:41 +1000 Subject: [PATCH] doc/cephfs: edit troubleshooting.rst Update the "Disconnected+Remounted FS" section in doc/cephfs/troubleshooting.rst, as suggested by Venky Shankar in https://github.com/ceph/ceph/pull/65129/files#r2312903062 Signed-off-by: Zac Dover (cherry picked from commit f4b40422fefaa993441396a5c31fbfd3d8714595) --- doc/cephfs/troubleshooting.rst | 45 ++++++++++++++-------------------- 1 file changed, 19 insertions(+), 26 deletions(-) diff --git a/doc/cephfs/troubleshooting.rst b/doc/cephfs/troubleshooting.rst index 9368bd152cb..d5542a4c340 100644 --- a/doc/cephfs/troubleshooting.rst +++ b/doc/cephfs/troubleshooting.rst @@ -378,32 +378,25 @@ will switch to doing writes synchronously. Synchronous writes are quite slow. Disconnected+Remounted FS ========================= -Because CephFS has a "consistent cache", your client is forcibly disconnected -from the cluster when the network connection has been disrupted for a long -time. When this happens, the kernel client cannot safely write back dirty data -and many applications will not handle IO errors correctly on ``close()``. -Currently, the kernel client will remount the file system, but any outstanding -file-system IO may not be properly handled. If this is the case, reboot the -client system. - -You are in this situation if the output of ``dmesg/kern.log`` contains -something like the following:: - - Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session - Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start - Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied - Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631 - Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707 - Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN) - Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset - Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0 - -This is an area of ongoing work to improve the behavior. Kernels will soon be -reliably issuing error codes to in-progress IO, although your application(s) -may not deal with them well. In the longer term, we hope to allow reconnection -and reclamation of data in cases where doing so does not violate POSIX -semantics (generally, data which hasn't been accessed or modified by other -clients). +Because CephFS has a "consistent cache", the MDS will forcibly evict (and +blocklist) clients from the cluster when the network connection has been +disrupted for a long time. When this happens, the kernel client cannot safely +write back dirty (buffered) data and this results in data loss. However: note +that this behavior is appropriate and also follows POSIX semantics. The client +has to be remounted to be able to access the file system again. This is the +default behavior but it can be overridden by the ``recover_session`` mount +option. See `the "options" section of the "mount.ceph" man page +`_ + +You are in this situation if the output of ``dmesg`` contains something like +the following:: + +[Fri Aug 15 02:38:10 2025] ceph: mds0 caps stale +[Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX :6800 socket closed (con state OPEN) +[Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX:6800 session reset +[Fri Aug 15 02:38:28 2025] ceph: mds0 closed our session +[Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect start +[Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect denied Mounting ======== -- 2.39.5