From 4d5d6272b3d3043c3f0b4b7863c89230d8c49c08 Mon Sep 17 00:00:00 2001 From: Jos Collin Date: Wed, 10 Jul 2019 18:47:06 +0530 Subject: [PATCH] doc: Update troubleshooting.rst * Updated troubleshooting.rst as kernel-client-troubleshooting.rst is dropped. * Added troubleshooting references in fuse.rst and kernel.rst Signed-off-by: Jos Collin --- doc/cephfs/fuse.rst | 2 ++ doc/cephfs/kernel.rst | 2 +- doc/cephfs/troubleshooting.rst | 41 +++++++++++++++++++++++++++++----- 3 files changed, 38 insertions(+), 7 deletions(-) diff --git a/doc/cephfs/fuse.rst b/doc/cephfs/fuse.rst index 251253703a0..3a6c70bc9d7 100644 --- a/doc/cephfs/fuse.rst +++ b/doc/cephfs/fuse.rst @@ -47,6 +47,8 @@ A persistent mount point can be setup via:: sudo systemctl enable ceph-fuse@/mnt.service +For troubleshooting, see :ref:`ceph_fuse_debugging`. + .. _ceph-fuse: ../../man/8/ceph-fuse/ .. _fstab: ../fstab/#fuse .. _CEPHX Config Reference: ../../rados/configuration/auth-config-ref diff --git a/doc/cephfs/kernel.rst b/doc/cephfs/kernel.rst index 3b5a75086a5..0b3f3128c2e 100644 --- a/doc/cephfs/kernel.rst +++ b/doc/cephfs/kernel.rst @@ -31,7 +31,7 @@ To unmount the Ceph file system, you may use the ``umount`` command. For example .. tip:: Ensure that you are not within the file system directories before executing this command. -See `mount.ceph`_ for details. +See `mount.ceph`_ for details. For troubleshooting, see :ref:`kernel_mount_debugging`. .. _mount.ceph: ../../man/8/mount.ceph/ .. _User Management: ../../rados/operations/user-management/ diff --git a/doc/cephfs/troubleshooting.rst b/doc/cephfs/troubleshooting.rst index d13914a1dfd..befd6c21d17 100644 --- a/doc/cephfs/troubleshooting.rst +++ b/doc/cephfs/troubleshooting.rst @@ -10,6 +10,17 @@ where the problem is occurring: in the client, the MDS, or the network connectin them. Start by looking to see if either side has stuck operations (:ref:`slow_requests`, below), and narrow it down from there. +We can get hints about what's going on by dumping the MDS cache :: + + ceph daemon mds. dump cache /tmp/dump.txt + +.. note:: The file `dump.txt` is on the machine executing the MDS and for systemd + controlled MDS services, this is in a tmpfs in the MDS container. + Use `nsenter(1)` to locate `dump.txt` or specify another system-wide path. + +If high logging levels are set on the MDS, that will almost certainly hold the +information we need to diagnose and solve the issue. + RADOS Health ============ @@ -24,12 +35,16 @@ If an operation is hung inside the MDS, it will eventually show up in ``ceph hea identifying "slow requests are blocked". It may also identify clients as "failing to respond" or misbehaving in other ways. If the MDS identifies specific clients as misbehaving, you should investigate why they are doing so. + Generally it will be the result of -1) overloading the system (if you have extra RAM, increase the -"mds cache size" config from its default 100000; having a larger active file set -than your MDS cache is the #1 cause of this!) -2) running an older (misbehaving) client, or -3) underlying RADOS issues. + +#. Overloading the system (if you have extra RAM, increase the "mds cache size" + config from its default 100000; having a larger active file set than your MDS + cache is the #1 cause of this!). + +#. Running an older (misbehaving) client. + +#. Underlying RADOS issues. Otherwise, you have probably discovered a new bug and should report it to the developers! @@ -57,10 +72,12 @@ If there are no slow requests reported on the MDS, and it is not reporting that clients are misbehaving, either the client has a problem or its requests are not reaching the MDS. +.. _ceph_fuse_debugging: + ceph-fuse debugging =================== -ceph-fuse also supports dump_ops_in_flight. See if it has any and where they are +ceph-fuse also supports ``dump_ops_in_flight``. See if it has any and where they are stuck. Debug output @@ -74,10 +91,16 @@ with logging to the console (``-d``) and enabling client debug If you suspect a potential monitor issue, enable monitor debugging as well (``--debug-monc=20``). +.. _kernel_mount_debugging: Kernel mount debugging ====================== +If there is an issue with the kernel client, the most important thing is +figuring out whether the problem is with the kernel client or the MDS. Generally, +this is easy to work out. If the kernel client broke directly, there will be +output in ``dmesg``. Collect it and any inappropriate kernel state. + Slow requests ------------- @@ -158,3 +181,9 @@ If the Ceph Client is behind the Ceph cluster, try to upgrade it:: You may need to uninstall, autoclean and autoremove ``ceph-common`` and then reinstall it so that you have the latest version. +Dynamic Debugging +================= + +You can enable dynamic debug against the CephFS module. + +Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh -- 2.39.5