cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
-Analyzing Core Dumps
+Capturing Core Dumps
---------------------
-When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
+A Ceph cluster that uses cephadm can be configured to capture core dumps.
+Initial capture and processing of the coredump is performed by
+`systemd-coredump <https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
+
+
+To enable coredump handling, run:
.. prompt:: bash #
ulimit -c unlimited
-Core dumps will now be written to ``/var/lib/systemd/coredump``.
+Core dumps will be written to ``/var/lib/systemd/coredump``.
+This will persist until the system is rebooted.
.. note::
they will be written to ``/var/lib/systemd/coredump`` on
the container host.
-Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``.
+Now, wait for the crash to happen again. To simulate the crash of a daemon, run
+e.g. ``killall -3 ceph-mon``.
+
+
+Running the Debugger with cephadm
+----------------------------------
+
+Running a single debugging session
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One can initiate a debugging session using the ``cephadm shell`` command.
+From within the shell container we need to install the debugger and debuginfo
+packages. To debug a core file captured by systemd, run the following:
+
+.. prompt:: bash #
+
+ # start the shell session
+ cephadm shell --mount /var/lib/system/coredump
+ # within the shell:
+ dnf install ceph-debuginfo gdb zstd
+ unzstd /var/lib/systemd/coredump/core.ceph-*.zst
+ gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
+
+You can then run debugger commands at gdb's prompt.
+
+.. prompt::
+
+ (gdb) bt
+ #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
+ #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
+ #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
+ #3 0x0000563085ca3d7e in main ()
+
+
+Running repeated debugging sessions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using ``cephadm shell``, like in the example above, the changes made to
+the container the shell command spawned are ephemeral. Once the shell session
+exits all of the files that were downloaded and installed are no longer
+available. One can simply re-run the same commands every time ``cephadm shell``
+is invoked, but in order to save time and resources one can create a new
+container image and use it for repeated debugging sessions.
+
+In the following example we create a simple file for constructing the
+container image. The command below uses podman but it should work correctly
+if ``podman`` is replaced with ``docker``.
+
+.. prompt:: bash
+
+ cat >Containerfile <<EOF
+ ARG BASE_IMG=quay.io/ceph/ceph:v18
+ FROM \${BASE_IMG}
+ # install ceph debuginfo packages, gdb and other potentially useful packages
+ RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
+ EOF
+ podman build -t ceph:debugging -f Containerfile .
+ # pass --build-arg=BASE_IMG=<your image> to customize the base image
+
+The result should be a new local image named ``ceph:debugging``. This image can
+be used on the same machine that built it. Later, the image could be pushed to
+a container repository, or saved and copied to a node runing other ceph
+containers. Please consult the documentation for ``podman`` or ``docker`` for
+more details on the general container workflow.
+
+Once the image has been built it can be used to initiate repeat debugging
+sessions without having to re-install the debug tools and debuginfo packages.
+To debug a core file using this image, in the same way as previously described,
+run:
+
+.. prompt:: bash #
+
+ cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
+
+
+Debugging live processes
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The gdb debugger has the ability to attach to running processes to debug them.
+For a containerized process this can be accomplished by using the debug image
+and attaching it to the same PID namespace as the process to be debugged.
+
+This requires running a container command with some custom arguments. We can generate a script that can debug a process in a running container.
+
+.. prompt:: bash #
+
+ cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
+
+This creates a script with the container command cephadm would use to create a
+shell. Now, modify the script by removing the ``--init`` argument and replace
+that with the argument to join to the namespace used for a running running
+container. For example, let's assume we want to debug the MGR, and have
+determnined that the MGR is running in a container named
+``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. The new
+argument
+``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
+should be used.
+
+Now, we can run our debugging container with ``sh /tmp/debug.sh``. Within the shell
+we can run commands such as ``ps`` to get the PID of the MGR process. In the following
+example this will be ``2``. Running gdb, we can now attach to the running process:
-Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll::
+.. prompt:: bash (gdb)
- # cephadm shell --mount /var/lib/systemd/coredump
- [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
- [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
- [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
- (gdb) bt
- #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
- #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
- #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
- #3 0x0000563085ca3d7e in main ()
+ attach 2
+ info threads
+ bt