From c39401b8de7205e2c580f2d0a04e2423ef73bb07 Mon Sep 17 00:00:00 2001 From: Venky Shankar Date: Fri, 1 Aug 2025 12:32:55 +0530 Subject: [PATCH] doc/cephfs: add a note about estimated replay completion time Signed-off-by: Venky Shankar (cherry picked from commit d471748aa0ccd2041a8a4ac5af059009597b5b53) doc/cephfs: update cephfs disaster recovery procedure Fixes: http://tracker.ceph.com/issues/71629 Signed-off-by: Venky Shankar fixup Signed-off-by: Zac Dover (cherry picked from commit bf46093470068d6ec4d168c1b24e886f4344b7fc) --- doc/cephfs/disaster-recovery-experts.rst | 221 +++++++++++++++++------ doc/cephfs/troubleshooting.rst | 18 ++ 2 files changed, 185 insertions(+), 54 deletions(-) diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst index 740bf7aabe7..b3312f958de 100644 --- a/doc/cephfs/disaster-recovery-experts.rst +++ b/doc/cephfs/disaster-recovery-experts.rst @@ -17,6 +17,12 @@ Advanced: Metadata repair tools If you do not have access to professional support for your cluster, consult the ceph-users mailing list or the #ceph IRC/Slack channel. +.. note:: The Ceph file system must be offline before metadata repair tools can + be used on it. The tools will complain if they are invoked when the file + system is online. If any of the recovery steps do not complete successfully, + DO NOT proceeed to run any more recovery steps. If any recovery step fails, + seek help from experts via mailing lists and IRC channels and Slack + channels. Journal export -------------- @@ -28,10 +34,13 @@ running the following command: cephfs-journal-tool journal export backup.bin -If the journal is badly corrupted, this command might not work. If the journal -is badly corrupted, make a RADOS-level copy -(http://tracker.ceph.com/issues/9902). +The backed up journal will come in handy in case the recovery procedure does +not go as expected. The MDS journal can be restored to its original state by +importing from the backed up journal: +.. prompt:: bash # + + cephfs-journal-tool journal import backup.bin Dentry recovery from journal ---------------------------- @@ -44,7 +53,8 @@ attempt to recover file metadata by running the following command: cephfs-journal-tool event recover_dentries summary By default, this command acts on MDS rank ``0``. Pass the option ``--rank=`` -to the ``cephfs-journal-tool`` command to operate on other ranks. +to the ``cephfs-journal-tool`` command to operate on other ranks or pass +``--rank=all`` to operate on all MDS ranks. This command writes all inodes and dentries recoverable from the journal into the backing store, but only if these inodes and dentries are higher-versioned @@ -76,34 +86,90 @@ that an MDS cannot replay: Specify the filesystem and the MDS rank using the ``--rank`` option when the file system has or had multiple active MDS daemons. -.. warning:: - - Resetting the journal *will* cause metadata to be lost unless you have - extracted it by other means such as ``recover_dentries``. Resetting the - journal is likely to leave orphaned objects in the data pool. Resetting - the journal may result in the re-allocation of already-written inodes, - which means that permissions rules could be violated. +.. warning:: Resetting the journal *will* cause metadata to be lost unless the + journal data has been extracted by other means such as ``recover_dentries``. + Resetting the journal is likely to leave orphaned objects in the data pool + and could result in the re-allocation of already-written inodes resulting in + faulty behaviour of the file system (bugs, etc..). MDS table wipes --------------- -After the journal has been reset, it may no longer be consistent with respect -to the contents of the MDS tables (InoTable, SessionMap, SnapServer). +It is not the case that every MDS table must reset during the recovery +procedure. A reset of an MDS table is required if the corresponding RADOS +object is either missing (say, due to some PGs getting lost) or if the tables +are inconsistent with the metadata stored in the RADOS back-end. To check for +missing table objects, run commands of the following forms: + +Session table: + +.. prompt:: bash # + + rados -p stat mds0_sessionmap -Use the following command to reset the SessionMap (this will erase all -sessions): +Inode table: .. prompt:: bash # - cephfs-table-tool all reset session + rados -p stat mds0_inotable -This command acts on the tables of all MDS ranks that are ``in``. To operate -only on a specified rank, replace ``all`` in the above command with an MDS -rank. +Snap table: + +.. prompt:: bash # + + rados -p stat mds_snaptable + +.. note:: The ``sessionmap`` and the ``inotable`` objects are per MDS rank (the + object names have rank number - mds0_inotable, mds1_inotable, etc..). + +Even if a table object exists, it can be inconsistent with the metadata stored +in the RADOS back-end. However, it is hard to detect inconsistency without +bringing the file system online if the metadata is in fact inconsistent or +corrupted. In cases like this, the MDS marks itself as `down:damaged`. To reset +individual tables, run commands of the following forms: + +Session table: + +.. prompt:: bash # -Of all tables, the session table is the table most likely to require a reset. -If you know that you need also to reset the other tables, then replace -``session`` with ``snap`` or ``inode``. + cephfs-table-tool 0 reset session + +SnapServer: + +.. prompt:: bash # + + cephfs-table-tool 0 reset snap + +InoTable: + +.. prompt:: bash # + + cephfs-table-tool 0 reset inode + +The above commands act on the tables of a particular MDS rank. To operate on +all MDS ranks that are in the ``in`` state, replace the MDS rank in the above +commands with ``all``, as shown in the following commands: + +Session table: + +.. prompt:: bash # + + cephfs-table-tool all reset session + +SnapServer: + +.. prompt:: bash # + + cephfs-table-tool all reset snap + +InoTable: + +.. prompt:: bash # + + cephfs-table-tool all reset inode + +.. note:: Remount or restart all CephFS clients after the session tables have + been reset. MDS map reset ------------- @@ -134,36 +200,45 @@ Recovery from missing metadata objects Depending on which objects are missing or corrupt, you may need to run additional commands to regenerate default versions of the objects. -:: +If the root inode or MDS directory (``~mdsdir``) is missing or corrupt, run the following command: - # Session table - cephfs-table-tool 0 reset session - # SnapServer - cephfs-table-tool 0 reset snap - # InoTable - cephfs-table-tool 0 reset inode - # Journal - cephfs-journal-tool --rank=:0 journal reset --yes-i-really-really-mean-it - # Root inodes ("/" and MDS directory) - cephfs-data-scan init - -Finally, you can regenerate metadata objects for missing files -and directories based on the contents of a data pool. This is -a three-phase process: - -#. Scanning *all* objects to calculate size and mtime metadata for inodes. -#. Scanning the first object from every file to collect this metadata and - inject it into the metadata pool. -#. Checking inode linkages and fixing found errors. +Root inodes ("/" and MDS directory): -:: +.. prompt:: bash # + + cephfs-data-scan init + +This is generally safe to run, because this command skips generating the root +inode and the mdsdir inode if they exist. But if these inodes are corrupt, then +they must be regenerated. Corruption can be identified by trying to bring the +file system back online (at which point the MDS would transition into +``down:damaged`` with an appropriate log message [in the MDS log] pointing at +potential issues when loading the root inode or the mdsdir inode). Another +method is to use the ``ceph-dencoder`` tool to decode the inodes. This step is +a bit more involved. + +Finally, you can regenerate metadata objects for missing files and directories +based on the contents of a data pool. This is a three-phase process: + +#. Scan *all* objects to calculate size and mtime metadata for inodes: + + .. prompt:: bash # + + cephfs-data-scan scan_extents [ [ ...]] +#. Scan the first object from every file to collect this metadata and + inject it into the metadata pool: + + .. prompt:: bash # - cephfs-data-scan scan_extents [ [ ...]] - cephfs-data-scan scan_inodes [] - cephfs-data-scan scan_links + cephfs-data-scan scan_inodes [] +#. Check inode linkages and fixing found errors: -``scan_extents`` and ``scan_inodes`` commands may take a *very long* time if -the data pool contains many files or very large files. + .. prompt:: bash # + + cephfs-data-scan scan_links + +The ``scan_extents`` and ``scan_inodes`` commands may take a *very long* time +if the data pool contains many files or very large files. To accelerate the process of running ``scan_extents`` or ``scan_inodes``, run multiple instances of the tool: @@ -194,15 +269,27 @@ The example below shows how to run four workers simultaneously: cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 It is **important** to ensure that all workers have completed the -``scan_extents`` phase before any worker enters the ``scan_inodes phase``. +``scan_extents`` phase before any worker enters the ``scan_inodes`` phase. After completing the metadata recovery process, you may want to run a cleanup -operation to delete ancillary data generated during recovery. Use a command of the following form to run a cleanup operation: +operation to delete ancillary data generated during recovery. Use a command of +the following form to run a cleanup operation: .. prompt:: bash # cephfs-data-scan cleanup [] +The cleanup phase can be run with multiple instances to speed up execution:: + + # Worker 0 + cephfs-data-scan cleanup --worker_n 0 --worker_m 4 + # Worker 1 + cephfs-data-scan cleanup --worker_n 1 --worker_m 4 + # Worker 2 + cephfs-data-scan cleanup --worker_n 2 --worker_m 4 + # Worker 3 + cephfs-data-scan cleanup --worker_n 3 --worker_m 4 + .. note:: The data pool parameters for ``scan_extents``, ``scan_inodes`` and @@ -212,14 +299,40 @@ operation to delete ancillary data generated during recovery. Use a command of t ``scan_inodes`` and ``cleanup`` commands require only that you specify the main data pool. +Known Limitations And Pitfalls +------------------------------ + +The disaster recovery process can be time consuming and daunting. The recovery +steps must be executed in the exact order as detailed above with the utmost +care, to ensure that failure in any step be understood clearly. You must be +able to decide with absolute certainty whether it is safe to proceed. If you +are confident that you can meet these challenges, study this list of +limitations and pitfalls before attempting disaster recovery: + +#. The data-scan commands provide no way of estimating the time to completion + of their operation. A feature that will provide such an estimate is under + development. See https://tracker.ceph.com/issues/63191 for details. +#. It is important to perform a file system scrub after recovery before CephFS + clients start using the file system. +#. In general, we do not recommend that you change any MDS-related settings + (for example, ``max_mds``) while things are broken. +#. Disaster recovery is currently a manual process. There is a plan to automate + the recovery via the Disaster Recovery Super Tool. See + https://tracker.ceph.com/issues/71804 for details. +#. A well-known trick (used by some community users) is to use the disaster + recovery procedure (especially the ``recover_dentries`` step) when the MDS + is somewhat stuck in the ``up_replay`` state due to a long journal. Before + jumping onto invoking ``recover_dentries`` when the MDS takes a bit long to + replay the journal, consider trying to speed up journal replay by following + the procedure detailed in :ref:`cephfs_dr_stuck_during_recovery`. Using an alternate metadata pool for recovery --------------------------------------------- -.. warning:: - - This procedure has not been extensively tested. It should be undertaken only - with great care. +.. warning:: This procedure has not been extensively tested. We recommend + recovering the file system using the recovery procedure detailed above + unless there is a good reason not to do so. This procedure should be + undertaken with great care. If an existing CephFS file system is damaged and inoperative, then it is possible to create a fresh metadata pool and to attempt the reconstruction the diff --git a/doc/cephfs/troubleshooting.rst b/doc/cephfs/troubleshooting.rst index a2698dc3c5c..164c14767df 100644 --- a/doc/cephfs/troubleshooting.rst +++ b/doc/cephfs/troubleshooting.rst @@ -33,6 +33,8 @@ If high logging levels have been set on the MDS, ``dump.txt`` can be expected to hold the information needed to diagnose and solve the issue causing the CephFS operations to hang. +.. _cephfs_dr_stuck_during_recovery: + Stuck during recovery ===================== @@ -86,6 +88,22 @@ Replay completes when the ``journal_read_pos`` reaches the ``journal_write_pos``. The write position does not change during replay. Track the progression of the read position to compute the expected time to complete. +.. In Tentacle and later releases, the following text appears. It should not be + backported to any branch prior to Tentacle, because the + ``MDS_ESTIMATED_REPLAY_TIME`` warning was never a backport candidate for the + Squid and the Reef release branches. See + https://github.com/ceph/ceph/pull/65058#discussion_r2281247895. Here is the + text that should not be backported: + + BEGIN QUOTED TEXT + The MDS emits an `MDS_ESTIMATED_REPLAY_TIME` warning when the act of + replaying the journal takes more than 30 seconds. The warning message + includes an estimated time to the completion of journal replay:: + + mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s + + END QUOTED TEXT + Avoiding recovery roadblocks ---------------------------- -- 2.39.5