From 82f99601629fee5524c03d1ae8cda5d375b15977 Mon Sep 17 00:00:00 2001 From: John Spray Date: Tue, 10 Jul 2018 10:41:52 +0100 Subject: [PATCH] doc/cephfs: make scary DR bits less prominent I'm sure people will still find them, but let's at least force people to click through one more time to get to the commands that can damage your cluster. Also, the ".. danger" directive at the top of the page wasn't actually getting special formatting, so I changed it to a ".. warning" which is red. Signed-off-by: John Spray --- doc/cephfs/disaster-recovery-experts.rst | 254 +++++++++++++++++++++++ doc/cephfs/disaster-recovery.rst | 254 ++--------------------- doc/cephfs/index.rst | 5 + 3 files changed, 276 insertions(+), 237 deletions(-) create mode 100644 doc/cephfs/disaster-recovery-experts.rst diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst new file mode 100644 index 0000000000000..23dd6138c0b04 --- /dev/null +++ b/doc/cephfs/disaster-recovery-experts.rst @@ -0,0 +1,254 @@ + +.. _disaster-recovery-experts: + +Advanced: Metadata repair tools +=============================== + +.. warning:: + + If you do not have expert knowledge of CephFS internals, you will + need to seek assistance before using any of these tools. + + The tools mentioned here can easily cause damage as well as fixing it. + + It is essential to understand exactly what has gone wrong with your + filesystem before attempting to repair it. + + If you do not have access to professional support for your cluster, + consult the ceph-users mailing list or the #ceph IRC channel. + + +Journal export +-------------- + +Before attempting dangerous operations, make a copy of the journal like so: + +:: + + cephfs-journal-tool journal export backup.bin + +Note that this command may not always work if the journal is badly corrupted, +in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902). + + +Dentry recovery from journal +---------------------------- + +If a journal is damaged or for any reason an MDS is incapable of replaying it, +attempt to recover what file metadata we can like so: + +:: + + cephfs-journal-tool event recover_dentries summary + +This command by default acts on MDS rank 0, pass --rank= to operate on other ranks. + +This command will write any inodes/dentries recoverable from the journal +into the backing store, if these inodes/dentries are higher-versioned +than the previous contents of the backing store. If any regions of the journal +are missing/damaged, they will be skipped. + +Note that in addition to writing out dentries and inodes, this command will update +the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers +are now in use. In simple cases, this will result in an entirely valid backing +store state. + +.. warning:: + + The resulting state of the backing store is not guaranteed to be self-consistent, + and an online MDS scrub will be required afterwards. The journal contents + will not be modified by this command, you should truncate the journal + separately after recovering what you can. + +Journal truncation +------------------ + +If the journal is corrupt or MDSs cannot replay it for any reason, you can +truncate it like so: + +:: + + cephfs-journal-tool journal reset + +.. warning:: + + Resetting the journal *will* lose metadata unless you have extracted + it by other means such as ``recover_dentries``. It is likely to leave + some orphaned objects in the data pool. It may result in re-allocation + of already-written inodes, such that permissions rules could be violated. + +MDS table wipes +--------------- + +After the journal has been reset, it may no longer be consistent with respect +to the contents of the MDS tables (InoTable, SessionMap, SnapServer). + +To reset the SessionMap (erase all sessions), use: + +:: + + cephfs-table-tool all reset session + +This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS +rank to operate on that rank only. + +The session table is the table most likely to need resetting, but if you know you +also need to reset the other tables then replace 'session' with 'snap' or 'inode'. + +MDS map reset +------------- + +Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool) +is somewhat recovered, it may be necessary to update the MDS map to reflect +the contents of the metadata pool. Use the following command to reset the MDS +map to a single MDS: + +:: + + ceph fs reset --yes-i-really-mean-it + +Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored: +as a result it is possible for this to result in data loss. + +One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The +key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such +that it would overwrite any existing root inode on disk and orphan any existing files. In +contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS +daemon to claim the rank will go ahead and use the existing in-RADOS metadata. + +Recovery from missing metadata objects +-------------------------------------- + +Depending on what objects are missing or corrupt, you may need to +run various commands to regenerate default versions of the +objects. + +:: + + # Session table + cephfs-table-tool 0 reset session + # SnapServer + cephfs-table-tool 0 reset snap + # InoTable + cephfs-table-tool 0 reset inode + # Journal + cephfs-journal-tool --rank=0 journal reset + # Root inodes ("/" and MDS directory) + cephfs-data-scan init + +Finally, you can regenerate metadata objects for missing files +and directories based on the contents of a data pool. This is +a three-phase process. First, scanning *all* objects to calculate +size and mtime metadata for inodes. Second, scanning the first +object from every file to collect this metadata and inject it into +the metadata pool. Third, checking inode linkages and fixing found +errors. + +:: + + cephfs-data-scan scan_extents + cephfs-data-scan scan_inodes + cephfs-data-scan scan_links + +'scan_extents' and 'scan_inodes' commands may take a *very long* time +if there are many files or very large files in the data pool. + +To accelerate the process, run multiple instances of the tool. + +Decide on a number of workers, and pass each worker a number within +the range 0-(worker_m - 1). + +The example below shows how to run 4 workers simultaneously: + +:: + + # Worker 0 + cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 + # Worker 1 + cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 + # Worker 2 + cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 + # Worker 3 + cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 + + # Worker 0 + cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 + # Worker 1 + cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 + # Worker 2 + cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 + # Worker 3 + cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 + +It is **important** to ensure that all workers have completed the +scan_extents phase before any workers enter the scan_inodes phase. + +After completing the metadata recovery, you may want to run cleanup +operation to delete ancillary data geneated during recovery. + +:: + + cephfs-data-scan cleanup + + + +Using an alternate metadata pool for recovery +--------------------------------------------- + +.. warning:: + + There has not been extensive testing of this procedure. It should be + undertaken with great care. + +If an existing filesystem is damaged and inoperative, it is possible to create +a fresh metadata pool and attempt to reconstruct the filesystem metadata +into this new pool, leaving the old metadata in place. This could be used to +make a safer attempt at recovery since the existing metadata pool would not be +overwritten. + +.. caution:: + + During this process, multiple metadata pools will contain data referring to + the same data pool. Extreme caution must be exercised to avoid changing the + data pool contents while this is the case. Once recovery is complete, the + damaged metadata pool should be deleted. + +To begin this process, first create the fresh metadata pool and initialize +it with empty file system data structures: + +:: + + ceph fs flag set enable_multiple true --yes-i-really-mean-it + ceph osd pool create recovery replicated + ceph fs new recovery-fs recovery --allow-dangerous-metadata-overlay + cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery + ceph fs reset recovery-fs --yes-i-really-mean-it + cephfs-table-tool recovery-fs:all reset session + cephfs-table-tool recovery-fs:all reset snap + cephfs-table-tool recovery-fs:all reset inode + +Next, run the recovery toolset using the --alternate-pool argument to output +results to the alternate pool: + +:: + + cephfs-data-scan scan_extents --alternate-pool recovery --filesystem + cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem --force-corrupt --force-init + cephfs-data-scan scan_links --filesystem recovery-fs + +If the damaged filesystem contains dirty journal data, it may be recovered next +with: + +:: + + cephfs-journal-tool --rank=:0 event recover_dentries list --alternate-pool recovery + cephfs-journal-tool --rank recovery-fs:0 journal reset --force + +After recovery, some recovered directories will have incorrect statistics. +Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set +to false (the default) to prevent the MDS from checking the statistics, then +run a forward scrub to repair them. Ensure you have an MDS running and issue: + +:: + + ceph daemon mds.a scrub_path / recursive repair diff --git a/doc/cephfs/disaster-recovery.rst b/doc/cephfs/disaster-recovery.rst index 9af4884b67f55..dd91f5d9c1ab3 100644 --- a/doc/cephfs/disaster-recovery.rst +++ b/doc/cephfs/disaster-recovery.rst @@ -2,188 +2,28 @@ Disaster recovery ================= -.. danger:: - - The notes in this section are aimed at experts, making a best effort - to recovery what they can from damaged filesystems. These steps - have the potential to make things worse as well as better. If you - are unsure, do not proceed. - +Metadata damage and repair +-------------------------- -Journal export --------------- - -Before attempting dangerous operations, make a copy of the journal like so: - -:: +If a filesystem has inconsistent or missing metadata, it is considered +*damaged*. You may find out about damage from a health message, or in some +unfortunate cases from an assertion in a running MDS daemon. - cephfs-journal-tool journal export backup.bin +Metadata damage can result either from data loss in the underlying RADOS +layer (e.g. multiple disk failures that lose all copies of a PG), or from +software bugs. -Note that this command may not always work if the journal is badly corrupted, -in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902). - - -Dentry recovery from journal ----------------------------- - -If a journal is damaged or for any reason an MDS is incapable of replaying it, -attempt to recover what file metadata we can like so: - -:: - - cephfs-journal-tool event recover_dentries summary - -This command by default acts on MDS rank 0, pass --rank= to operate on other ranks. - -This command will write any inodes/dentries recoverable from the journal -into the backing store, if these inodes/dentries are higher-versioned -than the previous contents of the backing store. If any regions of the journal -are missing/damaged, they will be skipped. - -Note that in addition to writing out dentries and inodes, this command will update -the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers -are now in use. In simple cases, this will result in an entirely valid backing -store state. - -.. warning:: - - The resulting state of the backing store is not guaranteed to be self-consistent, - and an online MDS scrub will be required afterwards. The journal contents - will not be modified by this command, you should truncate the journal - separately after recovering what you can. - -Journal truncation ------------------- - -If the journal is corrupt or MDSs cannot replay it for any reason, you can -truncate it like so: - -:: +CephFS includes some tools that may be able to recover a damaged filesystem, +but to use them safely requires a solid understanding of CephFS internals. +The documentation for these potentially dangerous operations is on a +separate page: :ref:`disaster-recovery-experts`. - cephfs-journal-tool journal reset +Data pool damage (files affected by lost data PGs) +-------------------------------------------------- -.. warning:: - - Resetting the journal *will* lose metadata unless you have extracted - it by other means such as ``recover_dentries``. It is likely to leave - some orphaned objects in the data pool. It may result in re-allocation - of already-written inodes, such that permissions rules could be violated. - -MDS table wipes ---------------- - -After the journal has been reset, it may no longer be consistent with respect -to the contents of the MDS tables (InoTable, SessionMap, SnapServer). - -To reset the SessionMap (erase all sessions), use: - -:: - - cephfs-table-tool all reset session - -This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS -rank to operate on that rank only. - -The session table is the table most likely to need resetting, but if you know you -also need to reset the other tables then replace 'session' with 'snap' or 'inode'. - -MDS map reset -------------- - -Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool) -is somewhat recovered, it may be necessary to update the MDS map to reflect -the contents of the metadata pool. Use the following command to reset the MDS -map to a single MDS: - -:: - - ceph fs reset --yes-i-really-mean-it - -Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored: -as a result it is possible for this to result in data loss. - -One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The -key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such -that it would overwrite any existing root inode on disk and orphan any existing files. In -contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS -daemon to claim the rank will go ahead and use the existing in-RADOS metadata. - -Recovery from missing metadata objects --------------------------------------- - -Depending on what objects are missing or corrupt, you may need to -run various commands to regenerate default versions of the -objects. - -:: - - # Session table - cephfs-table-tool 0 reset session - # SnapServer - cephfs-table-tool 0 reset snap - # InoTable - cephfs-table-tool 0 reset inode - # Journal - cephfs-journal-tool --rank=0 journal reset - # Root inodes ("/" and MDS directory) - cephfs-data-scan init - -Finally, you can regenerate metadata objects for missing files -and directories based on the contents of a data pool. This is -a three-phase process. First, scanning *all* objects to calculate -size and mtime metadata for inodes. Second, scanning the first -object from every file to collect this metadata and inject it into -the metadata pool. Third, checking inode linkages and fixing found -errors. - -:: - - cephfs-data-scan scan_extents - cephfs-data-scan scan_inodes - cephfs-data-scan scan_links - -'scan_extents' and 'scan_inodes' commands may take a *very long* time -if there are many files or very large files in the data pool. - -To accelerate the process, run multiple instances of the tool. - -Decide on a number of workers, and pass each worker a number within -the range 0-(worker_m - 1). - -The example below shows how to run 4 workers simultaneously: - -:: - - # Worker 0 - cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 - # Worker 1 - cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 - # Worker 2 - cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 - # Worker 3 - cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 - - # Worker 0 - cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 - # Worker 1 - cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 - # Worker 2 - cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 - # Worker 3 - cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 - -It is **important** to ensure that all workers have completed the -scan_extents phase before any workers enter the scan_inodes phase. - -After completing the metadata recovery, you may want to run cleanup -operation to delete ancillary data geneated during recovery. - -:: - - cephfs-data-scan cleanup - -Finding files affected by lost data PGs ---------------------------------------- +If a PG is lost in a *data* pool, then the filesystem will continue +to operate normally, but some parts of some files will simply +be missing (reads will return zeros). Losing a data PG may affect many files. Files are split into many objects, so identifying which files are affected by loss of particular PGs requires @@ -218,63 +58,3 @@ Note that this command acts as a normal CephFS client to find all the files in the filesystem and read their layouts, so the MDS must be up and running. -Using an alternate metadata pool for recovery ---------------------------------------------- - -.. warning:: - - There has not been extensive testing of this procedure. It should be - undertaken with great care. - -If an existing filesystem is damaged and inoperative, it is possible to create -a fresh metadata pool and attempt to reconstruct the filesystem metadata -into this new pool, leaving the old metadata in place. This could be used to -make a safer attempt at recovery since the existing metadata pool would not be -overwritten. - -.. caution:: - - During this process, multiple metadata pools will contain data referring to - the same data pool. Extreme caution must be exercised to avoid changing the - data pool contents while this is the case. Once recovery is complete, the - damaged metadata pool should be deleted. - -To begin this process, first create the fresh metadata pool and initialize -it with empty file system data structures: - -:: - - ceph fs flag set enable_multiple true --yes-i-really-mean-it - ceph osd pool create recovery replicated - ceph fs new recovery-fs recovery --allow-dangerous-metadata-overlay - cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery - ceph fs reset recovery-fs --yes-i-really-mean-it - cephfs-table-tool recovery-fs:all reset session - cephfs-table-tool recovery-fs:all reset snap - cephfs-table-tool recovery-fs:all reset inode - -Next, run the recovery toolset using the --alternate-pool argument to output -results to the alternate pool: - -:: - - cephfs-data-scan scan_extents --alternate-pool recovery --filesystem - cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem --force-corrupt --force-init - cephfs-data-scan scan_links --filesystem recovery-fs - -If the damaged filesystem contains dirty journal data, it may be recovered next -with: - -:: - - cephfs-journal-tool --rank=:0 event recover_dentries list --alternate-pool recovery - cephfs-journal-tool --rank recovery-fs:0 journal reset --force - -After recovery, some recovered directories will have incorrect statistics. -Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set -to false (the default) to prevent the MDS from checking the statistics, then -run a forward scrub to repair them. Ensure you have an MDS running and issue: - -:: - - ceph daemon mds.a scrub_path / recursive repair diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst index 056037dfbf11b..8cdfeaf7b6306 100644 --- a/doc/cephfs/index.rst +++ b/doc/cephfs/index.rst @@ -103,6 +103,11 @@ authentication keyring. Configuring multiple active MDS daemons Export over NFS +.. toctree:: + :hidden: + + Advanced: Metadata repair + .. raw:: html -- 2.39.5