or a similar string they will prompt you with. Consider these actions carefully
before proceeding; they are placed on especially dangerous activities.
+.. _advanced-cephfs-admin-settings:
Advanced
--------
to bring back the file system. The steps to recover a multiple active MDS file
system or multiple file systems are yet to be identified. Currently, only the steps
to recover a **single active MDS** file system with no additional file systems
-in the cluster have been identified and tested. Briefly the steps are: stop the
-MDSs; recreate the FSMap with basic defaults; and allow MDSs to recover from
+in the cluster have been identified and tested. Briefly the steps are:
+recreate the FSMap with basic defaults; and allow MDSs to recover from
the journal/metadata stored in the filesystem's pools. The steps are described
in more detail below.
-First up, stop all the MDSs of the cluster.
-
-Verify that the MDSs have been stopped. Execute the below command and
-check that no active or standby MDS daemons are listed for the file system.
-
-::
-
- ceph fs dump
-
-Recreate the file system using the recovered file system pools. The new FSMap
-will have the filesystem's default settings. However, the user defined file
-system settings such as ``standby_count_wanted``, ``required_client_features``,
+First up, recreate the file system using the recovered file system pools. The
+new FSMap will have the filesystem's default settings. However, the user defined
+file system settings such as ``standby_count_wanted``, ``required_client_features``,
extra data pools, etc., are lost and need to be reapplied later.
::
- ceph fs new <fs_name> <metadata_pool> <data_pool> --force
+ ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover
+
+The ``recover`` flag sets the state of file system's rank 0 to existing but
+failed. So when a MDS daemon eventually picks up rank 0, the daemon reads the
+existing in-RADOS metadata and doesn't overwrite it. The flag also prevents the
+standby MDS daemons to activate the file system.
The file system cluster ID, fscid, of the file system will not be preserved.
This behaviour may not be desirable for certain applications (e.g., Ceph CSI)
-that expect the file system to be unchanged across recovery. To fix this, pass
-the desired fscid when recreating the file system.
-
-::
+that expect the file system to be unchanged across recovery. To fix this, you
+can optionally set the ``fscid`` option in the above command (see
+:ref:`advanced-cephfs-admin-settings`).
- ceph fs new <fs_name> <metadata_pool> <data_pool> --fscid <fscid> --force
-
-Next, reset the file system. The below command marks the state of the
-file system's rank 0 such that eventually when a MDS daemon picks up rank 0 the
-daemon reads the existing in-RADOS metadata and doesn't overwrite it.
+Allow standby MDS daemons to join the file system.
::
- ceph fs reset <fs_name> --yes-i-really-mean-it
+ ceph fs set <fs_name> joinable true
+
-Restart the MDSs. Check that the file system is no longer in degraded state and
-one of the MDSs is active.
+Check that the file system is no longer in degraded state and has an active
+MDS.
::
- ceph fs dump
+ ceph fs status
Reapply any other custom file system settings.
--- /dev/null
+tasks:
+- cephfs_test_runner:
+ modules:
+ - tasks.cephfs.test_recovery_fs
--- /dev/null
+import logging
+from os.path import join as os_path_join
+
+from tasks.cephfs.cephfs_test_case import CephFSTestCase
+
+log = logging.getLogger(__name__)
+
+class TestFSRecovery(CephFSTestCase):
+ """
+ Tests for recovering FS after loss of FSMap
+ """
+
+ CLIENTS_REQUIRED = 1
+ MDSS_REQUIRED = 3
+
+ def test_recover_fs_after_fsmap_removal(self):
+ data_pool = self.fs.get_data_pool_name()
+ metadata_pool = self.fs.get_metadata_pool_name()
+ # write data in mount, and fsync
+ self.mount_a.create_n_files('file_on_fs', 1, sync=True)
+ # faild MDSs to allow removing the file system in the next step
+ self.fs.fail()
+ # Remove file system to lose FSMap and keep the pools intact.
+ # This mimics the scenario where the monitor store is rebuilt
+ # using OSDs to recover a cluster with corrupt monitor store.
+ # The FSMap is permanently lost, but the FS pools are
+ # recovered/intact
+ self.fs.rm()
+ # Recreate file system with pool and previous fscid
+ self.fs.mon_manager.raw_cluster_cmd(
+ 'fs', 'new', self.fs.name, metadata_pool, data_pool,
+ '--recover', '--force', '--fscid', f'{self.fs.id}')
+ self.fs.set_joinable()
+ # Check status of file system
+ self.fs.wait_for_daemons()
+ # check data in file sytem is intact
+ filepath = os_path_join(self.mount_a.hostfs_mntpt, 'file_on_fs_0')
+ self.assertEqual(self.mount_a.read_file(filepath), "content")
Filesystem::ref FSMap::create_filesystem(std::string_view name,
int64_t metadata_pool, int64_t data_pool, uint64_t features,
- fs_cluster_id_t fscid)
+ fs_cluster_id_t fscid, bool recover)
{
auto fs = Filesystem::create();
fs->mds_map.epoch = epoch;
next_filesystem_id = std::max(fscid, (fs_cluster_id_t)next_filesystem_id) + 1;
}
+ if (recover) {
+ // Populate rank 0 as existing (so don't go into CREATING)
+ // but failed (so that next available MDS is assigned the rank)
+ fs->mds_map.in.insert(mds_rank_t(0));
+ fs->mds_map.failed.insert(mds_rank_t(0));
+
+ fs->mds_map.set_flag(CEPH_MDSMAP_NOT_JOINABLE);
+ }
+
// File system's ID can be FS_CLUSTER_ID_ANONYMOUS if we're recovering
// a legacy file system by passing FS_CLUSTER_ID_ANONYMOUS as the desired
// file system ID
Filesystem::ref create_filesystem(
std::string_view name, int64_t metadata_pool,
int64_t data_pool, uint64_t features,
- fs_cluster_id_t fscid);
+ fs_cluster_id_t fscid, bool recover);
/**
* Remove the filesystem (it must exist). Caller should already
static_cast<double>(4.0));
mon->osdmon()->propose_pending();
+ bool recover = false;
+ cmd_getval(cmdmap, "recover", recover);
+
// All checks passed, go ahead and create.
auto&& fs = fsmap.create_filesystem(fs_name, metadata, data,
- mon->get_quorum_con_features(), fscid);
+ mon->get_quorum_con_features(), fscid, recover);
ss << "new fs with metadata pool " << metadata << " and data pool " << data;
+ if (recover) {
+ return 0;
+ }
+
// assign a standby to rank 0 to avoid health warnings
auto info = fsmap.find_replacement_for({fs->fscid, 0});
"name=data,type=CephString "
"name=force,type=CephBool,req=false "
"name=allow_dangerous_metadata_overlay,type=CephBool,req=false "
- "name=fscid,type=CephInt,range=0,req=false",
+ "name=fscid,type=CephInt,range=0,req=false "
+ "name=recover,type=CephBool,req=false",
"make new filesystem using named pools <metadata> and <data>",
"fs", "rw")
COMMAND("fs fail "