mon/FSCommands: add 'recover' flag in `fs new` command

author Ramana Raja <rraja@redhat.com>

Wed, 11 Aug 2021 20:34:47 +0000 (16:34 -0400)

committer Ramana Raja <rraja@redhat.com>

Tue, 22 Mar 2022 20:08:45 +0000 (16:08 -0400)
author Ramana Raja <rraja@redhat.com>
Wed, 11 Aug 2021 20:34:47 +0000 (16:34 -0400)
committer Ramana Raja <rraja@redhat.com>
Tue, 22 Mar 2022 20:08:45 +0000 (16:08 -0400)
diff --git a/doc/cephfs/administration.rst b/doc/cephfs/administration.rst

index fe55ff8ca3ba4007c57beaec6e47c4f0d75c6a1a..966b45ee5c62a01716313c36ad386f2e0e29f20f 100644 (file)
--- a/doc/cephfs/administration.rst
+++ b/doc/cephfs/administration.rst
@@ -340,6 +340,7 @@ Some flags require you to confirm your intentions with "--yes-i-really-mean-it"
  or a similar string they will prompt you with. Consider these actions carefully
  before proceeding; they are placed on especially dangerous activities.
  
+.. _advanced-cephfs-admin-settings:
  
  Advanced
  --------
diff --git a/doc/cephfs/recover-fs-after-mon-store-loss.rst b/doc/cephfs/recover-fs-after-mon-store-loss.rst

index ca088573162e55aa7ad75eae57a88fc0104eb57f..d44269c831880da15ec91ff946fa01e7146df62b 100644 (file)
--- a/doc/cephfs/recover-fs-after-mon-store-loss.rst
+++ b/doc/cephfs/recover-fs-after-mon-store-loss.rst
@@ -9,51 +9,43 @@ stores don't restore the file system maps ("FSMap"). Additional steps are requir
  to bring back the file system. The steps to recover a multiple active MDS file
  system or multiple file systems are yet to be identified. Currently, only the steps
  to recover a **single active MDS** file system with no additional file systems
-in the cluster have been identified and tested. Briefly the steps are: stop the
-MDSs; recreate the FSMap with basic defaults; and allow MDSs to recover from
+in the cluster have been identified and tested. Briefly the steps are:
+recreate the FSMap with basic defaults; and allow MDSs to recover from
  the journal/metadata stored in the filesystem's pools. The steps are described
  in more detail below.
  
-First up, stop all the MDSs of the cluster.
-
-Verify that the MDSs have been stopped. Execute the below command and
-check that no active or standby MDS daemons are listed for the file system.
-
-::
-
-    ceph fs dump
-
-Recreate the file system using the recovered file system pools. The new FSMap
-will have the filesystem's default settings. However, the user defined file
-system settings such as ``standby_count_wanted``, ``required_client_features``,
+First up, recreate the file system using the recovered file system pools. The
+new FSMap will have the filesystem's default settings. However, the user defined
+file system settings such as ``standby_count_wanted``, ``required_client_features``,
  extra data pools, etc., are lost and need to be reapplied later.
  
  ::
  
-    ceph fs new <fs_name> <metadata_pool> <data_pool> --force
+    ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover
+
+The ``recover`` flag sets the state of file system's rank 0 to existing but
+failed. So when a MDS daemon eventually picks up rank 0, the daemon reads the
+existing in-RADOS metadata and doesn't overwrite it. The flag also prevents the
+standby MDS daemons to activate the file system.
  
  The file system cluster ID, fscid, of the file system will not be preserved.
  This behaviour may not be desirable for certain applications (e.g., Ceph CSI)
-that expect the file system to be unchanged across recovery. To fix this, pass
-the desired fscid when recreating the file system.
-
-::
+that expect the file system to be unchanged across recovery. To fix this, you
+can optionally set the ``fscid`` option in the above command (see
+:ref:`advanced-cephfs-admin-settings`).
  
-    ceph fs new <fs_name> <metadata_pool> <data_pool> --fscid <fscid> --force
-
-Next, reset the file system. The below command marks the state of the
-file system's rank 0 such that eventually when a MDS daemon picks up rank 0 the
-daemon reads the existing in-RADOS metadata and doesn't overwrite it.
+Allow standby MDS daemons to join the file system.
  
  ::
  
-    ceph fs reset <fs_name> --yes-i-really-mean-it
+    ceph fs set <fs_name> joinable true
+
  
-Restart the MDSs. Check that the file system is no longer in degraded state and
-one of the MDSs is active.
+Check that the file system is no longer in degraded state and has an active
+MDS.
  
  ::
  
-    ceph fs dump
+    ceph fs status
  
  Reapply any other custom file system settings.
diff --git a/qa/suites/fs/functional/tasks/recovery-fs.yaml b/qa/suites/fs/functional/tasks/recovery-fs.yaml

new file mode 100644 (file)

index 0000000..d354e9f
--- /dev/null
+++ b/qa/suites/fs/functional/tasks/recovery-fs.yaml
@@ -0,0 +1,4 @@
+tasks:
+- cephfs_test_runner:
+    modules:
+      - tasks.cephfs.test_recovery_fs
diff --git a/qa/tasks/cephfs/test_recovery_fs.py b/qa/tasks/cephfs/test_recovery_fs.py

new file mode 100644 (file)

index 0000000..a7aefed
--- /dev/null
+++ b/qa/tasks/cephfs/test_recovery_fs.py
@@ -0,0 +1,38 @@
+import logging
+from os.path import join as os_path_join
+
+from tasks.cephfs.cephfs_test_case import CephFSTestCase
+
+log = logging.getLogger(__name__)
+
+class TestFSRecovery(CephFSTestCase):
+    """
+    Tests for recovering FS after loss of FSMap
+    """
+
+    CLIENTS_REQUIRED = 1
+    MDSS_REQUIRED = 3
+
+    def test_recover_fs_after_fsmap_removal(self):
+        data_pool = self.fs.get_data_pool_name()
+        metadata_pool = self.fs.get_metadata_pool_name()
+        # write data in mount, and fsync
+        self.mount_a.create_n_files('file_on_fs', 1, sync=True)
+        # faild MDSs to allow removing the file system in the next step
+        self.fs.fail()
+        # Remove file system to lose FSMap and keep the pools intact.
+        # This mimics the scenario where the monitor store is rebuilt
+        # using  OSDs to recover a cluster with corrupt monitor store.
+        # The FSMap is permanently lost, but the FS pools are
+        # recovered/intact
+        self.fs.rm()
+        # Recreate file system with pool and previous fscid
+        self.fs.mon_manager.raw_cluster_cmd(
+            'fs', 'new', self.fs.name, metadata_pool, data_pool,
+            '--recover', '--force', '--fscid', f'{self.fs.id}')
+        self.fs.set_joinable()
+        # Check status of file system
+        self.fs.wait_for_daemons()
+        # check data in file sytem is intact
+        filepath = os_path_join(self.mount_a.hostfs_mntpt, 'file_on_fs_0')
+        self.assertEqual(self.mount_a.read_file(filepath), "content")
diff --git a/src/mds/FSMap.cc b/src/mds/FSMap.cc

index 89de8bb82e031f515ac539095f6197e3a7b56035..d16ffd3f03f49b8cfcdeae61970a43b26c136a8c 100644 (file)
--- a/src/mds/FSMap.cc
+++ b/src/mds/FSMap.cc
@@ -450,7 +450,7 @@ mds_gid_t Filesystem::get_standby_replay(mds_gid_t who) const
  
  Filesystem::ref FSMap::create_filesystem(std::string_view name,
      int64_t metadata_pool, int64_t data_pool, uint64_t features,
-    fs_cluster_id_t fscid)
+    fs_cluster_id_t fscid, bool recover)
  {
    auto fs = Filesystem::create();
    fs->mds_map.epoch = epoch;
@@ -469,6 +469,15 @@ Filesystem::ref FSMap::create_filesystem(std::string_view name,
      next_filesystem_id = std::max(fscid,  (fs_cluster_id_t)next_filesystem_id) + 1;
    }
  
+  if (recover) {
+    // Populate rank 0 as existing (so don't go into CREATING)
+    // but failed (so that next available MDS is assigned the rank)
+    fs->mds_map.in.insert(mds_rank_t(0));
+    fs->mds_map.failed.insert(mds_rank_t(0));
+
+    fs->mds_map.set_flag(CEPH_MDSMAP_NOT_JOINABLE);
+  }
+
    // File system's ID can be FS_CLUSTER_ID_ANONYMOUS if we're recovering
    // a legacy file system by passing FS_CLUSTER_ID_ANONYMOUS as the desired
    // file system ID
diff --git a/src/mds/FSMap.h b/src/mds/FSMap.h

index f985da39b64f9f278bf2912c7284b03f455809a5..9cb2dce45b2b994f85f57aea875889ebc32a5c2c 100644 (file)
--- a/src/mds/FSMap.h
+++ b/src/mds/FSMap.h
@@ -410,7 +410,7 @@ public:
    Filesystem::ref create_filesystem(
        std::string_view name, int64_t metadata_pool,
        int64_t data_pool, uint64_t features,
-      fs_cluster_id_t fscid);
+      fs_cluster_id_t fscid, bool recover);
  
    /**
     * Remove the filesystem (it must exist).  Caller should already
diff --git a/src/mon/FSCommands.cc b/src/mon/FSCommands.cc

index f9aee4cf35f7f2255e31dedd01ee8adff3c783a2..0b1bb2a03b79ad190ad59a2012c68c9730f310b9 100644 (file)
--- a/src/mon/FSCommands.cc
+++ b/src/mon/FSCommands.cc
@@ -287,12 +287,19 @@ class FsNewHandler : public FileSystemCommandHandler
                                    static_cast<double>(4.0));
      mon->osdmon()->propose_pending();
  
+    bool recover = false;
+    cmd_getval(cmdmap, "recover", recover);
+
      // All checks passed, go ahead and create.
      auto&& fs = fsmap.create_filesystem(fs_name, metadata, data,
-        mon->get_quorum_con_features(), fscid);
+        mon->get_quorum_con_features(), fscid, recover);
  
      ss << "new fs with metadata pool " << metadata << " and data pool " << data;
  
+    if (recover) {
+      return 0;
+    }
+
      // assign a standby to rank 0 to avoid health warnings
      auto info = fsmap.find_replacement_for({fs->fscid, 0});
  
diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h

index 948b67501864361f561b84f484047369b7422c73..64711299b419ca8abd099ccc0cf029e2be47c733 100644 (file)
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -372,7 +372,8 @@ COMMAND("fs new "
         "name=data,type=CephString "
         "name=force,type=CephBool,req=false "
         "name=allow_dangerous_metadata_overlay,type=CephBool,req=false "
-       "name=fscid,type=CephInt,range=0,req=false",
+       "name=fscid,type=CephInt,range=0,req=false "
+       "name=recover,type=CephBool,req=false",
         "make new filesystem using named pools <metadata> and <data>",
         "fs", "rw")
  COMMAND("fs fail "
author	Ramana Raja <rraja@redhat.com>
	Wed, 11 Aug 2021 20:34:47 +0000 (16:34 -0400)
committer	Ramana Raja <rraja@redhat.com>
	Tue, 22 Mar 2022 20:08:45 +0000 (16:08 -0400)
doc/cephfs/administration.rst		patch \| blob \| history
doc/cephfs/recover-fs-after-mon-store-loss.rst		patch \| blob \| history
qa/suites/fs/functional/tasks/recovery-fs.yaml	[new file with mode: 0644]	patch \| blob
qa/tasks/cephfs/test_recovery_fs.py	[new file with mode: 0644]	patch \| blob
src/mds/FSMap.cc		patch \| blob \| history
src/mds/FSMap.h		patch \| blob \| history
src/mon/FSCommands.cc		patch \| blob \| history
src/mon/MonCommands.h		patch \| blob \| history