mon/FSCommands: add 'recover' flag in `fs new` command

author Ramana Raja <rraja@redhat.com>

Wed, 11 Aug 2021 20:34:47 +0000 (16:34 -0400)

committer Ramana Raja <rraja@redhat.com>

Mon, 13 Sep 2021 04:15:39 +0000 (00:15 -0400)
author Ramana Raja <rraja@redhat.com>
Wed, 11 Aug 2021 20:34:47 +0000 (16:34 -0400)
committer Ramana Raja <rraja@redhat.com>
Mon, 13 Sep 2021 04:15:39 +0000 (00:15 -0400)
diff --git a/doc/cephfs/administration.rst b/doc/cephfs/administration.rst

index e73dfa51e3775ae9ba01b761d67ce94aa7a3ca83..e5acc9ab870512cc1e00aedd578dcbdee29479a0 100644 (file)
--- a/doc/cephfs/administration.rst
+++ b/doc/cephfs/administration.rst
@@ -356,6 +356,7 @@ Some flags require you to confirm your intentions with "--yes-i-really-mean-it"
  or a similar string they will prompt you with. Consider these actions carefully
  before proceeding; they are placed on especially dangerous activities.
  
+.. _advanced-cephfs-admin-settings:
  
  Advanced
  --------
diff --git a/doc/cephfs/recover-fs-after-mon-store-loss.rst b/doc/cephfs/recover-fs-after-mon-store-loss.rst

index ca088573162e55aa7ad75eae57a88fc0104eb57f..d44269c831880da15ec91ff946fa01e7146df62b 100644 (file)
--- a/doc/cephfs/recover-fs-after-mon-store-loss.rst
+++ b/doc/cephfs/recover-fs-after-mon-store-loss.rst
@@ -9,51 +9,43 @@ stores don't restore the file system maps ("FSMap"). Additional steps are requir
  to bring back the file system. The steps to recover a multiple active MDS file
  system or multiple file systems are yet to be identified. Currently, only the steps
  to recover a **single active MDS** file system with no additional file systems
-in the cluster have been identified and tested. Briefly the steps are: stop the
-MDSs; recreate the FSMap with basic defaults; and allow MDSs to recover from
+in the cluster have been identified and tested. Briefly the steps are:
+recreate the FSMap with basic defaults; and allow MDSs to recover from
  the journal/metadata stored in the filesystem's pools. The steps are described
  in more detail below.
  
-First up, stop all the MDSs of the cluster.
-
-Verify that the MDSs have been stopped. Execute the below command and
-check that no active or standby MDS daemons are listed for the file system.
-
-::
-
-    ceph fs dump
-
-Recreate the file system using the recovered file system pools. The new FSMap
-will have the filesystem's default settings. However, the user defined file
-system settings such as ``standby_count_wanted``, ``required_client_features``,
+First up, recreate the file system using the recovered file system pools. The
+new FSMap will have the filesystem's default settings. However, the user defined
+file system settings such as ``standby_count_wanted``, ``required_client_features``,
  extra data pools, etc., are lost and need to be reapplied later.
  
  ::
  
-    ceph fs new <fs_name> <metadata_pool> <data_pool> --force
+    ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover
+
+The ``recover`` flag sets the state of file system's rank 0 to existing but
+failed. So when a MDS daemon eventually picks up rank 0, the daemon reads the
+existing in-RADOS metadata and doesn't overwrite it. The flag also prevents the
+standby MDS daemons to activate the file system.
  
  The file system cluster ID, fscid, of the file system will not be preserved.
  This behaviour may not be desirable for certain applications (e.g., Ceph CSI)
-that expect the file system to be unchanged across recovery. To fix this, pass
-the desired fscid when recreating the file system.
-
-::
+that expect the file system to be unchanged across recovery. To fix this, you
+can optionally set the ``fscid`` option in the above command (see
+:ref:`advanced-cephfs-admin-settings`).
  
-    ceph fs new <fs_name> <metadata_pool> <data_pool> --fscid <fscid> --force
-
-Next, reset the file system. The below command marks the state of the
-file system's rank 0 such that eventually when a MDS daemon picks up rank 0 the
-daemon reads the existing in-RADOS metadata and doesn't overwrite it.
+Allow standby MDS daemons to join the file system.
  
  ::
  
-    ceph fs reset <fs_name> --yes-i-really-mean-it
+    ceph fs set <fs_name> joinable true
+
  
-Restart the MDSs. Check that the file system is no longer in degraded state and
-one of the MDSs is active.
+Check that the file system is no longer in degraded state and has an active
+MDS.
  
  ::
  
-    ceph fs dump
+    ceph fs status
  
  Reapply any other custom file system settings.
diff --git a/qa/suites/fs/functional/tasks/recovery-fs.yaml b/qa/suites/fs/functional/tasks/recovery-fs.yaml

new file mode 100644 (file)

index 0000000..d354e9f
--- /dev/null
+++ b/qa/suites/fs/functional/tasks/recovery-fs.yaml
@@ -0,0 +1,4 @@
+tasks:
+- cephfs_test_runner:
+    modules:
+      - tasks.cephfs.test_recovery_fs
diff --git a/qa/tasks/cephfs/test_recovery_fs.py b/qa/tasks/cephfs/test_recovery_fs.py

new file mode 100644 (file)

index 0000000..a7aefed
--- /dev/null
+++ b/qa/tasks/cephfs/test_recovery_fs.py
@@ -0,0 +1,38 @@
+import logging
+from os.path import join as os_path_join
+
+from tasks.cephfs.cephfs_test_case import CephFSTestCase
+
+log = logging.getLogger(__name__)
+
+class TestFSRecovery(CephFSTestCase):
+    """
+    Tests for recovering FS after loss of FSMap
+    """
+
+    CLIENTS_REQUIRED = 1
+    MDSS_REQUIRED = 3
+
+    def test_recover_fs_after_fsmap_removal(self):
+        data_pool = self.fs.get_data_pool_name()
+        metadata_pool = self.fs.get_metadata_pool_name()
+        # write data in mount, and fsync
+        self.mount_a.create_n_files('file_on_fs', 1, sync=True)
+        # faild MDSs to allow removing the file system in the next step
+        self.fs.fail()
+        # Remove file system to lose FSMap and keep the pools intact.
+        # This mimics the scenario where the monitor store is rebuilt
+        # using  OSDs to recover a cluster with corrupt monitor store.
+        # The FSMap is permanently lost, but the FS pools are
+        # recovered/intact
+        self.fs.rm()
+        # Recreate file system with pool and previous fscid
+        self.fs.mon_manager.raw_cluster_cmd(
+            'fs', 'new', self.fs.name, metadata_pool, data_pool,
+            '--recover', '--force', '--fscid', f'{self.fs.id}')
+        self.fs.set_joinable()
+        # Check status of file system
+        self.fs.wait_for_daemons()
+        # check data in file sytem is intact
+        filepath = os_path_join(self.mount_a.hostfs_mntpt, 'file_on_fs_0')
+        self.assertEqual(self.mount_a.read_file(filepath), "content")
diff --git a/src/mds/FSMap.cc b/src/mds/FSMap.cc

index 7fa5ca9904b5e3d09ffe864e58f7478fb0daea45..ba46ab49b09f78c8b878efe2546f93a5b9e9bf95 100644 (file)
--- a/src/mds/FSMap.cc
+++ b/src/mds/FSMap.cc
@@ -450,7 +450,7 @@ mds_gid_t Filesystem::get_standby_replay(mds_gid_t who) const
  
  Filesystem::ref FSMap::create_filesystem(std::string_view name,
      int64_t metadata_pool, int64_t data_pool, uint64_t features,
-    fs_cluster_id_t fscid)
+    fs_cluster_id_t fscid, bool recover)
  {
    auto fs = Filesystem::create();
    fs->mds_map.epoch = epoch;
@@ -469,6 +469,15 @@ Filesystem::ref FSMap::create_filesystem(std::string_view name,
      next_filesystem_id = std::max(fscid,  (fs_cluster_id_t)next_filesystem_id) + 1;
    }
  
+  if (recover) {
+    // Populate rank 0 as existing (so don't go into CREATING)
+    // but failed (so that next available MDS is assigned the rank)
+    fs->mds_map.in.insert(mds_rank_t(0));
+    fs->mds_map.failed.insert(mds_rank_t(0));
+
+    fs->mds_map.set_flag(CEPH_MDSMAP_NOT_JOINABLE);
+  }
+
    // File system's ID can be FS_CLUSTER_ID_ANONYMOUS if we're recovering
    // a legacy file system by passing FS_CLUSTER_ID_ANONYMOUS as the desired
    // file system ID
diff --git a/src/mds/FSMap.h b/src/mds/FSMap.h

index 7deee53bef747bab17a2dcaa304435860b76f705..36711ca11e08596f3af755b8bd56fd5e19d9b777 100644 (file)
--- a/src/mds/FSMap.h
+++ b/src/mds/FSMap.h
@@ -409,7 +409,7 @@ public:
    Filesystem::ref create_filesystem(
        std::string_view name, int64_t metadata_pool,
        int64_t data_pool, uint64_t features,
-      fs_cluster_id_t fscid);
+      fs_cluster_id_t fscid, bool recover);
  
    /**
     * Remove the filesystem (it must exist).  Caller should already
diff --git a/src/mon/FSCommands.cc b/src/mon/FSCommands.cc

index c3b8293e114c2d19402d69f100aa783d120b149d..b5f8b6ce373c3d5ceac4ce17cefa4fe378b20176 100644 (file)
--- a/src/mon/FSCommands.cc
+++ b/src/mon/FSCommands.cc
@@ -287,12 +287,19 @@ class FsNewHandler : public FileSystemCommandHandler
                                    static_cast<double>(4.0));
      mon->osdmon()->propose_pending();
  
+    bool recover = false;
+    cmd_getval(cmdmap, "recover", recover);
+
      // All checks passed, go ahead and create.
      auto&& fs = fsmap.create_filesystem(fs_name, metadata, data,
-        mon->get_quorum_con_features(), fscid);
+        mon->get_quorum_con_features(), fscid, recover);
  
      ss << "new fs with metadata pool " << metadata << " and data pool " << data;
  
+    if (recover) {
+      return 0;
+    }
+
      // assign a standby to rank 0 to avoid health warnings
      auto info = fsmap.find_replacement_for({fs->fscid, 0});
  
diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h

index 9919c723434c7d34cf16ef055e44375ac5201a79..88f2278cfbac2b59f82177a0493eef064ac9041c 100644 (file)
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -336,7 +336,8 @@ COMMAND("fs new "
         "name=data,type=CephString "
         "name=force,type=CephBool,req=false "
         "name=allow_dangerous_metadata_overlay,type=CephBool,req=false "
-       "name=fscid,type=CephInt,range=0,req=false",
+       "name=fscid,type=CephInt,range=0,req=false "
+       "name=recover,type=CephBool,req=false",
         "make new filesystem using named pools <metadata> and <data>",
         "fs", "rw")
  COMMAND("fs fail "
author	Ramana Raja <rraja@redhat.com>
	Wed, 11 Aug 2021 20:34:47 +0000 (16:34 -0400)
committer	Ramana Raja <rraja@redhat.com>
	Mon, 13 Sep 2021 04:15:39 +0000 (00:15 -0400)
doc/cephfs/administration.rst		patch \| blob \| history
doc/cephfs/recover-fs-after-mon-store-loss.rst		patch \| blob \| history
qa/suites/fs/functional/tasks/recovery-fs.yaml	[new file with mode: 0644]	patch \| blob
qa/tasks/cephfs/test_recovery_fs.py	[new file with mode: 0644]	patch \| blob
src/mds/FSMap.cc		patch \| blob \| history
src/mds/FSMap.h		patch \| blob \| history
src/mon/FSCommands.cc		patch \| blob \| history
src/mon/MonCommands.h		patch \| blob \| history