doc: add section on new mds_join_fs behavior

author Patrick Donnelly <pdonnell@redhat.com>

Tue, 11 Feb 2020 04:02:29 +0000 (20:02 -0800)

committer Patrick Donnelly <pdonnell@redhat.com>

Thu, 13 Feb 2020 15:51:10 +0000 (07:51 -0800)
author Patrick Donnelly <pdonnell@redhat.com>
Tue, 11 Feb 2020 04:02:29 +0000 (20:02 -0800)
committer Patrick Donnelly <pdonnell@redhat.com>
Thu, 13 Feb 2020 15:51:10 +0000 (07:51 -0800)
diff --git a/doc/cephfs/add-remove-mds.rst b/doc/cephfs/add-remove-mds.rst

index a3190fed2b4993429b168f922f959e9869c5c6f1..545779a6e573f88ba7e3f2eea501b02f3ee266ba 100644 (file)
--- a/doc/cephfs/add-remove-mds.rst
+++ b/doc/cephfs/add-remove-mds.rst
@@ -64,11 +64,11 @@ Adding an MDS
  
  #. Create an mds data point ``/var/lib/ceph/mds/ceph-${id}``. The daemon only uses this directory to store its keyring.
  
-#. Create the authentication key, if you use CephX. ::
+#. Create the authentication key, if you use CephX: ::
  
         $ sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring
  
-#. Start the service. ::
+#. Start the service: ::
  
         $ sudo systemctl start ceph-mds@${id}
  
@@ -76,6 +76,11 @@ Adding an MDS
  
         mds: ${id}:1 {0=${id}=up:active} 2 up:standby
  
+#. Optionally, configure the file system the MDS should join (:ref:`mds-join-fs`): ::
+
+    $ ceph config set mds.${id} mds_join_fs ${fs}
+
+
  Removing an MDS
  ===============
  
diff --git a/doc/cephfs/standby.rst b/doc/cephfs/standby.rst

index b50378397aef205145263553d9e75f3a33ef48e8..22216c36f8d2ef570ff8e876632c72901107b8bc 100644 (file)
--- a/doc/cephfs/standby.rst
+++ b/doc/cephfs/standby.rst
@@ -105,3 +105,85 @@ standby for the rank that it is following. If another rank fails, this
  standby-replay daemon will not be used as a replacement, even if no other
  standbys are available. For this reason, it is advised that if standby-replay
  is used then every active MDS should have a standby-replay daemon.
+
+.. _mds-join-fs:
+
+Configuring MDS file system affinity
+------------------------------------
+
+You may want to have an MDS used for a particular file system. Or, perhaps you
+have larger MDSs on better hardware that should be preferred over a last-resort
+standby on lesser or over-provisioned hardware. To express this preference,
+CephFS provides a configuration option for MDS called ``mds_join_fs`` which
+enforces this `affinity`.
+
+As part of any failover, the Ceph monitors will prefer standby daemons with
+``mds_join_fs`` equal to the file system name with the failed rank.  If no
+standby exists with ``mds_join_fs`` equal to the file system name, it will
+choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
+or any other available standby as a last resort. Note, this does not change the
+behavior that ``standby-replay`` daemons are always selected before looking at
+other standbys.
+
+Even further, the monitors will regularly examine the CephFS file systems when
+stable to check if a standby with stronger affinity is available to replace an
+MDS with lower affinity. This process is also done for standby-replay daemons:
+if a regular standby has stronger affinity than the standby-replay MDS, it will
+replace the standby-replay MDS.
+
+For example, given this stable and healthy file system:
+
+::
+
+    $ ceph fs dump
+    dumped fsmap epoch 399
+    ...
+    Filesystem 'cephfs' (27)
+    ...
+    e399
+    max_mds 1
+    in      0
+    up      {0=20384}
+    failed
+    damaged
+    stopped
+    ...
+    [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
+
+    Standby daemons:
+
+    [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
+
+
+You may set ``mds_join_fs`` on the standby to enforce your preference: ::
+
+    $ ceph config set mds.b mds_join_fs cephfs
+
+after automatic failover: ::
+
+    $ ceph fs dump
+    dumped fsmap epoch 405
+    e405
+    ...
+    Filesystem 'cephfs' (27)
+    ...
+    max_mds 1
+    in      0
+    up      {0=10420}
+    failed
+    damaged
+    stopped
+    ...
+    [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
+
+    Standby daemons:
+
+    [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
+
+Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
+output, the file system name from ``mds_join_fs`` is changed to the file system
+identifier (27). If the file system is recreated with the same name, the
+standby will follow the new file system as expected.
+
+Finally, if the file system is degraded or undersized, no failover will occur
+to enforce ``mds_join_fs``.
author	Patrick Donnelly <pdonnell@redhat.com>
	Tue, 11 Feb 2020 04:02:29 +0000 (20:02 -0800)
committer	Patrick Donnelly <pdonnell@redhat.com>
	Thu, 13 Feb 2020 15:51:10 +0000 (07:51 -0800)
doc/cephfs/add-remove-mds.rst		patch \| blob \| history
doc/cephfs/standby.rst		patch \| blob \| history