#. Create an mds data point ``/var/lib/ceph/mds/ceph-${id}``. The daemon only uses this directory to store its keyring.
-#. Create the authentication key, if you use CephX. ::
+#. Create the authentication key, if you use CephX: ::
$ sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring
-#. Start the service. ::
+#. Start the service: ::
$ sudo systemctl start ceph-mds@${id}
mds: ${id}:1 {0=${id}=up:active} 2 up:standby
+#. Optionally, configure the file system the MDS should join (:ref:`mds-join-fs`): ::
+
+ $ ceph config set mds.${id} mds_join_fs ${fs}
+
+
Removing an MDS
===============
standby-replay daemon will not be used as a replacement, even if no other
standbys are available. For this reason, it is advised that if standby-replay
is used then every active MDS should have a standby-replay daemon.
+
+.. _mds-join-fs:
+
+Configuring MDS file system affinity
+------------------------------------
+
+You may want to have an MDS used for a particular file system. Or, perhaps you
+have larger MDSs on better hardware that should be preferred over a last-resort
+standby on lesser or over-provisioned hardware. To express this preference,
+CephFS provides a configuration option for MDS called ``mds_join_fs`` which
+enforces this `affinity`.
+
+As part of any failover, the Ceph monitors will prefer standby daemons with
+``mds_join_fs`` equal to the file system name with the failed rank. If no
+standby exists with ``mds_join_fs`` equal to the file system name, it will
+choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
+or any other available standby as a last resort. Note, this does not change the
+behavior that ``standby-replay`` daemons are always selected before looking at
+other standbys.
+
+Even further, the monitors will regularly examine the CephFS file systems when
+stable to check if a standby with stronger affinity is available to replace an
+MDS with lower affinity. This process is also done for standby-replay daemons:
+if a regular standby has stronger affinity than the standby-replay MDS, it will
+replace the standby-replay MDS.
+
+For example, given this stable and healthy file system:
+
+::
+
+ $ ceph fs dump
+ dumped fsmap epoch 399
+ ...
+ Filesystem 'cephfs' (27)
+ ...
+ e399
+ max_mds 1
+ in 0
+ up {0=20384}
+ failed
+ damaged
+ stopped
+ ...
+ [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
+
+ Standby daemons:
+
+ [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
+
+
+You may set ``mds_join_fs`` on the standby to enforce your preference: ::
+
+ $ ceph config set mds.b mds_join_fs cephfs
+
+after automatic failover: ::
+
+ $ ceph fs dump
+ dumped fsmap epoch 405
+ e405
+ ...
+ Filesystem 'cephfs' (27)
+ ...
+ max_mds 1
+ in 0
+ up {0=10420}
+ failed
+ damaged
+ stopped
+ ...
+ [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
+
+ Standby daemons:
+
+ [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
+
+Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
+output, the file system name from ``mds_join_fs`` is changed to the file system
+identifier (27). If the file system is recreated with the same name, the
+standby will follow the new file system as expected.
+
+Finally, if the file system is degraded or undersized, no failover will occur
+to enforce ``mds_join_fs``.