From 2e916037f4ef99760ff4d97c0319033679d4c329 Mon Sep 17 00:00:00 2001 From: Patrick Donnelly Date: Mon, 10 Feb 2020 20:02:29 -0800 Subject: [PATCH] doc: add section on new mds_join_fs behavior Signed-off-by: Patrick Donnelly --- doc/cephfs/add-remove-mds.rst | 9 +++- doc/cephfs/standby.rst | 82 +++++++++++++++++++++++++++++++++++ 2 files changed, 89 insertions(+), 2 deletions(-) diff --git a/doc/cephfs/add-remove-mds.rst b/doc/cephfs/add-remove-mds.rst index a3190fed2b4..545779a6e57 100644 --- a/doc/cephfs/add-remove-mds.rst +++ b/doc/cephfs/add-remove-mds.rst @@ -64,11 +64,11 @@ Adding an MDS #. Create an mds data point ``/var/lib/ceph/mds/ceph-${id}``. The daemon only uses this directory to store its keyring. -#. Create the authentication key, if you use CephX. :: +#. Create the authentication key, if you use CephX: :: $ sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring -#. Start the service. :: +#. Start the service: :: $ sudo systemctl start ceph-mds@${id} @@ -76,6 +76,11 @@ Adding an MDS mds: ${id}:1 {0=${id}=up:active} 2 up:standby +#. Optionally, configure the file system the MDS should join (:ref:`mds-join-fs`): :: + + $ ceph config set mds.${id} mds_join_fs ${fs} + + Removing an MDS =============== diff --git a/doc/cephfs/standby.rst b/doc/cephfs/standby.rst index b50378397ae..22216c36f8d 100644 --- a/doc/cephfs/standby.rst +++ b/doc/cephfs/standby.rst @@ -105,3 +105,85 @@ standby for the rank that it is following. If another rank fails, this standby-replay daemon will not be used as a replacement, even if no other standbys are available. For this reason, it is advised that if standby-replay is used then every active MDS should have a standby-replay daemon. + +.. _mds-join-fs: + +Configuring MDS file system affinity +------------------------------------ + +You may want to have an MDS used for a particular file system. Or, perhaps you +have larger MDSs on better hardware that should be preferred over a last-resort +standby on lesser or over-provisioned hardware. To express this preference, +CephFS provides a configuration option for MDS called ``mds_join_fs`` which +enforces this `affinity`. + +As part of any failover, the Ceph monitors will prefer standby daemons with +``mds_join_fs`` equal to the file system name with the failed rank. If no +standby exists with ``mds_join_fs`` equal to the file system name, it will +choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement +or any other available standby as a last resort. Note, this does not change the +behavior that ``standby-replay`` daemons are always selected before looking at +other standbys. + +Even further, the monitors will regularly examine the CephFS file systems when +stable to check if a standby with stronger affinity is available to replace an +MDS with lower affinity. This process is also done for standby-replay daemons: +if a regular standby has stronger affinity than the standby-replay MDS, it will +replace the standby-replay MDS. + +For example, given this stable and healthy file system: + +:: + + $ ceph fs dump + dumped fsmap epoch 399 + ... + Filesystem 'cephfs' (27) + ... + e399 + max_mds 1 + in 0 + up {0=20384} + failed + damaged + stopped + ... + [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]] + + Standby daemons: + + [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]] + + +You may set ``mds_join_fs`` on the standby to enforce your preference: :: + + $ ceph config set mds.b mds_join_fs cephfs + +after automatic failover: :: + + $ ceph fs dump + dumped fsmap epoch 405 + e405 + ... + Filesystem 'cephfs' (27) + ... + max_mds 1 + in 0 + up {0=10420} + failed + damaged + stopped + ... + [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]] + + Standby daemons: + + [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]] + +Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this +output, the file system name from ``mds_join_fs`` is changed to the file system +identifier (27). If the file system is recreated with the same name, the +standby will follow the new file system as expected. + +Finally, if the file system is degraded or undersized, no failover will occur +to enforce ``mds_join_fs``. -- 2.39.5