From a256c42d47e3f1455d62cd81478ba86edf396aa5 Mon Sep 17 00:00:00 2001
From: Patrick Donnelly <pdonnell@redhat.com>
Date: Fri, 21 Jun 2019 17:17:01 -0700
Subject: [PATCH] doc/cephfs: improve add/remove MDS section

Include hardware details and update language for modern tools.

Fixes: http://tracker.ceph.com/issues/39620
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
---
 doc/cephfs/add-remove-mds.rst         | 97 +++++++++++++++++++++------
 doc/cephfs/createfs.rst               | 23 +++++--
 doc/cephfs/file-layouts.rst           |  1 +
 doc/cephfs/index.rst                  |  5 +-
 doc/rados/operations/erasure-code.rst |  2 +
 5 files changed, 99 insertions(+), 29 deletions(-)

diff --git a/doc/cephfs/add-remove-mds.rst b/doc/cephfs/add-remove-mds.rst
index 1f95913e94c4..c695fbbb5c2d 100644
--- a/doc/cephfs/add-remove-mds.rst
+++ b/doc/cephfs/add-remove-mds.rst
@@ -1,51 +1,106 @@
 ============================
- Add/Remove Metadata Server
+ Deploying Metadata Servers
 ============================
 
-You must deploy at least one metadata server daemon to use CephFS.  Instructions are given here for setting up an MDS manually, but you might prefer to use another tool such as ceph-deploy or ceph-ansible.
+Each CephFS file system requires at least one MDS. The cluster operator will
+generally use their automated deployment tool to launch required MDS servers as
+needed.  Rook and ansible (via the ceph-ansible playbooks) are recommended
+tools for doing this. For clarity, we also show the systemd commands here which
+may be run by the deployment technology if executed on bare-metal.
 
 See `MDS Config Reference`_ for details on configuring metadata servers.
 
 
-Add a Metadata Server
-=====================
+Provisioning Hardware for an MDS
+================================
 
-#. Create an mds data point ``/var/lib/ceph/mds/ceph-{$id}``.
+The present version of the MDS is single-threaded and CPU-bound for most
+activities, including responding to client requests. Even so, an MDS under the
+most aggressive client loads still uses about 2 to 3 CPU cores. This is due to
+the other miscellaneous upkeep threads working in tandem.
+
+Even so, it is recommended that an MDS server be well provisioned with an
+advanced CPU with sufficient cores. Development is on-going to make better use
+of available CPU cores in the MDS; it is expected in future versions of Ceph
+that the MDS server will improve performance by taking advantage of more cores.
+
+The other dimension to MDS performance is the available RAM for caching. The
+MDS necessarily manages a distributed and cooperative metadata cache among all
+clients and other active MDSs. Therefore it is essential to provide the MDS
+with sufficient RAM to enable faster metadata access and mutation.
+
+Generally, an MDS serving a large cluster of clients (1000 or more) will use at
+least 64GB of cache (see also :doc:`/cephfs/cache-size-limits`). An MDS with a larger
+cache is not well explored in the largest known community clusters; there may
+be diminishing returns where management of such a large cache negatively
+impacts performance in surprising ways. It would be best to do analysis with
+expected workloads to determine if provisioning more RAM is worthwhile.
+
+In a bare-metal cluster, the best practice is to over-provision hardware for
+the MDS server. Even if a single MDS daemon is unable to fully utilize the
+hardware, it may be desirable later on to start more active MDS daemons on the
+same node to fully utilize the available cores and memory. Additionally, it may
+become clear with workloads on the cluster that performance improves with
+multiple active MDS on the same node rather than over-provisioning a single
+MDS.
+
+Finally, be aware that CephFS is a highly-available file system by supporting
+standby MDS (see also :ref:`mds-standby`) for rapid failover. To get a real
+benefit from deploying standbys, it is usually necessary to distribute MDS
+daemons across at least two nodes in the cluster. Otherwise, a hardware failure
+on a single node may result in the file system becoming unavailable.
+
+Co-locating the MDS with other Ceph daemons (hyperconverged) is an effective
+and recommended way to accomplish this so long as all daemons are configured to
+use available hardware within certain limits.  For the MDS, this generally
+means limiting its cache size.
+
+
+Adding an MDS
+=============
+
+#. Create an mds data point ``/var/lib/ceph/mds/ceph-${id}``. The daemon only uses this directory to store its keyring.
 
 #. Edit ``ceph.conf`` and add MDS section. ::
 
-	[mds.{$id}]
+	[mds.${id}]
 	host = {hostname}
 
 #. Create the authentication key, if you use CephX. ::
 
-	$ sudo ceph auth get-or-create mds.{$id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-{$id}/keyring
+	$ sudo ceph auth get-or-create mds.${id} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-${id}/keyring
 
 #. Start the service. ::
 
-	$ sudo service ceph start mds.{$id}
+	$ sudo systemctl start mds.${id}
 
-#. The status of the cluster shows: ::
+#. The status of the cluster should show: ::
 
-	mds: cephfs_a-1/1/1 up  {0=c=up:active}, 3 up:standby
+	mds: ${id}:1 {0=${id}=up:active} 2 up:standby
 
-Remove a Metadata Server
-========================
-
-.. note:: Ensure that if you remove a metadata server, the remaining metadata
-   servers will be able to service requests from CephFS clients. If that is not
-   possible, consider adding a metadata server before destroying the metadata
-   server you would like to take offline.
+Removing an MDS
+===============
 
 If you have a metadata server in your cluster that you'd like to remove, you may use
 the following method.
 
-#. Create a new Metadata Server as shown in the above section.
+#. (Optionally:) Create a new replacement Metadata Server. If there are no
+   replacement MDS to take over once the MDS is removed, the file system will
+   become unavailable to clients.  If that is not desirable, consider adding a
+   metadata server before tearing down the metadata server you would like to
+   take offline.
+
+#. Stop the MDS to be removed. ::
+
+	$ sudo systemctl stop mds.${id}
 
-#. Stop the old Metadata Server and start using the new one. ::
+   The MDS will automatically notify the Ceph monitors that it is going down.
+   This enables the monitors to perform instantaneous failover to an available
+   standby, if one exists. It is unnecessary to use administrative commands to
+   effect this failover, e.g. through the use of ``ceph mds fail mds.${id}``.
 
-	$ ceph mds fail <mds name>
+#. Remove the ``/var/lib/ceph/mds/ceph-${id}`` directory on the MDS. ::
 
-#. Remove the ``/var/lib/ceph/mds/ceph-{$id}`` directory on the old Metadata server.
+	$ sudo rm -rf /var/lib/ceph/mds/ceph-${id}
 
 .. _MDS Config Reference: ../mds-config-ref
diff --git a/doc/cephfs/createfs.rst b/doc/cephfs/createfs.rst
index d3d487e01510..757bd00acb73 100644
--- a/doc/cephfs/createfs.rst
+++ b/doc/cephfs/createfs.rst
@@ -8,11 +8,19 @@ Creating pools
 A Ceph filesystem requires at least two RADOS pools, one for data and one for metadata.
 When configuring these pools, you might consider:
 
-- Using a higher replication level for the metadata pool, as any data
-  loss in this pool can render the whole filesystem inaccessible.
-- Using lower-latency storage such as SSDs for the metadata pool, as this
-  will directly affect the observed latency of filesystem operations
-  on clients.
+- Using a higher replication level for the metadata pool, as any data loss in
+  this pool can render the whole filesystem inaccessible.
+- Using lower-latency storage such as SSDs for the metadata pool, as this will
+  directly affect the observed latency of filesystem operations on clients.
+- The data pool used to create the file system is the "default" data pool and
+  the location for storing all inode backtrace information, used for hard link
+  management and disaster recovery. For this reason, all inodes created in
+  CephFS have at least one object in the default data pool. If erasure-coded
+  pools are planned for the file system, it is usually better to use a
+  replicated pool for the default data pool to improve small-object write and
+  read performance for updating backtraces. Separately, another erasure-coded
+  data pool can be added (see also :ref:`ecpool`) that can be used on an entire
+  hierarchy of directories and files (see also :ref:`file-layouts`).
 
 Refer to :doc:`/rados/operations/pools` to learn more about managing pools.  For
 example, to create two pools with default settings for use with a filesystem, you
@@ -23,6 +31,11 @@ might run the following commands:
     $ ceph osd pool create cephfs_data <pg_num>
     $ ceph osd pool create cephfs_metadata <pg_num>
 
+Generally, the metadata pool will have at most a few gigabytes of data. For
+this reason, a smaller PG count is usually recommended. 64 or 128 is commonly
+used in practice for large clusters.
+
+
 Creating a filesystem
 =====================
 
diff --git a/doc/cephfs/file-layouts.rst b/doc/cephfs/file-layouts.rst
index 5334765136a5..b779b16d207a 100644
--- a/doc/cephfs/file-layouts.rst
+++ b/doc/cephfs/file-layouts.rst
@@ -1,3 +1,4 @@
+.. _file-layouts:
 
 File layouts
 ============
diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst
index d55701d137d9..b57a005ee7cf 100644
--- a/doc/cephfs/index.rst
+++ b/doc/cephfs/index.rst
@@ -50,8 +50,7 @@ least one :term:`Ceph Metadata Server` running.
 .. toctree:: 
 	:maxdepth: 1
 
-	Add/Remove MDS(s) <add-remove-mds>
-	MDS states <mds-states>
+	Provision/Add/Remove MDS(s) <add-remove-mds>
 	MDS failover and standby configuration <standby>
 	MDS Configuration Settings <mds-config-ref>
 	Client Configuration Settings <client-config-ref>
@@ -70,7 +69,7 @@ authentication keyring.
 .. toctree:: 
 	:maxdepth: 1
 
-	Create CephFS <createfs>
+	Create a CephFS file system <createfs>
 	Mount CephFS <kernel>
 	Mount CephFS as FUSE <fuse>
 	Mount CephFS in fstab <fstab>
diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst
index 7808ad0b424a..9fe83a3f117d 100644
--- a/doc/rados/operations/erasure-code.rst
+++ b/doc/rados/operations/erasure-code.rst
@@ -1,3 +1,5 @@
+.. _ecpool:
+
 =============
  Erasure code
 =============
-- 
2.47.3