From: John Spray <john.spray@redhat.com>
Date: Tue, 9 Dec 2014 13:21:44 +0000 (+0000)
Subject: doc: add cephfs ENOSPC and eviction information
X-Git-Tag: v0.91~47^2
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=refs%2Fpull%2F2836%2Fhead;p=ceph.git

doc: add cephfs ENOSPC and eviction information

Adding this at this time to give us a sensible place
to talk about the epoch barrier stuff.  The eviction
stuff will probably get simplified once we add a mon-side
eviction command that handles blacklisting and MDS session
eviction in one go.

Signed-off-by: John Spray <john.spray@redhat.com>
---

diff --git a/doc/cephfs/eviction.rst b/doc/cephfs/eviction.rst
new file mode 100644
index 00000000000..bd201cebbf7
--- /dev/null
+++ b/doc/cephfs/eviction.rst
@@ -0,0 +1,119 @@
+
+Ceph filesystem client eviction
+===============================
+
+When a filesystem client is unresponsive or otherwise misbehaving, it
+may be necessary to forcibly terminate its access to the filesystem.  This
+process is called *eviction*.
+
+This process is somewhat thorough in order to protect against data inconsistency
+resulting from misbehaving clients.
+
+OSD blacklisting
+----------------
+
+First, prevent the client from performing any more data operations by *blacklisting*
+it at the RADOS level.  You may be familiar with this concept as *fencing* in other
+storage systems.
+
+Identify the client to evict from the MDS session list:
+
+::
+
+    # ceph daemon mds.a session ls
+    [
+        { "id": 4117,
+          "num_leases": 0,
+          "num_caps": 1,
+          "state": "open",
+          "replay_requests": 0,
+          "reconnecting": false,
+          "inst": "client.4117 172.16.79.251:0\/3271",
+          "client_metadata": { "entity_id": "admin",
+              "hostname": "fedoravm.localdomain",
+              "mount_point": "\/home\/user\/mnt"}}]
+
+In this case the 'fedoravm' client has address ``172.16.79.251:0/3271``, so we blacklist
+it as follows:
+
+::
+
+    # ceph osd blacklist add 172.16.79.251:0/3271
+    blacklisting 172.16.79.251:0/3271 until 2014-12-09 13:09:56.569368 (3600 sec)
+
+OSD epoch barrier
+-----------------
+
+While the evicted client is now marked as blacklisted in the central (mon) copy of the OSD
+map, it is now necessary to ensure that this OSD map update has propagated to all daemons
+involved in subsequent filesystem I/O.  To do this, use the ``osdmap barrier`` MDS admin
+socket command.
+
+First read the latest OSD epoch:
+
+::
+
+    # ceph osd dump
+    epoch 12
+    fsid fd61ca96-53ff-4311-826c-f36b176d69ea
+    created 2014-12-09 12:03:38.595844
+    modified 2014-12-09 12:09:56.619957
+    ...
+
+In this case it is 12.  Now request the MDS to barrier on this epoch:
+
+::
+
+    # ceph daemon mds.a osdmap barrier 12
+
+MDS session eviction
+--------------------
+
+Finally, it is safe to evict the client's MDS session, such that any capabilities it held
+may be issued to other clients.  The ID here is the ``id`` attribute from the ``session ls``
+output:
+
+::
+
+    # ceph daemon mds.a session evict 4117
+
+That's it!  The client has now been evicted, and any resources it had locked will
+now be available for other clients.
+
+Background: OSD epoch barrier
+-----------------------------
+
+The purpose of the barrier is to ensure that when we hand out any
+capabilities which might allow touching the same RADOS objects, the
+clients we hand out the capabilities to must have a sufficiently recent
+OSD map to not race with cancelled operations (from ENOSPC) or
+blacklisted clients (from evictions)
+
+More specifically, the cases where we set an epoch barrier are:
+
+ * Client eviction (where the client is blacklisted and other clients
+   must wait for a post-blacklist epoch to touch the same objects)
+ * OSD map full flag handling in the client (where the client may
+   cancel some OSD ops from a pre-full epoch, so other clients must
+   wait until the full epoch or later before touching the same objects).
+ * MDS startup, because we don't persist the barrier epoch, so must
+   assume that latest OSD map is always required after a restart.
+
+Note that this is a global value for simplicity: we could maintain this on
+a per-inode basis.  We don't, because:
+
+ * It would be more complicated
+ * It would use an extra 4 bytes of memory for every inode
+ * It would not be much more efficient as almost always everyone has the latest
+   OSD map anyway, in most cases everyone will breeze through this barrier
+   rather than waiting.
+ * We only do this barrier in very rare cases, so any benefit from per-inode
+   granularity would only very rarely be seen.
+
+The epoch barrier is transmitted along with all capability messages, and
+instructs the receiver of the message to avoid sending any more RADOS
+operations to OSDs until it has seen this OSD epoch.  This mainly applies
+to clients (doing their data writes directly to files), but also applies
+to the MDS because things like file size probing and file deletion are
+done directly from the MDS.
+
diff --git a/doc/cephfs/full.rst b/doc/cephfs/full.rst
new file mode 100644
index 00000000000..a58b94c77bb
--- /dev/null
+++ b/doc/cephfs/full.rst
@@ -0,0 +1,60 @@
+
+Handling a full Ceph filesystem
+===============================
+
+When a RADOS cluster reaches its ``mon_osd_full_ratio`` (default
+95%) capacity, it is marked with the OSD full flag.  This flag causes
+most normal RADOS clients to pause all operations until it is resolved
+(for example by adding more capacity to the cluster).
+
+The filesystem has some special handling of the full flag, explained below.
+
+Hammer and later
+----------------
+
+Since the hammer release, a full filesystem will lead to ENOSPC
+results from:
+
+ * Data writes on the client
+ * Metadata operations other than deletes and truncates
+
+Because the full condition may not be encountered until
+data is flushed to disk (sometime after a ``write`` call has already
+returned 0), the ENOSPC error may not be seen until the application
+calls ``fsync`` or ``fclose`` (or equivalent) on the file handle.
+
+Calling ``fsync`` is guaranteed to reliably indicate whether the data
+made it to disk, and will return an error if it doesn't.  ``fclose`` will
+only return an error if buffered data happened to be flushed since
+the last write -- a successful ``fclose`` does not guarantee that the
+data made it to disk, and in a full-space situation, buffered data
+may be discarded after an ``fclose`` if no space is available to persist it.
+
+.. warning::
+    If an application appears to be misbehaving on a full filesystem,
+    check that it is performing ``fsync()`` calls as necessary to ensure
+    data is on disk before proceeding.
+
+Data writes may be cancelled by the client if they are in flight at the
+time the OSD full flag is sent.  Clients update the ``osd_epoch_barrier``
+when releasing capabilities on files affected by cancelled operations, in
+order to ensure that these cancelled operations do not interfere with
+subsequent access to the data objects by the MDS or other clients.  For
+more on the epoch barrier mechanism, see :doc:`eviction`.
+
+Legacy (pre-hammer) behavior
+----------------------------
+
+In versions of Ceph earlier than hammer, the MDS would ignore
+the full status of the RADOS cluster, and any data writes from
+clients would stall until the cluster ceased to be full.
+
+There are two dangerous conditions to watch for with this behaviour:
+
+* If a client had pending writes to a file, then it was not possible
+  for the client to release the file to the MDS for deletion: this could
+  lead to difficulty clearing space on a full filesystem
+* If clients continued to create a large number of empty files, the
+  resulting metadata writes from the MDS could lead to total exhaustion
+  of space on the OSDs such that no further deletions could be performed.
+
diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst
index 7821701f5e0..be55ecfa6f9 100644
--- a/doc/cephfs/index.rst
+++ b/doc/cephfs/index.rst
@@ -81,6 +81,8 @@ authentication keyring.
 	libcephfs <../../api/libcephfs-java/>
 	cephfs-journal-tool <cephfs-journal-tool>
 	File layouts <file-layouts>
+	Client eviction <eviction>
+	Handling full filesystems <full>
 	Troubleshooting <troubleshooting>
 
 .. raw:: html