--- /dev/null
+
+Ceph filesystem client eviction
+===============================
+
+When a filesystem client is unresponsive or otherwise misbehaving, it
+may be necessary to forcibly terminate its access to the filesystem. This
+process is called *eviction*.
+
+This process is somewhat thorough in order to protect against data inconsistency
+resulting from misbehaving clients.
+
+OSD blacklisting
+----------------
+
+First, prevent the client from performing any more data operations by *blacklisting*
+it at the RADOS level. You may be familiar with this concept as *fencing* in other
+storage systems.
+
+Identify the client to evict from the MDS session list:
+
+::
+
+ # ceph daemon mds.a session ls
+ [
+ { "id": 4117,
+ "num_leases": 0,
+ "num_caps": 1,
+ "state": "open",
+ "replay_requests": 0,
+ "reconnecting": false,
+ "inst": "client.4117 172.16.79.251:0\/3271",
+ "client_metadata": { "entity_id": "admin",
+ "hostname": "fedoravm.localdomain",
+ "mount_point": "\/home\/user\/mnt"}}]
+
+In this case the 'fedoravm' client has address ``172.16.79.251:0/3271``, so we blacklist
+it as follows:
+
+::
+
+ # ceph osd blacklist add 172.16.79.251:0/3271
+ blacklisting 172.16.79.251:0/3271 until 2014-12-09 13:09:56.569368 (3600 sec)
+
+OSD epoch barrier
+-----------------
+
+While the evicted client is now marked as blacklisted in the central (mon) copy of the OSD
+map, it is now necessary to ensure that this OSD map update has propagated to all daemons
+involved in subsequent filesystem I/O. To do this, use the ``osdmap barrier`` MDS admin
+socket command.
+
+First read the latest OSD epoch:
+
+::
+
+ # ceph osd dump
+ epoch 12
+ fsid fd61ca96-53ff-4311-826c-f36b176d69ea
+ created 2014-12-09 12:03:38.595844
+ modified 2014-12-09 12:09:56.619957
+ ...
+
+In this case it is 12. Now request the MDS to barrier on this epoch:
+
+::
+
+ # ceph daemon mds.a osdmap barrier 12
+
+MDS session eviction
+--------------------
+
+Finally, it is safe to evict the client's MDS session, such that any capabilities it held
+may be issued to other clients. The ID here is the ``id`` attribute from the ``session ls``
+output:
+
+::
+
+ # ceph daemon mds.a session evict 4117
+
+That's it! The client has now been evicted, and any resources it had locked will
+now be available for other clients.
+
+Background: OSD epoch barrier
+-----------------------------
+
+The purpose of the barrier is to ensure that when we hand out any
+capabilities which might allow touching the same RADOS objects, the
+clients we hand out the capabilities to must have a sufficiently recent
+OSD map to not race with cancelled operations (from ENOSPC) or
+blacklisted clients (from evictions)
+
+More specifically, the cases where we set an epoch barrier are:
+
+ * Client eviction (where the client is blacklisted and other clients
+ must wait for a post-blacklist epoch to touch the same objects)
+ * OSD map full flag handling in the client (where the client may
+ cancel some OSD ops from a pre-full epoch, so other clients must
+ wait until the full epoch or later before touching the same objects).
+ * MDS startup, because we don't persist the barrier epoch, so must
+ assume that latest OSD map is always required after a restart.
+
+Note that this is a global value for simplicity: we could maintain this on
+a per-inode basis. We don't, because:
+
+ * It would be more complicated
+ * It would use an extra 4 bytes of memory for every inode
+ * It would not be much more efficient as almost always everyone has the latest
+ OSD map anyway, in most cases everyone will breeze through this barrier
+ rather than waiting.
+ * We only do this barrier in very rare cases, so any benefit from per-inode
+ granularity would only very rarely be seen.
+
+The epoch barrier is transmitted along with all capability messages, and
+instructs the receiver of the message to avoid sending any more RADOS
+operations to OSDs until it has seen this OSD epoch. This mainly applies
+to clients (doing their data writes directly to files), but also applies
+to the MDS because things like file size probing and file deletion are
+done directly from the MDS.
+
--- /dev/null
+
+Handling a full Ceph filesystem
+===============================
+
+When a RADOS cluster reaches its ``mon_osd_full_ratio`` (default
+95%) capacity, it is marked with the OSD full flag. This flag causes
+most normal RADOS clients to pause all operations until it is resolved
+(for example by adding more capacity to the cluster).
+
+The filesystem has some special handling of the full flag, explained below.
+
+Hammer and later
+----------------
+
+Since the hammer release, a full filesystem will lead to ENOSPC
+results from:
+
+ * Data writes on the client
+ * Metadata operations other than deletes and truncates
+
+Because the full condition may not be encountered until
+data is flushed to disk (sometime after a ``write`` call has already
+returned 0), the ENOSPC error may not be seen until the application
+calls ``fsync`` or ``fclose`` (or equivalent) on the file handle.
+
+Calling ``fsync`` is guaranteed to reliably indicate whether the data
+made it to disk, and will return an error if it doesn't. ``fclose`` will
+only return an error if buffered data happened to be flushed since
+the last write -- a successful ``fclose`` does not guarantee that the
+data made it to disk, and in a full-space situation, buffered data
+may be discarded after an ``fclose`` if no space is available to persist it.
+
+.. warning::
+ If an application appears to be misbehaving on a full filesystem,
+ check that it is performing ``fsync()`` calls as necessary to ensure
+ data is on disk before proceeding.
+
+Data writes may be cancelled by the client if they are in flight at the
+time the OSD full flag is sent. Clients update the ``osd_epoch_barrier``
+when releasing capabilities on files affected by cancelled operations, in
+order to ensure that these cancelled operations do not interfere with
+subsequent access to the data objects by the MDS or other clients. For
+more on the epoch barrier mechanism, see :doc:`eviction`.
+
+Legacy (pre-hammer) behavior
+----------------------------
+
+In versions of Ceph earlier than hammer, the MDS would ignore
+the full status of the RADOS cluster, and any data writes from
+clients would stall until the cluster ceased to be full.
+
+There are two dangerous conditions to watch for with this behaviour:
+
+* If a client had pending writes to a file, then it was not possible
+ for the client to release the file to the MDS for deletion: this could
+ lead to difficulty clearing space on a full filesystem
+* If clients continued to create a large number of empty files, the
+ resulting metadata writes from the MDS could lead to total exhaustion
+ of space on the OSDs such that no further deletions could be performed.
+