From: John Spray Date: Tue, 9 Dec 2014 13:21:44 +0000 (+0000) Subject: doc: add cephfs ENOSPC and eviction information X-Git-Tag: v0.91~47^2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=refs%2Fpull%2F2836%2Fhead;p=ceph.git doc: add cephfs ENOSPC and eviction information Adding this at this time to give us a sensible place to talk about the epoch barrier stuff. The eviction stuff will probably get simplified once we add a mon-side eviction command that handles blacklisting and MDS session eviction in one go. Signed-off-by: John Spray --- diff --git a/doc/cephfs/eviction.rst b/doc/cephfs/eviction.rst new file mode 100644 index 000000000000..bd201cebbf76 --- /dev/null +++ b/doc/cephfs/eviction.rst @@ -0,0 +1,119 @@ + +Ceph filesystem client eviction +=============================== + +When a filesystem client is unresponsive or otherwise misbehaving, it +may be necessary to forcibly terminate its access to the filesystem. This +process is called *eviction*. + +This process is somewhat thorough in order to protect against data inconsistency +resulting from misbehaving clients. + +OSD blacklisting +---------------- + +First, prevent the client from performing any more data operations by *blacklisting* +it at the RADOS level. You may be familiar with this concept as *fencing* in other +storage systems. + +Identify the client to evict from the MDS session list: + +:: + + # ceph daemon mds.a session ls + [ + { "id": 4117, + "num_leases": 0, + "num_caps": 1, + "state": "open", + "replay_requests": 0, + "reconnecting": false, + "inst": "client.4117 172.16.79.251:0\/3271", + "client_metadata": { "entity_id": "admin", + "hostname": "fedoravm.localdomain", + "mount_point": "\/home\/user\/mnt"}}] + +In this case the 'fedoravm' client has address ``172.16.79.251:0/3271``, so we blacklist +it as follows: + +:: + + # ceph osd blacklist add 172.16.79.251:0/3271 + blacklisting 172.16.79.251:0/3271 until 2014-12-09 13:09:56.569368 (3600 sec) + +OSD epoch barrier +----------------- + +While the evicted client is now marked as blacklisted in the central (mon) copy of the OSD +map, it is now necessary to ensure that this OSD map update has propagated to all daemons +involved in subsequent filesystem I/O. To do this, use the ``osdmap barrier`` MDS admin +socket command. + +First read the latest OSD epoch: + +:: + + # ceph osd dump + epoch 12 + fsid fd61ca96-53ff-4311-826c-f36b176d69ea + created 2014-12-09 12:03:38.595844 + modified 2014-12-09 12:09:56.619957 + ... + +In this case it is 12. Now request the MDS to barrier on this epoch: + +:: + + # ceph daemon mds.a osdmap barrier 12 + +MDS session eviction +-------------------- + +Finally, it is safe to evict the client's MDS session, such that any capabilities it held +may be issued to other clients. The ID here is the ``id`` attribute from the ``session ls`` +output: + +:: + + # ceph daemon mds.a session evict 4117 + +That's it! The client has now been evicted, and any resources it had locked will +now be available for other clients. + +Background: OSD epoch barrier +----------------------------- + +The purpose of the barrier is to ensure that when we hand out any +capabilities which might allow touching the same RADOS objects, the +clients we hand out the capabilities to must have a sufficiently recent +OSD map to not race with cancelled operations (from ENOSPC) or +blacklisted clients (from evictions) + +More specifically, the cases where we set an epoch barrier are: + + * Client eviction (where the client is blacklisted and other clients + must wait for a post-blacklist epoch to touch the same objects) + * OSD map full flag handling in the client (where the client may + cancel some OSD ops from a pre-full epoch, so other clients must + wait until the full epoch or later before touching the same objects). + * MDS startup, because we don't persist the barrier epoch, so must + assume that latest OSD map is always required after a restart. + +Note that this is a global value for simplicity: we could maintain this on +a per-inode basis. We don't, because: + + * It would be more complicated + * It would use an extra 4 bytes of memory for every inode + * It would not be much more efficient as almost always everyone has the latest + OSD map anyway, in most cases everyone will breeze through this barrier + rather than waiting. + * We only do this barrier in very rare cases, so any benefit from per-inode + granularity would only very rarely be seen. + +The epoch barrier is transmitted along with all capability messages, and +instructs the receiver of the message to avoid sending any more RADOS +operations to OSDs until it has seen this OSD epoch. This mainly applies +to clients (doing their data writes directly to files), but also applies +to the MDS because things like file size probing and file deletion are +done directly from the MDS. + diff --git a/doc/cephfs/full.rst b/doc/cephfs/full.rst new file mode 100644 index 000000000000..a58b94c77bbf --- /dev/null +++ b/doc/cephfs/full.rst @@ -0,0 +1,60 @@ + +Handling a full Ceph filesystem +=============================== + +When a RADOS cluster reaches its ``mon_osd_full_ratio`` (default +95%) capacity, it is marked with the OSD full flag. This flag causes +most normal RADOS clients to pause all operations until it is resolved +(for example by adding more capacity to the cluster). + +The filesystem has some special handling of the full flag, explained below. + +Hammer and later +---------------- + +Since the hammer release, a full filesystem will lead to ENOSPC +results from: + + * Data writes on the client + * Metadata operations other than deletes and truncates + +Because the full condition may not be encountered until +data is flushed to disk (sometime after a ``write`` call has already +returned 0), the ENOSPC error may not be seen until the application +calls ``fsync`` or ``fclose`` (or equivalent) on the file handle. + +Calling ``fsync`` is guaranteed to reliably indicate whether the data +made it to disk, and will return an error if it doesn't. ``fclose`` will +only return an error if buffered data happened to be flushed since +the last write -- a successful ``fclose`` does not guarantee that the +data made it to disk, and in a full-space situation, buffered data +may be discarded after an ``fclose`` if no space is available to persist it. + +.. warning:: + If an application appears to be misbehaving on a full filesystem, + check that it is performing ``fsync()`` calls as necessary to ensure + data is on disk before proceeding. + +Data writes may be cancelled by the client if they are in flight at the +time the OSD full flag is sent. Clients update the ``osd_epoch_barrier`` +when releasing capabilities on files affected by cancelled operations, in +order to ensure that these cancelled operations do not interfere with +subsequent access to the data objects by the MDS or other clients. For +more on the epoch barrier mechanism, see :doc:`eviction`. + +Legacy (pre-hammer) behavior +---------------------------- + +In versions of Ceph earlier than hammer, the MDS would ignore +the full status of the RADOS cluster, and any data writes from +clients would stall until the cluster ceased to be full. + +There are two dangerous conditions to watch for with this behaviour: + +* If a client had pending writes to a file, then it was not possible + for the client to release the file to the MDS for deletion: this could + lead to difficulty clearing space on a full filesystem +* If clients continued to create a large number of empty files, the + resulting metadata writes from the MDS could lead to total exhaustion + of space on the OSDs such that no further deletions could be performed. + diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst index 7821701f5e0b..be55ecfa6f9d 100644 --- a/doc/cephfs/index.rst +++ b/doc/cephfs/index.rst @@ -81,6 +81,8 @@ authentication keyring. libcephfs <../../api/libcephfs-java/> cephfs-journal-tool File layouts + Client eviction + Handling full filesystems Troubleshooting .. raw:: html