From 51c9b86bf1f7be41f568ef1594b305133cd8b118 Mon Sep 17 00:00:00 2001 From: Patrick Donnelly Date: Wed, 10 Jan 2024 22:08:13 -0500 Subject: [PATCH] doc: add dev docs for quiesce protocol Signed-off-by: Patrick Donnelly --- doc/dev/mds_internals/quiesce.rst | 93 +++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 doc/dev/mds_internals/quiesce.rst diff --git a/doc/dev/mds_internals/quiesce.rst b/doc/dev/mds_internals/quiesce.rst new file mode 100644 index 0000000000000..72f3a6ddd27d4 --- /dev/null +++ b/doc/dev/mds_internals/quiesce.rst @@ -0,0 +1,93 @@ +MDS Quiesce Protocol +==================== + +The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree +in a file system, stopping all write (and most read) I/O. + +The purpose of this API is to prevent multiple clients from interleaving reads +and writes across an eventually consistent snapshot barrier where out-of-band +communication exists between clients. This communication can lead to clients +wrongly believing they've reached a checkpoint that is mutually recoverable to +via a snapshot. + +Mechanism +--------- + +The MDS quiesces I/O using a new ``quiesce_path`` internal request that obtains +appropriate locks on the root of a tree and then launches a series of +sub-requests for locking other inodes in the tree. The locks obtained will +force clients to release caps and in-progress client/MDS requests to complete. + +The sub-requests launched are ``quiesce_inode`` internal requests which simply +lock the inode, if the MDS is authoritative for the inode. Generally, these +are rdlocks (read locks) on each inode metadata lock but the ``filelock`` is +xlocked (exclusively locked) because its type allows for multiple readers and +writers. Additionally, a new ``quiescelock`` is exclusively locked (more on +that next). + +Because the ``quiesce_inode`` request will xlock the ``filelock`` and +``quiescelock``, it only does so if run on the authoritative MDS. It is +expected that the glue layer on top of the quiesce protocol will execute the +same ``quiesce_path`` operation on each MDS rank. This allows each rank which +may be authoritative for part of the tree to lock all inodes it is +authoritative for. + + +Inode Quiescelock +----------------- + +The ``quiescelock`` is a new lock for inodes which supports quiescing I/O. It +is a type of superlock where every client or MDS operation which accesses an +inode lock will also implicitly acquire the ``quiescelock`` (readonly). In +general, this lock is never held except for reading. When a subtree is +quiesced, the ``quiesce_inode`` internal operation will hold ``quiescelock`` +exclusively, thereby denying the **new** acquisition of any other inode lock. +The ``quiescelock`` must be ordered before all other locks (see +``src/include/ceph_fs.h`` for ordering) in order to act as this superlock. + +The reason for this lock is to prevent an operation from blocking on acquiring +locks held by ``quiesce_inode`` while still holding locks obtained +during path traversal. Notably, the important locks are the ``snaplock`` and +``policylock`` obtained via ``Locker::try_rdlock_snap_layout`` on all parents +of the root inode of the request (the ``ino`` in the ``filepath`` struct). If +that operation waits with those locks held, then a future ``mksnap`` on the +root inode will be impossible. + +.. note:: The ``mksnap`` RPC only acquires a wrlock (write lock) on the + ``snaplock`` for the inode to be snapshotted. + +The way ``quiescelock`` helps prevent this is by being the first **mandatory** +lock acquired and the special handling when it cannot be acquired: all locks +held by the operation are dropped and the operation waits for the +``quiescelock`` to be available. The lock is mandatory in that all inode locks +automatically include (add) the ``quiescelock`` when calling +``Locker::acquire_locks``. So the expected normal flow is that an operation +like ``getattr`` will perform its path traversal, acquiring parent and dentry +locks, then attempt to acquire locks on the inode necessary for the requested +client caps. The operation will fail to acquire the automatically included +``quiescelock``, add itself to the ``quiescelock`` wait list, and then drop all +held locks. + +There is a divergence in locking behavior for the root of the subvolume. The +``quiescelock`` is only locked read-only. This allows the inode to be accessed +by operations like ``mksnap`` which will implicitly acquire the ``quiescelock`` +read-only when locking the ``snaplock`` for writing. Additionally, if +``Locker::acquire_locks`` will only acquire read locks without waiting, then it +will skip the read-only lock on ``quiescelock``. This is to allow some forms of +``lookup`` nececessary for snapshot management (e.g. volumes plugin) at higher +layers. + + +Readable quiesced tree +---------------------- + +It may be desirable to allow readers to continue accessing a quiesced +subvolume. One way to do that is to have a separate superlock (yuck) for read +access, say ``quiescerlock``. If a "readable" quiesce is performed, then +``quiescerlock`` is not xlocked by ``quiesce_inode``. Read locks on +other (non-quiesce) locks will acquire a read lock only on ``quiescerlock`` and +no longer on ``quiescelock``. Write locks would try to acquire both +``quiescelock`` and ``quiescerlock`` (since writes may also read). + +Ideally, it may be a new lock type could be used to handle both cases but no +such lock type yet exists. -- 2.39.5