From 783bd25f5cf3bb7766e8383c2fe61e5dac9fc877 Mon Sep 17 00:00:00 2001
From: Alex Ainscow <aainscow@uk.ibm.com>
Date: Wed, 24 Sep 2025 16:58:18 +0100
Subject: [PATCH] docs: Split Reads design (EC Direct Reads and Replica splits)

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
---
 doc/dev/osd_internals/erasure_coding.rst |   1 +
 doc/dev/osd_internals/split_reads.rst    | 443 +++++++++++++++++++++++
 2 files changed, 444 insertions(+)
 create mode 100644 doc/dev/osd_internals/split_reads.rst

diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst
index 7495c521b81..42bf3de3620 100644
--- a/doc/dev/osd_internals/erasure_coding.rst
+++ b/doc/dev/osd_internals/erasure_coding.rst
@@ -86,3 +86,4 @@ Table of contents
    Jerasure plugin <erasure_coding/jerasure>
    High level design document <erasure_coding/ecbackend>
    Erasure coding enhancements design document <erasure_coding/enhancements>
+   Split reads design document <split_reads>
diff --git a/doc/dev/osd_internals/split_reads.rst b/doc/dev/osd_internals/split_reads.rst
new file mode 100644
index 00000000000..3f18e08d800
--- /dev/null
+++ b/doc/dev/osd_internals/split_reads.rst
@@ -0,0 +1,443 @@
+==============================================
+Design Document for Erasure Coded Direct Reads
+==============================================
+
+Basic design
+------------
+
+Small reads
+~~~~~~~~~~~
+
+Currently, all reads in an erasure-coded pool are directed to the primary
+OSD of each PG. For a small (sub-chunk) read, this results in the following
+message flow:
+
+.. ditaa::
+
+   client -> primary -> OSD (read) -> primary -> client
+
+The proposed direct read mechanism allows the client to bypass the primary and
+interact directly with the target OSD, streamlining the message flow:
+
+.. ditaa::
+
+   client -> OSD (read) -> client
+
+In the event of OSD unavailability or a read error, the client will fall back
+to the legacy, primary-mediated read method to ensure data can be rebuilt from
+the coding shards.
+
+Medium sized reads
+~~~~~~~~~~~~~~~~~~
+
+Reads are classified as 'medium' when they span multiple data shards but are
+contained within a single stripe. The standard message flow for such a read is:
+
+.. ditaa::
+
+                       +-> OSD 1 --+
+                       |           |
+   client --> primary -+-> OSD 2 --+--> primary --> client
+                       |           |
+                       +-> OSD 3 --+
+
+With the direct read mechanism, the message flow is optimized as follows:
+
+.. ditaa::
+
+           +-> OSD 1 --+
+           |           |
+   client -+-> OSD 2 --+--> client
+           |           |
+           +-> OSD 3 --+
+
+Upon successful receipt of all data shards, the client is responsible for
+reconstructing the original data by concatenating the buffers in the correct
+sequence.
+
+Large size reads
+~~~~~~~~~~~~~~~~
+
+Large reads are those that exceed the size of a single data stripe. While each
+OSD can still perform a single read operation, the client cannot merely
+concatenate the data buffers. It must perform a de-striping operation to
+reconstruct the object's data from the various stripe units.
+
+Objecter
+--------
+
+The client's Objecter component is responsible for orchestrating direct reads
+through the following process:
+
+Prior to initiating a direct read, the Objecter performs a preliminary,
+best-effort check based on its current OSD map to verify that all required
+OSDs are available. The possibility of race conditions arising from a stale OSD
+map is considered a rare event with minimal performance impact. If any
+required OSD is marked down, the read immediately falls back to the primary.
+
+If all necessary OSDs are available, a new ``SplitRead`` object is instantiated
+to manage the direct read operation.
+
+The ``SplitRead`` object calculates the target OSDs and submits read requests to
+each one in parallel.
+
+Upon completion of all parallel reads, the ``SplitRead`` object inspects the
+results for failures.
+
+a. If all reads were successful, it de-stripes (for large reads) or
+   concatenates (for medium reads) the data and completes the client's
+   original read request.
+b. If any of the reads failed, it will discard any partial data and reissue
+   the entire read operation to the primary OSD. While client-side recovery
+   using parity shards is theoretically possible, it is avoided by design.
+   Implementing the complex EC plugin architecture within every client is
+   considered undesirable. Given that OSD failures are infrequent events to
+   which the cluster adapts rapidly, the cost-benefit analysis strongly
+   favors this simpler, more robust fallback mechanism.
+
+EC Backend
+----------
+
+The Erasure Coding backend will be enhanced to implement synchronous reads. This
+functionality will be utilised exclusively by the direct-read mechanism and
+will be identified by the ``BALANCED_READ`` flag on an EC read operation. There
+will be a single attempt at the read; should this attempt fail, the I/O
+operation will be failed back to the client without further retries at this
+layer.
+
+Missing objects
+---------------
+
+In scenarios where an object is temporarily unavailable due to recovery
+operations, the OSD will fail the direct read request. A new error code (name
+to be determined) will be introduced that instructs the client to not only
+retry the I/O but also to disable direct reads for the affected PG until the
+next epoch. This mechanism prevents the introduction of excessive latency
+during cluster recovery activities.
+
+Torn Writes
+-----------
+
+A torn write is a potential consistency issue that arises from race conditions
+between read and write operations. This condition will be mitigated through a
+version-checking mechanism. For any multi-OSD read, the "OI" (Object Info)
+attribute on the primary is read *in parallel* with the data reads from the
+other shards. A version mismatch is considered an infrequent event. Executing
+this check concurrently with the data reads provides significant latency
+improvements, especially for workloads that are bandwidth-limited due to large
+chunk sizes. If a mismatch is detected between the version list from the
+primary and the versions read from the other shards, the operation will fall
+back to the legacy read mechanism.
+
+Changes to default chunk sizes
+------------------------------
+
+To verify RADOS object versions, medium-sized reads frequently necessitate an
+additional I/O operation to the primary OSD. To reduce the frequency of these
+multi-shard reads, the default chunk size will be increased. While recent
+enhancements to erasure coding have removed many of the performance
+limitations of large chunks in Ceph, a key trade-off persists concerning
+storage efficiency for small objects.
+
+The primary limitation of an increased chunk size manifests when the average
+object size is smaller than the chunk size. In this case, data chunks are left
+partially empty, and storage overhead increases. For small files, a K+M pool
+essentially degrades to a 1+M array, wasting space. Consequently, the default
+chunk size must be increased judiciously. Our data shows that we cannot
+practically increase chunk sizes beyond the current recommendation of 16k.
+However, future designs may allow for packing of multiple small objects into a
+single EC stripe. This will mitigate the effect of the small-object penalty,
+and we should then increase the chunk size. We believe that 256k is optimal for
+HDD and 32k for SSD, but careful performance testing will be required.
+
+Read-write ordering
+-------------------
+
+If the ``RWORDERING`` flag is not set on an I/O operation, a read may be
+reordered with respect to any preceding, uncommitted write operations.
+
+For example, consider the following sequence of operations on a single object:
+
+#. Client submits ``Write_1`` containing data 'AAA' to the object.
+#. ``Write_1`` completes. The object's content is now 'AAA'.
+#. Client submits ``Write_2`` containing data 'BBB' to the same object.
+#. Before ``Write_2`` completes, the client submits a ``Read`` operation.
+
+Under these conditions, the read operation is guaranteed to return the data from
+either the completed ``Write_1`` ('AAA') or the subsequent ``Write_2`` ('BBB'),
+but never a torn result containing a mixture of both (e.g., 'AAB'). This
+provides a consistency guarantee analogous to that of balanced reads in
+replicated pools.
+
+If the ``RWORDERING`` flag is set, the operation must use the legacy,
+primary-mediated read path to guarantee strict ordering. This ensures strict
+ordering is maintained, particularly in failure scenarios. If a direct read were
+attempted and then retried via the fallback path to the primary, the timing
+change could violate the ordering guarantee. Forcing the use of the
+primary-mediated path from the beginning prevents this ambiguity and is
+consistent with the equivalent implementation for replicated pools.
+
+Read Instability
+----------------
+
+Read instability is where read data changes after being first read, without a
+subsequent write.
+
+Consider the following sequence:
+
+#. Write A (completes)
+#. Write B (does not complete yet)
+#. Read, get data B.
+#. Client performs some activity, which assumes B is committed to disk.
+#. Write is rolled back (and failed, or never completed).
+
+Even though the write B had not completed, it is often assumed by the client
+that B will never be rolled back.
+
+Balanced replica reads deal with this scenario by rejecting a non-primary read
+if uncommitted writes exist in the log for this object. Such a mechanism is
+necessary for EC direct reads too if none of the reads apply to the primary.
+
+.. NOTE::
+   A similar mechanism for replica is implemented here:
+   https://github.com/ceph/ceph/pull/49380
+
+Kernel Client Modifications
+---------------------------
+
+Kernel-mode clients that use Ceph (such as the ``krbd`` driver for block
+devices) will also require modification. The logic described in this document,
+which is implemented in the user-space ``librados`` library, must be replicated
+within the kernel drivers.
+
+This involves implementing the ``SplitRead`` mechanism, including the
+calculation of target OSDs, the parallel submission of I/O requests, and the
+subsequent reassembly (concatenation or de-striping) of data. This is a
+non-trivial task, as kernel development brings additional complexity regarding
+memory management, asynchronous I/O handling, and ensuring system stability.
+The design and implementation of these kernel-level changes are recognised as
+a necessary work item, but the specific approach will require a separate,
+detailed investigation. Additionally, the challenge of submitting code to the
+kernel itself may prove an interesting challenge!
+
+Performance, QoS and throttling
+-------------------------------
+
+The direct read mechanism fundamentally alters the I/O pattern from the
+perspective of the OSDs. Where previously a client read resulted in a single
+I/O operation to the primary OSD, this design will now submit multiple, smaller
+I/O operations in parallel to different OSDs.
+
+The balance of network operations will change and this will need to be carefully
+assesed in performance evaluation, so that we can provide sufficient data to the
+end user, to help make their decision about enabling this optional feature.
+
+The splitting will significantly reduce the network traffic withing the cluster.
+The network load on the public, client network will see a more complex change:
+
+- Small IOs (sub-chunk) will see no increase in traffic at all.
+- Very large IOs will see an increase in the number of messages, but this is
+  likely insignificant compared to the associated data transfers
+- For the intermediate IOs, the performance considerations are more nuanced, the
+  splitting up of IOs will help to further distribute network traffic, but will
+  lead to a larger overhead, as the IO count will increase.
+
+CPU Load
+~~~~~~~~
+
+As part of the implementation, the impact on CPU utilization on the client will
+be measured. The "destripe" mechanism is CPU intensive and results in
+non-contiguous buffers.  This may be significant for some applications where
+the ceph client is completing for CPU resources with the client application.
+
+It is important that the CPU "cost" of this feature is negligible when it is
+switched off: the bypass code must be trivially simple.
+
+
+Throttling Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This change presents a challenge for the existing throttling framework. It is
+critical that the throttling behavior for a single logical client read remains
+consistent, regardless of how many physical OSD operations it is split into.
+The precise method for ensuring that the throttling scheduler correctly
+accounts for these fragmented operations as a single logical unit is an open
+problem that requires further investigation. As such, the design for this
+component is not yet finalised and will be addressed during the implementation
+phase.
+
+Replica Balanced Reads
+----------------------
+
+For large, bandwidth constrained I/O patterns, direct EC reads are expected to
+show significant improvements in latency compared to replica reads. Considering
+this, the replica read mechanism will also be adapted to utilise the
+``SplitRead`` object for large, bandwidth constrained I/O. This will allow the
+read workload to be split across multiple replica OSDs, preventing a single OSD
+from becoming a bottleneck and improving overall performance for these
+workloads. A separate PR will be used for replica reads, but for now we will
+continue to use the same design document.
+
+Splitting Reads
+~~~~~~~~~~~~~~~
+
+Early prototypes suggest that replica IOs should be split if:
+
+- SSD: If replica IO >= 2 x 128k, IO will be split into I/Os of at least 128k
+- HDD: If replica IO >= 2 x 256k, IO will be split into I/Os of at least 256k
+
+Further performance measurements will be conducted to verify these changes and
+a user-parameter will be provided to adjust these thresholds if required.
+
+Plugin Support
+--------------
+
+Potentially this could be made to work with all plugins. However, to reduce
+testing overhead, we will restrict to the following plugins:
+
+- Jerasure
+- ISA-L
+
+We will set supporting LRC as a stretch goal, but this is dependent on the
+enhanced EC work supporting LRC, which is not currently required for Umbrella.
+
+Supported Ops
+-------------
+
+Further investigation will be performed during implementation, so the following
+may change. The intent is to support ops such that the vast majority of IO in
+realistic work loads of RBD, CephFS and RGW perform EC direct reads. At the
+time of writing, the following seems like a reasonable limitation:
+
+- Objecter ops must contain a single op
+- Read
+- Sparse read
+
+It is possible, but more complex to permit:
+
+- Multiple reads/sparse reads
+- Attribute reads (when performed with a read or sparse reads)
+
+Sparse Reads
+~~~~~~~~~~~~
+
+Sparse reads are currently treated as full reads by EC. Direct reads will
+support sparse reads. A new synchronous code path through the OSD EC Backend is
+required and implementing this such that it supports sparse reads is simple
+enough to be worth implementing
+
+No attempt will be made to support sparse reads on any other EC reads. This
+means that in failure scenarios, the sparseness of the backend will disappear.
+This would likely be a significant issue for encryption, so will need to be
+addressed before encryption is supported. No client that supports EC can have a
+dependency on sparse reads being fully implemented, so this should not cause
+any regressions in the clients.
+
+Object (user) Version
+---------------------
+
+Clients can (and regularly do) request an âobject versionâ. In the OSD this is
+known as the âuser versionâ and is store in the OI along with the OSDâs
+interval version generated by primary log PG. They do this by passing a non-null
+pointer to objver in the read request. The objver is not always up-to-date on
+the non-primary OSDs. This means that any op requesting the objver must send a
+read to the primary, even if such a read is otherwise empty. This will have a
+significant impact on performance of small reads. Requesting objver was
+previously negligible form a performance point of view, as such a review of
+RBD, CephFS and RGW will be conducted to determine whether objver is sometimes
+requested without being necessary.
+
+For replica split reads, the user version is up to date with the data on all
+shards, so the object version will be requested from a single arbitrary read.
+
+Testing
+-------
+
+ceph_test_rados_io_sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Small vs Medium vs Large Reads
+++++++++++++++++++++++++++++++
+
+Already covered by the IO exerciser through existing design of
+ceph_test_rados_io_sequence. Running through this should give us coverage of
+various reads.
+
+Objecter
+++++++++
+
+Already covered by IO exerciser, currently sends all IOs through librados which
+calls Objecter. Running through this should give us coverage of various reads.
+
+Stale Maps/Fallback path
+++++++++++++++++++++++++
+
+IO Exerciser has two injects for reads currently:
+
+#. Type = 0 injects EIO when reading data, shard indicates which shard to fail.
+   The inject is sent to the primary OSD
+#. Type = 1 makes reads pretend a shard is missing which will immediately
+   trigger extra reads for a decode, shard indicates which shard to fail. The
+   inject is sent to the primary OSD
+
+As it stands, these will not work with Direct Reads and we will need to make
+modifications to both the IO Exerciser so they send to the shard they want to
+inject, along with modifications to the injects so they can be called on the
+synchronous read path.
+
+Missing Objects
++++++++++++++++
+
+We could add a Type = 2 inject similar to Type = 0 where instead of returning
+EIO it returns our new error code. We would need to either have a hook into
+objecter to tell it to start sending direct reads again in the same epoch, or
+force a new epoch following this. This would allow us to exercise the fallback
+path of the client.
+
+Possibly we might want to enhance the âtestrecovery flag to be something like
+âtestrecovery=âread:2â or something to ensure we are able to use this new
+inject specifically.
+
+Read/Write Ordering
++++++++++++++++++++
+
+A new ConcurrentWriteRead Op could be added which asserts the value that is
+read afterwards is either entirely the original data or entirely the new write
+and throw a miscompare either. It would be possible to keep stats and output
+which of the two hits we get to ensure we get coverage of both results
+occurring/assert if the ordering is wrong when we do not have the ordering flag
+set.
+
+Along with this OP a new sequence to test this with various small/medium/large
+writes/reads would probably be needed to exercise this.
+
+Kernel level testing
+++++++++++++++++++++
+
+This area needs more consideration. Currently not covered by the
+ceph_test_rados_io_sequence. Need to investigate if there is an equivalent to
+librados for kernel space which the IO exerciser can use.
+
+If not, then falling back to testing kRBD specifically may require a different
+script which can exercise different object sizes and stale map responses.
+
+Miscellaneous
++++++++++++++
+
+It is possible to set the balanced reads flag directly on the IO itself to have
+fine-grained control on a per-io basis of whether we are using balanced reads
+or not. The IO exerciser can at the point of doing a read decide if it wants to
+test doing a balanced read or not. The ReadOp can take an optional parameter
+to override this, if we want to have an op in a sequence which will always or
+never be balanced and an optional parameter can be added to the (Failed and
+non-failed) ReadOp(s) in interactive mode for if you wish to specify the
+value.
+
+ceph_test_rados
+~~~~~~~~~~~~~~~
+
+ceph_test_rados is an existing tool that was very useful in the development of
+optimised ec when looking for issues. There are no enhancements planned in this
+area, but use of this tool will be heavily used during development to check for
+regressions.
+
-- 
2.39.5