doc: doc/dev/osd_internals/erasure_coding/enhancements.rst

author Bill Scales <bill_scales@uk.ibm.com>

Wed, 3 Jul 2024 13:09:19 +0000 (13:09 +0000)

committer Bill Scales <bill_scales@uk.ibm.com>

Thu, 8 Aug 2024 11:55:17 +0000 (11:55 +0000)
author Bill Scales <bill_scales@uk.ibm.com>
Wed, 3 Jul 2024 13:09:19 +0000 (13:09 +0000)
committer Bill Scales <bill_scales@uk.ibm.com>
Thu, 8 Aug 2024 11:55:17 +0000 (11:55 +0000)
diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst

index 40064961bbaf7a765fb5f2f56f54f7b21ee01a79..7495c521b81b7793cb54d51eb0c07bd233053a4d 100644 (file)
--- a/doc/dev/osd_internals/erasure_coding.rst
+++ b/doc/dev/osd_internals/erasure_coding.rst
@@ -85,3 +85,4 @@ Table of contents
     Developer notes <erasure_coding/developer_notes>
     Jerasure plugin <erasure_coding/jerasure>
     High level design document <erasure_coding/ecbackend>
+   Erasure coding enhancements design document <erasure_coding/enhancements>
diff --git a/doc/dev/osd_internals/erasure_coding/enhancements.rst b/doc/dev/osd_internals/erasure_coding/enhancements.rst

new file mode 100644 (file)

index 0000000..dddf974
--- /dev/null
+++ b/doc/dev/osd_internals/erasure_coding/enhancements.rst
@@ -0,0 +1,1517 @@
+===========================
+Erasure coding enhancements
+===========================
+
+Objectives
+==========
+
+Our objective is to improve the performance of erasure coding, in particular
+for small random accesses to make it more viable to use erasure coding pools
+for storing block and file data.
+
+We are looking to reduce the number of OSD read and write accesses per client
+I/O (sometimes referred to as I/O amplification), reduce the amount of network
+traffic between OSDs (network bandwidth) and reduce I/O latency (time to
+complete read and write I/O operations). We expect the changes will also
+provide modest reductions to CPU overheads.
+
+While the changes are focused on enhancing small random accesses, some
+enhancements will provide modest benefits for larger I/O accesses and for
+object storage.
+
+The following sections give a brief description of the improvements we are
+looking to make. Please see the later design sections for more details
+
+Current Read Implementation
+---------------------------
+
+For reference this is how erasure code reads currently work
+
+.. ditaa::
+
+ RADOS Client
+                            * Current code reads all data chunks
+      ^                     * Discards unneeded data
+      |                     * Returns requested data to client
+ +----+----+
+ | Discard |                If data cannot be read then the coding parity
+ |unneeded |                chunks are read as well and are used to reconstruct
+ |  data   |                the data
+ +---------+
+    ^^^^
+    ||||
+    ||||
+    ||||
+    |||+----------------------------------------------+
+    ||+-------------------------------------+         |
+    |+----------------------------+         |         |
+    |                             |         |         |
+  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
+ (       )                     (       ) (       ) (       ) (       ) (       )
+ |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       |                     |       | |       | |       | |       | |       |
+ |       |                     |       | |       | |       | |       | |       |
+ (       )                     (       ) (       ) (       ) (       ) (       )
+  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
+ Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+   OSD
+
+Note: All the diagrams illustrate a K=4 + M=2 configuration, however the
+concepts and techniques can be used for all K+M configurations.
+
+Partial Reads
+-------------
+
+If only a small amount of data is being read it is not necessary to read the
+whole stripe, for small I/Os ideally only a single OSD needs to be involved in
+reading the data. See also larger chunk size below.
+
+.. ditaa::
+
+ RADOS Client
+                            * Optimize by only reading required chunks
+      ^                     * For large chunk sizes and sub-chunk reads only
+      |                     read a sub-chunk
+ +----+----+
+ | Return  |                If data cannot be read then extra data and coding
+ |  data   |                parity chunks are read as well and are used to
+ |         |                reconstruct the data
+ +---------+
+     ^
+     |
+     |
+     |
+     |
+     |
+     +----------------------------+
+                                  |
+  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
+ (       )                     (       ) (       ) (       ) (       ) (       )
+ |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       |                     |       | |       | |       | |       | |       |
+ |       |                     |       | |       | |       | |       | |       |
+ (       )                     (       ) (       ) (       ) (       ) (       )
+  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
+ Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+   OSD
+
+Pull Request https://github.com/ceph/ceph/pull/55196 is implementing most of
+this optimization, however it still issues full chunk reads.
+
+Current Overwrite Implementation
+--------------------------------
+
+For reference here is how erasure code overwrites currently work
+
+.. ditaa::
+
+ RADOS Client
+      |                     * Read all data chunks
+      |                     * Merges new data
+ +----v-----+               * Encodes new coding parities
+ | Read old |               * Writes data and coding parities
+ |Merge new |
+ |  Encode  |-------------------------------------------------------------+
+ |  Write   |---------------------------------------------------+         |
+ +----------+                                                   |         |
+    ^|^|^|^|                                                    |         |
+    |||||||+-------------------------------------------+        |         |
+    ||||||+-------------------------------------------+|        |         |
+    |||||+-----------------------------------+        ||        |         |
+    ||||+-----------------------------------+|        ||        |         |
+    |||+---------------------------+        ||        ||        |         |
+    ||+---------------------------+|        ||        ||        |         |
+    |v                            |v        |v        |v        v         v
+  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
+ (       )                     (       ) (       ) (       ) (       ) (       )
+ |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       |                     |       | |       | |       | |       | |       |
+ |       |                     |       | |       | |       | |       | |       |
+ (       )                     (       ) (       ) (       ) (       ) (       )
+  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
+ Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+   OSD
+
+Partial Overwrites
+------------------
+
+Ideally we aim to be able to perform updates to erasure coded stripes by only
+updating a subset of the shards (those with modified data or coding
+parities). Avoiding performing unnecessary data updates on the other shards is
+easy, avoiding performing any metadata updates on the other shards is much
+harder (see design section on metadata updates).
+
+.. ditaa::
+
+ RADOS Client
+      |                     * Only read chunks that are not being overwritten
+      |                     * Merge new data
+ +----v-----+               * Encode new coding parities
+ | Read old |               * Only write modified data and parity shards
+ |Merge new |
+ |  Encode  |-------------------------------------------------------------+
+ |  Write   |---------------------------------------------------+         |
+ +----------+                                                   |         |
+    ^  |^ ^                                                     |         |
+    |  || |                                                     |         |
+    |  || +-------------------------------------------+         |         |
+    |  ||                                             |         |         |
+    |  |+-----------------------------------+         |         |         |
+    |  +---------------------------+        |         |         |         |
+    |                              |        |         |         |         |
+    |                              v        |         |         v         v
+  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
+ (       )                     (       ) (       ) (       ) (       ) (       )
+ |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       |                     |       | |       | |       | |       | |       |
+ |       |                     |       | |       | |       | |       | |       |
+ (       )                     (       ) (       ) (       ) (       ) (       )
+  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
+ Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+   OSD
+
+This diagram is overly simplistic, only showing the data flows. The simplest
+implementation of this optimization retains a metadata write to every
+OSD. With more effort it is possible to reduce the number of metadata updates
+as well, see design below for more details.
+
+Parity-delta-write
+------------------
+
+A common technique used by block storage controllers implementing RAID5 and
+RAID6 is to implement what is sometimes called a parity delta write. When a
+small part of the stripe is being overwritten it is possible to perform the
+update by reading the old data, XORing this with the new data to create a
+delta and then read each coding parity, apply the delta and write the new
+parity. The advantage of this technique is that it can involve a lot less I/O,
+especially for K+M encodings with larger values of K. The technique is not
+specific to M=1 and M=2, it can be applied with any number of coding parities.
+
+.. ditaa::
+
+                        Parity delta writes
+                        * Read old data and XOR with new data to create a delta
+ RADOS Client           * Read old encoding parities apply the delta and write
+    |                     the new encoding parities
+    |                   
+    |                   For K+M erasure codings where K is larger and M is small
+    |  +-----+    +-----+  this is much more efficient
+    +->| XOR |-+->| GF  |---------------------------------------------------+
+  +-+->|     | |  |     |<------------------------------------------------+ |
+  | |  +-----+ |  +-----+                                                 | |
+  | |          |                                                          | |
+  | |          |  +-----+                                                 | |
+  | |          +->| XOR |-----------------------------------------+       | |
+  | |             |     |<--------------------------------------+ |       | |
+  | |             +-----+                                       | |       | |
+  | |                                                           | |       | |
+  | |                                                           | |       | |
+  | +-------------------------------+                           | |       | |
+  +-------------------------------+ |                           | |       | |
+                                  | |                           | |       | |
+                                  | v                           | v       | v
+  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
+ (       )                     (       ) (       ) (       ) (       ) (       )
+ |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       |                     |       | |       | |       | |       | |       |
+ |       |                     |       | |       | |       | |       | |       |
+ (       )                     (       ) (       ) (       ) (       ) (       )
+  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
+  Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+    OSD
+
+Direct Read I/O
+---------------
+
+We want clients to submit small I/Os directly to the OSD that stores the data
+rather than directing all I/O requests to the Primary OSD and have it issue
+requests to the secondary OSDs. By eliminating an intermediate hop this
+reduces network bandwidth and improves I/O latency
+
+.. ditaa::
+
+         RADOS Client
+               ^
+               |
+          +----+----+     Client sends small read requests directly to OSD
+          | Return  |     avoiding extra network hop via Primary
+          |  data   |
+          |         |
+          +---------+
+               ^
+               |
+               |
+               |
+               |
+               |
+               |
+               |
+  .-----.   .-----.   .-----.   .-----.   .-----.   .-----.
+ (       ) (       ) (       ) (       ) (       ) (       )
+ |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       | |       | |       | |       | |       | |       |
+ |       | |       | |       | |       | |       | |       |
+ (       ) (       ) (       ) (       ) (       ) (       )
+  `-----'   `-----'   `-----'   `-----'   `-----'   `-----'
+  Primary    OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+    OSD
+
+
+.. ditaa::
+
+               RADOS Client
+               ^         ^
+               |         |
+          +----+----+ +--+------+  Client breaks larger read
+          | Return  | | Return  |  requests into separate
+          |  data   | |  data   |  requests to multiple OSDs
+          |         | |         |  
+          +---------+ +---------+  Note client loses atomicity
+               ^         ^         guarantees if this optimization
+               |         |         is used as an update could occur
+               |         |         between the two reads
+               |         |
+               |         |
+               |         |
+               |         |
+               |         |
+  .-----.   .-----.   .-----.   .-----.   .-----.   .-----.
+ (       ) (       ) (       ) (       ) (       ) (       )
+ |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       | |       | |       | |       | |       | |       |
+ |       | |       | |       | |       | |       | |       |
+ (       ) (       ) (       ) (       ) (       ) (       )
+  `-----'   `-----'   `-----'   `-----'   `-----'   `-----'
+  Primary    OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+    OSD
+
+Distributed processing of writes
+--------------------------------
+
+The existing erasure code implementation processes write I/Os on the primary
+OSD, issuing both reads and writes to other OSDs to fetch and update data for
+other shards. This is perhaps the simplest implementation, but it uses a lot
+of network bandwidth. With parity-delta-writes it is possible to distribute
+the processing across OSDs to reduce network bandwidth.
+
+.. ditaa::
+
+               Performing the coding parity delta updates on the coding parity
+               OSD instead of the primary OSD reduces network bandwidth
+ RADOS Client
+    |          Note: A naive implementation will increase latency by serializing
+    |          the data and coding parity reads, for best performance these
+    |          reads need to happen in parallel
+    |  +-----+                                                          +-----+
+    +->| XOR |-+------------------------------------------------------->| GF  |
+  +-+->|     | |                                                        |     |
+  | |  +-----+ |                                                        +----++
+  | |          |                                              +-----+     ^ |
+  | |          +--------------------------------------------->| XOR |     | |
+  | |                                                         |     |     | |
+  | |                                                         +---+-+     | |
+  | +-------------------------------+                           ^ |       | |
+  +-------------------------------+ |                           | |       | |
+                                  | |                           | |       | |
+                                  | |                           | |       | |
+                                  | |                           | |       | |
+                                  | |                           | |       | |
+                                  | v                           | v       | v
+  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
+ (       )                     (       ) (       ) (       ) (       ) (       )
+ |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       |                     |       | |       | |       | |       | |       |
+ |       |                     |       | |       | |       | |       | |       |
+ (       )                     (       ) (       ) (       ) (       ) (       )
+  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
+  Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+    OSD
+
+Direct Write I/O
+----------------
+
+.. ditaa::
+
+             RADOS Client
+                  |
+                  |  Similarly Clients could direct small write I/Os
+                  |  to the OSD that needs updating
+                  |
+                  |  +-----+                        +-----+
+                  +->| XOR |-+--------------------->| GF  |
+            +-----+->|     | |                      |     | 
+            |     |  +-----+ |                      +----++
+            |     |          |            +-----+     ^ |
+            |     |          +----------->| XOR |     | |
+            |     |                       |     |     | |
+            |     |                       +---+-+     | |
+            |     |                         ^ |       | |
+            |     |                         | |       | |
+            |     |                         | |       | |
+            |     |                         | |       | |
+            |     |                         | |       | |
+            |     |                         | |       | |
+            |     v                         | v       | v
+  .-----.   .-----.   .-----.   .-----.   .-----.   .-----.
+ (       ) (       ) (       ) (       ) (       ) (       )
+ |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
+ |       | |       | |       | |       | |       | |       |
+ |       | |       | |       | |       | |       | |       |
+ (       ) (       ) (       ) (       ) (       ) (       )
+  `-----'   `-----'   `-----'   `-----'   `-----'   `-----'
+  Primary    OSD 2     OSD 3     OSD 4     OSD P     OSD Q
+    OSD
+
+This diagram is overly simplistic, only showing the data flows - direct writes
+are much harder to implement and will need control messages to the Primary to
+ensure writes to the same stripe are ordered correctly
+
+Larger chunk size
+-----------------
+
+The default chunk size is 4K, this is too small and means that small reads
+have to be split up and processed by many OSDs. It is more efficient if small
+I/Os can be serviced by a single OSD. Choosing a larger chunk size such as 64K
+or 256K and implementing partial reads and writes will fix this issue, but has
+the disadvantage that small sized RADOS objects get rounded up in size to a
+whole stripe of capacity.
+
+We would like the code to automatically choose what chunk size to use to
+optimize for both capacity and performance. Small objects should use a small
+chunk size like 4K, larger objects should use a larger chunk size.
+
+Code currently rounds up I/O sizes to multiples of the chunk size, which isn't
+an issue with a small chunk size. With a larger chunk size and partial
+reads/writes we should round up to the page size rather than the chunk size.
+
+Design
+======
+
+We will describe the changes we aim to make in three sections, the first
+section looks at the existing test tools for erasure coding and discusses the
+improvements we believe will be necessary to get good test coverage for the
+changes.
+
+The second section covers changes to the read and write I/O path.
+
+The third section discusses the changes to metadata to avoid the need to
+update metadata on all shards for each metadata update. While it is possible
+to implement many of the I/O path changes without reducing the number of
+metadata updates, there are bigger performance benefits if the number of
+metadata updates can be reduced as well.
+
+Test tools
+----------
+
+A survey of the existing test tools shows that there is insufficient coverage
+of erasure coding to be able to just make changes to the code and expect the
+existing CI pipelines to get sufficient coverage. Therefore one of the first
+steps will be to improve the test tools to be able to get better test
+coverage.
+
+Teuthology is the main test tool used to get test coverage and it relies
+heavily on the following tests for generating I/O:
+
+1. **rados** task - qa/tasks/rados.py. This uses ceph_test_rados
+   (src/test/osd/TestRados.cc) which can generate a wide mixture of different
+   rados operations. There is limited support for read and write I/Os,
+   typically using offset 0 although there is a chunked read command used by a
+   couple of tests.
+
+2. **radosbench** task - qa/tasks/radosbench.py. This uses the **rados bench**
+   (src/tools/rados/rados.cc and src/common/obj_bencher.cc). Can be used to
+   generate sequential and random I/O workloads, offset starts at 0 for
+   sequential I/O. I/O size can be set but is constant for whole test.
+
+3. **rbd_fio** task - qa/tasks/fio.py. This uses **fio** to generate
+   read/write I/O to an rbd image volume
+
+4. **cbt** task - qa/tasks/cbt.py. This uses the Ceph benchmark tool **cbt**
+   to run fio or radosbench to benchmark the performance of a cluster.
+
+5. **rbd bench**. Some of the standalone tests use rbd bench
+   (src/tools/rbd/action/Bench.cc) to generate small amounts of I/O
+   workload. It is also used by the **rbd_pwl_cache_recovery** task.
+
+It is hard to use these tools to get good coverage of I/Os to non-zero (and
+non-stripe aligned) offsets, or to generate a wide variety of offsets and
+lengths of I/O requests including all the boundary cases for chunks and
+stripes. There is scope to improve either rados, radosbench or rbd bench to
+generate much more interesting I/O patterns for testing erasure coding.
+
+For the optimizations described above it is essential that we have good tools
+for checking the consistency of either selected objects or all objects in an
+erasure coded pool by checking that the data and coding parities are
+coherent. There is a test tool **ceph-erasure-code-tool** which can use the
+plugins to encode and decode data provided in a set of files. However there
+does not seem to be any scripting in teuthology to perform consistency checks
+by using objectstore tool to read data and then using this tool to validate
+consistency. We will write some teuthology helpers that use
+ceph-objectstore-tool and ceph-erasure-code-tool to perform offline
+validation.
+
+We would also like an online way of performing full consistency checks, either
+for specific objects or for a whole pool. Inconveniently EC pools do not
+support class methods so it's not possible to use this as a way of
+implementing a full consistency check. We will investigate putting a flag on a
+read request, on the pool or implementing a new request type to perform a full
+consistency check on an object and look at making extensions to the rados CLI
+to be able to perform these tests. See also the discussion on deep scrub
+below.
+
+When there is more than one coding parity and there is an inconsistency
+between the data and the coding parities it is useful to try and analyze the
+cause of the inconsistency. Because the multiple coding parities are providing
+redundancy, there can be multiple ways of reconstructing each chunk and this
+can be used to detect the most like cause of the inconsistency. For example
+with a 4+2 erasure coding and a dropped write to 1st data OSD, the stripe (all
+6 OSDs) will be inconsistent, as will be any selection of 5 OSDs that includes
+the 1st data OSD, but data OSDs 2,3 and 4 and the two coding parity OSDs will
+be still be consistent. While there are many ways a stripe could get into this
+state, a tool could conclude that the most likely cause is a missed update to
+OSD 1. Ceph does not have a tool to perform this type of analysis, but it
+should be easy to extend ceph-erasure-code-tool.
+
+Teuthology seems to have adequate tools for taking OSDs offline and bringing
+them back online again. There are a few tools for injecting read I/O errors
+(without taking an OSD offline) but there is scope to improve these
+(e.g. ability to specify a particular offset in an object that will fail a
+read, more controls over setting and deleting error inject sites).
+
+The general philosophy of teuthology seems to be to randomly inject faults and
+simply through brute force get sufficient coverage of all the error
+paths. This is a good approach for CI testing, however when EC code paths
+become complex and require multiple errors to occur with precise timings to
+cause a particular code path to execute it becomes hard to get coverage
+without running the tests for a very long time. There are some standalone
+tests for EC which do test some of the multiple failure paths, but these tests
+perform very limited amounts of I/O and don't inject failures while there are
+I/Os in flight so miss some of the interesting scenarios.
+
+To deal with these more complex error paths we propose developing a new type
+of thrasher for erasure coding that injects a sequence of errors and makes use
+of debug hooks to capture and delay I/O requests at particular points to
+ensure an error inject hits a particular timing window. To do this we will
+extend the tell osd command to include extra interfaces to inject errors and
+capture and stall I/Os at specific points.
+
+Some parts of erasure coding such as the plugins are stand alone bits of code
+which can be tested with unit tests. There are already some unit tests and
+performance benchmark tools for erasure coding, we will look to extend these
+to get further coverage of code that can be run stand alone.
+
+I/O path changes
+----------------
+
+Avoid unnecessary reads and writes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current code reads too much data for read and overwrite I/Os. For
+overwrites it will also rewrite unmodified data. This occurs because reads and
+overwrites are rounded up to full-stripe operations. This isn’t a problem when
+data is mainly being accessed sequentially but is very wasteful for random I/O
+operations. The code can be changed to only read/write necessary shards. To
+allow the code to efficiently support larger chunk sizes I/Os should be
+rounded to page size I/Os instead of chunk sized I/Os.
+
+The first simple set of optimizations eliminates unnecessary reads and
+unnecessary writes of data, but retains writes of metadata on all shards. This
+avoids breaking the current design which depends on all shards receiving a
+metadata update for every transaction. When changes to the metadata handling
+are completed (see below) then it will be possible to make further
+optimizations to reduce the number of metadata updates for additional savings.
+
+Parity-delta-write
+^^^^^^^^^^^^^^^^^^
+
+The current code implements overwrites by performing a full-stripe read,
+merging the overwritten data, calculating new coding parities and performing a
+full-stripe write. Reading and writing every shard is expensive, there are a
+number of optimizations that can be applied to speed this up. For a K+M
+configuration where M is small, it is often less work to perform a
+parity-delta-write. This is implemented by reading the old data that is about
+to be overwritten and XORing it with the new data to create a delta. The
+coding parities can then be read, updated to apply the delta and
+re-written. With M=2 (RAID6) this can result in just 3 read and 3 writes to
+perform an overwrite of less than one chunk.
+
+Note that where a large fraction of the data in the stripe is being updated,
+this technique can result in more work than performing a partial overwrite,
+however if both update techniques are supported it is fairly easy to calculate
+for a given I/O offset and length which is the optimal technique to use.
+
+Write I/Os submitted to the Primary OSD will perform this calculation to
+decide whether to use a full-stripe update or a parity-delta-write. Note that
+if read failures are encountered while performing a parity-delta-write and it
+is necessary to reconstruct data or a coding parity then it will be more
+efficient to switch to performing a full-stripe read, merge and write.
+
+Not all erasure codings and erasure coding libraries support the capability of
+performing delta updates, however those implemented using XOR and/or GF
+arithmetic should. We have checked jerasure and isa-l and confirmed that they
+support this feature, although the necessary APIs are not currently exposed by
+the plugins. For some erasure codes such as clay and lrc it may be possible to
+apply delta updates, but the delta may need to be applied in so many places
+that this makes it a worthless optimization. This proposal suggests that
+parity-delta-write optimizations are initially implemented only for the most
+commonly used erasure codings. Erasure code plugins will provide a new flag
+indicating whether they support the new interfaces needed to perform delta
+updates.
+
+Direct reads
+^^^^^^^^^^^^
+
+Read I/Os are currently directed to the primary OSD which then issues reads to
+other shards. To reduce I/O latency and network bandwidth it would be better
+if clients could issue direct read requests to the OSD storing the data,
+rather than via the primary. There are a few error scenarios where the client
+may still need to fallback to submitting reads to the primary, a secondary OSD
+will have the option of failing a direct read with -EAGAIN to request the
+client retries the request to the primary OSD.
+
+Direct reads will always be for <= one chunk. For reads of more than one chunk
+the client can issue direct reads to multiple OSDs, however these will no
+longer guaranteed to be atomic because an update (write) may be applied in
+between the separate read requests. If a client needs atomicity guarantees
+they will need to continue to send the read to the primary.
+
+Direct reads will be failed with EAGAIN where a reconstruct and decode
+operation is required to return the data. This means only reads to primary OSD
+will need to handle the reconstruct code path. When an OSD is backfilling we
+don't want the client to have large quantities of I/O failed with EAGAIN,
+therefore we will make the client detect this situation and avoid issuing
+direct I/Os to a backfilling OSD.
+
+For backwards compatibility, for client requests that cannot cope with the
+reduced guarantees of a direct read, and for scenarios where the direct read
+would be to an OSD that is absent or backfilling, reads directed to the
+primary OSD will still be supported.
+
+Direct writes
+^^^^^^^^^^^^^
+
+Write I/Os are currently directed to the primary OSD which then updates the
+other shards. To reduce latency and network bandwidth it would be better if
+clients could direct small overwrites requests directly to the OSD storing the
+data, rather than via the primary. For larger write I/Os and for error
+scenarios and abnormal cases clients will continue to submit write I/Os to the
+primary OSD.
+
+Direct writes will always be for <= one chunk and will use the
+parity-delta-write technique to perform the update. For medium sized writes a
+client may issue direct writes to multiple OSDs, but such updates will no
+longer be guaranteed to be atomic. If a client requires atomicity for a larger
+write they will need to continue to send it to the primary.
+
+For backwards compatibility, and for scenarios where the direct write would be
+to an OSD that is absent, writes directed to the primary OSD will still be
+supported.
+
+I/O serialization, recovery/backfill and other error scenarios
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Direct writes look fairly simple until you start considering all the abnormal
+scenarios. The current implementation of processing all writes on the Primary
+OSD means that there is one central point of control for the stripe that can
+manage things like the ordering of multiple inflight I/Os to the same stripe,
+ensuring that recovery/backfill for an object has been completed before it is
+accessed and assigning the object version number and modification time.
+
+With direct I/Os these become distributed problems. Our approach is to send a
+control path message to the Primary OSD and let it continue to be the central
+point of control. The Primary OSD will issue a reply when the OSD can start
+the direct write and will be informed with another message when the I/O has
+completed. See section below on metadata updates for more details.
+
+Stripe cache
+^^^^^^^^^^^^
+
+Erasure code pools maintain a stripe cache which stores shard data while
+updates are in progress. This is required to allow writes and reads to the
+same stripe to be processed in parallel. For small sequential write workloads
+and for extreme hot spots (e.g. where the same block is repeatedly re-written
+for some kind of crude checkpointing mechanism) there would be a benefit in
+keeping the stripe cache slightly longer than the duration of the I/O. In
+particularly the coding parities are typically read and written for every
+update to a stripe. There is obviously a balancing act to achieve between
+keeping the cache long enough that it reduces the overheads for future I/Os
+versus the memory overheads of storing this data. A small (MiB as opposed to
+GiB sized cache) should be sufficient for most workloads. The stripe cache can
+also help reduce latency for direct write I/Os by allowing prefetch I/Os to
+read old data and coding parities ready for later parts of the write operation
+without requiring more complex interlocks.
+
+The stripe cache is less important when the default chunk size is small
+(e.g. 4K), because even with small write I/O requests there will not be many
+sequential updates to fill a stripe. With a larger chunk size (e.g. 64K) the
+benefits of a good stripe cache become more significant because the stripe
+size will be 100’s KiB to small number of MiB’s and hence it becomes much more
+likely that a sequential workload will issue many I/Os to the same stripe.
+
+Automatically choose chunk size
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The default chunk size of 4K is good for small objects because the data and
+coding parities are rounded up to whole chunks and because if an object has
+less than one data stripe of data then the capacity overheads for the coding
+parities are higher (e.g. a 4K object in a 10+2 erasure coded pool has 4K of
+data and 8K of coding parity, so there is a 200% overhead). However the
+optimizations above all provide much bigger savings if the typical random
+access I/O only reads or writes a single shard. This means that so long as
+objects are big enough that a larger chunk size such as 64K would be better.
+
+Whilst the user can try and predict what their typically object size will be
+and choose an appropriate chunk size, it would be better if the code could
+automatically select a small chunk size for small objects and a larger chunk
+size for larger objects. There will always be scenarios where an object grows
+(or is truncated) and the chosen chunk size becomes inappropriate, however
+reading and re-writing the object with a new chunk size when this happens
+won’t have that much performance impact. This also means that the chunk size
+can be deduced from the object size in object_info_t which is read before the
+objects data is read/modified. Clients already provide a hint as to the object
+size when creating the object so this could be used to select a chunk size to
+reduce the likelihood of having to re-stripe an object
+
+The thought is to support a new chunk size of auto/variable to enable this
+feature, it will only be applicable for newly created pools, there will be no
+way to migrate an existing pool.
+
+Deep scrub support
+^^^^^^^^^^^^^^^^^^
+
+EC Pools with overwrite do not check CRCs because it is too costly to update
+the CRC for the object on every overwrite, instead the code relies on
+Bluestore to maintain and check CRCs. When an EC pool is operating with
+overwrite disabled a CRC is kept for each shard, because it is possible to
+update CRCs as the object is appended to just by calculating a CRC for the new
+data being appended and then doing a simple (quick) calculation to combine the
+old and new CRC together.
+
+In dev/osd_internals/erasure_coding/proposals.rst it discusses the possibility
+of keeping CRCs at a finer granularity (for example per chunk), storing these
+either as an xattr or an omap (omap is more suitable as large objects could
+end up with a lot of CRC metadata) and updating these CRCs when data is
+overwritten (the update would need to perform a read-modify-write at the same
+granularity as the CRC). These finer granularity CRCs can then easily be
+combined to produce a CRC for the whole shard or even the whole erasure coded
+object.
+
+This proposal suggests going in the opposite direction - EC overwrite pools
+have survived without CRCs and relied on Bluestore up until now, so why is
+this feature needed? The current code doesn’t check CRCs if overwrite is
+enabled, but sadly still calculates and updates a CRC in the hinfo xattr, even
+if performing overwrites which mean that the calculated value will be
+garbage. This means we pay all the overheads of calculating the CRC and get no
+benefits.
+
+The code can easily be fixed so that CRCs are calculated and maintained when
+objects are written sequentially, but as soon as the first overwrite to an
+object occurs the hinfo xattr will be discarded and CRCs will no longer be
+calculated or checked. This will improve performance when objects are
+overwritten, and will improve data integrity in cases where they are not.
+
+While the thought is to abandon EC storing CRCs in objects being overwritten,
+there is an improvement that can be made to deep scrub. Currently deep scrub
+of an EC with overwrite pool just checks that every shard can read the object,
+there is no checking to verify that the copies on the shards are consistent. A
+full consistency check would require large data transfers between the shards
+so that the coding parities could be recalculated and compared with the stored
+versions, in most cases this would be unacceptably slow. However for many
+erasure codes (including the default ones used by Ceph) if the contents of a
+chunk are XOR’d together to produce a longitudinal summary value, then an
+encoding of the longitudinal summary values of each data shard should produce
+the same longitudinal summary values as are stored by the coding parity
+shards. This comparison is less expensive than the CRC checks performed by
+replication pools. There is a risk that by XORing the contents of a chunk
+together that a set of corruptions cancel each other out, but this level of
+check is better than no check and will be very successful at detecting a
+dropped write which will be the most common type of corruption.
+
+Metadata changes
+----------------
+
+What metadata do we need to consider?
+
+1. object_info_t. Every Ceph object has some metadata stored in the
+   object_info_t data structure. Some of these fields (e.g. object length) are
+   not updated frequently and we can simply avoid performing partial writes
+   optimizations when these fields need updating. The more problematic fields
+   are the version numbers and the last modification time which are updated on
+   every write. Version numbers of objects are compared to version numbers in
+   PG log entries for peering/recovery and with version numbers on other
+   shards for backfill. Version numbers and modification times can be read by
+   clients.
+
+2. PG log entries. The PG log is used to track inflight transactions and to
+   allow incomplete transactions to be rolled forward/backwards after an
+   outage/network glitch. The PG log is also used to detect and resolve
+   duplicate requests (e.g. resent due to network glitch) from
+   clients. Peering currently assumes that every shard has a copy of the log
+   and that this is updated for every transaction.
+
+3. PG stats entries and other PG metadata. There is other PG metadata (PG
+   stats is the simplest example) that gets updated on every
+   transaction. Currently all OSDs retain a cached and a persistent copy of
+   this metadata.
+
+How many copies of metadata are required?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current implementation keeps K+M replicated copies of metadata, one copy
+on each shard. The minimum number of copies that need to be kept to support up
+to M failures is M+1. In theory metadata could be erasure encoded, however
+given that it is small it is probably not worth the effort. One advantage of
+keeping K+M replicated copies of the metadata is that any fully in sync shard
+can read the local copy of metadata, avoiding the need for inter-OSD messages
+and asynchronous code paths. Specifically this means that any OSD not
+performing backfill can become the primary and can access metadata such as
+object_info_t locally.
+
+M+1 arbitrarily distributed copies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A partial write to one data shard will always involve updates to the data
+shard and all M coding parity shards, therefore for optimal performance it
+would be ideal if the same M+1 shards are updated to track the associated
+metadata update. This means that for small random writes that a different M+1
+shards would get updated for each write. The drawback of this approach is that
+you might need to read K shards to find the most up to date version of the
+metadata.
+
+In this design no shard will have an up to date copy of the metadata for every
+object. This means that whatever shard is picked to be the acting primary that
+it may not have all the metadata available locally and may need to send
+messages to other OSDs to read it. This would add significant extra complexity
+to the PG code and cause divergence between Erasure coded pools and Replicated
+pools. For these reasons we discount this design option.
+
+M+1 copies on known shards
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The next best performance can be achieved by always applying metadata updates
+to the same M+1 shards, for example choosing the 1st data shard and all M
+coding parity shards. Coding parity shards will get updated by every partial
+write so this will result in zero or one extra shard being updated. With this
+approach only 1 shard needs to be read to find the most up to date version of
+the metadata.
+
+We can restrict the acting primary to be one of the M+1 shards, which means
+that once any incomplete updates in the log have been resolved that the
+primary will have an up to date local copy of all the metadata, this means
+that much more of the PG code can be kept unchanged.
+
+Partial Writes and the PG log
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Peering currently assumes that every shard has a copy of the log, however
+because of inflight updates and small term absences it is possible that some
+shards are missing some of the log entries. The job of peering is to combine
+the logs from the set of present shards to form a definitive log of
+transactions that have been committed by all the shards. Any discrepancies
+between a shards log and the definitive log are then resolved, typically by
+rolling backwards transactions (using information held in the log entry) so
+that all the shards are in a consistent state.
+
+To support partial writes the log entry needs to be modified to include the
+set of shards that are being updated. Peering needs to be modified to consider
+a log entry as missing from a shard only if a copy of the log entry on another
+shard indicates that this shard was meant to be updated.
+
+The logs are not infinite in size, and old log entries where it is known that
+the update has been successfully committed on all affected shards are
+trimmed. Log entries are first condensed to a pg_log_dup_t entry which can no
+longer assist in rollback of a transaction but can still be used to detect
+duplicated client requests, and then later completely discarded. Log trimming
+is performed at the same time as adding a new log entry, typically when a
+future write updates the log. With partial writes log trimming will only occur
+on shards that receive updates, which means that some shards may have stale
+log entries that should have been discarded.
+
+TBD: I think the code can already cope with discrepancies in log trimming
+between the shards. Clearly an in flight trim operation may not have completed
+on every shard so small discrepancies can be dealt with, but I think an absent
+OSD can cause larger discrepancies. I believe that this is resolved during
+Peering, with each OSD keeping a record of what the oldest log entry should be
+and this gets shared between OSDs so that they can work out stale log entries
+that were trimmed in absentia. Hopefully this means that only sending log
+trimming updates to shards that are creating new log entries will work without
+code changes.
+
+Backfill
+^^^^^^^^
+
+Backfill is used to correct inconsistencies between OSDs that occur when an
+OSD is absent for a longer period of time and the PG log entries have been
+trimmed. Backfill works by comparing object versions between shards. If some
+shards have out of date versions of an object then a reconstruct is performed
+by the backfill process to update the shard. If the version numbers on objects
+are not updated on all shards then this will break the backfill process and
+cause a huge amount of unnecessary reconstruct work. This is unacceptable, in
+particular for the scenario where an OSD is just absent for maintenance for a
+relatively short time with noout set. The requirement is to be able to
+minimize the amount of reconstruct work needed to complete a backfill.
+
+In dev/osd_internals/erasure_coding/proposals.rst it discusses the idea of
+each shard storing a vector of version numbers that records the most recent
+update that the pair <this shard, other shard> both should have participated
+in. By collecting this information from at least M shards it is possible to
+work out what the expected minimum version number should be for an object on a
+shard and hence deduce whether a backfill is required to update the
+object. The drawback of this approach is that backfill will need to scan M
+shards to collect this information, compared with the current implementation
+that only scans the primary and shard(s) being backfilled.
+
+With the additional constraint that a known M+1 shards will always be updated
+and that the (acting) primary will be one of these shards, it will be possible
+to determine whether a backfill is required just by examining the vector on
+the primary and the object version on the shard being backfilled. If the
+backfill target is one of the M+1 shards the existing version number
+comparison is sufficient, if it is another shard then the version in the
+vector on the primary needs to be compared with the version on the backfill
+target. This means that backfill does not have to scan any more shards than it
+currently does, however the scan of the primary does need to read the vector
+and if there are multiple backfill targets then it may need to store multiple
+entries of the vector per object increasing memory usage during the backfill.
+
+There is only a requirement to keep the vector on the M+1 shards, and the
+vector only needs K-1 entires because we only need to track version number
+differences between any of the M+1 shards (which should have the same version)
+and each of the K-1 shards (which can have a stale version number). This will
+slightly reduce the amount of extra metadata required. The vector of version
+numbers could be stored in the object_info_t structure or stored as a separate
+attribute.
+
+Our preference is to store the vector in the object_info_t structure because
+typically both are accessed together, and because this makes it easier to
+cache both in the same object cache. We will keep metadata and memory
+overheads low by only storing the vector when it is needed.
+
+Care is required to ensure that existing clusters can be upgraded. The absence
+of the vector of version numbers implies that an object has never had a
+partial update and therefore all shards are expected to have the same version
+number for the object and the existing backfill algorithm can be used.
+
+Code references
+"""""""""""""""
+
+PrimaryLogPG::scan_range - this function creates a map of objects and their
+version numbers, on the primary it tries to get this information from the
+object cache, otherwise it reads to OI attribute. This will need changes to
+deal with the vectors. To conserve memory it will need to be provided with the
+set of backfill targets so it can select which part of the vector to keep.
+
+PrimaryLogPG::recover_backfill - this function call scan_range for the local
+(primary) and sends MOSDPGScan to the backfill targets to get them to perform
+the same scan. Once it has collected all the version numbers it compares the
+primary and backfill targets to work out which objects need to be
+recovered. This will also need changes to deal with the vectors when comparing
+version numbers.
+
+PGBackend::run_recovery_op - recovers a single object. For an EC pool this
+involves reconstructing the data for the shards that need backfilling (read
+other shards and use decode to recover). This code shouldn't need any changes.
+
+Version number and last modification time for clients
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Clients can read the object version number and set expectations about what the
+minimum version number is when making updates. Clients can also read the last
+modification time. There are use cases where it is important that these values
+can be read and give consistent results, but there is also a large number of
+scenarios where this information is not required.
+
+If the object version number is only being updated on a known M+1 shards for
+partial writes, then where this information is required it will need to
+involve a metadata access to one of those shards. We have arranged for the
+primary to be one of the M+1 shards so I/Os submitted to the primary will
+always have access to the up to date information.
+
+Direct write I/Os need to update the M+1 shards, so it is not difficult to
+also return this information to the client when completing the I/O.
+
+Direct read I/Os are the problem case, these will only access the local shard
+and will not necessarily have access to the latest version and modification
+time. For simplicity we will require clients that require this information to
+send requests to the primary rather than using the direct I/O
+optimization. Where a client does not need this information they can use the
+direct I/O optimizations.
+
+The direct read I/O optimization will still return a (potentially stale)
+object version number. This may still be of use to clients to help understand
+the ordering of I/Os to a chunk.
+
+Direct Write with Metadata updates
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Here's the full picture of what a direct write performing a parity-delta-write
+looks like with all the control messages:
+
+.. ditaa::
+
+                   RADOS Client
+ 
+                        |  ^
+                        |  |
+                        1  28
+ +-----+                |  |
+ |     |<------27-------+--+
+ |     |                |  |
+ |     |  +-------------|->|
+ |     |  |             |  |
+ |     |<-|----2--------+  |<--------------------------------------------+
+ | Seq |  |             |  |<----------------------------+               |
+ |     |  |    +----3---+  |                             |               |
+ |     |  |    |        +--|-----------------------5-----|---+           |
+ |     |  |    |        +--|-------4---------+           |   |           |
+ |     +--|-10-|------->|  |                 |           |   |           |
+ |     |  |    |    +---+  |                 |           |   |           |
+ |     |  |    |    |   |  |                 |           |   |           |
+ |     |  |    |    v   |  |                 |           |   |           |
+ +----++  |    |  +---+ |  |                 |           |   |           |
+   ^  |   |    |  |XOR+-|--|----------15-----|-----------|---|-----+     |
+   |  |   |    |  |13 +-|--|-------14--------|-----+     |   |     |     |
+   |  |   |    |  +---+ |  |                 |     |     |   |     |     |
+   |  |   |    |    ^   |  |                 |     v     |   |     v     |
+   |  |   |    |    |   |  |                 |  +------+ |   |  +------+ |
+   6  11  |    |    |   |  |                 |  | XOR  | |   |  |  GF  | |
+   |  |   |    |    |   |  |                 |  | 18   | |   |  |  21  | |
+   |  |   |    |    12  16 |                 |  +----+-+ |   |  +----+-+ |
+   |  |   |    |    |   |  |                 |    ^  |   |   |    ^  |   |
+   |  |   |    |    |   |  |                 |    |  |   |   |    |  |   |
+   |  |   |    |    |   |  |                 |    17 19  |   |    20 22  |
+   |  |   |    |    |   |  |                 |    |  |   |   |    |  |   |
+   |  |   |    |    |   v  |                 |    |  v   |   |    |  v   |
+   |  |   |    |  +-+----+ |                 |  +-+----+ |   |  +-+----+ |
+   |  |   |    +->|Extent| |                 +->|Extent| |   +->|Extent| |
+   |  |   23      |Cache | 24                   |Cache | 25     |Cache | 26
+   |  |   |       +----+-+ |                    +----+-+ |      +----+-+ |
+   |  |   |         ^  |   |                      ^  |   |        ^  |   |
+   |  |   |         |  |   |                      |  |   |        |  |   |
+   |  +---+         7  +---+                      8  +---+        9  +---+
+   |  |             |  |                          |  |            |  |
+   |  v             |  v                          |  v            |  v
+  .-----.          .-----.   .-----.   .-----.   .-----.         .-----.
+ (       )        (       ) (       ) (       ) (       )       (       )
+ |`-----'|        |`-----'| |`-----'| |`-----'| |`-----'|       |`-----'|
+ |       |        |       | |       | |       | |       |       |       |
+ |       |        |       | |       | |       | |       |       |       |
+ (       )        (       ) (       ) (       ) (       )       (       )
+  `-----'          `-----'   `-----'   `-----'   `-----'         `-----'
+ Primary            OSD 2     OSD 3     OSD 4     OSD P           OSD Q
+   OSD
+ 
+ * Xattr            * No Xattr                    * Xattr         * Xattr
+ * OI               * Stale OI                    * OI            * OI
+ * PG log           * Partial PG log              * PG log        * PG log
+ * PG stats         * No PG stats                 * PG stats      * PG stats
+
+Note: Only the primary OSD and parity coding OSDs (the M+1 shards) have Xattr,
+up to date object info, PG log and PG stats. Only one of these OSDs is
+permitted to become the (acting) primary. The other data OSDs 2,3 and 4 (the
+K-1 shards) do not have Xattrs or PG stats, may have state object info and
+only have PG log entries for their own updates. OSDs 2,3 and 4 may have stale
+OI with an old version number. The other OSDs have the latest OI and a vector
+with the expected version numbers for OSDs 2,3 and 4.
+
+1. Data message with Write I/O from client (MOSDOp)
+2. Control message to Primary with Xattr (new msg MOSDEcSubOpSequence)
+
+Note: the primary needs to be told about any xattr update so it can update its
+copy, but the main purpose of this message is to allow the primary to sequence
+the write I/O. The reply message at step 10 is what allows the write to start
+and provides the PG stats and new object info including the new version
+number. If necessary the primary can delay this to ensure that
+recovery/backfill of the object is completed first and deal with overlapping
+writes. Data may be read (prefetched) before the reply, but obviously no
+transactions can start.
+
+3. Prefetch request to local extent cache
+4. Control message to P to prefetch to extent cache (new msg
+   MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead)
+5. Control message to Q to prefetch to extent cache (new msg
+   MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead)
+6. Primary reads object info
+7. Prefetch old data
+8. Prefetch old P
+9. Prefetch old Q
+
+Note: The objective of these prefetches is to get the old data, P and Q reads
+started as quickly as possible to reduce the latency of the whole I/O. There
+may be error scenarios where the extent cache is not able to retain this and
+it will need to be re-read. This includes the rare/pathological scenarios
+where there is a mixture of writes sent to the primary and writes sent
+directly to the data OSD for the same object.
+
+10. Control message to data OSD with new object info + PG stats (new msg
+    MOSDEcSubOpSequenceReply)
+11. Transaction to update object info + PG log + PG stats
+12. Fetch old data (hopefully cached)
+
+Note: For best performance we want to pipeline writes to the same stripe. The
+primary assigns the version number to each write and consequently defines the
+order in which writes should be processed. It is important that the data shard
+and the coding parity shards apply overlapping writes in the same order. The
+primary knows what set of writes are in flight so can detect this situation
+and indicate in its reply message at step 10 that an update must wait until an
+earlier update has been applied. This information needs to be forwarded to the
+coding parities (steps 14 and 15) so they can also ensure updates are applied
+in the same order.
+
+13. XOR new and old data to create delta
+14. Data message to P with delta + Xattr + object info + PG log + PG stats
+    (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite)
+15. Data message to Q with delta + Xattr + object info + PG log + PG stats
+    (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite)
+16. Transaction to update data + object info + PG log
+17. Fetch old P (hopefully cached)
+18. XOR delta and old P to create new P
+19. Transaction to update P + Xattr + object info + PG log + PG stats
+20. Fetch old Q (hopefully cached)
+21. XOR delta and old Q to create new Q
+22. Transaction to update Q + Xattr + object info + PG log + PG stats
+23. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
+    equivalent of MOSDEcSubOpWriteReply)
+24. Local commit notification
+25. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
+    equivalent of MOSDEcSubOpWriteReply)
+26. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
+    equivalent of MOSDEcSubOpWriteReply)
+27. Control message to Primary to signal end of write (variant of new msg
+    MOSDEcSubOpSequence)
+28. Control message reply to client (MOSDOpReply)
+
+Upgrade and backwards compatibility
+-----------------------------------
+
+A few of the optimizations can be made just by changing code on the primary
+OSD with no backwards compatibility concerns regarding clients or the other
+OSDs. These optimizations will be enabled as soon as the primary OSD upgrades
+and will replace the existing code paths.
+
+The remainder of the changes will be new I/O code paths that will exist
+alongside the existing code paths.
+
+Similar to EC Overwrites many of the changes will need to ensure that all OSDs
+are running new code and that the EC plugins support new interfaces required
+for parity-delta-writes. A new pool level flag will be required to enforce
+this. It will be possible to enable this flag (and hence enable the new
+performance optimizations) after upgrading an existing cluster. Once set it
+will not be possible to add down level OSDs to the pool. It will not be
+possible to turn this flag off other than by deleting the pool. Downgrade is
+not supported because:
+
+1. It is not trivial to quiesce all I/O to a pool to ensure that none of the
+   new I/O code paths are in use when the flag is cleared.
+
+2. The PG log format for new I/Os will not be understood by down level
+   OSDs. It would be necessary to ensure the log has been trimmed of all new
+   format entries before clearing the flag to ensure that down level OSDs will
+   be able to interpret the log.
+
+3. Additional xattr data will be stored by the new I/O code paths and used by
+   backfill. Down level code will not understand how to backfill a pool that
+   has been running the new I/O paths and will get confused by the
+   inconsistent object version numbers. While it is theoretically possible to
+   disable partial updates and then scan and update all the metadata to return
+   the pool to a state where a downgrade is possible, we have no intention of
+   writing this code.
+
+The direct I/O changes will additionally require clients to be running new
+code. These will require that the pool has the new flag set and that a new
+client is used. Old clients can use pools with the new flag set, just without
+the direct I/O optimization.
+
+Not under consideration
+-----------------------
+
+There is a list of enhancements discussed in
+doc/dev/osd_internals/erasure_coding/proposals.rst, the following are not
+under consideration:
+
+1. RADOS Client Acknowledgement Generation optimization
+
+When updating K+M shards in an erasure coded pool, in theory you don’t have to
+wait for all the updates to complete before completing the update to the
+client, because so long as K updates have completed any viable subset of
+shards should be able to roll forward the update.
+
+For partial writes where only M+1 shards are updated this optimization does
+not apply as all M+1 updates need to complete before the update is completed
+to the client.
+
+This optimization would require changes to the peering code to work out
+whether partially completed updates need to be rolled forwards or
+backwards. To roll an update forwards it would be simplest to mark the object
+as missing and use the recovery path to reconstruct and push the update to
+OSDs that are behind.
+
+2. Avoid sending read request to local OSD via Messenger
+
+The EC backend code has an optimization for writes to the local OSD which
+avoids sending a message and reply via messenger. The equivalent optimization
+could be made for reads as well, although a bit more care is required because
+the read is synchronous and will block the thread waiting for the I/O to
+complete.
+
+Pull request https://github.com/ceph/ceph/pull/57237 is making this
+optimization
+
+Stories
+=======
+
+This is our high level breakdown of the work. Our intention is to deliver this
+work as a series of PRs. The stories are roughly in the order we plan to
+develop. Each story is at least one PR, where possible they will be broken up
+further. The earlier stories can be implemented as stand alone pieces of work
+and will not introduce upgrade/backwards compatibility issues. The later
+stories will start breaking backwards compatibility, here we plan to add a new
+flag to the pool to enable these new features. Initially this will be an
+experimental flag while the later stories are developed.
+
+Test tools - enhanced I/O generator for testing erasure coding
+--------------------------------------------------------------
+
+* Extend rados bench to be able to generate more interesting patterns of I/O
+  for erasure coding, in particular reading and writing at different offsets
+  and for different lengths and making sure we get good coverage of boundary
+  conditions such as the sub-chunk size, chunk size and stripe size
+* Improve data integrity checking by using a seed to generate data patterns
+  and remembering which seed is used for each block that is written so that
+  data can later be validated
+
+Test tools - offline consistency checking tool
+----------------------------------------------
+
+* Test tools for performing offline consistency checks combining use of
+  objectstore_tool with ceph-erasure-code-tool
+* Enhance some of the teuthology standalone erasure code checks to use this
+  tool
+
+Test tools - online consistency checking tool
+---------------------------------------------
+
+* New CLI to be able to perform online consistency checking for an object or a
+  range of objects that reads all the data and coding parity shards and
+  re-encodes the data to validate the coding parities
+
+Switch for JErasure to ISA-L
+----------------------------
+
+The JErasure library has not been updated since 2014, the ISA-L library is
+maintained and exploits newer instructions sets (e.g. AVX512, AVX2) which
+provides faster encoding/decoding
+
+* Change defaults to ISA-L in upstream ceph
+* Benchmark Jerasure and ISA-L
+* Refactor Ceph isa_encode region_xor() to use AVX when M=1
+* Documentation updates
+* Present results at performance weekly
+
+Sub Stripe Reads
+----------------
+
+Ceph currently reads an integer number of stripes and discards unneeded
+data. In particular for small random reads it will be more efficient to just
+read the required data
+
+* Help finish Pull Request https://github.com/ceph/ceph/pull/55196 if not
+  already complete
+* Further changes to issue sub-chunk reads rather than full-chunk reads
+
+Simple Optimizations to Overwrite
+---------------------------------
+
+Ceph overwrites currently read an integer number of stripes, merge the new
+data and write an integer number of stripes. This story makes simple
+improvements by making the same optimizations as for sub stripe reads and for
+small (sub-chunk) updates reducing the amount of data being read/written to
+each shard.
+
+* Only read chunks that are not being fully overwritten (code currently reads
+  whole stripe and then merges new data)
+* Perform sub-chunk reads for sub-chunk updates
+* Perform sub-chunk writes for sub-chunk updates
+
+Eliminate unnecessary chunk writes but keep metadata transactions
+-----------------------------------------------------------------
+
+This story avoids re-writing data that has not been modified. A transaction is
+still applied to every OSD to update object metadata, the PG log and PG stats.
+
+* Continue to create transactions for all chunks but without the new write data
+* Add sub-chunk writes to transactions where data is being modified
+
+Avoid zero padding objects to a full stripe
+-------------------------------------------
+
+Objects are rounded up to an integer number of stripes by adding zero
+padding. These buffers of zeros are then sent in messages to other OSDs and
+written to the OS consuming storage. This story make optimizations to remove
+the need for this padding
+
+* Modifications to reconstruct reads to avoid reading zero-padding at the end
+  of an object - just fill the read buffer with zeros instead
+* Avoid transfers/writes of buffers of zero padding. Still send transactions
+  to all shards and create the object, just don't populate it with zeros
+* Modifications to encode/decode functions to avoid having to pass in buffers
+  of zeros when objects are padded
+
+Erasure coding plugin changes to support distributed partial writes
+-------------------------------------------------------------------
+
+This is preparatory work for future stories, it adds new APIs to the erasure
+code plugins.
+
+* Add a new interface to create a delta by XORing old and new data together
+  and implement this for the ISA-L and JErasure plugins
+* Add a new interface to apply a delta to one coding parity by using XOR/GF
+  and implement this for the ISA-L and JErasure plugins
+* Add a new interface which reports which erasure codes support this feature
+  (ISA-L and JErasure will support it, others will not)
+
+Erasure coding interface to allow RADOS clients to direct I/Os to OSD storing the data
+--------------------------------------------------------------------------------------
+
+This is preparatory work for future stories, its adds a new API for clients
+
+* New interface to convert the pair (pg, offset) to {OSD, remaining chunk
+  length}
+
+We do not want clients to have to dynamically link to the erasure code plugins
+so this code will need to be part of librados. However this interface needs to
+understand how erasure codes distribute data and coding chunks to be able to
+perform this translation.
+
+We will only support ISA-L and JErasure plugins where there is a trivial
+striping of data chunks to OSDs.
+
+Changes to object_info_t
+------------------------
+
+This is preparatory work for future stories.
+
+This adds the vector of version numbers to object_info_t which will be used
+for partial updates. For replicated pools and for erasure coded objects that
+are not overwritten we will avoid storing extra data in object_info_t.
+
+Changes to PGLog and Peering to support updating a subset of OSDs
+-----------------------------------------------------------------
+
+This is preparatory work for future stories.
+
+* Modify the PG log entry to store a record of which OSDs are being updated
+* Modify peering to use this extra data to work out OSDs that are missing
+  updates
+
+Change to selection of (acting) primary
+---------------------------------------
+
+This is preparatory work for future stories.
+
+Constrain the choice of primary to be the first data OSD or one of the erasure
+coding parities. If none of these OSDs are available and up to date then the
+pool must be offline.
+
+Implement parity-delta-write with all computation on the primary
+----------------------------------------------------------------
+
+* Calculate whether its more efficient for an update to perform a full stripe
+  overwrite or a parity-delta-write
+* Implement new code paths to perform the parity-delta-write
+* Test tool enhancements. We want to make sure that both parity-delta-write
+  and full-stripe write are tested. We will add a new conf file option with a
+  choice of 'parity-delta', 'full-stripe', 'mixture for testing' or
+  'automatic' and update teuthology test cases to predominately use a mixture.
+
+Upgrades and backwards compatibility
+------------------------------------
+
+* Add a new feature flag for erasure coded pools
+* All OSDs must be running new code to enable the flag on the pool
+* Clients may only issue direct I/Os if the flag is set
+* OSDs running old code may not join a pool with the flag set
+* Its not possible to turn the feature flag off (other than by deleting the
+  pool)
+
+Changes to Backfill to use the vector in object_info_t
+------------------------------------------------------
+
+This is preparatory work for future stories.
+
+* Modify the backfill process to use the vector of version numbers in
+  object_info_t so that when partial updates occur we do not backfill OSDs
+  which did not participate in the partial update.
+* When there is a single backfill target extract the appropriate version
+  number from the vector (no additional storage required)
+* When there are multiple backfill targets extract the subset of the vector
+  required by the backfill targets and select the appropriate entry when
+  comparing version numbers in PrimaryLogPG::recover_backfill
+
+Test tools - offline metadata validation tool
+---------------------------------------------
+
+* Test tools for performing offline consistency checking of metadata, in
+  particular checking the vector of version numbers in object_info_t matches
+  the versions on each OSD, but also for validating PG log entries
+
+Eliminate transactions on OSDs not updating data chunks
+-------------------------------------------------------
+
+Peering, log recovery and backfill can now all cope with partial updates using
+the vector of version numbers in object_info_t.
+
+* Modify the overwrite I/O path to not bother with metadata only transactions
+  (except to the Primary OSD)
+* Modify the update of the version numbers in object_info_t to use the vector
+  and only update entries that are receiving a transaction
+* Modify the generation of the PG log entry to record which OSDs are being
+  updated
+
+Direct reads to OSDs (single chunk only)
+----------------------------------------
+
+* Modify OSDClient to route single chunk read I/Os to the OSD storing the data
+* Modify OSD to accept reads from non-primary OSD (expand existing changes for
+  replicated pools to work with EC pools as well)
+* If necessary fail the read with EAGAIN if the OSD is unable to process the
+  read directly
+* Modify OSDClient to retry read by submitting to Primary OSD if read is
+  failed with EAGAIN
+* Test tool enhancements. We want to make sure that both direct reads and
+  reads to the primary are tested. We will add a new conf file option with a
+  choice of 'prefer direct', 'primary only' or 'mixture for testing' and
+  update teuthology test cases to predominately use a mixture.
+
+The changes will be made to the OSDC part of the RADOS client so will be
+applicable to rbd, rgw and cephfs.
+
+We will not make changes to other code that has its own version of RADOS
+client code such as krbd, although this could be done in the future.
+
+Direct reads to OSDs (multiple chunks)
+--------------------------------------
+
+* Add a new OSDC flag NONATOMIC which allows OSDC to split a read into
+  multiple requests
+* Modify OSDC to split reads spanning multiple chunks into separate requests
+  to each OSD if the NONATOMIC flag is set
+* Modifications to OSDC to coalesce results (if any sub read fails the whole
+  read needs to fail)
+* Changes to librbd client to set NONATOMIC flag for reads
+* Changes to cephfs client to set NONATOMIC flag for reads
+
+We are only changing a very limited set of clients, focusing on those that
+issue smaller reads and are latency sensitive. Future work could look at
+extending the set of clients (including krbd).
+
+Implement distributed parity-delta-write
+----------------------------------------
+
+* Implement new message MOSDEcSubOpDelta and MOSDEcSubOpDeltaReply
+* Change primary to calculate delta and send MOSDEcSubOpDelta message to
+  coding parity OSDs
+* Modify coding parity OSDs to apply the delta and send MOSDEcSubOpDeltaReply
+  message
+
+Note: This change will increase latency because the coding parity reads start
+after the old data read. Future work will fix this.
+
+Test tools - EC error injection thrasher
+----------------------------------------
+
+* Implement a new type of thrasher that specifically injects faults to stress
+  erasure coded pools
+* Take one or multiple (up to M) OSDs down, more focus on taking different
+  subsets of OSDs down to drive all the different EC recovery paths than
+  stressing out peering/recovery/backfill (the existing OSD thrasher excels at
+  this)
+* Inject read I/O failures to force reconstructs using decode for single and
+  multiple failures
+* Inject delays using osd tell type interface to make it easier to test OSD
+  down at all the interesting stages of EC I/Os
+* Inject delays using osd tell type interface to slow down an OSD transaction
+  or message to expose the less common completion orders for parallel work
+
+Implement prefetch message MOSDEcSubOpPrefetch and modify extent cache
+----------------------------------------------------------------------
+
+* Implement new message MOSDEcSubOpPrefetch
+* Change primary to issue this message to the coding parity OSDs before
+  starting read of old data
+* Change the extent cache so that each OSD caches its own data rather than
+  caching everything on the primary
+* Change coding parity OSDs to handle this message and read the old coding
+  parity into the extent cache
+* Changes to extent cache to retain the prefetched old parity until the
+  MOSDEcSubOpDelta message is received, and to discard this on error paths
+  (e.g. new OSDMap)
+
+Implement sequencing message MOSDEcSubOpSequence
+------------------------------------------------
+
+* Implement new message MODSEcSubOpSequence and MOSDEcSubOpSequenceReply
+* Modify primary code to create these messages and route them locally to
+  itself in preparation for direct writes
+
+Direct writes to OSD (single chunk only)
+----------------------------------------
+
+* Modify OSDC to route single chunk write I/Os to the OSD storing the data
+* Changes to issue MOSDEcSubOpSequence and MOSDEcSubOpSequenceReply between
+  data OSD and primary OSD
+
+Direct writes to OSD (multiple chunks)
+--------------------------------------
+
+* Modifications to OSDC to split multiple chunk writes into separate requests
+  if NONATOMIC flag is set
+* Further changes to coalescing completions (in particular reporting version
+  number correctly)
+* Changes to librbd client to set NONATOMIC flag for reads
+* Changes to cephfs client to set NONATOMIC flag for reads
+
+We are only changing a very limited set of clients, focusing on those that
+issue smaller writes and are latency sensitive. Future work could look at
+extending the set of clients.
+
+Deep scrub / CRC
+----------------
+
+* Disable CRC generation in the EC code for overwrites, delete hinfo Xattr
+  when first overwrite occurs
+* For objects in pool with new feature flag set that have not been overwritten
+  check CRC, even if pool overwrite flag is set. The presence/absence of hinfo
+  can be used to determine if the object has been overwritten
+* For deep scrub requests XOR the contents of the shard to create a
+  longitudinal check (8 bytes wide?)
+* Return the longitudinal check in the scrub reply message, have the primary
+  encode the set of longitudinal replies to check for inconsistencies
+
+Variable chunk size erasure coding
+----------------------------------
+
+* Implement new pool option for automatic/variable chunk size
+* When object size is small use a small chunk size (4K) when the pool is using
+  the new option
+* When object size is large use a large chunk size (64K or 256K?)
+* Convert the chunk size by reading and re-writing the whole object when a
+  small object grows (append)
+* Convert the chunk size by reading and re-writing the whole object when a
+  large object shrinks (truncate)
+* Use the object size hint to avoid creating small objects and then almost
+  immediately converting them to a larger chunk size
+
+CLAY Erasure Codes
+------------------
+
+In theory CLAY erasure codes should be good for K+M erasure codes with larger
+values of M, in particular when these erasure codes are used with multiple
+OSDs in the same failure domain (e.g. an 8+6 erasure code with 5 servers each
+with 4 OSDs). We would like to improve the test coverage for CLAY and perform
+some more benchmarking to collect data to help substantiate when people should
+consider using CLAY.
+
+* Benchmark CLAY erasure codes - in particular the number of I/O required for
+  backfills when multiple OSDs fail
+* Enhance test cases to validate the implementation
+* See also https://bugzilla.redhat.com/show_bug.cgi?id=2004256
author	Bill Scales <bill_scales@uk.ibm.com>
	Wed, 3 Jul 2024 13:09:19 +0000 (13:09 +0000)
committer	Bill Scales <bill_scales@uk.ibm.com>
	Thu, 8 Aug 2024 11:55:17 +0000 (11:55 +0000)
doc/dev/osd_internals/erasure_coding.rst		patch \| blob \| history
doc/dev/osd_internals/erasure_coding/enhancements.rst	[new file with mode: 0644]	patch \| blob