--- /dev/null
+=========================
+Coupled LAYer code plugin
+=========================
+
+Coupled LAYer (CLAY) codes are erasure codes designed to save in terms of network
+bandwidth, disk IO when a failed node/OSD/rack is being repaired. Let:
+
+ d = number of OSDs contacted during repair
+
+If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
+reading from the *d=8* others to repair. And recovery of say a 1GiB needs
+a download of 8 X 1GiB = 8GiB amount of information.
+
+However, in the case of *clay* plugin *d* is configurable such that:
+
+ k+1 <= d <= k+m-1
+
+By default the clay code plugin picks *d=k+m-1* as it gives most savings in terms
+of network bandwidth and disk IO. In the case of *clay* plugin configured with
+*k=8*, *m=4* and *d=11* when a single OSD fails d=11 osds are contacted and
+250MiB is downloaded from each of them resulting in download of 11 X 250MiB = 2.75GiB
+amount of information. More general parameters are shown below. The benefits are huge
+when the repair is being done for a rack that stores information in the order of
+Tera bytes.
+
+ +-------------+---------------------------+
+ | plugin | total amount of disk IO |
+ +=============+===========================+
+ |jerasure,isa | k*S |
+ +-------------+---------------------------+
+ | clay | d*S/(d-k+1) = (k+m-1)*S/m |
+ +-------------+---------------------------+
+
+where *S* is the amount of data stored of single OSD being repaired and
+in the table above, we are using the maximum possible value of d for minimal amount
+of data transmission for recovery.
+
+Erasure code profile examples
+=============================
+
+Reduced bandwidth usage can actually be observed.::
+
+ $ ceph osd erasure-code-profile set CLAYprofile \
+ plugin=clay \
+ k=4 m=2 d=5 \
+ crush-failure-domain=host
+ $ ceph osd pool create claypool 12 12 erasure CLAYprofile
+
+
+Create a clay profile
+=====================
+
+To create a new clay code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=clay \
+ k={data-chunks} \
+ m={coding-chunks} \
+ [d={helper-chunks}] \
+ [scalar_mds={plugin-name}] \
+ [technique={technique-name}] \
+ [crush-failure-domain={bucket-type}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``d={helper-chunks}``
+
+:Description: Number of OSDs requested to send data during recovery of
+ a single chunk. *d* needs to be chosen such that
+ k+1 <= d <= k+m-1. Larger the *d*, better the savings.
+
+:Type: Integer
+:Required: No.
+:Default: k+m-1
+
+``scalar_mds={jerasure|isa|shec}``
+
+:Description: **scalar_mds** specifies the plugin that is used as a
+ building block in the layered construction. It can be
+ one of *jerasure*, *isa*, *shec*
+
+:Type: String
+:Required: No.
+:Default: jerasure
+
+``technique={technique}``
+
+:Description: **technique** specifies the technique that will be picked
+ within the 'scalar_mds' plugin specified. Supported techniques
+ are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
+ 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
+ 'cauchy' for isa and 'single', 'multiple' for shec.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van (for jerasure, isa), single (for shec)
+
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the CRUSH rule. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a CRUSH rule step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+
+Notion of sub-chunks
+====================
+
+Clay code is able to save in terms of disk IO, network bandwidth as it
+is a vector code and it can see a chunk at a finer granularity called
+sub-chunks. Number of sub-chunks within a chunk for a clay code is
+given by:
+
+ sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1
+
+
+During repair of a OSD, the helper information requested
+from an available OSD is only a fraction of a chunk. In fact, the number
+of sub-chunks within a chunk that are accessed during repair is given by:
+
+ repair sub-chunk count = sub-chunk count / q
+
+Examples
+--------
+
+#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
+ 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
+ during repair.
+#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
+ is 16. A quarter of a chunk is read from an available OSD for repair of a failed
+ chunk.
+
+
+
+How to choose configuration given a workload
+============================================
+
+Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
+are not necessarily stored consecutively within a chunk. For best disk IO
+performance, it is helpful to read contiguous data. Choose stripe-size such that
+sub-chunk size is sufficiently large.
+
+For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::
+
+ sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...
+
+#. For large size workloads for which stripe size is large it is easy to choose k, m, d.
+ For example consider stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
+ result in a sub-chunk count of 1024 and sub-chunk size of 4KB.
+#. For small size workloads *k=4*, *m=2* is a good configuration that gives both network
+ and disk IO benefits.
+
+Comparisons with LRC
+====================
+
+Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
+bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
+number of OSDs contacted during repair (d) to be minimal at the cost of storage overhead.
+*clay* code has a storage overhead m/k. In the case of *lrc*, it stores (k+m)/d parities in
+addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
+can recover from failure of any ``m`` OSDs.
+
+ +-----------------+----------------------------------+----------------------------------+
+ | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) |
+ +=================+================+=================+==================================+
+ | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) |
+ +-----------------+----------------------------------+----------------------------------+
+ | (k=16, m=4) | 4 * S, 0.5 (d=5) | 4.75 * S, 0.25 (d=19) |
+ +-----------------+----------------------------------+----------------------------------+
+
+
+where ``S`` is the amount of data stored of single OSD being recovered.