From c5a46dde94bfb18a050876a7d23446a991a7a224 Mon Sep 17 00:00:00 2001
From: Myna V <mynaramana@gmail.com>
Date: Mon, 8 Oct 2018 18:56:01 +0530
Subject: [PATCH] doc: clay code plugin

Signed-off-by: Myna <mynaramana@gmail.com>
---
 doc/rados/operations/erasure-code-clay.rst | 235 +++++++++++++++++++++
 doc/rados/operations/erasure-code.rst      |   1 +
 2 files changed, 236 insertions(+)
 create mode 100644 doc/rados/operations/erasure-code-clay.rst

diff --git a/doc/rados/operations/erasure-code-clay.rst b/doc/rados/operations/erasure-code-clay.rst
new file mode 100644
index 0000000000000..da4c5e25b664a
--- /dev/null
+++ b/doc/rados/operations/erasure-code-clay.rst
@@ -0,0 +1,235 @@
+=========================
+Coupled LAYer code plugin
+=========================
+
+Coupled LAYer (CLAY) codes are erasure codes designed to save in terms of network 
+bandwidth, disk IO when a failed node/OSD/rack is being repaired. Let:
+
+	d = number of OSDs contacted during repair
+
+If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires 
+reading from the *d=8* others to repair. And recovery of say a 1GiB needs
+a download of 8 X 1GiB = 8GiB amount of information.
+
+However, in the case of *clay* plugin *d* is configurable such that:
+
+	k+1 <= d <= k+m-1 
+
+By default the clay code plugin picks *d=k+m-1* as it gives most savings in terms 
+of network bandwidth and disk IO. In the case of *clay* plugin configured with 
+*k=8*, *m=4* and *d=11* when a single OSD fails d=11 osds are contacted and 
+250MiB is downloaded from each of them resulting in download of 11 X 250MiB = 2.75GiB 
+amount of information. More general parameters are shown below. The benefits are huge
+when the repair is being done for a rack that stores information in the order of 
+Tera bytes.
+
+	+-------------+---------------------------+
+	| plugin      | total amount of disk IO   |
+	+=============+===========================+
+	|jerasure,isa | k*S                       |
+	+-------------+---------------------------+
+	| clay        | d*S/(d-k+1) = (k+m-1)*S/m |
+	+-------------+---------------------------+
+
+where *S* is the amount of data stored of single OSD being repaired and 
+in the table above, we are using the maximum possible value of d for minimal amount 
+of data transmission for recovery.
+
+Erasure code profile examples
+=============================
+
+Reduced bandwidth usage can actually be observed.::
+
+        $ ceph osd erasure-code-profile set CLAYprofile \
+             plugin=clay \
+             k=4 m=2 d=5 \
+             crush-failure-domain=host
+        $ ceph osd pool create claypool 12 12 erasure CLAYprofile
+
+
+Create a clay profile
+=====================
+
+To create a new clay code profile::
+
+        ceph osd erasure-code-profile set {name} \
+             plugin=clay \
+             k={data-chunks} \
+             m={coding-chunks} \
+             [d={helper-chunks}] \
+             [scalar_mds={plugin-name}] \
+             [technique={technique-name}] \
+             [crush-failure-domain={bucket-type}] \
+             [directory={directory}] \
+             [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+              each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+              on different OSDs. The number of coding chunks is also
+              the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``d={helper-chunks}``
+
+:Description: Number of OSDs requested to send data during recovery of
+              a single chunk. *d* needs to be chosen such that
+              k+1 <= d <= k+m-1. Larger the *d*, better the savings.
+
+:Type: Integer
+:Required: No.
+:Default: k+m-1
+
+``scalar_mds={jerasure|isa|shec}``
+
+:Description: **scalar_mds** specifies the plugin that is used as a 
+             building block in the layered construction. It can be 
+             one of *jerasure*, *isa*, *shec*
+
+:Type: String
+:Required: No.
+:Default: jerasure
+
+``technique={technique}``
+
+:Description: **technique** specifies the technique that will be picked
+             within the 'scalar_mds' plugin specified. Supported techniques
+             are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', 
+             'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
+             'cauchy' for isa and 'single', 'multiple' for shec.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van (for jerasure, isa), single (for shec)
+
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+              the CRUSH rule. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+              failure domain. For instance, if the failure domain is
+              **host** no two chunks will be stored on the same
+              host. It is used to create a CRUSH rule step such as **step
+              chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+              ``ssd`` or ``hdd``), using the crush device class names
+              in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+              plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+
+Notion of sub-chunks
+====================
+
+Clay code is able to save in terms of disk IO, network bandwidth as it
+is a vector code and it can see a chunk at a finer granularity called 
+sub-chunks. Number of sub-chunks within a chunk for a clay code is
+given by:
+
+	sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1
+
+
+During repair of a OSD, the helper information requested
+from an available OSD is only a fraction of a chunk. In fact, the number
+of sub-chunks within a chunk that are accessed during repair is given by:
+
+	repair sub-chunk count = sub-chunk count / q
+
+Examples
+--------
+
+#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
+   8 and  the repair sub-chunk count is 4. Therefore, only half of a chunk is read 
+   during repair.
+#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
+   is 16. A quarter of a chunk is read from an available OSD for repair of a failed 
+   chunk.
+
+
+
+How to choose configuration given a workload
+============================================
+
+Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
+are not necessarily stored consecutively within a chunk. For best disk IO 
+performance, it is helpful to read contiguous data. Choose stripe-size such that
+sub-chunk size is sufficiently large.
+
+For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::
+
+	sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...
+
+#. For large size workloads for which stripe size is large it is easy to choose k, m, d.
+   For example consider stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
+   result in a sub-chunk count of 1024 and sub-chunk size of 4KB.
+#. For small size workloads *k=4*, *m=2* is a good configuration that gives both network
+   and disk IO benefits.
+
+Comparisons with LRC
+====================
+
+Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
+bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
+number of OSDs contacted during repair (d) to be minimal at the cost of storage overhead.
+*clay* code has a storage overhead m/k. In the case of *lrc*, it stores (k+m)/d parities in
+addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
+can recover from failure of any ``m`` OSDs.
+
+	+-----------------+----------------------------------+----------------------------------+
+	| Parameters      | disk IO, storage overhead (LRC)  | disk IO, storage overhead (CLAY) |
+	+=================+================+=================+==================================+
+	| (k=10, m=4)     | 7 * S, 0.6 (d=7)                 | 3.25 * S, 0.4 (d=13)             |
+	+-----------------+----------------------------------+----------------------------------+
+	| (k=16, m=4)     | 4 * S, 0.5 (d=5)                 | 4.75 * S, 0.25 (d=19)            |
+	+-----------------+----------------------------------+----------------------------------+
+
+
+where ``S`` is the amount of data stored of single OSD being recovered.
diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst
index de0ba36474990..03928cd4b23c7 100644
--- a/doc/rados/operations/erasure-code.rst
+++ b/doc/rados/operations/erasure-code.rst
@@ -193,3 +193,4 @@ Table of content
 	erasure-code-isa
 	erasure-code-lrc
 	erasure-code-shec
+	erasure-code-clay
-- 
2.39.5