From cd41b64460106cc2486665d5c9e897a3bd817e1c Mon Sep 17 00:00:00 2001
From: myoungwon oh <omwmw@sk.com>
Date: Fri, 7 Jun 2019 19:07:05 +0900
Subject: [PATCH] doc: add a deduplication document

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
---
 doc/dev/deduplication.rst | 258 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 258 insertions(+)
 create mode 100644 doc/dev/deduplication.rst

diff --git a/doc/dev/deduplication.rst b/doc/dev/deduplication.rst
new file mode 100644
index 00000000000..62b487d3c64
--- /dev/null
+++ b/doc/dev/deduplication.rst
@@ -0,0 +1,258 @@
+===============
+ Deduplication
+===============
+
+
+Introduction
+============
+
+Applying data deduplication on an existing software stack is not easy 
+due to additional metadata management and original data processing 
+procedure. 
+
+In a typical deduplication system, the input source as a data
+object is split into multiple chunks by a chunking algorithm.
+The deduplication system then compares each chunk with
+the existing data chunks, stored in the storage previously.
+To this end, a fingerprint index that stores the hash value
+of each chunk is employed by the deduplication system
+in order to easily find the existing chunks by comparing
+hash value rather than searching all contents that reside in
+the underlying storage.
+
+There are many challenges in order to implement deduplication on top
+of Ceph. Among them, two issues are essential for deduplication.
+First is managing scalability of fingerprint index; Second is
+it is complex to ensure compatibility between newly applied
+deduplication metadata and existing metadata.
+
+Key Idea
+========
+1. Content hashing (Double hashing): Each client can find an object data 
+for an object ID using CRUSH. With CRUSH, a client knows object's location
+in Base tier. 
+By hashing object's content at Base tier, a new OID (chunk ID) is generated.
+Chunk tier stores in the new OID that has a partial content of original object.
+
+ Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K -> 
+ CRUSH(K) -> chunk's location
+
+
+2. Self-contained object: The external metadata design
+makes difficult for integration with storage feature support
+since existing storage features cannot recognize the
+additional external data structures. If we can design data
+deduplication system without any external component, the
+original storage features can be reused.
+
+More details in https://ieeexplore.ieee.org/document/8416369
+
+Design
+======
+
+.. ditaa:: 
+           +-------------+
+           | Ceph Client |
+           +------+------+
+                  ^
+     Tiering is   |  
+    Transparent   |               Metadata
+        to Ceph   |           +---------------+
+     Client Ops   |           |               |   
+                  |    +----->+   Base Pool   |
+                  |    |      |               |
+                  |    |      +-----+---+-----+
+                  |    |            |   ^ 
+                  v    v            |   |   Dedup metadata in Base Pool
+           +------+----+--+         |   |   (Dedup metadata contains chunk offsets
+           |   Objecter   |         |   |    and fingerprints)
+           +-----------+--+         |   |
+                       ^            |   |   Data in Chunk Pool
+                       |            v   |
+                       |      +-----+---+-----+
+                       |      |               |
+                       +----->|  Chunk Pool   |
+                              |               |
+                              +---------------+
+                                    Data
+
+
+Pool-based object management:
+We define two pools.
+The metadata pool stores metadata objects and the chunk pool stores
+chunk objects. Since these two pools are divided based on
+the purpose and usage, each pool can be managed more
+efficiently according to its different characteristics. Base
+pool and the chunk pool can separately select a redundancy
+scheme between replication and erasure coding depending on
+its usage and each pool can be placed in a different storage
+location depending on the required performance.
+
+Manifest Object: 
+Metadata objects are stored in the
+base pool, which contains metadata for data deduplication.
+
+::
+  
+        struct object_manifest_t {
+                enum {
+                        TYPE_NONE = 0,
+                        TYPE_REDIRECT = 1,
+                        TYPE_CHUNKED = 2,
+                };
+                uint8_t type;  // redirect, chunked, ...
+                hobject_t redirect_target;
+                std::map<uint64_t, chunk_info_t> chunk_map;
+        }
+
+
+A chunk Object: 
+Chunk objects are stored in the chunk pool. Chunk object contains chunk data 
+and its reference count information.
+
+
+Although chunk objects and manifest objects have a different purpose 
+from existing objects, they can be handled the same way as 
+original objects. Therefore, to support existing features such as replication,
+no additional operations for dedup are needed.
+
+Usage
+=====
+
+To set up deduplication pools, you must have two pools. One will act as the 
+base pool and the other will act as the chunk pool. The base pool need to be
+configured with fingerprint_algorithm option as follows.
+
+::
+
+  ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 
+  --yes-i-really-mean-it
+
+1. Create objects ::
+
+        - rados -p base_pool put foo ./foo
+
+        - rados -p chunk_pool put foo-chunk ./foo-chunk
+
+2. Make a manifest object ::
+
+        - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool 
+        chunk_pool foo-chunk $START_OFFSET --with-reference
+
+
+Interface
+=========
+
+* set-redirect 
+
+  set redirection between a base_object in the base_pool and a target_object 
+  in the target_pool.
+  A redirected object will forward all operations from the client to the 
+  target_object. ::
+  
+        rados -p base_pool set-redirect <base_object> --target-pool <target_pool> 
+         <target_object>
+
+* set-chunk 
+
+  set chunk-offset in a source_object to make a link between it and a 
+  target_object. ::
+  
+        rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool 
+         <caspool> <target_object> <taget-offset> 
+
+* tier-promote 
+
+  promote the object (including chunks). ::
+
+        rados -p base_pool tier-promote <obj-name> 
+
+* unset-manifest
+
+  unset manifest option from the object that has manifest. ::
+
+        rados -p base_pool unset-manifest <obj-name>
+
+
+Dedup tool
+==========
+
+Dedup tool has two features: finding optimal chunk offset for dedup chunking 
+and fixing the reference count.
+
+* find optimal chunk offset
+
+  a. fixed chunk  
+
+    To find out a fixed chunk length, you need to run following command many 
+    times while changing the chunk_size. ::
+
+            ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size  
+              --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
+
+  b. rabin chunk(Rabin-karp algorithm) 
+
+    As you know, Rabin-karp algorithm is string-searching algorithm based
+    on a rolling-hash. But rolling-hash is not enough to do deduplication because 
+    we don't know the chunk boundary. So, we need content-based slicing using 
+    a rolling hash for content-defined chunking.
+    The current implementation uses the simplest approach: look for chunk boundaries 
+    by inspecting the rolling hash for pattern(like the
+    lower N bits are all zeroes). 
+      
+    - Usage
+
+      Users who want to use deduplication need to find an ideal chunk offset.
+      To find out ideal chunk offset, Users should discover
+      the optimal configuration for their data workload via ceph-dedup-tool.
+      And then, this chunking information will be used for object chunking through
+      set-chunk api. ::
+
+              ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size  
+                --chunk-algorithm rabin --fingerprint-algorithm rabin
+
+      ceph-dedup-tool has many options to utilize rabin chunk.
+      These are options for rabin chunk. ::
+
+              --mod-prime <uint64_t>
+              --rabin-prime <uint64_t>
+              --pow <uint64_t>
+              --chunk-mask-bit <uint32_t>
+              --window-size <uint32_t>
+              --min-chunk <uint32_t>
+              --max-chunk <uint64_t>
+
+      Users need to refer following equation to use above options for rabin chunk. ::
+
+              rabin_hash = 
+                (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
+
+  c. Fixed chunk vs content-defined chunk
+
+    Content-defined chunking may or not be optimal solution.
+    For example,
+
+    Data chunk A : abcdefgabcdefgabcdefg
+
+    Let's think about Data chunk A's deduplication. Ideal chunk offset is
+    from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
+    But, in the case of content-based slicing, the optimal chunk length
+    could not be found (dedup ratio will not be 100%).
+    Because we need to find optimal parameter such
+    as boundary bit, window size and prime value. This is as easy as fixed chunk.
+    But, content defined chunking is very effective in the following case.
+
+    Data chunk B : abcdefgabcdefgabcdefg
+
+    Data chunk C : Tabcdefgabcdefgabcdefg
+      
+
+* fix reference count
+  
+  The key idea behind of reference counting for dedup is false-positive, which means 
+  (manifest object (no ref), chunk object(has ref)) happen instead of 
+  (manifest object (has ref), chunk 1(no ref)).
+  To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
+
+          ceph-dedup-tool --op chunk_scrub --pool $POOL --chunk_pool $CHUNK_POOL
+
-- 
2.39.5