From cd41b64460106cc2486665d5c9e897a3bd817e1c Mon Sep 17 00:00:00 2001 From: myoungwon oh Date: Fri, 7 Jun 2019 19:07:05 +0900 Subject: [PATCH] doc: add a deduplication document Signed-off-by: Myoungwon Oh --- doc/dev/deduplication.rst | 258 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 258 insertions(+) create mode 100644 doc/dev/deduplication.rst diff --git a/doc/dev/deduplication.rst b/doc/dev/deduplication.rst new file mode 100644 index 00000000000..62b487d3c64 --- /dev/null +++ b/doc/dev/deduplication.rst @@ -0,0 +1,258 @@ +=============== + Deduplication +=============== + + +Introduction +============ + +Applying data deduplication on an existing software stack is not easy +due to additional metadata management and original data processing +procedure. + +In a typical deduplication system, the input source as a data +object is split into multiple chunks by a chunking algorithm. +The deduplication system then compares each chunk with +the existing data chunks, stored in the storage previously. +To this end, a fingerprint index that stores the hash value +of each chunk is employed by the deduplication system +in order to easily find the existing chunks by comparing +hash value rather than searching all contents that reside in +the underlying storage. + +There are many challenges in order to implement deduplication on top +of Ceph. Among them, two issues are essential for deduplication. +First is managing scalability of fingerprint index; Second is +it is complex to ensure compatibility between newly applied +deduplication metadata and existing metadata. + +Key Idea +======== +1. Content hashing (Double hashing): Each client can find an object data +for an object ID using CRUSH. With CRUSH, a client knows object's location +in Base tier. +By hashing object's content at Base tier, a new OID (chunk ID) is generated. +Chunk tier stores in the new OID that has a partial content of original object. + + Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K -> + CRUSH(K) -> chunk's location + + +2. Self-contained object: The external metadata design +makes difficult for integration with storage feature support +since existing storage features cannot recognize the +additional external data structures. If we can design data +deduplication system without any external component, the +original storage features can be reused. + +More details in https://ieeexplore.ieee.org/document/8416369 + +Design +====== + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Metadata + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Base Pool | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Dedup metadata in Base Pool + +------+----+--+ | | (Dedup metadata contains chunk offsets + | Objecter | | | and fingerprints) + +-----------+--+ | | + ^ | | Data in Chunk Pool + | v | + | +-----+---+-----+ + | | | + +----->| Chunk Pool | + | | + +---------------+ + Data + + +Pool-based object management: +We define two pools. +The metadata pool stores metadata objects and the chunk pool stores +chunk objects. Since these two pools are divided based on +the purpose and usage, each pool can be managed more +efficiently according to its different characteristics. Base +pool and the chunk pool can separately select a redundancy +scheme between replication and erasure coding depending on +its usage and each pool can be placed in a different storage +location depending on the required performance. + +Manifest Object: +Metadata objects are stored in the +base pool, which contains metadata for data deduplication. + +:: + + struct object_manifest_t { + enum { + TYPE_NONE = 0, + TYPE_REDIRECT = 1, + TYPE_CHUNKED = 2, + }; + uint8_t type; // redirect, chunked, ... + hobject_t redirect_target; + std::map chunk_map; + } + + +A chunk Object: +Chunk objects are stored in the chunk pool. Chunk object contains chunk data +and its reference count information. + + +Although chunk objects and manifest objects have a different purpose +from existing objects, they can be handled the same way as +original objects. Therefore, to support existing features such as replication, +no additional operations for dedup are needed. + +Usage +===== + +To set up deduplication pools, you must have two pools. One will act as the +base pool and the other will act as the chunk pool. The base pool need to be +configured with fingerprint_algorithm option as follows. + +:: + + ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 + --yes-i-really-mean-it + +1. Create objects :: + + - rados -p base_pool put foo ./foo + + - rados -p chunk_pool put foo-chunk ./foo-chunk + +2. Make a manifest object :: + + - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool + chunk_pool foo-chunk $START_OFFSET --with-reference + + +Interface +========= + +* set-redirect + + set redirection between a base_object in the base_pool and a target_object + in the target_pool. + A redirected object will forward all operations from the client to the + target_object. :: + + rados -p base_pool set-redirect --target-pool + + +* set-chunk + + set chunk-offset in a source_object to make a link between it and a + target_object. :: + + rados -p base_pool set-chunk --target-pool + + +* tier-promote + + promote the object (including chunks). :: + + rados -p base_pool tier-promote + +* unset-manifest + + unset manifest option from the object that has manifest. :: + + rados -p base_pool unset-manifest + + +Dedup tool +========== + +Dedup tool has two features: finding optimal chunk offset for dedup chunking +and fixing the reference count. + +* find optimal chunk offset + + a. fixed chunk + + To find out a fixed chunk length, you need to run following command many + times while changing the chunk_size. :: + + ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size + --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 + + b. rabin chunk(Rabin-karp algorithm) + + As you know, Rabin-karp algorithm is string-searching algorithm based + on a rolling-hash. But rolling-hash is not enough to do deduplication because + we don't know the chunk boundary. So, we need content-based slicing using + a rolling hash for content-defined chunking. + The current implementation uses the simplest approach: look for chunk boundaries + by inspecting the rolling hash for pattern(like the + lower N bits are all zeroes). + + - Usage + + Users who want to use deduplication need to find an ideal chunk offset. + To find out ideal chunk offset, Users should discover + the optimal configuration for their data workload via ceph-dedup-tool. + And then, this chunking information will be used for object chunking through + set-chunk api. :: + + ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size + --chunk-algorithm rabin --fingerprint-algorithm rabin + + ceph-dedup-tool has many options to utilize rabin chunk. + These are options for rabin chunk. :: + + --mod-prime + --rabin-prime + --pow + --chunk-mask-bit + --window-size + --min-chunk + --max-chunk + + Users need to refer following equation to use above options for rabin chunk. :: + + rabin_hash = + (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) + + c. Fixed chunk vs content-defined chunk + + Content-defined chunking may or not be optimal solution. + For example, + + Data chunk A : abcdefgabcdefgabcdefg + + Let's think about Data chunk A's deduplication. Ideal chunk offset is + from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length. + But, in the case of content-based slicing, the optimal chunk length + could not be found (dedup ratio will not be 100%). + Because we need to find optimal parameter such + as boundary bit, window size and prime value. This is as easy as fixed chunk. + But, content defined chunking is very effective in the following case. + + Data chunk B : abcdefgabcdefgabcdefg + + Data chunk C : Tabcdefgabcdefgabcdefg + + +* fix reference count + + The key idea behind of reference counting for dedup is false-positive, which means + (manifest object (no ref), chunk object(has ref)) happen instead of + (manifest object (has ref), chunk 1(no ref)). + To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. :: + + ceph-dedup-tool --op chunk_scrub --pool $POOL --chunk_pool $CHUNK_POOL + -- 2.39.5