From c873a7af082a947f00fdc55947ae95f1bcd3d5c5 Mon Sep 17 00:00:00 2001 From: myoungwon oh Date: Tue, 21 Apr 2020 23:23:07 -0400 Subject: [PATCH] doc: break deduplication.rst int several files Signed-off-by: Samuel Just Signed-off-by: Myoungwon Oh --- doc/dev/deduplication.rst | 144 +-------------------------- doc/dev/osd_internals/manifest.rst | 153 +++++++++++++++++++++++++++++ 2 files changed, 154 insertions(+), 143 deletions(-) create mode 100644 doc/dev/osd_internals/manifest.rst diff --git a/doc/dev/deduplication.rst b/doc/dev/deduplication.rst index def15955e43..2dec9f03bc3 100644 --- a/doc/dev/deduplication.rst +++ b/doc/dev/deduplication.rst @@ -117,148 +117,6 @@ from existing objects, they can be handled the same way as original objects. Therefore, to support existing features such as replication, no additional operations for dedup are needed. -Usage -===== -To set up deduplication pools, you must have two pools. One will act as the -base pool and the other will act as the chunk pool. The base pool need to be -configured with fingerprint_algorithm option as follows. - -:: - - ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 - --yes-i-really-mean-it - -1. Create objects :: - - - rados -p base_pool put foo ./foo - - - rados -p chunk_pool put foo-chunk ./foo-chunk - -2. Make a manifest object :: - - - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool - chunk_pool foo-chunk $START_OFFSET --with-reference - - -Interface -========= - -* set-redirect - - set redirection between a base_object in the base_pool and a target_object - in the target_pool. - A redirected object will forward all operations from the client to the - target_object. :: - - rados -p base_pool set-redirect --target-pool - - -* set-chunk - - set chunk-offset in a source_object to make a link between it and a - target_object. :: - - rados -p base_pool set-chunk --target-pool - - -* tier-promote - - promote the object (including chunks). :: - - rados -p base_pool tier-promote - -* unset-manifest - - unset manifest option from the object that has manifest. :: - - rados -p base_pool unset-manifest - -* tier-flush - - flush the object that has chunks to the chunk pool. :: - - rados -p base_pool tier-flush - -Dedup tool -========== - -Dedup tool has two features: finding optimal chunk offset for dedup chunking -and fixing the reference count. - -* find optimal chunk offset - - a. fixed chunk - - To find out a fixed chunk length, you need to run following command many - times while changing the chunk_size. :: - - ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size - --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 - - b. rabin chunk(Rabin-karp algorithm) - - As you know, Rabin-karp algorithm is string-searching algorithm based - on a rolling-hash. But rolling-hash is not enough to do deduplication because - we don't know the chunk boundary. So, we need content-based slicing using - a rolling hash for content-defined chunking. - The current implementation uses the simplest approach: look for chunk boundaries - by inspecting the rolling hash for pattern(like the - lower N bits are all zeroes). - - - Usage - - Users who want to use deduplication need to find an ideal chunk offset. - To find out ideal chunk offset, Users should discover - the optimal configuration for their data workload via ceph-dedup-tool. - And then, this chunking information will be used for object chunking through - set-chunk api. :: - - ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size - --chunk-algorithm rabin --fingerprint-algorithm rabin - - ceph-dedup-tool has many options to utilize rabin chunk. - These are options for rabin chunk. :: - - --mod-prime - --rabin-prime - --pow - --chunk-mask-bit - --window-size - --min-chunk - --max-chunk - - Users need to refer following equation to use above options for rabin chunk. :: - - rabin_hash = - (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) - - c. Fixed chunk vs content-defined chunk - - Content-defined chunking may or not be optimal solution. - For example, - - Data chunk A : abcdefgabcdefgabcdefg - - Let's think about Data chunk A's deduplication. Ideal chunk offset is - from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length. - But, in the case of content-based slicing, the optimal chunk length - could not be found (dedup ratio will not be 100%). - Because we need to find optimal parameter such - as boundary bit, window size and prime value. This is as easy as fixed chunk. - But, content defined chunking is very effective in the following case. - - Data chunk B : abcdefgabcdefgabcdefg - - Data chunk C : Tabcdefgabcdefgabcdefg - - -* fix reference count - - The key idea behind of reference counting for dedup is false-positive, which means - (manifest object (no ref), chunk object(has ref)) happen instead of - (manifest object (has ref), chunk 1(no ref)). - To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. :: - - ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL +Regarding how to use, please see :doc:`doc/dev/osd_internals/manifest.rst` diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst new file mode 100644 index 00000000000..806baa12889 --- /dev/null +++ b/doc/dev/osd_internals/manifest.rst @@ -0,0 +1,153 @@ +======== +Manifest +======== + + + +RAODS Interface +=============== + +To set up deduplication pools, you must have two pools. One will act as the +base pool and the other will act as the chunk pool. The base pool need to be +configured with fingerprint_algorithm option as follows. + +:: + + ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 + --yes-i-really-mean-it + +1. Create objects :: + + - rados -p base_pool put foo ./foo + + - rados -p chunk_pool put foo-chunk ./foo-chunk + +2. Make a manifest object :: + + - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool + chunk_pool foo-chunk $START_OFFSET --with-reference + +Operations: + + +* set-redirect + + set a redirection between a base_object in the base_pool and a target_object + in the target_pool. + A redirected object will forward all operations from the client to the + target_object. :: + + rados -p base_pool set-redirect --target-pool + + +* set-chunk + + set the chunk-offset in a source_object to make a link between it and a + target_object. :: + + rados -p base_pool set-chunk --target-pool + + +* tier-promote + + promote the object (including chunks). :: + + rados -p base_pool tier-promote + +* unset-manifest + + unset the manifest info in the object that has manifest. :: + + rados -p base_pool unset-manifest + +* tier-flush + + flush the object which has chunks to the chunk pool. :: + + rados -p base_pool tier-flush + + + +Dedup tool +========== + +Dedup tool has two features: finding an optimal chunk offset for dedup chunking +and fixing the reference count. + +* find an optimal chunk offset + + a. fixed chunk + + To find out a fixed chunk length, you need to run the following command many + times while changing the chunk_size. :: + + ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size + --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 + + b. rabin chunk(Rabin-karp algorithm) + + As you know, Rabin-karp algorithm is string-searching algorithm based + on a rolling-hash. But rolling-hash is not enough to do deduplication because + we don't know the chunk boundary. So, we need content-based slicing using + a rolling hash for content-defined chunking. + The current implementation uses the simplest approach: look for chunk boundaries + by inspecting the rolling hash for pattern(like the + lower N bits are all zeroes). + + - Usage + + Users who want to use deduplication need to find an ideal chunk offset. + To find out ideal chunk offset, Users should discover + the optimal configuration for their data workload via ceph-dedup-tool. + And then, this chunking information will be used for object chunking through + set-chunk api. :: + + ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size + --chunk-algorithm rabin --fingerprint-algorithm rabin + + ceph-dedup-tool has many options to utilize rabin chunk. + These are options for rabin chunk. :: + + --mod-prime + --rabin-prime + --pow + --chunk-mask-bit + --window-size + --min-chunk + --max-chunk + + Users need to refer following equation to use above options for rabin chunk. :: + + rabin_hash = + (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) + + c. Fixed chunk vs content-defined chunk + + Content-defined chunking may or not be optimal solution. + For example, + + Data chunk A : abcdefgabcdefgabcdefg + + Let's think about Data chunk A's deduplication. Ideal chunk offset is + from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length. + But, in the case of content-based slicing, the optimal chunk length + could not be found (dedup ratio will not be 100%). + Because we need to find optimal parameter such + as boundary bit, window size and prime value. This is as easy as fixed chunk. + But, content defined chunking is very effective in the following case. + + Data chunk B : abcdefgabcdefgabcdefg + + Data chunk C : Tabcdefgabcdefgabcdefg + + +* fix reference count + + The key idea behind of reference counting for dedup is false-positive, which means + (manifest object (no ref), chunk object(has ref)) happen instead of + (manifest object (has ref), chunk 1(no ref)). + To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. :: + + ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL + + -- 2.39.5