doc/radosgw: Improve language, formatting in s3_objects_dedup.rst

author Ville Ojamo <14869000+bluikko@users.noreply.github.com>

Mon, 15 Dec 2025 17:04:18 +0000 (00:04 +0700)

committer Ville Ojamo <14869000+bluikko@users.noreply.github.com>

Mon, 9 Mar 2026 07:35:08 +0000 (14:35 +0700)
author Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Mon, 15 Dec 2025 17:04:18 +0000 (00:04 +0700)
committer Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Mon, 9 Mar 2026 07:35:08 +0000 (14:35 +0700)
diff --git a/doc/radosgw/s3_objects_dedup.rst b/doc/radosgw/s3_objects_dedup.rst

index ae62531de26bb29f041ec703db43d25ca3dd80eb..28ea792f4b136efb564968976a44f986e2a59aa9 100644 (file)
--- a/doc/radosgw/s3_objects_dedup.rst
+++ b/doc/radosgw/s3_objects_dedup.rst
@@ -1,122 +1,147 @@
  =====================
  Full RGW Object Dedup
  =====================
-Adds ``radosgw-admin`` commands to remove duplicated RGW tail-objects and to collect and report deduplication stats.
  
-**************
-Admin commands
-**************
+Full RGW object deduplication adds ``radosgw-admin`` commands to remove
+duplicated RGW tail objects and to collect and report dedup statistics.
+
+
+Admin Commands
+==============
+
  - ``radosgw-admin dedup estimate``:
-   Starts a new dedup estimate session (aborting first existing session if exists).
-   It doesn't make any change to the existing system and will only collect statistics and report them.
+   Starts a new dedup estimate session (aborting first any existing session).
+   No changes are made to the existing system. Only statistics will be
+   collected and reported.
  - ``radosgw-admin dedup exec --yes-i-really-mean-it``:
-   Starts a new dedup session (aborting first existing session if exists).
-   It will perform a full dedup, finding duplicated tail-objects and removing them.
+   Starts a new dedup session (aborting first any existing session).
+   Performs a full dedup, finding duplicated tail objects and removing them.
  
-  This command can lead to **data-loss** and should not be used on production data!!
+   This command can lead to **data loss** and should not be used on production
+   data!!
  - ``radosgw-admin dedup pause``:
     Pauses an active dedup session (dedup resources are not released).
  - ``radosgw-admin dedup resume``:
     Resumes a paused dedup session.
  - ``radosgw-admin dedup abort``:
-   Aborts an active dedup session and release all resources used by it.
+   Aborts an active dedup session, releasing all resources used by it.
  - ``radosgw-admin dedup stats``:
-   Collects & displays last dedup statistics.
+   Collects and displays last dedup statistics.
  - ``radosgw-admin dedup throttle --max-bucket-index-ops=<count>``:
-   Specify max bucket-index requests per second allowed for a single RGW server during dedup, 0 means unlimited.
+   Specifies maximum allowed bucket index read requests per second for a single
+   RGW server during dedup, ``0`` means unlimited.
  - ``radosgw-admin dedup throttle --stat``:
-   Display dedup throttle setting.
+   Displays dedup throttle setting.
+
  
-***************
  Skipped Objects
-***************
-Dedup Estimate process skips the following objects:
+===============
  
-- Objects smaller than rgw_dedup_min_obj_size_for_dedup (unless they are multipart).
+Dedup estimate process skips the following objects:
+
+- Objects smaller than :confval:`rgw_dedup_min_obj_size_for_dedup` (unless they
+  are multipart).
  - Objects with different placement rules.
  - Objects with different pools.
  - Objects with different storage classes.
  
-The full dedup process skips all the above and it also skips **compressed** and **user-encrypted** objects.
+The full dedup process skips all of the above and additionally skips
+**compressed** and **user-encrypted** objects.
  
-The minimum size object for dedup is controlled by the following config option:
+The minimum size object for dedup is controlled by the following
+configuration option:
  
  .. confval:: rgw_dedup_min_obj_size_for_dedup
  
-*******************
+
  Estimate Processing
-*******************
-The Dedup Estimate process collects all the needed information directly from
-the bucket indices reading one full bucket index object with thousands of
-entries at a time.
+===================
+
+The dedup estimate process collects all the needed information directly from
+the bucket indices, reading one full bucket index object a thousand entries at
+a time.
  
-The bucket indices objects are sharded between the participating
-members so every bucket index object is read exactly one time.
-The sharding allow processing to scale almost linearly splitting the
-load evenly between the participating members.
+The bucket index objects are sharded between the participating members so each
+bucket index object is read exactly one time. The sharding allows processing to
+scale almost linearly, splitting the load evenly between the participating
+members.
  
-The Dedup Estimate process does not access the objects themselves
-(data/metadata) which means its processing time won't be affected by
-the underlying media storing the objects (SSD/HDD) since the bucket indices are
-virtually always stored on a fast medium (SSD with heavy memory
-caching).
+The dedup estimate process does not access the objects themselves
+(data/metadata), which means its processing time won't be affected by the
+underlying media (SSD/HDD) storing the objects. The bucket indices are
+virtually always accessed from a fast medium: placement on SSD
+:ref:`is recommended <hardware-recommendations>` and they are cached heavily
+in memory.
  
-The admin can throttle the estimate process by setting a limit to the number of
-bucket-index reads per-second per an RGW server (each read brings 1000 object entries) using:
+The administrator can throttle the estimate process by setting a limit on the
+number of bucket index reads per second per an RGW server (each read brings
+1000 object entries) using:
  
-$ radosgw-admin dedup throttle --max-bucket-index-ops=<count>
+.. prompt:: bash #
+
+   radosgw-admin dedup throttle --max-bucket-index-ops=<count>
+
+A typical RGW server performs about 100 bucket index reads per second (i.e.
+100,000 object entries). For example, setting ``count`` to 50 would then
+typically slow down the estimate process by half.
  
-A typical RGW server performs about 100 bucket-index reads per second (i.e. 100,000 object entries).
-Setting the count to 50 will typically slow down access by half and so on...
  
-*********************
  Full Dedup Processing
-*********************
-The Full Dedup process begins by constructing a dedup table from the bucket indices, similar to the estimate process above.
+=====================
+
+The full dedup process begins by constructing a dedup table from the bucket
+indices, similar to the estimate process above.
  
-This table is then scanned linearly to purge objects without duplicates, leaving only dedup candidates.
+This table is then scanned linearly to purge objects without duplicates,
+leaving only dedup candidates.
  
-Next, we iterate through these dedup candidate objects, reading their complete information from the object metadata (a per-object RADOS operation).
-During this step, we filter out **compressed** and **user-encrypted** objects.
+Next, we iterate through these dedup candidate objects, reading their complete
+information from the object metadata (a per-object RADOS operation). During
+this step, we filter out **compressed** and **user-encrypted** objects.
  
-Following this, we calculate a strong-hash of the object data, which involves a full-object read and is a resource-intensive operation.
-This strong-hash ensures that the dedup candidates are indeed perfect matches.
-If they are, we proceed with the deduplication:
+Following this, we calculate a cryptograhically strong hash of the candidate
+object data. This involves a full-object read which is a resource-intensive
+operation. The hash ensures that the dedup candidates are indeed perfect
+matches. If they are, we proceed with the deduplication:
+
+- Increment the reference count on the source tail objects one by one.
+- Copy the manifest from the source to the target.
+- Remove all tail objects on the target.
  
-- incrementing the reference count on the source tail-objects one by one.
-- copying the manifest from the source to the target.
-- removing all tail-objects on the target.
  
-***************
  Split Head Mode
-***************
+===============
+
  Dedup code can split the head object into 2 objects
  
  - one with attributes and no data and
-- a new tail-object with only data.
+- a new tail object with only data.
+
+The new tail object will be deduped, unlike the head objects, which cannot
+be deduplicated.
+This feature is only enabled for RGW objects without existing tail objects
+(in other words, objects sized 4 MB or less).
  
-The new-tail object will be deduped (unlike the head objects which can't be deduplicated)
-This feature is only enabled for RGW Objects without existing tail-objects (in other words object-size <= 4MB)
  
-************
  Memory Usage
-************
- +---------------+----------+
- | RGW Obj Count |  Memory  |
- +===============+==========+
- | 1M            | 8 MB     |
- +---------------+----------+
- | 4M            | 16 MB    |
- +---------------+----------+
- | 16M           | 32 MB    |
- +---------------+----------+
- | 64M           | 64 MB    |
- +---------------+----------+
- | 256M          | 128 MB   |
- +---------------+----------+
- | 1024M (1G)    | 256 MB   |
- +---------------+----------+
- | 4096M (4G)    | 512 MB   |
- +---------------+----------+
- | 16384M (16G)  | 1024 MB  |
- +---------------+----------+
+============
+
+ +------------------+----------+
+ | RGW Object Count |  Memory  |
+ +==================+==========+
+ | 1M               | 8 MB     |
+ +------------------+----------+
+ | 4M               | 16 MB    |
+ +------------------+----------+
+ | 16M              | 32 MB    |
+ +------------------+----------+
+ | 64M              | 64 MB    |
+ +------------------+----------+
+ | 256M             | 128 MB   |
+ +------------------+----------+
+ | 1024M (1G)       | 256 MB   |
+ +------------------+----------+
+ | 4096M (4G)       | 512 MB   |
+ +------------------+----------+
+ | 16384M (16G)     | 1024 MB  |
+ +------------------+----------+
author	Ville Ojamo <14869000+bluikko@users.noreply.github.com>
	Mon, 15 Dec 2025 17:04:18 +0000 (00:04 +0700)
committer	Ville Ojamo <14869000+bluikko@users.noreply.github.com>
	Mon, 9 Mar 2026 07:35:08 +0000 (14:35 +0700)