its usage and each pool can be placed in a different storage
location depending on the required performance.
-Regarding how to use, please see osd_internals/manifest.rst
+Regarding how to use, please see ``osd_internals/manifest.rst``
Usage Patterns
==============
-The different ceph interface layers present potentially different oportunities
+The different Ceph interface layers present potentially different oportunities
and costs for deduplication and tiering in general.
RadosGW
Unlike cephfs and rbd, radosgw has a system for storing
explicit metadata in the head object of a logical s3 object for
locating the remaining pieces. As such, radosgw could use the
-refcounting machinery (osd_internals/refcount.rst) directly without
+refcounting machinery (``osd_internals/refcount.rst``) directly without
needing direct support from rados for manifests.
RBD/Cephfs
RADOS Machinery
===============
-For more information on rados redirect/chunk/dedup support, see osd_internals/manifest.rst.
-For more information on rados refcount support, see osd_internals/refcount.rst.
+For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
+For more information on rados refcount support, see ``osd_internals/refcount.rst``.
Status and Future Work
======================
At the moment, there exists some preliminary support for manifest
-objects within the osd as well as a dedup tool.
+objects within the OSD as well as a dedup tool.
RadosGW data warehouse workloads probably represent the largest
opportunity for this feature, so the first priority is probably to add
to radosgw.
Aside from radosgw, completing work on manifest object support in the
-osd particularly as it relates to snapshots would be the next step for
+OSD particularly as it relates to snapshots would be the next step for
rbd and cephfs workloads.
Asynchronous Recovery
=====================
-PGs in Ceph maintain a log of writes to allow speedy recovery of data.
-Instead of scanning all of the objects to see what is missing on each
-osd, we can examine the pg log to see which objects we need to
-recover. See :ref:`Log Based PG <log-based-pg>` for more details on this process.
+Ceph Placement Groups (PGs) maintain a log of write transactions to
+facilitate speedy recovery of data. During recovery, each of these PG logs
+is used to determine which content in each OSD is missing or outdated.
+This obviates the need to scan all RADOS objects.
+See :ref:`Log Based PG <log-based-pg>` for more details on this process.
-Until now, this recovery process was synchronous - it blocked writes
-to an object until it was recovered. In contrast, backfill could allow
-writes to proceed (assuming enough up-to-date copies of the data were
-available) by temporarily assigning a different acting set, and
-backfilling an OSD outside of the acting set. In some circumstances,
+Prior to the Nautilus release this recovery process was synchronous: it
+blocked writes to a RADOS object until it was recovered. In contrast,
+backfill could allow writes to proceed (assuming enough up-to-date replicas
+were available) by temporarily assigning a different acting set, and
+backfilling an OSD outside of the acting set. In some circumstances
this ends up being significantly better for availability, e.g. if the
-pg log contains 3000 writes to different objects. Recovering several
-megabytes of an object (or even worse, several megabytes of omap keys,
-like rgw bucket indexes) can drastically increase latency for a small
+PG log contains 3000 writes to disjoint objects. When the PG log contains
+thousands of entries, it could actually be faster (though not as safe) to
+trade backfill for recovery by deleting and redeploying the containing
+OSD than to iterate through the PG log. Recovering several megabytes
+of RADOS object data (or even worse, several megabytes of omap keys,
+notably RGW bucket indexes) can drastically increase latency for a small
update, and combined with requests spread across many degraded objects
it is a recipe for slow requests.
-To avoid this, we can perform recovery in the background on an OSD out
-of the acting set, similar to backfill, but still using the PG log to
-determine what needs recovery. This is known as asynchronous recovery.
+To avoid this we can perform recovery in the background on an OSD
+out-of-band of the live acting set, similar to backfill, but still using
+the PG log to determine what needs to be done. This is known as *asynchronous
+recovery*.
-Exactly when we perform asynchronous recovery instead of synchronous
-recovery is not a clear-cut threshold. There are a few criteria which
+The threashold for performing asynchronous recovery instead of synchronous
+recovery is not a clear-cut. There are a few criteria which
need to be met for asynchronous recovery:
-* try to keep min_size replicas available
-* use the approximate magnitude of the difference in length of
- logs combined with historical missing objects as the cost of recovery
-* use the parameter osd_async_recovery_min_cost to determine
+* Try to keep ``min_size`` replicas available
+* Use the approximate magnitude of the difference in length of
+ logs combined with historical missing objects to estimate the cost of
+ recovery
+* Use the parameter ``osd_async_recovery_min_cost`` to determine
when asynchronous recovery is appropriate
With the existing peering process, when we choose the acting set we
-have not fetched the pg log from each peer, we have only the bounds of
-it and other metadata from their pg_info_t. It would be more expensive
+have not fetched the PG log from each peer; we have only the bounds of
+it and other metadata from their ``pg_info_t``. It would be more expensive
to fetch and examine every log at this point, so we only consider an
approximate check for log length for now. In Nautilus, we improved
-the accounting of missing objects, so post nautilus, this information
+the accounting of missing objects, so post-Nautilus this information
is also used to determine the cost of recovery.
-While async recovery is occurring, writes on members of the acting set
+While async recovery is occurring, writes to members of the acting set
may proceed, but we need to send their log entries to the async
-recovery targets (just like we do for backfill osds) so that they
+recovery targets (just like we do for backfill OSDs) so that they
can completely catch up.
Backfill Reservation
====================
-When a new osd joins a cluster, all pgs containing it must eventually backfill
-to it. If all of these backfills happen simultaneously, it would put excessive
-load on the osd. osd_max_backfills limits the number of outgoing or
-incoming backfills on a single node. The maximum number of outgoing backfills is
-osd_max_backfills. The maximum number of incoming backfills is
-osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2
-simultaneous backfills on one osd.
+When a new OSD joins a cluster all PGs with it in their acting sets must
+eventually backfill. If all of these backfills happen simultaneously
+they will present excessive load on the OSD: the "thundering herd"
+effect.
-Each OSDService now has two AsyncReserver instances: one for backfills going
-from the osd (local_reserver) and one for backfills going to the osd
-(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue
-by priority of waiting items and a set of current reservation holders. When a
-slot frees up, the AsyncReserver queues the Context* associated with the next
-item on the highest priority queue in the finisher provided to the constructor.
+The ``osd_max_backfills`` tunable limits the number of outgoing or
+incoming backfills that are active on a given OSD. Note that this limit is
+applied separately to incoming and to outgoing backfill operations.
+Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
+in flight on each OSD. This subtlety is often missed, and Ceph
+operators can be puzzled as to why more ops are observed than expected.
-For a primary to initiate a backfill, it must first obtain a reservation from
-its own local_reserver. Then, it must obtain a reservation from the backfill
-target's remote_reserver via a MBackfillReserve message. This process is
-managed by substates of Active and ReplicaActive (see the substates of Active
-in PG.h). The reservations are dropped either on the Backfilled event, which
-is sent on the primary before calling recovery_complete and on the replica on
-receipt of the BackfillComplete progress message), or upon leaving Active or
-ReplicaActive.
+Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
+from the OSD (``local_reserver``) and one for backfills going to the OSD
+(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``)
+manages a queue by priority of waiting items and a set of current reservation
+holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
+associated with the next item on the highest priority queue in the finisher
+provided to the constructor.
-It's important that we always grab the local reservation before the remote
+For a primary to initiate a backfill it must first obtain a reservation from
+its own ``local_reserver``. Then it must obtain a reservation from the backfill
+target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
+managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
+of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled``
+event, which is sent on the primary before calling ``recovery_complete``
+and on the replica on receipt of the ``BackfillComplete`` progress message),
+or upon leaving ``Active`` or ``ReplicaActive``.
+
+It's important to always grab the local reservation before the remote
reservation in order to prevent a circular dependency.
-We want to minimize the risk of data loss by prioritizing the order in
-which PGs are recovered. A user can override the default order by using
-force-recovery or force-backfill. A force-recovery at priority 255 will start
-before a force-backfill at priority 254.
+We minimize the risk of data loss by prioritizing the order in
+which PGs are recovered. Admins can override the default order by using
+``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
+priority ``255`` will start before a ``force-backfill`` op at priority ``254``.
+
+If a recovery is needed because a PG is below ``min_size`` a base priority of
+``220`` is used. This is incremented by the number of OSDs short of the pool's
+``min_size`` as well as a value relative to the pool's ``recovery_priority``.
+The resultant priority is capped at ``253`` so that it does not confound forced
+ops as described above. Under ordinary circumstances a recovery op is
+prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
+The resultant priority is capped at ``219``.
-If a recovery is needed because a PG is below min_size a base priority of 220
-is used. The number of OSDs below min_size of the pool is added, as well as a
-value relative to the pool's recovery_priority. The total priority is limited
-to 253. Under ordinary circumstances a recovery is prioritized at 180 plus a
-value relative to the pool's recovery_priority. The total priority is limited
-to 219.
+If a backfill op is needed because the number of acting OSDs is less than
+the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs
+short of the pool's `` min_size`` is added as well as a value relative to
+the pool's ``recovery_priority``. The total priority is limited to ``253``.
+If a backfill op is needed because a PG is undersized,
+a priority of ``140`` is used. The number of OSDs below the size of the pool is
+added as well as a value relative to the pool's ``recovery_priority``. The
+resultant priority is capped at ``179``. If a backfill op is
+needed because a PG is degraded, a priority of ``140`` is used. A value
+relative to the pool's ``recovery_priority`` is added. The resultant priority
+is capped at ``179`` . Under ordinary circumstances a
+backfill op priority of ``100`` is used. A value relative to the pool's
+``recovery_priority`` is added. The total priority is capped at ``139``.
-If a backfill is needed because the number of acting OSDs is less than min_size,
-a priority of 220 is used. The number of OSDs below min_size of the pool is
-added as well as a value relative to the pool's recovery_priority. The total
-priority is limited to 253. If a backfill is needed because a PG is undersized,
-a priority of 140 is used. The number of OSDs below the size of the pool is
-added as well as a value relative to the pool's recovery_priority. The total
-priority is limited to 179. If a backfill is needed because a PG is degraded,
-a priority of 140 is used. A value relative to the pool's recovery_priority is
-added. The total priority is limited to 179. Under ordinary circumstances a
-backfill is priority of 100 is used. A value relative to the pool's
-recovery_priority is added. The total priority is limited to 139.
+.. list-table:: Backfill and Recovery op priorities
+ :widths: 20 20 20
+ :header-rows: 1
+ * - Description
+ - Base priority
+ - Maximum priority
+ * - Backfill
+ - 100
+ - 139
+ * - Degraded Backfill
+ - 140
+ - 179
+ * - Recovery
+ - 180
+ - 219
+ * - Inactive Recovery
+ - 220
+ - 253
+ * - Inactive Backfill
+ - 220
+ - 253
+ * - force-backfill
+ - 254
+ -
+ * - force-recovery
+ - 255
+ -
-Description Base priority Maximum priority
------------ ------------- ----------------
-Backfill 100 139
-Degraded Backfill 140 179
-Recovery 180 219
-Inactive Recovery 220 253
-Inactive Backfill 220 253
-force-backfill 254
-force-recovery 255
last_epoch_started
======================
-info.last_epoch_started records an activation epoch e for interval i
-such that all writes committed in i or earlier are reflected in the
-local info/log and no writes after i are reflected in the local
+``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
+such that all writes committed in ``i`` or earlier are reflected in the
+local info/log and no writes after ``i`` are reflected in the local
info/log. Since no committed write is ever divergent, even if we
-get an authoritative log/info with an older info.last_epoch_started,
-we can leave our info.last_epoch_started alone since no writes could
+get an authoritative log/info with an older ``info.last_epoch_started``,
+we can leave our ``info.last_epoch_started`` alone since no writes could
have committed in any intervening interval (See PG::proc_master_log).
-info.history.last_epoch_started records a lower bound on the most
-recent interval in which the pg as a whole went active and accepted
-writes. On a particular osd, it is also an upper bound on the
-activation epoch of intervals in which writes in the local pg log
-occurred (we update it before accepting writes). Because all
-committed writes are committed by all acting set osds, any
-non-divergent writes ensure that history.last_epoch_started was
+``info.history.last_epoch_started`` records a lower bound on the most
+recent interval in which the PG as a whole went active and accepted
+writes. On a particular OSD it is also an upper bound on the
+activation epoch of intervals in which writes in the local PG log
+occurred: we update it before accepting writes. Because all
+committed writes are committed by all acting set OSDs, any
+non-divergent writes ensure that ``history.last_epoch_started`` was
recorded by all acting set members in the interval. Once peering has
-queried one osd from each interval back to some seen
-history.last_epoch_started, it follows that no interval after the max
-history.last_epoch_started can have reported writes as committed
+queried one OSD from each interval back to some seen
+``history.last_epoch_started``, it follows that no interval after the max
+``history.last_epoch_started`` can have reported writes as committed
(since we record it before recording client writes in an interval).
-Thus, the minimum last_update across all infos with
-info.last_epoch_started >= MAX(history.last_epoch_started) must be an
+Thus, the minimum ``last_update`` across all infos with
+``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
upper bound on writes reported as committed to the client.
-We update info.last_epoch_started with the initial activation message,
-but we only update history.last_epoch_started after the new
-info.last_epoch_started is persisted (possibly along with the first
-write). This ensures that we do not require an osd with the most
-recent info.last_epoch_started until all acting set osds have recorded
+We update ``info.last_epoch_started`` with the initial activation message,
+but we only update ``history.last_epoch_started`` after the new
+``info.last_epoch_started`` is persisted (possibly along with the first
+write). This ensures that we do not require an OSD with the most
+recent ``info.last_epoch_started`` until all acting set OSDs have recorded
it.
-In find_best_info, we do include info.last_epoch_started values when
-calculating the max_last_epoch_started_found because we want to avoid
+In ``find_best_info``, we do include ``info.last_epoch_started`` values when
+calculating ``max_last_epoch_started_found`` because we want to avoid
designating a log entry divergent which in a prior interval would have
been non-divergent since it might have been used to serve a read. In
-activate(), we use the peer's last_epoch_started value as a bound on
+``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
how far back divergent log entries can be found.
However, in a case like
calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
-since osd.1 is the only one which recorded info.les=477 while 4,0
-which were the acting set in that interval did not (4 restarted and 0
-did not get the message in time) the pg is marked incomplete when
-either 4 or 0 would have been valid choices. To avoid this, we do not
-consider info.les for incomplete peers when calculating
-min_last_epoch_started_found. It would not have been in the acting
-set, so we must have another osd from that interval anyway (if
-maybe_went_rw). If that osd does not remember that info.les, then we
+since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
+(which were the acting set in that interval) did not (osd.4 restarted and osd.0
+did not get the message in time), the PG is marked incomplete when
+either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
+consider ``info.les`` for incomplete peers when calculating
+``min_last_epoch_started_found``. It would not have been in the acting
+set, so we must have another OSD from that interval anyway (if
+``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we
cannot have served reads.
-----------------
Currently, consistency for all ceph pool types is ensured by primary
-log-based replication. This goes for both erasure-coded and
+log-based replication. This goes for both erasure-coded (EC) and
replicated pools.
Primary log-based replication
Reads must return data written by any write which completed (where the
client could possibly have received a commit message). There are lots
-of ways to handle this, but ceph's architecture makes it easy for
+of ways to handle this, but Ceph's architecture makes it easy for
everyone at any map epoch to know who the primary is. Thus, the easy
-answer is to route all writes for a particular pg through a single
+answer is to route all writes for a particular PG through a single
ordering primary and then out to the replicas. Though we only
-actually need to serialize writes on a single object (and even then,
+actually need to serialize writes on a single RADOS object (and even then,
the partial ordering only really needs to provide an ordering between
writes on overlapping regions), we might as well serialize writes on
the whole PG since it lets us represent the current state of the PG
using two numbers: the epoch of the map on the primary in which the
most recent write started (this is a bit stranger than it might seem
since map distribution itself is asynchronous -- see Peering and the
-concept of interval changes) and an increasing per-pg version number
--- this is referred to in the code with type eversion_t and stored as
-pg_info_t::last_update. Furthermore, we maintain a log of "recent"
+concept of interval changes) and an increasing per-PG version number
+-- this is referred to in the code with type ``eversion_t`` and stored as
+``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
operations extending back at least far enough to include any
*unstable* writes (writes which have been started but not committed)
and objects which aren't uptodate locally (see recovery and
backfill). In practice, the log will extend much further
-(osd_min_pg_log_entries when clean, osd_max_pg_log_entries when not
+(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
clean) because it's handy for quickly performing recovery.
Using this log, as long as we talk to a non-empty subset of the OSDs
newer cannot have completed without that log containing it) and the
newest head remembered (clearly, all writes in the log were started,
so it's fine for us to remember them) as the new head. This is the
-main point of divergence between replicated pools and ec pools in
-PG/PrimaryLogPG: replicated pools try to choose the newest valid
+main point of divergence between replicated pools and EC pools in
+``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
option to avoid the client needing to replay those operations and
instead recover the other copies. EC pools instead try to choose
the *oldest* option available to them.
The reason for this gets to the heart of the rest of the differences
in implementation: one copy will not generally be enough to
-reconstruct an ec object. Indeed, there are encodings where some log
-combinations would leave unrecoverable objects (as with a 4+2 encoding
+reconstruct an EC object. Indeed, there are encodings where some log
+combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
where 3 of the replicas remember a write, but the other 3 do not -- we
don't have 3 copies of either version). For this reason, log entries
representing *unstable* writes (writes not yet committed to the
-client) must be rollbackable using only local information on ec pools.
+client) must be rollbackable using only local information on EC pools.
Log entries in general may therefore be rollbackable (and in that case,
via a delayed application or via a set of instructions for rolling
back an inplace update) or not. Replicated pool log entries are
never able to be rolled back.
-For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
-osd_types.h:pg_log_entry_t, and peering in general.
+For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
+``osd_types.h:pg_log_entry_t``, and peering in general.
ReplicatedBackend/ECBackend unification strategy
================================================
PGBackend
---------
-So, the fundamental difference between replication and erasure coding
+The fundamental difference between replication and erasure coding
is that replication can do destructive updates while erasure coding
cannot. It would be really annoying if we needed to have two entire
-implementations of PrimaryLogPG, one for each of the two, if there
+implementations of ``PrimaryLogPG`` since there
are really only a few fundamental differences:
-#. How reads work -- async only, requires remote reads for ec
+#. How reads work -- async only, requires remote reads for EC
#. How writes work -- either restricted to append, or must write aside and do a
tpc
#. Whether we choose the oldest or newest possible head entry during peering
Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
-#. PGBackend
-#. PGTransaction
-#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
+#. ``PGBackend``
+#. ``PGTransaction``
+#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
#. Various bits of the write pipeline disallow some operations based on pool
type -- like omap operations, class operation reads, and writes which are
- not aligned appends (officially, so far) for ec
+ not aligned appends (officially, so far) for EC
#. Misc other kludges here and there
-PGBackend and PGTransaction enable abstraction of differences 1, 2,
+``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
and the addition of 4 as needed to the log entries.
-The replicated implementation is in ReplicatedBackend.h/cc and doesn't
-require much explanation, I think. More detail on the ECBackend can be
-found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
+The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
+require much additional explanation. More detail on the ``ECBackend`` can be
+found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
PGBackend Interface Explanation
===============================
-Note: this is from a design document from before the original firefly
+Note: this is from a design document that predated the Firefly release
and is probably out of date w.r.t. some of the method names.
Readable vs Degraded
--------------------
-For a replicated pool, an object is readable iff it is present on
-the primary (at the right version). For an ec pool, we need at least
-M shards present to do a read, and we need it on the primary. For
-this reason, PGBackend needs to include some interfaces for determining
+For a replicated pool, an object is readable IFF it is present on
+the primary (at the right version). For an EC pool, we need at least
+`m` shards present to perform a read, and we need it on the primary. For
+this reason, ``PGBackend`` needs to include some interfaces for determining
when recovery is required to serve a read vs a write. This also
changes the rules for when peering has enough logs to prove that it
Core Changes:
-- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
+- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
| objects to allow the user to make these determinations.
Client Reads
------------
-Reads with the replicated strategy can always be satisfied
-synchronously out of the primary OSD. With an erasure coded strategy,
+Reads from a replicated pool can always be satisfied
+synchronously by the primary OSD. Within an erasure coded pool,
the primary will need to request data from some number of replicas in
-order to satisfy a read. PGBackend will therefore need to provide
-separate objects_read_sync and objects_read_async interfaces where
-the former won't be implemented by the ECBackend.
+order to satisfy a read. ``PGBackend`` will therefore need to provide
+separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
+the former won't be implemented by the ``ECBackend``.
-PGBackend interfaces:
+``PGBackend`` interfaces:
-- objects_read_sync
-- objects_read_async
+- ``objects_read_sync``
+- ``objects_read_async``
-Scrub
------
+Scrubs
+------
We currently have two scrub modes with different default frequencies:
#. [shallow] scrub: compares the set of objects and metadata, but not
the contents
-#. deep scrub: compares the set of objects, metadata, and a crc32 of
+#. deep scrub: compares the set of objects, metadata, and a CRC32 of
the object contents (including omap)
The primary requests a scrubmap from each replica for a particular
range of objects. The replica fills out this scrubmap for the range
-of objects including, if the scrub is deep, a crc32 of the contents of
+of objects including, if the scrub is deep, a CRC32 of the contents of
each object. The primary gathers these scrubmaps from each replica
and performs a comparison identifying inconsistent objects.
Most of this can work essentially unchanged with erasure coded PG with
-the caveat that the PGBackend implementation must be in charge of
+the caveat that the ``PGBackend`` implementation must be in charge of
actually doing the scan.
-PGBackend interfaces:
+``PGBackend`` interfaces:
-- be_*
+- ``be_*``
Recovery
--------
minimum number of replica chunks required to reconstruct the object
and push out the replacement chunks concurrently.
-Another difference is that objects in erasure coded pg may be
-unrecoverable without being unfound. The "unfound" concept
-should probably then be renamed to unrecoverable. Also, the
-PGBackend implementation will have to be able to direct the search
-for pg replicas with unrecoverable object chunks and to be able
+Another difference is that objects in erasure coded PG may be
+unrecoverable without being unfound. The ``unfound`` state
+should probably be renamed to ``unrecoverable``. Also, the
+``PGBackend`` implementation will have to be able to direct the search
+for PG replicas with unrecoverable object chunks and to be able
to determine whether a particular object is recoverable.
Core changes:
-- s/unfound/unrecoverable
+- ``s/unfound/unrecoverable``
PGBackend interfaces:
Introduction
============
-As described in ../deduplication.rst, adding transparent redirect
+As described in ``../deduplication.rst``, adding transparent redirect
machinery to RADOS would enable a more capable tiering solution
than RADOS currently has with "cache/tiering".
-See ../deduplication.rst
+See ``../deduplication.rst``
At a high level, each object has a piece of metadata embedded in
-the object_info_t which can map subsets of the object data payload
+the ``object_info_t`` which can map subsets of the object data payload
to (refcounted) objects in other pools.
This document exists to detail:
RBD
---
-For RBD, the primary goal is for either an osd-internal agent or a
+For RBD, the primary goal is for either an OSD-internal agent or a
cluster-external agent to be able to transparently shift portions
of the consituent 4MB extents between a dedup pool and a hot base
pool.
-As such, rbd operations (including class operations and snapshots)
+As such, RBD operations (including class operations and snapshots)
must have the same observable results regardless of the current
status of the object.
-Moreover, tiering/dedup operations must interleave with rbd operations
+Moreover, tiering/dedup operations must interleave with RBD operations
without changing the result.
Thus, here is a sketch of how I'd expect a tiering agent to perform
basic operations:
-* Demote cold rbd chunk to slow pool:
+* Demote cold RBD chunk to slow pool:
1. Read object, noting current user_version.
2. In memory, run CDC implementation to fingerprint object.
using the CAS class.
4. Submit operation to base pool:
- * ASSERT_VER with the user version from the read to fail if the
+ * ``ASSERT_VER`` with the user version from the read to fail if the
object has been mutated since the read.
- * SET_CHUNK for each of the extents to the corresponding object
+ * ``SET_CHUNK`` for each of the extents to the corresponding object
in the base pool.
- * EVICT_CHUNK for each extent to free up space in the base pool.
- Results in each chunk being marked MISSING.
+ * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
+ Results in each chunk being marked ``MISSING``.
RBD users should then either see the state prior to the demotion or
subsequent to it.
Note that between 3 and 4, we potentially leak references, so a
periodic scrub would be needed to validate refcounts.
-* Promote cold rbd chunk to fast pool.
+* Promote cold RBD chunk to fast pool.
- 1. Submit TIER_PROMOTE
+ 1. Submit ``TIER_PROMOTE``
For clones, all of the above would be identical except that the
-initial read would need a LIST_SNAPS to determine which clones exist
-and the PROMOTE or SET_CHUNK/EVICT operations would need to include
-the cloneid.
+initial read would need a ``LIST_SNAPS`` to determine which clones exist
+and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
+the ``cloneid``.
RadosGW
-------
-For reads, RadosGW could operate as RBD above relying on the manifest
-machinery in the OSD to hide the distinction between the object being
-dedup'd or present in the base pool
+For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
+manifest machinery in the OSD to hide the distinction between the object
+being dedup'd or present in the base pool
-For writes, RadosGW could operate as RBD does above, but it could
+For writes, RGW could operate as RBD does above, but could
optionally have the freedom to fingerprint prior to doing the write.
In that case, it could immediately write out the target objects to the
CAS pool and then atomically write an object with the corresponding
- Snapshots: We want to be able to deduplicate portions of clones
below the level of the rados snapshot system. As such, the
rados operations below need to be extended to work correctly on
- clones (e.g.: we should be able to call SET_CHUNK on a clone, clear the
- corresponding extent in the base pool, and correctly maintain osd metadata).
+ clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
+ corresponding extent in the base pool, and correctly maintain OSD metadata).
- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
cache/tiering implementation, but to do that we need to ensure that we
can address the same use cases.
The existing implementation has some things that need to be cleaned up:
-* SET_REDIRECT: Should create the object if it doesn't exist, otherwise
+* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
one couldn't create an object atomically as a redirect.
-* SET_CHUNK:
+* ``SET_CHUNK``:
* Appears to trigger a new clone as user_modify gets set in
- do_osd_ops. This probably isn't desirable, see Snapshots section
+ ``do_osd_ops``. This probably isn't desirable, see Snapshots section
below for some options on how generally to mix these operations
- with snapshots. At a minimum, SET_CHUNK probably shouldn't set
+ with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
user_modify.
* Appears to assume that the corresponding section of the object
- does not exist (sets FLAG_MISSING) but does not check whether the
+ does not exist (sets ``FLAG_MISSING``) but does not check whether the
corresponding extent exists already in the object. Should always
leave the extent clean.
* Appears to clear the manifest unconditionally if not chunked,
that's probably wrong. We should return an error if it's a
- REDIRECT ::
+ ``REDIRECT`` ::
case CEPH_OSD_OP_SET_CHUNK:
if (oi.manifest.is_redirect()) {
}
-* TIER_PROMOTE:
+* ``TIER_PROMOTE``:
- * SET_REDIRECT clears the contents of the object. PROMOTE appears
+ * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
to copy them back in, but does not unset the redirect or clear the
reference. This violates the invariant that a redirect object
should be empty in the base pool. In particular, as long as the
redirect is set, it appears that all operations will be proxied
- even after the promote defeating the purpose. We do want PROMOTE
+ even after the promote defeating the purpose. We do want ``PROMOTE``
to be able to atomically replace a redirect with the actual
object, so the solution is to clear the redirect at the end of the
promote.
* For a chunked manifest, we appear to flush prior to promoting.
Promotion will often be used to prepare an object for low latency
reads and writes, accordingly, the only effect should be to read
- any MISSING extents into the base pool. No flushing should be done.
+ any ``MISSING`` extents into the base pool. No flushing should be done.
* High Level:
- * It appears that FLAG_DIRTY should never be used for an extent pointing
+ * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
at a dedup extent. Writing the mutated extent back to the dedup pool
requires writing a new object since the previous one cannot be mutated,
just as it would if it hadn't been dedup'd yet. Thus, we should always
drop the reference and remove the manifest pointer.
* There isn't currently a way to "evict" an object region. With the above
- change to SET_CHUNK to always retain the existing object region, we
- need an EVICT_CHUNK operation to then remove the extent.
+ change to ``SET_CHUNK`` to always retain the existing object region, we
+ need an ``EVICT_CHUNK`` operation to then remove the extent.
Testing
to extend that testing to include dedup/manifest support as well. Here's
a short list of the touchpoints:
-* Thrasher tests like qa/suites/rados/thrash/workloads/cache-snaps.yaml
+* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
That test, of course, tests the existing cache/tiering machinery. Add
additional files to that directory that instead setup a dedup pool. Add
- support to ceph_test_rados (src/test/osd/TestRados*).
+ support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
* RBD tests
- Add a test that runs an rbd workload concurrently with blind
+ Add a test that runs an RBD workload concurrently with blind
promote/evict operations.
-* RadosGW
+* RGW
Add a test that runs a rgw workload concurrently with blind
promote/evict operations.
Snapshots
---------
-Fundamentally, I think we need to be able to manipulate the manifest
+Fundamentally we need to be able to manipulate the manifest
status of clones because we want to be able to dynamically promote,
flush (if the state was dirty when the clone was created), and evict
extents from clones.
-As such, the plan is to allow the object_manifest_t for each clone
+As such, the plan is to allow the ``object_manifest_t`` for each clone
to be independent. Here's an incomplete list of the high level
tasks:
-* Modify the op processing pipeline to permit SET_CHUNK, EVICT_CHUNK
+* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
to operation directly on clones.
* Ensure that recovery checks the object_manifest prior to trying to
- use the overlaps in clone_range. ReplicatedBackend::calc_*_subsets
+ use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
are the two methods that would likely need to be modified.
-See snaps.rst for a rundown of the librados snapshot system and osd
+See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
support details. I'd like to call out one particular data structure
we may want to exploit.
-The dedup-tool needs to be updated to use LIST_SNAPS to discover
+The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
clones as part of leak detection.
An important question is how we deal with the fact that many clones
will frequently have references to the same backing chunks at the same
-offset. In particular, make_writeable will generally create a clone
-that shares the same object_manifest_t references with the exception
+offset. In particular, ``make_writeable`` will generally create a clone
+that shares the same ``object_manifest_t`` references with the exception
of any extents modified in that transaction. The metadata that
commits as part of that transaction must therefore map onto the same
refcount as before because otherwise we'd have to first increment
refcounts on backing objects (or risk a reference to a dead object)
Thus, we introduce a simple convention: consecutive clones which
share a reference at the same offset share the same refcount. This
-means that a write that invokes make_writeable may decrease refcounts,
+means that a write that invokes ``make_writeable`` may decrease refcounts,
but not increase them. This has some conquences for removing clones.
Consider the following sequence ::
10 : [0, 512) aaa, [512, 1024) bbb
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
-What should be the refcount for aaa be at the end? By our
-above rule, it should be two since the two aaa refs are not
-contiguous. However, consider removing clone 20 ::
+What should be the refcount for ``aaa`` be at the end? By our
+above rule, it should be ``2`` since the two ```aaa``` refs are not
+contiguous. However, consider removing clone ``20`` ::
initial:
head: [0, 512) aaa, [512, 1024) bbb
10 : [0, 512) aaa, [512, 1024) bbb
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
-At this point, our rule dictates that refcount(aaa) is 1.
-This means that removing 20 needs to check for refs held by
+At this point, our rule dictates that ``refcount(aaa)`` is `1`.
+This means that removing ``20`` needs to check for refs held by
the clones on either side which will then match.
-See osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal
+See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
for the logic implementing this rule.
This seems complicated, but it gets us two valuable properties:
1) The refcount change from make_writeable will not block on
incrementing a ref
-2) We don't need to load the object_manifest_t for every clone
+2) We don't need to load the ``object_manifest_t`` for every clone
to determine how to handle removing one -- just the ones
immediately preceding and succeeding it.
-All clone operations will need to consider adjacent chunk_maps
+All clone operations will need to consider adjacent ``chunk_maps``
when adding or removing references.
Cache/Tiering
One goal here should ultimately be for this manifest machinery to
provide a complete replacement.
-See cache-pool.rst
+See ``cache-pool.rst``
The manifest machinery already shares some code paths with the
-existing cache/tiering code, mainly stat_flush.
+existing cache/tiering code, mainly ``stat_flush``.
In no particular order, here's in incomplete list of things that need
to be wired up to provide feature parity:
for maintaining bloom filters which provide estimates of access
recency for objects. We probably need to modify this to permit
hitset maintenance for a normal pool -- there are already
- CEPH_OSD_OP_PG_HITSET* interfaces for querying them.
+ ``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
* Tiering agent: The osd already has a background tiering agent which
would need to be modified to instead flush and evict using
manifests.
- hitset
- age, ratio, bytes
-* Add tiering-mode to manifest-tiering.
+* Add tiering-mode to ``manifest-tiering``
- Writeback
- Read-only
Data Structures
===============
-Each object contains an object_manifest_t embedded within the
-object_info_t (see osd_types.h):
+Each RADOS object contains an ``object_manifest_t`` embedded within the
+``object_info_t`` (see ``osd_types.h``):
::
std::map<uint64_t, chunk_info_t> chunk_map;
}
-The type enum reflects three possible states an object can be in:
+The ``type`` enum reflects three possible states an object can be in:
-1. TYPE_NONE: normal rados object
-2. TYPE_REDIRECT: object payload is backed by a single object
- specified by redirect_target
-3. TYPE_CHUNKED: object payload is distributed among objects with
- size and offset specified by the chunk_map. chunk_map maps
- the offset of the chunk to a chunk_info_t shown below further
- specifying the length, target oid, and flags.
+1. ``TYPE_NONE``: normal RADOS object
+2. ``TYPE_REDIRECT``: object payload is backed by a single object
+ specified by ``redirect_target``
+3. ``TYPE_CHUNKED: object payload is distributed among objects with
+ size and offset specified by the ``chunk_map``. ``chunk_map`` maps
+ the offset of the chunk to a ``chunk_info_t`` as shown below, also
+ specifying the ``length``, target `OID`, and ``flags``.
::
cflag_t flags; // FLAG_*
-FLAG_DIRTY at this time can happen if an extent with a fingerprint
+``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
is written. This should be changed to drop the fingerprint instead.
================
Similarly to cache/tiering, the initial touchpoint is
-maybe_handle_manifest_detail.
+``maybe_handle_manifest_detail``.
-For manifest operations listed below, we return NOOP and continue onto
-dedicated handling within do_osd_ops.
+For manifest operations listed below, we return ``NOOP`` and continue onto
+dedicated handling within ``do_osd_ops``.
-For redirect objects which haven't been promoted (apparently oi.size >
-0 indicates that it's present?) we proxy reads and writes.
+For redirect objects which haven't been promoted (apparently ``oi.size >
+0`` indicates that it's present?) we proxy reads and writes.
-For reads on TYPE_CHUNKED, if can_proxy_chunked_read (basically, all
-of the ops are reads of extents in the object_manifest_t chunk_map),
+For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
+of the ops are reads of extents in the ``object_manifest_t chunk_map``),
we proxy requests to those objects.
RADOS Interface
================
-To set up deduplication pools, you must have two pools. One will act as the
+To set up deduplication one must provision two pools. One will act as the
base pool and the other will act as the chunk pool. The base pool need to be
-configured with fingerprint_algorithm option as follows.
+configured with the ``fingerprint_algorithm`` option as follows.
::
ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
--yes-i-really-mean-it
-1. Create objects ::
+Create objects ::
- - rados -p base_pool put foo ./foo
+ rados -p base_pool put foo ./foo
+ rados -p chunk_pool put foo-chunk ./foo-chunk
- - rados -p chunk_pool put foo-chunk ./foo-chunk
+Make a manifest object ::
-2. Make a manifest object ::
-
- - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
- chunk_pool foo-chunk $START_OFFSET --with-reference
+ rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
Operations:
-* set-redirect
+* ``set-redirect``
- set a redirection between a base_object in the base_pool and a target_object
- in the target_pool.
+ Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
+ in the ``target_pool``.
A redirected object will forward all operations from the client to the
- target_object. ::
+ ``target_object``. ::
void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
uint64_t tgt_version, int flag = 0);
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
<target_object>
- Returns ENOENT if the object does not exist (TODO: why?)
- Returns EINVAL if the object already is a redirect.
+ Returns ``ENOENT`` if the object does not exist (TODO: why?)
+ Returns ``EINVAL`` if the object already is a redirect.
Takes a reference to target as part of operation, can possibly leak a ref
if the acting set resets and the client dies between taking the ref and
Truncates object, clears omap, and clears xattrs as a side effect.
- At the top of do_osd_ops, does not set user_modify.
+ At the top of ``do_osd_ops``, does not set user_modify.
This operation is not a user mutation and does not trigger a clone to be created.
- The purpose of set_redirect is two.
+ There are two purposes of ``set_redirect``:
1. Redirect all operation to the target object (like proxy)
- 2. Cache when tier_promote is called (redirect will be cleared at this time).
+ 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
-* set-chunk
+* ``set-chunk``
- set the chunk-offset in a source_object to make a link between it and a
- target_object. ::
+ Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
+ ``target_object``. ::
void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
<caspool> <target_object> <target-offset>
- Returns ENOENT if the object does not exist (TODO: why?)
- Returns EINVAL if the object already is a redirect.
- Returns EINVAL if on ill-formed parameter buffer.
- Returns ENOTSUPP if existing mapped chunks overlap with new chunk mapping.
+ Returns ``ENOENT`` if the object does not exist (TODO: why?)
+ Returns ``EINVAL`` if the object already is a redirect.
+ Returns ``EINVAL`` if on ill-formed parameter buffer.
+ Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
Takes references to targets as part of operation, can possibly leak refs
if the acting set resets and the client dies between taking the ref and
This operation is not a user mutation and does not trigger a clone to be created.
- TODO: SET_CHUNK appears to clear the manifest unconditionally if it's not chunked. ::
+ TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
if (!oi.manifest.is_chunked()) {
oi.manifest.clear();
}
-* evict-chunk
+* ``evict-chunk``
Clears an extent from an object leaving only the manifest link between
- it and the target_object. ::
+ it and the ``target_object``. ::
void evict_chunk(
uint64_t offset, uint64_t length, int flag = 0);
rados -p base_pool evict-chunk <offset> <length> <object>
- Returns EINVAL if the extent is not present in the manifest.
+ Returns ``EINVAL`` if the extent is not present in the manifest.
Note: this does not exist yet.
-* tier-promote
+* ``tier-promote``
- promotes the object ensuring that subsequent reads and writes will be local ::
+ Promotes the object ensuring that subsequent reads and writes will be local ::
void tier_promote();
rados -p base_pool tier-promote <obj-name>
- Returns ENOENT if the object does not exist
+ Returns ``ENOENT`` if the object does not exist
For a redirect manifest, copies data to head.
For a chunked manifest, reads all MISSING extents into the base pool,
subsequent reads and writes will be served from the base pool.
- Implementation Note: For a chunked manifest, calls start_copy on itself. The
- resulting copy_get operation will issue reads which will then be redirected by
+ Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
+ resulting ``copy_get`` operation will issue reads which will then be redirected by
the normal manifest read machinery.
- Does not set the user_modify flag.
+ Does not set the ``user_modify`` flag.
- Future work will involve adding support for specifying a clone_id.
+ Future work will involve adding support for specifying a ``clone_id``.
-* unset-manifest
+* ``unset-manifest``
- unset the manifest info in the object that has manifest. ::
+ Unset the manifest info in the object that has manifest. ::
void unset_manifest();
Clears manifest chunks or redirect. Lazily releases references, may
leak.
- do_osd_ops seems not to include it in the user_modify=false ignorelist,
+ ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
and so will trigger a snapshot. Note, this will be true even for a
- redirect though SET_REDIRECT does not flip user_modify. This should
- be fixed -- unset-manifest should not be a user_modify.
+ redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
+ be fixed -- ``unset-manifest`` should not be a ``user_modify``.
-* tier-flush
+* ``tier-flush``
- flush the object which has chunks to the chunk pool. ::
+ Flush the object which has chunks to the chunk pool. ::
void tier_flush();
rados -p base_pool tier-flush <obj-name>
- Included in the user_modify=false ignorelist, does not trigger a clone.
+ Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
Does not evict the extents.
-Dedup tool
-==========
+ceph-dedup-tool
+===============
-Dedup tool has two features: finding an optimal chunk offset for dedup chunking
-and fixing the reference count (see ./refcount.rst).
+``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
+and fixing the reference count (see ``./refcount.rst``).
-* find an optimal chunk offset
+* Find an optimal chunk offset
- a. fixed chunk
+ a. Fixed chunk
- To find out a fixed chunk length, you need to run the following command many
- times while changing the chunk_size. ::
+ To find out a fixed chunk length, you need to run the following command many
+ times while changing the ``chunk_size``. ::
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
--chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
- b. rabin chunk(Rabin-karp algorithm)
+ b. Rabin chunk(Rabin-Karp algorithm)
- As you know, Rabin-karp algorithm is string-searching algorithm based
- on a rolling-hash. But rolling-hash is not enough to do deduplication because
- we don't know the chunk boundary. So, we need content-based slicing using
- a rolling hash for content-defined chunking.
- The current implementation uses the simplest approach: look for chunk boundaries
- by inspecting the rolling hash for pattern(like the
- lower N bits are all zeroes).
+ Rabin-Karp is a string-searching algorithm based
+ on a rolling hash. But a rolling hash is not enough to do deduplication because
+ we don't know the chunk boundary. So, we need content-based slicing using
+ a rolling hash for content-defined chunking.
+ The current implementation uses the simplest approach: look for chunk boundaries
+ by inspecting the rolling hash for pattern (like the
+ lower N bits are all zeroes).
- - Usage
-
- Users who want to use deduplication need to find an ideal chunk offset.
- To find out ideal chunk offset, Users should discover
- the optimal configuration for their data workload via ceph-dedup-tool.
- And then, this chunking information will be used for object chunking through
- set-chunk api. ::
+ Users who want to use deduplication need to find an ideal chunk offset.
+ To find out ideal chunk offset, users should discover
+ the optimal configuration for their data workload via ``ceph-dedup-tool``.
+ This information will then be used for object chunking through
+ the ``set-chunk`` API. ::
ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
--chunk-algorithm rabin --fingerprint-algorithm rabin
- ceph-dedup-tool has many options to utilize rabin chunk.
- These are options for rabin chunk. ::
+ ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
+ These are options for ``rabin chunk``. ::
--mod-prime <uint64_t>
--rabin-prime <uint64_t>
--min-chunk <uint32_t>
--max-chunk <uint64_t>
- Users need to refer following equation to use above options for rabin chunk. ::
+ Users need to refer following equation to use above options for ``rabin chunk``. ::
rabin_hash =
(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
c. Fixed chunk vs content-defined chunk
- Content-defined chunking may or not be optimal solution.
- For example,
+ Content-defined chunking may or not be optimal solution.
+ For example,
- Data chunk A : abcdefgabcdefgabcdefg
+ Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
- Let's think about Data chunk A's deduplication. Ideal chunk offset is
- from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
- But, in the case of content-based slicing, the optimal chunk length
- could not be found (dedup ratio will not be 100%).
- Because we need to find optimal parameter such
- as boundary bit, window size and prime value. This is as easy as fixed chunk.
- But, content defined chunking is very effective in the following case.
+ Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
+ from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
+ But, in the case of content-based slicing, the optimal chunk length
+ could not be found (dedup ratio will not be 100%).
+ Because we need to find optimal parameter such
+ as boundary bit, window size and prime value. This is as easy as fixed chunk.
+ But, content defined chunking is very effective in the following case.
- Data chunk B : abcdefgabcdefgabcdefg
+ Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
+
+ Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
- Data chunk C : Tabcdefgabcdefgabcdefg
-
-* fix reference count
+* Fix reference count
The key idea behind of reference counting for dedup is false-positive, which means
- (manifest object (no ref), chunk object(has ref)) happen instead of
- (manifest object (has ref), chunk 1(no ref)).
- To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
+ ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
+ ``(manifest object (has ref), chunk 1(no ref))``.
+ To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
OSD Throttles
=============
-There are three significant throttles in the filestore: wbthrottle,
-op_queue_throttle, and a throttle based on journal usage.
+There are three significant throttles in the FileStore OSD back end:
+wbthrottle, op_queue_throttle, and a throttle based on journal usage.
WBThrottle
----------
limits until the background flusher catches up.
The relevant config options are filestore_wbthrottle*. There are
-different defaults for xfs and btrfs. Each set has hard and soft
+different defaults for XFS and Btrfs. Each set has hard and soft
limits on bytes (total dirty bytes), ios (total dirty ios), and
inodes (total dirty fds). The WBThrottle will begin flushing
when any of these hits the soft limit and will block in throttle()
Partial Object Recovery
=======================
-Partial Object Recovery devotes to improving the efficiency of
-log-based recovery rather than backfill. Original log-based recovery
-calculates missing_set based on the difference between pg_log.
+Partial Object Recovery improves the efficiency of log-based recovery (vs
+backfill). Original log-based recovery calculates missing_set based on pg_log
+differences.
The whole object should be recovery from one OSD to another
if the object is indicated modified by pg_log regardless of how much
State variables
---------------
-- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep
-- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep
-- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep
-- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep
-- Initiated deep scrub state is must_scrub && must_deep_scrub
+- Periodic tick state is ``!must_scrub && !must_deep_scrub && !time_for_deep``
+- Periodic tick after ``osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep``
+- Initiated scrub state is ``must_scrub && !must_deep_scrub && !time_for_deep``
+- Initiated scrub after ``osd_deep_scrub_interval`` state is ``must_scrub && !must_deep_scrub && time_for_deep``
+- Initiated deep scrub state is ``must_scrub && must_deep_scrub``
Scrub Reservations
------------------
Ondisk Structures
-----------------
-Each object has in the pg collection a *head* object (or *snapdir*, which we
+Each object has in the PG collection a *head* object (or *snapdir*, which we
will come to shortly) and possibly a set of *clone* objects.
Each hobject_t has a snap field. For the *head* (the only writeable version
of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
See PrimaryLogPG::SnapTrimmer, SnapMapper
This trimming is performed asynchronously by the snap_trim_wq while the
-pg is clean and not scrubbing.
+PG is clean and not scrubbing.
#. The next snap in PG::snap_trimq is selected for trimming
#. We determine the next object for trimming out of PG::snap_mapper.
Recovery
--------
Because the trim operations are implemented using repops and log entries,
-normal pg peering and recovery maintain the snap trimmer operations with
+normal PG peering and recovery maintain the snap trimmer operations with
the caveat that push and removal operations need to update the local
*SnapMapper* instance. If the purged_snaps update is lost, we merely
retrim a now empty snap.
pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
object does not involve reading all objects for any snap. Additionally,
upon construction, the *SnapMapper* is provided with a mask for filtering
-the objects in the single SnapMapper keyspace belonging to that pg.
+the objects in the single SnapMapper keyspace belonging to that PG.
Split
-----
-The snapid_t -> hobject_t key entries are arranged such that for any pg,
+The snapid_t -> hobject_t key entries are arranged such that for any PG,
up to 8 prefixes need to be checked to determine all hobjects in a particular
-snap for a particular pg. Upon split, the prefixes to check on the parent
-are adjusted such that only the objects remaining in the pg will be visible.
+snap for a particular PG. Upon split, the prefixes to check on the parent
+are adjusted such that only the objects remaining in the PG will be visible.
The children will immediately have the correct mapping.