Also, clean up some old ones.
Signed-off-by: Samuel Just <sjust@redhat.com>
--- /dev/null
+=================================
+ECBackend Implementation Strategy
+=================================
+
+Misc initial design notes
+=========================
+
+The initial (and still true for ec pools without the hacky ec
+overwrites debug flag enabled) design for ec pools restricted
+EC pools to operations which can be easily rolled back:
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+ including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+ requires that we retain the deleted object until all replicas have
+ persisted the deletion event. ErasureCoded backend will therefore
+ need to store objects with the version at which they were created
+ included in the key provided to the filestore. Old versions of an
+ object can be pruned when all replicas have committed up to the log
+ event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+ to be set or removed, we can roll back these operations locally.
+
+Log entries contain a structure explaining how to locally undo the
+operation represented by the operation
+(see osd_types.h:TransactionInfo::LocalRollBack).
+
+PGTemp and Crush
+----------------
+
+Primaries are able to request a temp acting set mapping in order to
+allow an up-to-date OSD to serve requests while a new primary is
+backfilled (and for other reasons). An erasure coded pg needs to be
+able to designate a primary for these reasons without putting it in
+the first position of the acting set. It also needs to be able to
+leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to separately return a primary. For most
+ cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+ a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+ and that all elements of acting are valid. This needs to be cleaned
+ up since the acting set may contain holes.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable. With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable. Worse, crush might cause chunk 2 to be written
+to an OSD which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<shard_t, pg_t> in order to distinguish different pg
+chunks on a single OSD.
+
+Because the mapping of object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names. To that end, the objectstore must
+include the chunk id in the object key.
+
+Core changes:
+
+- The objectstore `ghobject_t needs to also include a chunk id
+ <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
+ tuple<hobject_t, gen_t, shard_t>.
+- coll_t needs to include a shard_t.
+- The OSD pg_map and similar pg mappings need to work in terms of a
+ spg_t (essentially
+ pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
+ a shard_t
+- For client->PG messages, the OSD will need a way to know which PG
+ chunk should get the message since the OSD may contain both a
+ primary and non-primary chunk for the same pg
+
+Object Classes
+--------------
+
+Reads from object classes will return ENOTSUP on ec pools by invoking
+a special SYNC read.
+
+Scrub
+-----
+
+The main catch, however, for ec pools is that sending a crc32 of the
+stored chunk on a replica isn't particularly helpful since the chunks
+on different replicas presumably store different data. Because we
+don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum. The replica
+then reports to the primary whether the checksums match.
+
+With overwrites, all scrubs are disabled for now until we work out
+what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+=========
+ECBackend
+=========
+
+MAIN OPERATION OVERVIEW
+=======================
+
+A RADOS put operation can span
+multiple stripes of a single object. There must be code that
+tessellates the application level write into a set of per-stripe write
+operations -- some whole-stripes and up to two partial
+stripes. Without loss of generality, for the remainder of this
+document we will focus exclusively on writing a single stripe (whole
+or partial). We will use the symbol "W" to represent the number of
+blocks within a stripe that are being written, i.e., W <= K.
+
+There are three data flows for handling a write into an EC stripe. The
+choice of which of the three data flows to choose is based on the size
+of the write operation and the arithmetic properties of the selected
+parity-generation algorithm.
+
+(1) whole stripe is written/overwritten
+(2) a read-modify-write operation is performed.
+
+WHOLE STRIPE WRITE
+------------------
+
+This is the simple case, and is already performed in the existing code
+(for appends, that is). The primary receives all of the data for the
+stripe in the RADOS request, computes the appropriate parity blocks
+and send the data and parity blocks to their destination shards which
+write them. This is essentially the current EC code.
+
+READ-MODIFY-WRITE
+-----------------
+
+The primary determines which of the K-W blocks are to be unmodified,
+and reads them from the shards. Once all of the data is received it is
+combined with the received new data and new parity blocks are
+computed. The modified blocks are sent to their respective shards and
+written. The RADOS operation is acknowledged.
+
+OSD Object Write and Consistency
+--------------------------------
+
+Regardless of the algorithm chosen above, writing of the data is a two
+phase process: commit and rollforward. The primary sends the log
+entries with the operation described (see
+osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
+In all cases, the "commit" is performed in place, possibly leaving some
+information required for a rollback in a write-aside object. The
+rollforward phase occurs once all acting set replicas have committed
+the commit (sorry, overloaded term) and removes the rollback information.
+
+In the case of overwrites of exsting stripes, the rollback information
+has the form of a sparse object containing the old values of the
+overwritten extents populated using clone_range. This is essentially
+a place-holder implementation, in real life, bluestore will have an
+efficient primitive for this.
+
+The rollforward part can be delayed since we report the operation as
+committed once all replicas have committed. Currently, whenever we
+send a write, we also indicate that all previously committed
+operations should be rolled forward (see
+ECBackend::try_reads_to_commit). If there aren't any in the pipeline
+when we arrive at the waiting_rollforward queue, we start a dummy
+write to move things along (see the Pipeline section later on and
+ECBackend::try_finish_rmw).
+
+ExtentCache
+-----------
+
+It's pretty important to be able to pipeline writes on the same
+object. For this reason, there is a cache of extents written by
+cacheable operations. Each extent remains pinned until the operations
+referring to it are committed. The pipeline prevents rmw operations
+from running until uncacheable transactions (clones, etc) are flushed
+from the pipeline.
+
+See ExtentCache.h for a detailed explanation of how the cache
+states correspond to the higher level invariants about the conditions
+under which cuncurrent operations can refer to the same object.
+
+Pipeline
+--------
+
+Reading src/osd/ExtentCache.h should have given a good idea of how
+operations might overlap. There are several states involved in
+processing a write operation and an important invariant which
+isn't enforced by ReplicatedPG at a higher level which need to be
+managed by ECBackend. The important invariant is that we can't
+have uncacheable and rmw operations running at the same time
+on the same object. For simplicity, we simply enforce that any
+operation which contains an rmw operation must wait until
+all in-progress uncacheable operations complete.
+
+There are improvements to be made here in the future.
+
+For more details, see ECBackend::waiting_* and
+ECBackend::try_<from>_to_<to>.
+
+++ /dev/null
-===================
-PG Backend Proposal
-===================
-
-NOTE: the last update of this page is dated 2013, before the Firefly
-release. The details of the implementation may be different.
-
-Motivation
-----------
-
-The purpose of the `PG Backend interface
-<https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h>`_
-is to abstract over the differences between replication and erasure
-coding as failure recovery mechanisms.
-
-Much of the existing PG logic, particularly that for dealing with
-peering, will be common to each. With both schemes, a log of recent
-operations will be used to direct recovery in the event that an OSD is
-down or disconnected for a brief period of time. Similarly, in both
-cases it will be necessary to scan a recovered copy of the PG in order
-to recover an empty OSD. The PGBackend abstraction must be
-sufficiently expressive for Replicated and ErasureCoded backends to be
-treated uniformly in these areas.
-
-However, there are also crucial differences between using replication
-and erasure coding which PGBackend must abstract over:
-
-1. The current write strategy would not ensure that a particular
- object could be reconstructed after a failure.
-2. Reads on an erasure coded PG require chunks to be read from the
- replicas as well.
-3. Object recovery probably involves recovering the primary and
- replica missing copies at the same time to avoid performing extra
- reads of replica shards.
-4. Erasure coded PG chunks created for different acting set
- positions are not interchangeable. In particular, it might make
- sense for a single OSD to hold more than 1 PG copy for different
- acting set positions.
-5. Selection of a pgtemp for backfill may differ between replicated
- and erasure coded backends.
-6. The set of necessary OSDs from a particular interval required to
- continue peering may differ between replicated and erasure coded
- backends.
-7. The selection of the authoritative log may differ between replicated
- and erasure coded backends.
-
-Client Writes
--------------
-
-The current PG implementation performs a write by performing the write
-locally while concurrently directing replicas to perform the same
-operation. Once all operations are durable, the operation is
-considered durable. Because these writes may be destructive
-overwrites, during peering, a log entry on a replica (or the primary)
-may be found to be divergent if that replica remembers a log event
-which the authoritative log does not contain. This can happen if only
-1 out of 3 replicas persisted an operation, but was not available in
-the next interval to provide an authoritative log. With replication,
-we can repair the divergent object as long as at least 1 replica has a
-current copy of the divergent object. With erasure coding, however,
-it might be the case that neither the new version of the object nor
-the old version of the object has enough available chunks to be
-reconstructed. This problem is much simpler if we arrange for all
-supported operations to be locally roll-back-able.
-
-- CEPH_OSD_OP_APPEND: We can roll back an append locally by
- including the previous object size as part of the PG log event.
-- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
- requires that we retain the deleted object until all replicas have
- persisted the deletion event. ErasureCoded backend will therefore
- need to store objects with the version at which they were created
- included in the key provided to the filestore. Old versions of an
- object can be pruned when all replicas have committed up to the log
- event deleting the object.
-- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
- to be set or removed, we can roll back these operations locally.
-
-Core Changes:
-
-- Current code should be adapted to use and rollback as appropriate
- APPEND, DELETE, (SET|RM)ATTR log entries.
-- The filestore needs to be able to deal with multiply versioned
- hobjects. This means adapting the filestore internally to
- use a `ghobject <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L238>`_
- which is basically a tuple<hobject_t, gen_t,
- shard_t>. The gen_t + shard_t need to be included in the on-disk
- filename. gen_t is a unique object identifier to make sure there
- are no name collisions when object N is created +
- deleted + created again. An interface needs to be added to get all
- versions of a particular hobject_t or the most recently versioned
- instance of a particular hobject_t.
-
-PGBackend Interfaces:
-
-- PGBackend::perform_write() : It seems simplest to pass the actual
- ops vector. The reason for providing an async, callback based
- interface rather than having the PGBackend respond directly is that
- we might want to use this interface for internal operations like
- watch/notify expiration or snap trimming which might not necessarily
- have an external client.
-- PGBackend::try_rollback() : Some log entries (all of the ones valid
- for the Erasure coded backend) will support local rollback. In
- those cases, PGLog can avoid adding objects to the missing set when
- identifying divergent objects.
-
-Peering and PG Logs
--------------------
-
-Currently, we select the log with the newest last_update and the
-longest tail to be the authoritative log. This is fine because we
-aren't generally able to roll operations on the other replicas forward
-or backwards, instead relying on our ability to re-replicate divergent
-objects. With the write approach discussed in the previous section,
-however, the erasure coded backend will rely on being able to roll
-back divergent operations since we may not be able to re-replicate
-divergent objects. Thus, we must choose the *oldest* last_update from
-the last interval which went active in order to minimize the number of
-divergent objects.
-
-The difficulty is that the current code assumes that as long as it has
-an info from at least 1 OSD from the prior interval, it can complete
-peering. In order to ensure that we do not end up with an
-unrecoverably divergent object, a K+M erasure coded PG must hear from at
-least K of the replicas of the last interval to serve writes. This ensures
-that we will select a last_update old enough to roll back at least K
-replicas. If a replica with an older last_update comes along later,
-we will be able to provide at least K chunks of any divergent object.
-
-Core Changes:
-
-- PG::choose_acting(), etc. need to be generalized to use PGBackend to
- determine the authoritative log.
-- PG::RecoveryState::GetInfo needs to use PGBackend to determine
- whether it has enough infos to continue with authoritative log
- selection.
-
-PGBackend interfaces:
-
-- have_enough_infos()
-- choose_acting()
-
-PGTemp
-------
-
-Currently, an OSD is able to request a temp acting set mapping in
-order to allow an up-to-date OSD to serve requests while a new primary
-is backfilled (and for other reasons). An erasure coded pg needs to
-be able to designate a primary for these reasons without putting it
-in the first position of the acting set. It also needs to be able
-to leave holes in the requested acting set.
-
-Core Changes:
-
-- OSDMap::pg_to_*_osds needs to separately return a primary. For most
- cases, this can continue to be acting[0].
-- MOSDPGTemp (and related OSD structures) needs to be able to specify
- a primary as well as an acting set.
-- Much of the existing code base assumes that acting[0] is the primary
- and that all elements of acting are valid. This needs to be cleaned
- up since the acting set may contain holes.
-
-Client Reads
-------------
-
-Reads with the replicated strategy can always be satisfied
-synchronously out of the primary OSD. With an erasure coded strategy,
-the primary will need to request data from some number of replicas in
-order to satisfy a read. The perform_read() interface for PGBackend
-therefore will be async.
-
-PGBackend interfaces:
-
-- perform_read(): as with perform_write() it seems simplest to pass
- the ops vector. The call to oncomplete will occur once the out_bls
- have been appropriately filled in.
-
-Distinguished acting set positions
-----------------------------------
-
-With the replicated strategy, all replicas of a PG are
-interchangeable. With erasure coding, different positions in the
-acting set have different pieces of the erasure coding scheme and are
-not interchangeable. Worse, crush might cause chunk 2 to be written
-to an OSD which happens already to contain an (old) copy of chunk 4.
-This means that the OSD and PG messages need to work in terms of a
-type like pair<shard_t, pg_t> in order to distinguish different pg
-chunks on a single OSD.
-
-Because the mapping of object name to object in the filestore must
-be 1-to-1, we must ensure that the objects in chunk 2 and the objects
-in chunk 4 have different names. To that end, the filestore must
-include the chunk id in the object key.
-
-Core changes:
-
-- The filestore `ghobject_t needs to also include a chunk id
- <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
- tuple<hobject_t, gen_t, shard_t>.
-- coll_t needs to include a shard_t.
-- The OSD pg_map and similar pg mappings need to work in terms of a
- spg_t (essentially
- pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
- a shard_t
-- For client->PG messages, the OSD will need a way to know which PG
- chunk should get the message since the OSD may contain both a
- primary and non-primary chunk for the same pg
-
-Object Classes
---------------
-
-We probably won't support object classes at first on Erasure coded
-backends.
-
-Scrub
------
-
-We currently have two scrub modes with different default frequencies:
-
-1. [shallow] scrub: compares the set of objects and metadata, but not
- the contents
-2. deep scrub: compares the set of objects, metadata, and a crc32 of
- the object contents (including omap)
-
-The primary requests a scrubmap from each replica for a particular
-range of objects. The replica fills out this scrubmap for the range
-of objects including, if the scrub is deep, a crc32 of the contents of
-each object. The primary gathers these scrubmaps from each replica
-and performs a comparison identifying inconsistent objects.
-
-Most of this can work essentially unchanged with erasure coded PG with
-the caveat that the PGBackend implementation must be in charge of
-actually doing the scan, and that the PGBackend implementation should
-be able to attach arbitrary information to allow PGBackend on the
-primary to scrub PGBackend specific metadata.
-
-The main catch, however, for erasure coded PG is that sending a crc32
-of the stored chunk on a replica isn't particularly helpful since the
-chunks on different replicas presumably store different data. Because
-we don't support overwrites except via DELETE, however, we have the
-option of maintaining a crc32 on each chunk through each append.
-Thus, each replica instead simply computes a crc32 of its own stored
-chunk and compares it with the locally stored checksum. The replica
-then reports to the primary whether the checksums match.
-
-PGBackend interfaces:
-
-- scan()
-- scrub()
-- compare_scrub_maps()
-
-Crush
------
-
-If crush is unable to generate a replacement for a down member of an
-acting set, the acting set should have a hole at that position rather
-than shifting the other elements of the acting set out of position.
-
-Core changes:
-
-- Ensure that crush behaves as above for INDEP.
-
-Recovery
---------
-
-The logic for recovering an object depends on the backend. With
-the current replicated strategy, we first pull the object replica
-to the primary and then concurrently push it out to the replicas.
-With the erasure coded strategy, we probably want to read the
-minimum number of replica chunks required to reconstruct the object
-and push out the replacement chunks concurrently.
-
-Another difference is that objects in erasure coded pg may be
-unrecoverable without being unfound. The "unfound" concept
-should probably then be renamed to unrecoverable. Also, the
-PGBackend implementation will have to be able to direct the search
-for pg replicas with unrecoverable object chunks and to be able
-to determine whether a particular object is recoverable.
-
-
-Core changes:
-
-- s/unfound/unrecoverable
-
-PGBackend interfaces:
-
-- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
-- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
-- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
-- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
-- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
-
-Backfill
---------
-
-For the most part, backfill itself should behave similarly between
-replicated and erasure coded pools with a few exceptions:
-
-1. We probably want to be able to backfill multiple OSDs concurrently
- with an erasure coded pool in order to cut down on the read
- overhead.
-2. We probably want to avoid having to place the backfill peers in the
- acting set for an erasure coded pg because we might have a good
- temporary pg chunk for that acting set slot.
-
-For 2, we don't really need to place the backfill peer in the acting
-set for replicated PGs anyway.
-For 1, PGBackend::choose_backfill() should determine which OSDs are
-backfilled in a particular interval.
-
-Core changes:
-
-- Backfill should be capable of handling multiple backfill peers
- concurrently even for
- replicated pgs (easier to test for now)
-- Backfill peers should not be placed in the acting set.
-
-PGBackend interfaces:
-
-- choose_backfill(): allows the implementation to determine which OSDs
- should be backfilled in a particular interval.
--- /dev/null
+=================================
+Proposed Next Steps for ECBackend
+=================================
+
+PARITY-DELTA-WRITE
+------------------
+
+RMW operations current require 4 network hops (2 round trips). In
+principle, for some codes, we can reduce this to 3 by sending the
+update to the replicas holding the data blocks and having them
+compute a delta to forward onto the parity blocks.
+
+The primary reads the current values of the "W" blocks and then uses
+the new values of the "W" blocks to compute parity-deltas for each of
+the parity blocks. The W blocks and the parity delta-blocks are sent
+to their respective shards.
+
+The choice of whether to use a read-modify-write or a
+parity-delta-write is complex policy issue that is TBD in the details
+and is likely to be heavily dependant on the computational costs
+associated with a parity-delta vs. a regular parity-generation
+operation. However, it is believed that the parity-delta scheme is
+likely to be the preferred choice, when available.
+
+The internal interface to the erasure coding library plug-ins needs to
+be extended to support the ability to query if parity-delta
+computation is possible for a selected algorithm as well as an
+interface to the actual parity-delta computation algorithm when
+available.
+
+Stripe Cache
+------------
+
+It may be a good idea to extend the current ExtentCache usage to
+cache some data past when the pinning operation releases it.
+One application pattern that is important to optimize is the small
+block sequential write operation (think of the journal of a journaling
+file system or a database transaction log). Regardless of the chosen
+redundancy algorithm, it is advantageous for the primary to
+retain/buffer recently read/written portions of a stripe in order to
+reduce network traffic. The dynamic contents of this cache may be used
+in the determination of whether a read-modify-write or a
+parity-delta-write is performed. The sizing of this cache is TBD, but
+we should plan on allowing at least a few full stripes per active
+client. Limiting the cache occupancy on a per-client basis will reduce
+the noisy neighbor problem.
+
+Recovery and Rollback Details
+=============================
+
+Implementing a Rollback-able Prepare Operation
+----------------------------------------------
+
+The prepare operation is implemented at each OSD through a simulation
+of a versioning or copy-on-write capability for modifying a portion of
+an object.
+
+When a prepare operation is performed, the new data is written into a
+temporary object. The PG log for the
+operation will contain a reference to the temporary object so that it
+can be located for recovery purposes as well as a record of all of the
+shards which are involved in the operation.
+
+In order to avoid fragmentation (and hence, future read performance),
+creation of the temporary object needs special attention. The name of
+the temporary object affects its location within the KV store. Right
+now its unclear whether it's desirable for the name to locate near the
+base object or whether a separate subset of keyspace should be used
+for temporary objects. Sam believes that colocation with the base
+object is preferred (he suggests using the generation counter of the
+ghobject for temporaries). Whereas Allen believes that using a
+separate subset of keyspace is desirable since these keys are
+ephemeral and we don't want to actually colocate them with the base
+object keys. Perhaps some modeling here can help resolve this
+issue. The data of the temporary object wants to be located as close
+to the data of the base object as possible. This may be best performed
+by adding a new ObjectStore creation primitive that takes the base
+object as an addtional parameter that is a hint to the allocator.
+
+Sam: I think that the short lived thing may be a red herring. We'll
+be updating the donor and primary objects atomically, so it seems like
+we'd want them adjacent in the key space, regardless of the donor's
+lifecycle.
+
+The apply operation moves the data from the temporary object into the
+correct position within the base object and deletes the associated
+temporary object. This operation is done using a specialized
+ObjectStore primitive. In the current ObjectStore interface, this can
+be done using the clonerange function followed by a delete, but can be
+done more efficiently with a specialized move primitive.
+Implementation of the specialized primitive on FileStore can be done
+by copying the data. Some file systems have extensions that might also
+be able to implement this operation (like a defrag API that swaps
+chunks between files). It is expected that NewStore will be able to
+support this efficiently and natively (It has been noted that this
+sequence requires that temporary object allocations, which tend to be
+small, be efficiently converted into blocks for main objects and that
+blocks that were formerly inside of main objects must be reusable with
+minimal overhead)
+
+The prepare and apply operations can be separated arbitrarily in
+time. If a read operation accesses an object that has been altered by
+a prepare operation (but without a corresponding apply operation) it
+must return the data after the prepare operation. This is done by
+creating an in-memory database of objects which have had a prepare
+operation without a corresponding apply operation. All read operations
+must consult this in-memory data structure in order to get the correct
+data. It should explicitly recognized that it is likely that there
+will be multiple prepare operations against a single base object and
+the code must handle this case correctly. This code is implemented as
+a layer between ObjectStore and all existing readers. Annoyingly,
+we'll want to trash this state when the interval changes, so the first
+thing that needs to happen after activation is that the primary and
+replicas apply up to last_update so that the empty cache will be
+correct.
+
+During peering, it is now obvious that an unapplied prepare operation
+can easily be rolled back simply by deleting the associated temporary
+object and removing that entry from the in-memory data structure.
+
+Partial Application Peering/Recovery modifications
+--------------------------------------------------
+
+Some writes will be small enough to not require updating all of the
+shards holding data blocks. For write amplification minization
+reasons, it would be best to avoid writing to those shards at all,
+and delay even sending the log entries until the next write which
+actually hits that shard.
+
+The delaying (buffering) of the transmission of the prepare and apply
+operations for witnessing OSDs creates new situations that peering
+must handle. In particular the logic for determining the authoritative
+last_update value (and hence the selection of the OSD which has the
+authoritative log) must be modified to account for the valid but
+missing (i.e., delayed/buffered) pglog entries to which the
+authoritative OSD was only a witness to.
+
+Because a partial write might complete without persisting a log entry
+on every replica, we have to do a bit more work to determine an
+authoritative last_update. The constraint (as with a replicated PG)
+is that last_update >= the most recent log entry for which a commit
+was sent to the client (call this actual_last_update). Secondarily,
+we want last_update to be as small as possible since any log entry
+past actual_last_update (we do not apply a log entry until we have
+sent the commit to the client) must be able to be rolled back. Thus,
+the smaller a last_update we choose, the less recovery will need to
+happen (we can always roll back, but rolling a replica forward may
+require an object rebuild). Thus, we will set last_update to 1 before
+the oldest log entry we can prove cannot have been committed. In
+current master, this is simply the last_update of the shortest log
+from that interval (because that log did not persist any entry past
+that point -- a precondition for sending a commit to the client). For
+this design, we must consider the possibility that any log is missing
+at its head log entries in which it did not participate. Thus, we
+must determine the most recent interval in which we went active
+(essentially, this is what find_best_info currently does). We then
+pull the log from each live osd from that interval back to the minimum
+last_update among them. Then, we extend all logs from the
+authoritative interval until each hits an entry in which it should
+have participated, but did not record. The shortest of these extended
+logs must therefore contain any log entry for which we sent a commit
+to the client -- and the last entry gives us our last_update.
+
+Deep scrub support
+------------------
+
+The simple answer here is probably our best bet. EC pools can't use
+the omap namespace at all right now. The simplest solution would be
+to take a prefix of the omap space and pack N M byte L bit checksums
+into each key/value. The prefixing seems like a sensible precaution
+against eventually wanting to store something else in the omap space.
+It seems like any write will need to read at least the blocks
+containing the modified range. However, with a code able to compute
+parity deltas, we may not need to read a whole stripe. Even without
+that, we don't want to have to write to blocks not participating in
+the write. Thus, each shard should store checksums only for itself.
+It seems like you'd be able to store checksums for all shards on the
+parity blocks, but there may not be distinguished parity blocks which
+are modified on all writes (LRC or shec provide two examples). L
+should probably have a fixed number of options (16, 32, 64?) and be
+configurable per-pool at pool creation. N, M should be likewise be
+configurable at pool creation with sensible defaults.
+
+We need to handle online upgrade. I think the right answer is that
+the first overwrite to an object with an append only checksum
+removes the append only checksum and writes in whatever stripe
+checksums actually got written. The next deep scrub then writes
+out the full checksum omap entries.
+
+RADOS Client Acknowledgement Generation Optimization
+====================================================
+
+Now that the recovery scheme is understood, we can discuss the
+generation of of the RADOS operation acknowledgement (ACK) by the
+primary ("sufficient" from above). It is NOT required that the primary
+wait for all shards to complete their respective prepare
+operations. Using our example where the RADOS operations writes only
+"W" chunks of the stripe, the primary will generate and send W+M
+prepare operations (possibly including a send-to-self). The primary
+need only wait for enough shards to be written to ensure recovery of
+the data, Thus after writing W + M chunks you can afford the lost of M
+chunks. Hence the primary can generate the RADOS ACK after W+M-M => W
+of those prepare operations are completed.
+
+Inconsistent object_info_t versions
+===================================
+
+A natural consequence of only writing the blocks which actually
+changed is that we don't want to update the object_info_t of the
+objects which didn't. I actually think it would pose a problem to do
+so: pg ghobject namespaces are generally large, and unless the osd is
+seeing a bunch of overwrites on a small set of objects, I'd expect
+each write to be far enough apart in the backing ghobject_t->data
+mapping to each constitute a random metadata update. Thus, we have to
+accept that not every shard will have the current version in its
+object_info_t. We can't even bound how old the version on a
+particular shard will happen to be. In particular, the primary does
+not necessarily have the current version. One could argue that the
+parity shards would always have the current version, but not every
+code necessarily has designated parity shards which see every write
+(certainly LRC, iirc shec, and even with a more pedestrian code, it
+might be desirable to rotate the shards based on object hash). Even
+if you chose to designate a shard as witnessing all writes, the pg
+might be degraded with that particular shard missing. This is a bit
+tricky, currently reads and writes implicitely return the most recent
+version of the object written. On reads, we'd have to read K shards
+to answer that question. We can get around that by adding a "don't
+tell me the current version" flag. Writes are more problematic: we
+need an object_info from the most recent write in order to form the
+new object_info and log_entry.
+
+A truly terrifying option would be to eliminate version and
+prior_version entirely from the object_info_t. There are a few
+specific purposes it serves:
+(1) On OSD startup, we prime the missing set by scanning backwards
+ from last_update to last_complete comparing the stored object's
+ object_info_t to the version of most recent log entry.
+(2) During backfill, we compare versions between primary and target
+ to avoid some pushes.
+
+We use it elsewhere as well
+(3) While pushing and pulling objects, we verify the version.
+(4) We return it on reads and writes and allow the librados user to
+ assert it atomically on writesto allow the user to deal with write
+ races (used extensively by rbd).
+
+Case (3) isn't actually essential, just convenient. Oh well. (4)
+is more annoying. Writes are easy since we know the version. Reads
+are tricky because we may not need to read from all of the replicas.
+Simplest solution is to add a flag to rados operations to just not
+return the user version on read. We can also just not support the
+user version assert on ec for now (I think? Only user is rgw bucket
+indices iirc, and those will always be on replicated because they use
+omap).
+
+We can avoid (1) by maintaining the missing set explicitely. It's
+already possible for there to be a missing object without a
+corresponding log entry (Consider the case where the most recent write
+is to an object which has not been updated in weeks. If that write
+becomes divergent, the written object needs to be marked missing based
+on the prior_version which is not in the log.) THe PGLog already has
+a way of handling those edge cases (see divergent_priors). We'd
+simply expand that to contain the entire missing set and maintain it
+atomically with the log and the objects. This isn't really an
+unreasonable option, the addiitonal keys would be fewer than the
+existing log keys + divergent_priors and aren't updated in the fast
+write path anyway.
+
+The second case is a bit trickier. It's really an optimization for
+the case where a pg became not in the acting set long enough for the
+logs to no longer overlap but not long enough for the PG to have
+healed and removed the old copy. Unfortunately, this describes the
+case where a node was taken down for maintenance with noout set. It's
+probably not acceptable to re-backfill the whole OSD in such a case,
+so we need to be able to quickly determine whether a particular shard
+is up to date given a valid acting set of other shards.
+
+Let ordinary writes which do not change the object size not touch the
+object_info at all. That means that the object_info version won't
+match the pg log entry version. Include in the pg_log_entry_t the
+current object_info version as well as which shards participated (as
+mentioned above). In addition to the object_info_t attr, record on
+each shard s a vector recording for each other shard s' the most
+recent write which spanned both s and s'. Operationally, we maintain
+an attr on each shard containing that vector. A write touching S
+updates the version stamp entry for each shard in S on each shard in
+S's attribute (and leaves the rest alone). If we have a valid acting
+set during backfill, we must have a witness of every write which
+completed -- so taking the max of each entry over all of the acting
+set shards must give us the current version for each shard. During
+recovery, we set the attribute on the recovery target to that max
+vector (Question: with LRC, we may not need to touch much of the
+acting set to recover a particular shard -- can we just use the max of
+the shards we used to recovery, or do we need to grab the version
+vector from the rest of the acting set as well? I'm not sure, not a
+big deal anyway, I think).
+
+The above lets us perform blind writes without knowing the current
+object version (log entry version, that is) while still allowing us to
+avoid backfilling up to date objects. The only catch is that our
+backfill scans will can all replicas, not just the primary and the
+backfill targets.
+
+It would be worth adding into scrub the ability to check the
+consistency of the gathered version vectors -- probably by just
+taking 3 random valid subsets and verifying that they generate
+the same authoritative version vector.
+
+Implementation Strategy
+=======================
+
+It goes without saying that it would be unwise to attempt to do all of
+this in one massive PR. It's also not a good idea to merge code which
+isn't being tested. To that end, it's worth thinking a bit about
+which bits can be tested on their own (perhaps with a bit of temporary
+scaffolding).
+
+We can implement the overwrite friendly checksumming scheme easily
+enough with the current implementation. We'll want to enable it on a
+per-pool basis (probably using a flag which we'll later repurpose for
+actual overwrite support). We can enable it in some of the ec
+thrashing tests in the suite. We can also add a simple test
+validating the behavior of turning it on for an existing ec pool
+(later, we'll want to be able to convert append-only ec pools to
+overwrite ec pools, so that test will simply be expanded as we go).
+The flag should be gated by the experimental feature flag since we
+won't want to support this as a valid configuration -- testing only.
+We need to upgrade append only ones in place during deep scrub.
+
+Similarly, we can implement the unstable extent cache with the current
+implementation, it even lets us cut out the readable ack the replicas
+send to the primary after the commit which lets it release the lock.
+Same deal, implement, gate with experimental flag, add to some of the
+automated tests. I don't really see a reason not to use the same flag
+as above.
+
+We can certainly implement the move-range primitive with unit tests
+before there are any users. Adding coverage to the existing
+objectstore tests would suffice here.
+
+Explicit missing set can be implemented now, same deal as above --
+might as well even use the same feature bit.
+
+The TPC protocol outlined above can actually be implemented an append
+only EC pool. Same deal as above, can even use the same feature bit.
+
+The RADOS flag to suppress the read op user version return can be
+implemented immediately. Mostly just needs unit tests.
+
+The version vector problem is an interesting one. For append only EC
+pools, it would be pointless since all writes increase the size and
+therefore update the object_info. We could do it for replicated pools
+though. It's a bit silly since all "shards" see all writes, but it
+would still let us implement and partially test the augmented backfill
+code as well as the extra pg log entry fields -- this depends on the
+explicit pg log entry branch having already merged. It's not entirely
+clear to me that this one is worth doing seperately. It's enough code
+that I'd really prefer to get it done independently, but it's also a
+fair amount of scaffolding that will be later discarded.
+
+PGLog entries need to be able to record the participants and log
+comparison needs to be modified to extend logs with entries they
+wouldn't have witnessed. This logic should be abstracted behind
+PGLog so it can be unittested -- that would let us test it somewhat
+before the actual ec overwrites code merges.
+
+Whatever needs to happen to the ec plugin interface can probably be
+done independently of the rest of this (pending resolution of
+questions below).
+
+The actual nuts and bolts of performing the ec overwrite it seems to
+me can't be productively tested (and therefore implemented) until the
+above are complete, so best to get all of the supporting code in
+first.
+
+Open Questions
+==============
+
+Is there a code we should be using that would let us compute a parity
+delta without rereading and reencoding the full stripe? If so, is it
+the kind of thing we need to design for now, or can it be reasonably
+put off?
+
+What needs to happen to the EC plugin interface?
--- /dev/null
+============
+Log Based PG
+============
+
+Background
+==========
+
+Why ReplicatedPG?
+-----------------
+
+Currently, consistency for all ceph pool types is ensured by primary
+log-based replication. This goes for both erasure-coded and
+replicated pools. If you ever find yourself asking "why is it that
+both replicated and erasure coded pools are implemented by
+ReplicatedPG.h/cc", that's why (though it definitely should be called
+LogBasedPG, should include peering, and PG should be an abstract
+interface defining only those things the OSD needs to know to route
+messages etc -- but we live in an imperfect world where git deals
+imperfectly with cherry-picking between branches where the file has
+different names).
+
+Primary log-based replication
+-----------------------------
+
+Reads must return data written by any write which completed (where the
+client could possibly have received a commit message). There are lots
+of ways to handle this, but ceph's architecture makes it easy for
+everyone at any map epoch to know who the primary is. Thus, the easy
+answer is to route all writes for a particular pg through a single
+ordering primary and then out to the replicas. Though we only
+actually need to serialize writes on a single object (and even then,
+the partial ordering only really needs to provide an ordering between
+writes on overlapping regions), we might as well serialize writes on
+the whole PG since it lets us represent the current state of the PG
+using two numbers: the epoch of the map on the primary in which the
+most recent write started (this is a bit stranger than it might seem
+since map distribution itself is asyncronous -- see Peering and the
+concept of interval changes) and an increasing per-pg version number
+-- this is referred to in the code with type eversion_t and stored as
+pg_info_t::last_update. Furthermore, we maintain a log of "recent"
+operations extending back at least far enough to include any
+*unstable* writes (writes which have been started but not committed)
+and and objects which aren't uptodate locally (see recovery and
+backfill). In practice, the log will extend much further
+(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not
+clean) because it's handy for quickly performing recovery.
+
+Using this log, as long as we talk to a non-empty subset of the OSDs
+which must have accepted any completed writes from the most recent
+interval in which we accepted writes, we can determine a conservative
+log which must contain any write which has been reported to a client
+as committed. There is some freedom here, we can choose any log entry
+between the oldest head remembered by an element of that set (any
+newer cannot have completed without that log containing it) and the
+newest head remembered (clearly, all writes in the log were started,
+so it's fine for us to remember them) as the new head. This is the
+main point of divergence between replicated pools and ec pools in
+PG/ReplicatedPG: replicated pools try to choose the newest valid
+option to avoid the client needing to replay those operations and
+instead recover the other copies. EC pools instead try to choose
+the *oldest* option available to them.
+
+The reason for this gets to the heart of the rest of the differences
+in implementation: one copy will not generally be enough to
+reconstruct an ec object. Indeed, there are encodings where some log
+combinations would leave unrecoverable objects (as with a 4+2 encoding
+where 3 of the replicas remember a write, but the other 3 do not -- we
+don't have 3 copies of either version). For this reason, log entries
+representing *unstable* writes (writes not yet committed to the
+client) must be rollbackable using only local information on ec pools.
+Log entries in general may therefore be rollbackable (and in that case,
+via a delayed application or via a set of instructions for rolling
+back an inplace update) or not. Replicated pool log entries are
+never able to be rolled back.
+
+For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
+osd_types.h:pg_log_entry_t, and peering in general.
+
+ReplicatedBackend/ECBackend unification strategy
+================================================
+
+PGBackend
+---------
+
+So, the fundamental difference between replication and erasure coding
+is that replication can do destructive updates while erasure coding
+cannot. It would be really annoying if we needed to have two entire
+implementations of ReplicatedPG, one for each of the two, if there are
+really only a few fundamental differences:
+
+ 1. How reads work -- async only, requires remote reads for ec
+ 2. How writes work -- either restricted to append, or must
+ write aside and do a tpc
+ 3. Whether we choose the oldest or newest possible head entry
+ during peering
+ 4. A bit of extra information in the log entry to enable rollback
+
+and so many similarities
+
+ 1. All of the stats and metadata for objects
+ 2. The high level locking rules for mixing client IO with recovery
+ and scrub
+ 3. The high level locking rules for mixing reads and writes without
+ exposing uncommitted state (which might be rolled back or
+ forgotten later)
+ 4. The process, metadata, and protocol needed to determine the set
+ of osds which partcipated in the most recent interval in which we
+ accepted writes
+ 5. etc.
+
+Instead, we choose a few abstractions (and a few kludges) to paper
+over the difference:
+
+ 1. PGBackend
+ 2. PGTransaction
+ 2. PG::choose_acting chooses between calc_replicated_acting and
+ calc_ec_acting
+ 3. Various bits of the write pipeline disallow some operations based
+ on pool type -- like omap operations, class operation reads, and
+ writes which are not aligned appends (officially, so far)
+ for ec
+ 4. Misc other kludges here and there
+
+PGBackend and PGTransaction enables abstraction of differences 1, 2,
+and the addition of 4 as needed to the log entries.
+
+The replicated implementation is in ReplicatedBackend.h/cc and doesn't
+require much explanation, I think. More detail on the ECBackend can be
+found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
+
+PGBackend Interface Explanation
+===============================
+
+Note: this is from a design document from before the original firefly
+and is probably out of date w.r.t. some of the method names.
+
+Readable vs Degraded
+--------------------
+
+For a replicated pool, an object is readable iff it is present on
+the primary (at the right version). For an ec pool, we need at least
+M shards present to do a read, and we need it on the primary. For
+this reason, PGBackend needs to include some interfaces for determing
+when recovery is required to serve a read vs a write. This also
+changes the rules for when peering has enough logs to prove that it
+
+Core Changes:
+
+- PGBackend needs to be able to return
+ IsPG(Recoverable|Readable)Predicate objects to allow the user
+ to make these determinations.
+
+Client Reads
+------------
+
+Reads with the replicated strategy can always be satisfied
+synchronously out of the primary OSD. With an erasure coded strategy,
+the primary will need to request data from some number of replicas in
+order to satisfy a read. PGBackend will therefore need to provide
+seperate objects_read_sync and objects_read_async interfaces where
+the former won't be implemented by the ECBackend.
+
+PGBackend interfaces:
+
+- objects_read_sync
+- objects_read_async
+
+Scrub
+-----
+
+We currently have two scrub modes with different default frequencies:
+
+1. [shallow] scrub: compares the set of objects and metadata, but not
+ the contents
+2. deep scrub: compares the set of objects, metadata, and a crc32 of
+ the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects. The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a crc32 of the contents of
+each object. The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the PGBackend implementation must be in charge of
+actually doing the scan.
+
+
+PGBackend interfaces:
+
+- be_*
+
+Recovery
+--------
+
+The logic for recovering an object depends on the backend. With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded pg may be
+unrecoverable without being unfound. The "unfound" concept
+should probably then be renamed to unrecoverable. Also, the
+PGBackend implementation will have to be able to direct the search
+for pg replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+
+Core changes:
+
+- s/unfound/unrecoverable
+
+PGBackend interfaces:
+
+- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
+- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
+- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
+- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
+- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_