From 4740fe46ecadd050227fe5807570946995634dee Mon Sep 17 00:00:00 2001 From: Samuel Just Date: Thu, 27 Oct 2016 18:32:22 -0700 Subject: [PATCH] doc/dev/osd_internals: add some docs for ECBackend Also, clean up some old ones. Signed-off-by: Samuel Just --- .../erasure_coding/ecbackend.rst | 207 ++++++++++ .../erasure_coding/pgbackend.rst | 320 --------------- .../erasure_coding/proposals.rst | 384 ++++++++++++++++++ doc/dev/osd_internals/log_based_pg.rst | 221 ++++++++++ 4 files changed, 812 insertions(+), 320 deletions(-) create mode 100644 doc/dev/osd_internals/erasure_coding/ecbackend.rst delete mode 100644 doc/dev/osd_internals/erasure_coding/pgbackend.rst create mode 100644 doc/dev/osd_internals/erasure_coding/proposals.rst create mode 100644 doc/dev/osd_internals/log_based_pg.rst diff --git a/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/doc/dev/osd_internals/erasure_coding/ecbackend.rst new file mode 100644 index 0000000000000..2409ccabbe424 --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/ecbackend.rst @@ -0,0 +1,207 @@ +================================= +ECBackend Implementation Strategy +================================= + +Misc initial design notes +========================= + +The initial (and still true for ec pools without the hacky ec +overwrites debug flag enabled) design for ec pools restricted +EC pools to operations which can be easily rolled back: + +- CEPH_OSD_OP_APPEND: We can roll back an append locally by + including the previous object size as part of the PG log event. +- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete + requires that we retain the deleted object until all replicas have + persisted the deletion event. ErasureCoded backend will therefore + need to store objects with the version at which they were created + included in the key provided to the filestore. Old versions of an + object can be pruned when all replicas have committed up to the log + event deleting the object. +- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr + to be set or removed, we can roll back these operations locally. + +Log entries contain a structure explaining how to locally undo the +operation represented by the operation +(see osd_types.h:TransactionInfo::LocalRollBack). + +PGTemp and Crush +---------------- + +Primaries are able to request a temp acting set mapping in order to +allow an up-to-date OSD to serve requests while a new primary is +backfilled (and for other reasons). An erasure coded pg needs to be +able to designate a primary for these reasons without putting it in +the first position of the acting set. It also needs to be able to +leave holes in the requested acting set. + +Core Changes: + +- OSDMap::pg_to_*_osds needs to separately return a primary. For most + cases, this can continue to be acting[0]. +- MOSDPGTemp (and related OSD structures) needs to be able to specify + a primary as well as an acting set. +- Much of the existing code base assumes that acting[0] is the primary + and that all elements of acting are valid. This needs to be cleaned + up since the acting set may contain holes. + +Distinguished acting set positions +---------------------------------- + +With the replicated strategy, all replicas of a PG are +interchangeable. With erasure coding, different positions in the +acting set have different pieces of the erasure coding scheme and are +not interchangeable. Worse, crush might cause chunk 2 to be written +to an OSD which happens already to contain an (old) copy of chunk 4. +This means that the OSD and PG messages need to work in terms of a +type like pair in order to distinguish different pg +chunks on a single OSD. + +Because the mapping of object name to object in the filestore must +be 1-to-1, we must ensure that the objects in chunk 2 and the objects +in chunk 4 have different names. To that end, the objectstore must +include the chunk id in the object key. + +Core changes: + +- The objectstore `ghobject_t needs to also include a chunk id + `_ making it more like + tuple. +- coll_t needs to include a shard_t. +- The OSD pg_map and similar pg mappings need to work in terms of a + spg_t (essentially + pair). Similarly, pg->pg messages need to include + a shard_t +- For client->PG messages, the OSD will need a way to know which PG + chunk should get the message since the OSD may contain both a + primary and non-primary chunk for the same pg + +Object Classes +-------------- + +Reads from object classes will return ENOTSUP on ec pools by invoking +a special SYNC read. + +Scrub +----- + +The main catch, however, for ec pools is that sending a crc32 of the +stored chunk on a replica isn't particularly helpful since the chunks +on different replicas presumably store different data. Because we +don't support overwrites except via DELETE, however, we have the +option of maintaining a crc32 on each chunk through each append. +Thus, each replica instead simply computes a crc32 of its own stored +chunk and compares it with the locally stored checksum. The replica +then reports to the primary whether the checksums match. + +With overwrites, all scrubs are disabled for now until we work out +what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst). + +Crush +----- + +If crush is unable to generate a replacement for a down member of an +acting set, the acting set should have a hole at that position rather +than shifting the other elements of the acting set out of position. + +========= +ECBackend +========= + +MAIN OPERATION OVERVIEW +======================= + +A RADOS put operation can span +multiple stripes of a single object. There must be code that +tessellates the application level write into a set of per-stripe write +operations -- some whole-stripes and up to two partial +stripes. Without loss of generality, for the remainder of this +document we will focus exclusively on writing a single stripe (whole +or partial). We will use the symbol "W" to represent the number of +blocks within a stripe that are being written, i.e., W <= K. + +There are three data flows for handling a write into an EC stripe. The +choice of which of the three data flows to choose is based on the size +of the write operation and the arithmetic properties of the selected +parity-generation algorithm. + +(1) whole stripe is written/overwritten +(2) a read-modify-write operation is performed. + +WHOLE STRIPE WRITE +------------------ + +This is the simple case, and is already performed in the existing code +(for appends, that is). The primary receives all of the data for the +stripe in the RADOS request, computes the appropriate parity blocks +and send the data and parity blocks to their destination shards which +write them. This is essentially the current EC code. + +READ-MODIFY-WRITE +----------------- + +The primary determines which of the K-W blocks are to be unmodified, +and reads them from the shards. Once all of the data is received it is +combined with the received new data and new parity blocks are +computed. The modified blocks are sent to their respective shards and +written. The RADOS operation is acknowledged. + +OSD Object Write and Consistency +-------------------------------- + +Regardless of the algorithm chosen above, writing of the data is a two +phase process: commit and rollforward. The primary sends the log +entries with the operation described (see +osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack). +In all cases, the "commit" is performed in place, possibly leaving some +information required for a rollback in a write-aside object. The +rollforward phase occurs once all acting set replicas have committed +the commit (sorry, overloaded term) and removes the rollback information. + +In the case of overwrites of exsting stripes, the rollback information +has the form of a sparse object containing the old values of the +overwritten extents populated using clone_range. This is essentially +a place-holder implementation, in real life, bluestore will have an +efficient primitive for this. + +The rollforward part can be delayed since we report the operation as +committed once all replicas have committed. Currently, whenever we +send a write, we also indicate that all previously committed +operations should be rolled forward (see +ECBackend::try_reads_to_commit). If there aren't any in the pipeline +when we arrive at the waiting_rollforward queue, we start a dummy +write to move things along (see the Pipeline section later on and +ECBackend::try_finish_rmw). + +ExtentCache +----------- + +It's pretty important to be able to pipeline writes on the same +object. For this reason, there is a cache of extents written by +cacheable operations. Each extent remains pinned until the operations +referring to it are committed. The pipeline prevents rmw operations +from running until uncacheable transactions (clones, etc) are flushed +from the pipeline. + +See ExtentCache.h for a detailed explanation of how the cache +states correspond to the higher level invariants about the conditions +under which cuncurrent operations can refer to the same object. + +Pipeline +-------- + +Reading src/osd/ExtentCache.h should have given a good idea of how +operations might overlap. There are several states involved in +processing a write operation and an important invariant which +isn't enforced by ReplicatedPG at a higher level which need to be +managed by ECBackend. The important invariant is that we can't +have uncacheable and rmw operations running at the same time +on the same object. For simplicity, we simply enforce that any +operation which contains an rmw operation must wait until +all in-progress uncacheable operations complete. + +There are improvements to be made here in the future. + +For more details, see ECBackend::waiting_* and +ECBackend::try__to_. + diff --git a/doc/dev/osd_internals/erasure_coding/pgbackend.rst b/doc/dev/osd_internals/erasure_coding/pgbackend.rst deleted file mode 100644 index db602991af4ed..0000000000000 --- a/doc/dev/osd_internals/erasure_coding/pgbackend.rst +++ /dev/null @@ -1,320 +0,0 @@ -=================== -PG Backend Proposal -=================== - -NOTE: the last update of this page is dated 2013, before the Firefly -release. The details of the implementation may be different. - -Motivation ----------- - -The purpose of the `PG Backend interface -`_ -is to abstract over the differences between replication and erasure -coding as failure recovery mechanisms. - -Much of the existing PG logic, particularly that for dealing with -peering, will be common to each. With both schemes, a log of recent -operations will be used to direct recovery in the event that an OSD is -down or disconnected for a brief period of time. Similarly, in both -cases it will be necessary to scan a recovered copy of the PG in order -to recover an empty OSD. The PGBackend abstraction must be -sufficiently expressive for Replicated and ErasureCoded backends to be -treated uniformly in these areas. - -However, there are also crucial differences between using replication -and erasure coding which PGBackend must abstract over: - -1. The current write strategy would not ensure that a particular - object could be reconstructed after a failure. -2. Reads on an erasure coded PG require chunks to be read from the - replicas as well. -3. Object recovery probably involves recovering the primary and - replica missing copies at the same time to avoid performing extra - reads of replica shards. -4. Erasure coded PG chunks created for different acting set - positions are not interchangeable. In particular, it might make - sense for a single OSD to hold more than 1 PG copy for different - acting set positions. -5. Selection of a pgtemp for backfill may differ between replicated - and erasure coded backends. -6. The set of necessary OSDs from a particular interval required to - continue peering may differ between replicated and erasure coded - backends. -7. The selection of the authoritative log may differ between replicated - and erasure coded backends. - -Client Writes -------------- - -The current PG implementation performs a write by performing the write -locally while concurrently directing replicas to perform the same -operation. Once all operations are durable, the operation is -considered durable. Because these writes may be destructive -overwrites, during peering, a log entry on a replica (or the primary) -may be found to be divergent if that replica remembers a log event -which the authoritative log does not contain. This can happen if only -1 out of 3 replicas persisted an operation, but was not available in -the next interval to provide an authoritative log. With replication, -we can repair the divergent object as long as at least 1 replica has a -current copy of the divergent object. With erasure coding, however, -it might be the case that neither the new version of the object nor -the old version of the object has enough available chunks to be -reconstructed. This problem is much simpler if we arrange for all -supported operations to be locally roll-back-able. - -- CEPH_OSD_OP_APPEND: We can roll back an append locally by - including the previous object size as part of the PG log event. -- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete - requires that we retain the deleted object until all replicas have - persisted the deletion event. ErasureCoded backend will therefore - need to store objects with the version at which they were created - included in the key provided to the filestore. Old versions of an - object can be pruned when all replicas have committed up to the log - event deleting the object. -- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr - to be set or removed, we can roll back these operations locally. - -Core Changes: - -- Current code should be adapted to use and rollback as appropriate - APPEND, DELETE, (SET|RM)ATTR log entries. -- The filestore needs to be able to deal with multiply versioned - hobjects. This means adapting the filestore internally to - use a `ghobject `_ - which is basically a tuple. The gen_t + shard_t need to be included in the on-disk - filename. gen_t is a unique object identifier to make sure there - are no name collisions when object N is created + - deleted + created again. An interface needs to be added to get all - versions of a particular hobject_t or the most recently versioned - instance of a particular hobject_t. - -PGBackend Interfaces: - -- PGBackend::perform_write() : It seems simplest to pass the actual - ops vector. The reason for providing an async, callback based - interface rather than having the PGBackend respond directly is that - we might want to use this interface for internal operations like - watch/notify expiration or snap trimming which might not necessarily - have an external client. -- PGBackend::try_rollback() : Some log entries (all of the ones valid - for the Erasure coded backend) will support local rollback. In - those cases, PGLog can avoid adding objects to the missing set when - identifying divergent objects. - -Peering and PG Logs -------------------- - -Currently, we select the log with the newest last_update and the -longest tail to be the authoritative log. This is fine because we -aren't generally able to roll operations on the other replicas forward -or backwards, instead relying on our ability to re-replicate divergent -objects. With the write approach discussed in the previous section, -however, the erasure coded backend will rely on being able to roll -back divergent operations since we may not be able to re-replicate -divergent objects. Thus, we must choose the *oldest* last_update from -the last interval which went active in order to minimize the number of -divergent objects. - -The difficulty is that the current code assumes that as long as it has -an info from at least 1 OSD from the prior interval, it can complete -peering. In order to ensure that we do not end up with an -unrecoverably divergent object, a K+M erasure coded PG must hear from at -least K of the replicas of the last interval to serve writes. This ensures -that we will select a last_update old enough to roll back at least K -replicas. If a replica with an older last_update comes along later, -we will be able to provide at least K chunks of any divergent object. - -Core Changes: - -- PG::choose_acting(), etc. need to be generalized to use PGBackend to - determine the authoritative log. -- PG::RecoveryState::GetInfo needs to use PGBackend to determine - whether it has enough infos to continue with authoritative log - selection. - -PGBackend interfaces: - -- have_enough_infos() -- choose_acting() - -PGTemp ------- - -Currently, an OSD is able to request a temp acting set mapping in -order to allow an up-to-date OSD to serve requests while a new primary -is backfilled (and for other reasons). An erasure coded pg needs to -be able to designate a primary for these reasons without putting it -in the first position of the acting set. It also needs to be able -to leave holes in the requested acting set. - -Core Changes: - -- OSDMap::pg_to_*_osds needs to separately return a primary. For most - cases, this can continue to be acting[0]. -- MOSDPGTemp (and related OSD structures) needs to be able to specify - a primary as well as an acting set. -- Much of the existing code base assumes that acting[0] is the primary - and that all elements of acting are valid. This needs to be cleaned - up since the acting set may contain holes. - -Client Reads ------------- - -Reads with the replicated strategy can always be satisfied -synchronously out of the primary OSD. With an erasure coded strategy, -the primary will need to request data from some number of replicas in -order to satisfy a read. The perform_read() interface for PGBackend -therefore will be async. - -PGBackend interfaces: - -- perform_read(): as with perform_write() it seems simplest to pass - the ops vector. The call to oncomplete will occur once the out_bls - have been appropriately filled in. - -Distinguished acting set positions ----------------------------------- - -With the replicated strategy, all replicas of a PG are -interchangeable. With erasure coding, different positions in the -acting set have different pieces of the erasure coding scheme and are -not interchangeable. Worse, crush might cause chunk 2 to be written -to an OSD which happens already to contain an (old) copy of chunk 4. -This means that the OSD and PG messages need to work in terms of a -type like pair in order to distinguish different pg -chunks on a single OSD. - -Because the mapping of object name to object in the filestore must -be 1-to-1, we must ensure that the objects in chunk 2 and the objects -in chunk 4 have different names. To that end, the filestore must -include the chunk id in the object key. - -Core changes: - -- The filestore `ghobject_t needs to also include a chunk id - `_ making it more like - tuple. -- coll_t needs to include a shard_t. -- The OSD pg_map and similar pg mappings need to work in terms of a - spg_t (essentially - pair). Similarly, pg->pg messages need to include - a shard_t -- For client->PG messages, the OSD will need a way to know which PG - chunk should get the message since the OSD may contain both a - primary and non-primary chunk for the same pg - -Object Classes --------------- - -We probably won't support object classes at first on Erasure coded -backends. - -Scrub ------ - -We currently have two scrub modes with different default frequencies: - -1. [shallow] scrub: compares the set of objects and metadata, but not - the contents -2. deep scrub: compares the set of objects, metadata, and a crc32 of - the object contents (including omap) - -The primary requests a scrubmap from each replica for a particular -range of objects. The replica fills out this scrubmap for the range -of objects including, if the scrub is deep, a crc32 of the contents of -each object. The primary gathers these scrubmaps from each replica -and performs a comparison identifying inconsistent objects. - -Most of this can work essentially unchanged with erasure coded PG with -the caveat that the PGBackend implementation must be in charge of -actually doing the scan, and that the PGBackend implementation should -be able to attach arbitrary information to allow PGBackend on the -primary to scrub PGBackend specific metadata. - -The main catch, however, for erasure coded PG is that sending a crc32 -of the stored chunk on a replica isn't particularly helpful since the -chunks on different replicas presumably store different data. Because -we don't support overwrites except via DELETE, however, we have the -option of maintaining a crc32 on each chunk through each append. -Thus, each replica instead simply computes a crc32 of its own stored -chunk and compares it with the locally stored checksum. The replica -then reports to the primary whether the checksums match. - -PGBackend interfaces: - -- scan() -- scrub() -- compare_scrub_maps() - -Crush ------ - -If crush is unable to generate a replacement for a down member of an -acting set, the acting set should have a hole at that position rather -than shifting the other elements of the acting set out of position. - -Core changes: - -- Ensure that crush behaves as above for INDEP. - -Recovery --------- - -The logic for recovering an object depends on the backend. With -the current replicated strategy, we first pull the object replica -to the primary and then concurrently push it out to the replicas. -With the erasure coded strategy, we probably want to read the -minimum number of replica chunks required to reconstruct the object -and push out the replacement chunks concurrently. - -Another difference is that objects in erasure coded pg may be -unrecoverable without being unfound. The "unfound" concept -should probably then be renamed to unrecoverable. Also, the -PGBackend implementation will have to be able to direct the search -for pg replicas with unrecoverable object chunks and to be able -to determine whether a particular object is recoverable. - - -Core changes: - -- s/unfound/unrecoverable - -PGBackend interfaces: - -- `on_local_recover_start `_ -- `on_local_recover `_ -- `on_global_recover `_ -- `on_peer_recover `_ -- `begin_peer_recover `_ - -Backfill --------- - -For the most part, backfill itself should behave similarly between -replicated and erasure coded pools with a few exceptions: - -1. We probably want to be able to backfill multiple OSDs concurrently - with an erasure coded pool in order to cut down on the read - overhead. -2. We probably want to avoid having to place the backfill peers in the - acting set for an erasure coded pg because we might have a good - temporary pg chunk for that acting set slot. - -For 2, we don't really need to place the backfill peer in the acting -set for replicated PGs anyway. -For 1, PGBackend::choose_backfill() should determine which OSDs are -backfilled in a particular interval. - -Core changes: - -- Backfill should be capable of handling multiple backfill peers - concurrently even for - replicated pgs (easier to test for now) -- Backfill peers should not be placed in the acting set. - -PGBackend interfaces: - -- choose_backfill(): allows the implementation to determine which OSDs - should be backfilled in a particular interval. diff --git a/doc/dev/osd_internals/erasure_coding/proposals.rst b/doc/dev/osd_internals/erasure_coding/proposals.rst new file mode 100644 index 0000000000000..d557cc8a4da7c --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/proposals.rst @@ -0,0 +1,384 @@ +================================= +Proposed Next Steps for ECBackend +================================= + +PARITY-DELTA-WRITE +------------------ + +RMW operations current require 4 network hops (2 round trips). In +principle, for some codes, we can reduce this to 3 by sending the +update to the replicas holding the data blocks and having them +compute a delta to forward onto the parity blocks. + +The primary reads the current values of the "W" blocks and then uses +the new values of the "W" blocks to compute parity-deltas for each of +the parity blocks. The W blocks and the parity delta-blocks are sent +to their respective shards. + +The choice of whether to use a read-modify-write or a +parity-delta-write is complex policy issue that is TBD in the details +and is likely to be heavily dependant on the computational costs +associated with a parity-delta vs. a regular parity-generation +operation. However, it is believed that the parity-delta scheme is +likely to be the preferred choice, when available. + +The internal interface to the erasure coding library plug-ins needs to +be extended to support the ability to query if parity-delta +computation is possible for a selected algorithm as well as an +interface to the actual parity-delta computation algorithm when +available. + +Stripe Cache +------------ + +It may be a good idea to extend the current ExtentCache usage to +cache some data past when the pinning operation releases it. +One application pattern that is important to optimize is the small +block sequential write operation (think of the journal of a journaling +file system or a database transaction log). Regardless of the chosen +redundancy algorithm, it is advantageous for the primary to +retain/buffer recently read/written portions of a stripe in order to +reduce network traffic. The dynamic contents of this cache may be used +in the determination of whether a read-modify-write or a +parity-delta-write is performed. The sizing of this cache is TBD, but +we should plan on allowing at least a few full stripes per active +client. Limiting the cache occupancy on a per-client basis will reduce +the noisy neighbor problem. + +Recovery and Rollback Details +============================= + +Implementing a Rollback-able Prepare Operation +---------------------------------------------- + +The prepare operation is implemented at each OSD through a simulation +of a versioning or copy-on-write capability for modifying a portion of +an object. + +When a prepare operation is performed, the new data is written into a +temporary object. The PG log for the +operation will contain a reference to the temporary object so that it +can be located for recovery purposes as well as a record of all of the +shards which are involved in the operation. + +In order to avoid fragmentation (and hence, future read performance), +creation of the temporary object needs special attention. The name of +the temporary object affects its location within the KV store. Right +now its unclear whether it's desirable for the name to locate near the +base object or whether a separate subset of keyspace should be used +for temporary objects. Sam believes that colocation with the base +object is preferred (he suggests using the generation counter of the +ghobject for temporaries). Whereas Allen believes that using a +separate subset of keyspace is desirable since these keys are +ephemeral and we don't want to actually colocate them with the base +object keys. Perhaps some modeling here can help resolve this +issue. The data of the temporary object wants to be located as close +to the data of the base object as possible. This may be best performed +by adding a new ObjectStore creation primitive that takes the base +object as an addtional parameter that is a hint to the allocator. + +Sam: I think that the short lived thing may be a red herring. We'll +be updating the donor and primary objects atomically, so it seems like +we'd want them adjacent in the key space, regardless of the donor's +lifecycle. + +The apply operation moves the data from the temporary object into the +correct position within the base object and deletes the associated +temporary object. This operation is done using a specialized +ObjectStore primitive. In the current ObjectStore interface, this can +be done using the clonerange function followed by a delete, but can be +done more efficiently with a specialized move primitive. +Implementation of the specialized primitive on FileStore can be done +by copying the data. Some file systems have extensions that might also +be able to implement this operation (like a defrag API that swaps +chunks between files). It is expected that NewStore will be able to +support this efficiently and natively (It has been noted that this +sequence requires that temporary object allocations, which tend to be +small, be efficiently converted into blocks for main objects and that +blocks that were formerly inside of main objects must be reusable with +minimal overhead) + +The prepare and apply operations can be separated arbitrarily in +time. If a read operation accesses an object that has been altered by +a prepare operation (but without a corresponding apply operation) it +must return the data after the prepare operation. This is done by +creating an in-memory database of objects which have had a prepare +operation without a corresponding apply operation. All read operations +must consult this in-memory data structure in order to get the correct +data. It should explicitly recognized that it is likely that there +will be multiple prepare operations against a single base object and +the code must handle this case correctly. This code is implemented as +a layer between ObjectStore and all existing readers. Annoyingly, +we'll want to trash this state when the interval changes, so the first +thing that needs to happen after activation is that the primary and +replicas apply up to last_update so that the empty cache will be +correct. + +During peering, it is now obvious that an unapplied prepare operation +can easily be rolled back simply by deleting the associated temporary +object and removing that entry from the in-memory data structure. + +Partial Application Peering/Recovery modifications +-------------------------------------------------- + +Some writes will be small enough to not require updating all of the +shards holding data blocks. For write amplification minization +reasons, it would be best to avoid writing to those shards at all, +and delay even sending the log entries until the next write which +actually hits that shard. + +The delaying (buffering) of the transmission of the prepare and apply +operations for witnessing OSDs creates new situations that peering +must handle. In particular the logic for determining the authoritative +last_update value (and hence the selection of the OSD which has the +authoritative log) must be modified to account for the valid but +missing (i.e., delayed/buffered) pglog entries to which the +authoritative OSD was only a witness to. + +Because a partial write might complete without persisting a log entry +on every replica, we have to do a bit more work to determine an +authoritative last_update. The constraint (as with a replicated PG) +is that last_update >= the most recent log entry for which a commit +was sent to the client (call this actual_last_update). Secondarily, +we want last_update to be as small as possible since any log entry +past actual_last_update (we do not apply a log entry until we have +sent the commit to the client) must be able to be rolled back. Thus, +the smaller a last_update we choose, the less recovery will need to +happen (we can always roll back, but rolling a replica forward may +require an object rebuild). Thus, we will set last_update to 1 before +the oldest log entry we can prove cannot have been committed. In +current master, this is simply the last_update of the shortest log +from that interval (because that log did not persist any entry past +that point -- a precondition for sending a commit to the client). For +this design, we must consider the possibility that any log is missing +at its head log entries in which it did not participate. Thus, we +must determine the most recent interval in which we went active +(essentially, this is what find_best_info currently does). We then +pull the log from each live osd from that interval back to the minimum +last_update among them. Then, we extend all logs from the +authoritative interval until each hits an entry in which it should +have participated, but did not record. The shortest of these extended +logs must therefore contain any log entry for which we sent a commit +to the client -- and the last entry gives us our last_update. + +Deep scrub support +------------------ + +The simple answer here is probably our best bet. EC pools can't use +the omap namespace at all right now. The simplest solution would be +to take a prefix of the omap space and pack N M byte L bit checksums +into each key/value. The prefixing seems like a sensible precaution +against eventually wanting to store something else in the omap space. +It seems like any write will need to read at least the blocks +containing the modified range. However, with a code able to compute +parity deltas, we may not need to read a whole stripe. Even without +that, we don't want to have to write to blocks not participating in +the write. Thus, each shard should store checksums only for itself. +It seems like you'd be able to store checksums for all shards on the +parity blocks, but there may not be distinguished parity blocks which +are modified on all writes (LRC or shec provide two examples). L +should probably have a fixed number of options (16, 32, 64?) and be +configurable per-pool at pool creation. N, M should be likewise be +configurable at pool creation with sensible defaults. + +We need to handle online upgrade. I think the right answer is that +the first overwrite to an object with an append only checksum +removes the append only checksum and writes in whatever stripe +checksums actually got written. The next deep scrub then writes +out the full checksum omap entries. + +RADOS Client Acknowledgement Generation Optimization +==================================================== + +Now that the recovery scheme is understood, we can discuss the +generation of of the RADOS operation acknowledgement (ACK) by the +primary ("sufficient" from above). It is NOT required that the primary +wait for all shards to complete their respective prepare +operations. Using our example where the RADOS operations writes only +"W" chunks of the stripe, the primary will generate and send W+M +prepare operations (possibly including a send-to-self). The primary +need only wait for enough shards to be written to ensure recovery of +the data, Thus after writing W + M chunks you can afford the lost of M +chunks. Hence the primary can generate the RADOS ACK after W+M-M => W +of those prepare operations are completed. + +Inconsistent object_info_t versions +=================================== + +A natural consequence of only writing the blocks which actually +changed is that we don't want to update the object_info_t of the +objects which didn't. I actually think it would pose a problem to do +so: pg ghobject namespaces are generally large, and unless the osd is +seeing a bunch of overwrites on a small set of objects, I'd expect +each write to be far enough apart in the backing ghobject_t->data +mapping to each constitute a random metadata update. Thus, we have to +accept that not every shard will have the current version in its +object_info_t. We can't even bound how old the version on a +particular shard will happen to be. In particular, the primary does +not necessarily have the current version. One could argue that the +parity shards would always have the current version, but not every +code necessarily has designated parity shards which see every write +(certainly LRC, iirc shec, and even with a more pedestrian code, it +might be desirable to rotate the shards based on object hash). Even +if you chose to designate a shard as witnessing all writes, the pg +might be degraded with that particular shard missing. This is a bit +tricky, currently reads and writes implicitely return the most recent +version of the object written. On reads, we'd have to read K shards +to answer that question. We can get around that by adding a "don't +tell me the current version" flag. Writes are more problematic: we +need an object_info from the most recent write in order to form the +new object_info and log_entry. + +A truly terrifying option would be to eliminate version and +prior_version entirely from the object_info_t. There are a few +specific purposes it serves: +(1) On OSD startup, we prime the missing set by scanning backwards + from last_update to last_complete comparing the stored object's + object_info_t to the version of most recent log entry. +(2) During backfill, we compare versions between primary and target + to avoid some pushes. + +We use it elsewhere as well +(3) While pushing and pulling objects, we verify the version. +(4) We return it on reads and writes and allow the librados user to + assert it atomically on writesto allow the user to deal with write + races (used extensively by rbd). + +Case (3) isn't actually essential, just convenient. Oh well. (4) +is more annoying. Writes are easy since we know the version. Reads +are tricky because we may not need to read from all of the replicas. +Simplest solution is to add a flag to rados operations to just not +return the user version on read. We can also just not support the +user version assert on ec for now (I think? Only user is rgw bucket +indices iirc, and those will always be on replicated because they use +omap). + +We can avoid (1) by maintaining the missing set explicitely. It's +already possible for there to be a missing object without a +corresponding log entry (Consider the case where the most recent write +is to an object which has not been updated in weeks. If that write +becomes divergent, the written object needs to be marked missing based +on the prior_version which is not in the log.) THe PGLog already has +a way of handling those edge cases (see divergent_priors). We'd +simply expand that to contain the entire missing set and maintain it +atomically with the log and the objects. This isn't really an +unreasonable option, the addiitonal keys would be fewer than the +existing log keys + divergent_priors and aren't updated in the fast +write path anyway. + +The second case is a bit trickier. It's really an optimization for +the case where a pg became not in the acting set long enough for the +logs to no longer overlap but not long enough for the PG to have +healed and removed the old copy. Unfortunately, this describes the +case where a node was taken down for maintenance with noout set. It's +probably not acceptable to re-backfill the whole OSD in such a case, +so we need to be able to quickly determine whether a particular shard +is up to date given a valid acting set of other shards. + +Let ordinary writes which do not change the object size not touch the +object_info at all. That means that the object_info version won't +match the pg log entry version. Include in the pg_log_entry_t the +current object_info version as well as which shards participated (as +mentioned above). In addition to the object_info_t attr, record on +each shard s a vector recording for each other shard s' the most +recent write which spanned both s and s'. Operationally, we maintain +an attr on each shard containing that vector. A write touching S +updates the version stamp entry for each shard in S on each shard in +S's attribute (and leaves the rest alone). If we have a valid acting +set during backfill, we must have a witness of every write which +completed -- so taking the max of each entry over all of the acting +set shards must give us the current version for each shard. During +recovery, we set the attribute on the recovery target to that max +vector (Question: with LRC, we may not need to touch much of the +acting set to recover a particular shard -- can we just use the max of +the shards we used to recovery, or do we need to grab the version +vector from the rest of the acting set as well? I'm not sure, not a +big deal anyway, I think). + +The above lets us perform blind writes without knowing the current +object version (log entry version, that is) while still allowing us to +avoid backfilling up to date objects. The only catch is that our +backfill scans will can all replicas, not just the primary and the +backfill targets. + +It would be worth adding into scrub the ability to check the +consistency of the gathered version vectors -- probably by just +taking 3 random valid subsets and verifying that they generate +the same authoritative version vector. + +Implementation Strategy +======================= + +It goes without saying that it would be unwise to attempt to do all of +this in one massive PR. It's also not a good idea to merge code which +isn't being tested. To that end, it's worth thinking a bit about +which bits can be tested on their own (perhaps with a bit of temporary +scaffolding). + +We can implement the overwrite friendly checksumming scheme easily +enough with the current implementation. We'll want to enable it on a +per-pool basis (probably using a flag which we'll later repurpose for +actual overwrite support). We can enable it in some of the ec +thrashing tests in the suite. We can also add a simple test +validating the behavior of turning it on for an existing ec pool +(later, we'll want to be able to convert append-only ec pools to +overwrite ec pools, so that test will simply be expanded as we go). +The flag should be gated by the experimental feature flag since we +won't want to support this as a valid configuration -- testing only. +We need to upgrade append only ones in place during deep scrub. + +Similarly, we can implement the unstable extent cache with the current +implementation, it even lets us cut out the readable ack the replicas +send to the primary after the commit which lets it release the lock. +Same deal, implement, gate with experimental flag, add to some of the +automated tests. I don't really see a reason not to use the same flag +as above. + +We can certainly implement the move-range primitive with unit tests +before there are any users. Adding coverage to the existing +objectstore tests would suffice here. + +Explicit missing set can be implemented now, same deal as above -- +might as well even use the same feature bit. + +The TPC protocol outlined above can actually be implemented an append +only EC pool. Same deal as above, can even use the same feature bit. + +The RADOS flag to suppress the read op user version return can be +implemented immediately. Mostly just needs unit tests. + +The version vector problem is an interesting one. For append only EC +pools, it would be pointless since all writes increase the size and +therefore update the object_info. We could do it for replicated pools +though. It's a bit silly since all "shards" see all writes, but it +would still let us implement and partially test the augmented backfill +code as well as the extra pg log entry fields -- this depends on the +explicit pg log entry branch having already merged. It's not entirely +clear to me that this one is worth doing seperately. It's enough code +that I'd really prefer to get it done independently, but it's also a +fair amount of scaffolding that will be later discarded. + +PGLog entries need to be able to record the participants and log +comparison needs to be modified to extend logs with entries they +wouldn't have witnessed. This logic should be abstracted behind +PGLog so it can be unittested -- that would let us test it somewhat +before the actual ec overwrites code merges. + +Whatever needs to happen to the ec plugin interface can probably be +done independently of the rest of this (pending resolution of +questions below). + +The actual nuts and bolts of performing the ec overwrite it seems to +me can't be productively tested (and therefore implemented) until the +above are complete, so best to get all of the supporting code in +first. + +Open Questions +============== + +Is there a code we should be using that would let us compute a parity +delta without rereading and reencoding the full stripe? If so, is it +the kind of thing we need to design for now, or can it be reasonably +put off? + +What needs to happen to the EC plugin interface? diff --git a/doc/dev/osd_internals/log_based_pg.rst b/doc/dev/osd_internals/log_based_pg.rst new file mode 100644 index 0000000000000..4a0210bb842dc --- /dev/null +++ b/doc/dev/osd_internals/log_based_pg.rst @@ -0,0 +1,221 @@ +============ +Log Based PG +============ + +Background +========== + +Why ReplicatedPG? +----------------- + +Currently, consistency for all ceph pool types is ensured by primary +log-based replication. This goes for both erasure-coded and +replicated pools. If you ever find yourself asking "why is it that +both replicated and erasure coded pools are implemented by +ReplicatedPG.h/cc", that's why (though it definitely should be called +LogBasedPG, should include peering, and PG should be an abstract +interface defining only those things the OSD needs to know to route +messages etc -- but we live in an imperfect world where git deals +imperfectly with cherry-picking between branches where the file has +different names). + +Primary log-based replication +----------------------------- + +Reads must return data written by any write which completed (where the +client could possibly have received a commit message). There are lots +of ways to handle this, but ceph's architecture makes it easy for +everyone at any map epoch to know who the primary is. Thus, the easy +answer is to route all writes for a particular pg through a single +ordering primary and then out to the replicas. Though we only +actually need to serialize writes on a single object (and even then, +the partial ordering only really needs to provide an ordering between +writes on overlapping regions), we might as well serialize writes on +the whole PG since it lets us represent the current state of the PG +using two numbers: the epoch of the map on the primary in which the +most recent write started (this is a bit stranger than it might seem +since map distribution itself is asyncronous -- see Peering and the +concept of interval changes) and an increasing per-pg version number +-- this is referred to in the code with type eversion_t and stored as +pg_info_t::last_update. Furthermore, we maintain a log of "recent" +operations extending back at least far enough to include any +*unstable* writes (writes which have been started but not committed) +and and objects which aren't uptodate locally (see recovery and +backfill). In practice, the log will extend much further +(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not +clean) because it's handy for quickly performing recovery. + +Using this log, as long as we talk to a non-empty subset of the OSDs +which must have accepted any completed writes from the most recent +interval in which we accepted writes, we can determine a conservative +log which must contain any write which has been reported to a client +as committed. There is some freedom here, we can choose any log entry +between the oldest head remembered by an element of that set (any +newer cannot have completed without that log containing it) and the +newest head remembered (clearly, all writes in the log were started, +so it's fine for us to remember them) as the new head. This is the +main point of divergence between replicated pools and ec pools in +PG/ReplicatedPG: replicated pools try to choose the newest valid +option to avoid the client needing to replay those operations and +instead recover the other copies. EC pools instead try to choose +the *oldest* option available to them. + +The reason for this gets to the heart of the rest of the differences +in implementation: one copy will not generally be enough to +reconstruct an ec object. Indeed, there are encodings where some log +combinations would leave unrecoverable objects (as with a 4+2 encoding +where 3 of the replicas remember a write, but the other 3 do not -- we +don't have 3 copies of either version). For this reason, log entries +representing *unstable* writes (writes not yet committed to the +client) must be rollbackable using only local information on ec pools. +Log entries in general may therefore be rollbackable (and in that case, +via a delayed application or via a set of instructions for rolling +back an inplace update) or not. Replicated pool log entries are +never able to be rolled back. + +For more details, see PGLog.h/cc, osd_types.h:pg_log_t, +osd_types.h:pg_log_entry_t, and peering in general. + +ReplicatedBackend/ECBackend unification strategy +================================================ + +PGBackend +--------- + +So, the fundamental difference between replication and erasure coding +is that replication can do destructive updates while erasure coding +cannot. It would be really annoying if we needed to have two entire +implementations of ReplicatedPG, one for each of the two, if there are +really only a few fundamental differences: + + 1. How reads work -- async only, requires remote reads for ec + 2. How writes work -- either restricted to append, or must + write aside and do a tpc + 3. Whether we choose the oldest or newest possible head entry + during peering + 4. A bit of extra information in the log entry to enable rollback + +and so many similarities + + 1. All of the stats and metadata for objects + 2. The high level locking rules for mixing client IO with recovery + and scrub + 3. The high level locking rules for mixing reads and writes without + exposing uncommitted state (which might be rolled back or + forgotten later) + 4. The process, metadata, and protocol needed to determine the set + of osds which partcipated in the most recent interval in which we + accepted writes + 5. etc. + +Instead, we choose a few abstractions (and a few kludges) to paper +over the difference: + + 1. PGBackend + 2. PGTransaction + 2. PG::choose_acting chooses between calc_replicated_acting and + calc_ec_acting + 3. Various bits of the write pipeline disallow some operations based + on pool type -- like omap operations, class operation reads, and + writes which are not aligned appends (officially, so far) + for ec + 4. Misc other kludges here and there + +PGBackend and PGTransaction enables abstraction of differences 1, 2, +and the addition of 4 as needed to the log entries. + +The replicated implementation is in ReplicatedBackend.h/cc and doesn't +require much explanation, I think. More detail on the ECBackend can be +found in doc/dev/osd_internals/erasure_coding/ecbackend.rst. + +PGBackend Interface Explanation +=============================== + +Note: this is from a design document from before the original firefly +and is probably out of date w.r.t. some of the method names. + +Readable vs Degraded +-------------------- + +For a replicated pool, an object is readable iff it is present on +the primary (at the right version). For an ec pool, we need at least +M shards present to do a read, and we need it on the primary. For +this reason, PGBackend needs to include some interfaces for determing +when recovery is required to serve a read vs a write. This also +changes the rules for when peering has enough logs to prove that it + +Core Changes: + +- PGBackend needs to be able to return + IsPG(Recoverable|Readable)Predicate objects to allow the user + to make these determinations. + +Client Reads +------------ + +Reads with the replicated strategy can always be satisfied +synchronously out of the primary OSD. With an erasure coded strategy, +the primary will need to request data from some number of replicas in +order to satisfy a read. PGBackend will therefore need to provide +seperate objects_read_sync and objects_read_async interfaces where +the former won't be implemented by the ECBackend. + +PGBackend interfaces: + +- objects_read_sync +- objects_read_async + +Scrub +----- + +We currently have two scrub modes with different default frequencies: + +1. [shallow] scrub: compares the set of objects and metadata, but not + the contents +2. deep scrub: compares the set of objects, metadata, and a crc32 of + the object contents (including omap) + +The primary requests a scrubmap from each replica for a particular +range of objects. The replica fills out this scrubmap for the range +of objects including, if the scrub is deep, a crc32 of the contents of +each object. The primary gathers these scrubmaps from each replica +and performs a comparison identifying inconsistent objects. + +Most of this can work essentially unchanged with erasure coded PG with +the caveat that the PGBackend implementation must be in charge of +actually doing the scan. + + +PGBackend interfaces: + +- be_* + +Recovery +-------- + +The logic for recovering an object depends on the backend. With +the current replicated strategy, we first pull the object replica +to the primary and then concurrently push it out to the replicas. +With the erasure coded strategy, we probably want to read the +minimum number of replica chunks required to reconstruct the object +and push out the replacement chunks concurrently. + +Another difference is that objects in erasure coded pg may be +unrecoverable without being unfound. The "unfound" concept +should probably then be renamed to unrecoverable. Also, the +PGBackend implementation will have to be able to direct the search +for pg replicas with unrecoverable object chunks and to be able +to determine whether a particular object is recoverable. + + +Core changes: + +- s/unfound/unrecoverable + +PGBackend interfaces: + +- `on_local_recover_start `_ +- `on_local_recover `_ +- `on_global_recover `_ +- `on_peer_recover `_ +- `begin_peer_recover `_ -- 2.39.5