--- /dev/null
+===================
+PG Backend Proposal
+===================
+
+See src/osd/PGBackend.h
+
+Motivation
+----------
+
+The purpose of the PG Backend interface is to abstract over the
+differences between replication and erasure coding as failure recovery
+mechanisms.
+
+Much of the existing PG logic, particularly that for dealing with
+peering, will be common to each. With both schemes, a log of recent
+operations will be used to direct recovery in the event that an osd is
+down or disconnected for a brief period of time. Similarly, in both
+cases it will be necessary to scan a recovered copy of the PG in order
+to recover an empty OSD. The PGBackend abstraction must be
+sufficiently expressive for Replicated and ErasureCoded backends to be
+treated uniformly in these areas.
+
+However, there are also crucial differences between using replication
+and erasure coding which PGBackend must abstract over:
+
+1. The current write strategy would not ensure that a particular
+ object could be reconstructed after a failure.
+2. Reads on an erasure coded PG require chunks to be read from the
+ replicas as well.
+3. Object recovery probably involves recovering the primary and
+ replica missing copies at the same time to avoid performing extra
+ reads of replica shards.
+4. Erasure coded PG chunks created for different acting set
+ positions are not interchangeable. In particular, it might make
+ sense for a single OSD to hold more than 1 PG copy for different
+ acting set positions.
+5. Selection of a pgtemp for backfill may difer between replicated
+ and erasure coded backends.
+6. The set of necessary osds from a particular interval required to
+ to continue peering may difer between replicated and erasure
+ coded backends.
+7. The selection of the authoritative log may difer between replicated
+ and erasure coded backends.
+
+Client Writes
+-------------
+
+The current PG implementation performs a write by performing the write
+locally while concurrently directing replicas to perform the same
+operation. Once all operations are durable, the operation is
+considered durable. Because these writes may be destructive
+overwrites, during peering, a log entry on a replica (or the primary)
+may be found to be divergent if that replica remembers a log event
+which the authoritative log does not contain. This can happen if only
+1 out of 3 replicas persisted an operation, but was not available in
+the next interval to provide an authoritative log. With replication,
+we can repair the divergent object as long as at least 1 replica has a
+current copy of the divergent object. With erasure coding, however,
+it might be the case that neither the new version of the object nor
+the old version of the object has enough available chunks to be
+reconstructed. This problem is much simpler if we arrange for all
+supported operations to be locally roll-back-able.
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+ including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+ requires that we retain the deleted object until all replicas have
+ persisted the deletion event. ErasureCoded backend will therefore
+ need to store objects with the version at which they were created
+ included in the key provided to the filestore. Old versions of an
+ object can be pruned when all replicas have committed up to the log
+ event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+ to be set or removed, we can roll back these operations locally.
+
+Core Changes:
+
+- Current code should be adapted to use and rollback as appropriate
+ APPEND, DELETE, (SET|RM)ATTR log entries.
+- The filestore needs to be able to deal with multiply versioned
+ hobjects. This probably means adapting the filestore internally to
+ use a vhobject which is basically a pair<version_t, hobject_t>. The
+ version needs to be included in the on-disk filename. An interface
+ needs to be added to get all versions of a particular hobject_t or
+ the most recently versioned instance of a particular hobject_t.
+
+PGBackend Interfaces:
+
+- PGBackend::perform_write() : It seems simplest to pass the actual
+ ops vector. The reason for providing an async, callback based
+ interface rather than having the PGBackend respond directly is that
+ we might want to use this interface for internal operations like
+ watch/notify expiration or snap trimming which might not necessarily
+ have an external client.
+- PGBackend::try_rollback() : Some log entries (all of the ones valid
+ for the Erasure coded backend) will support local rollback. In
+ those cases, PGLog can avoid adding objects to the missing set when
+ identifying divergent objects.
+
+Peering and PG Logs
+-------------------
+
+Currently, we select the log with the newest last_update and the
+longest tail to be the authoritative log. This is fine because we
+aren't generally able to roll operations on the other replicas forward
+or backwards, instead relying on our ability to re-replicate divergent
+objects. With the write approach discussed in the prevous section,
+however, the erasure coded backend will rely on being able to roll
+back divergent operations since we may not be able to re-replicate
+divergent objects. Thus, we must choose the *oldest* last_update from
+the last interval which went active in order to minimize the number of
+divergent objects.
+
+The dificulty is that the current code assumes that as long as it has
+an info from at least 1 osd from the prior interval, it can complete
+peering. In order to ensure that we do not end up with an
+unrecoverably divergent object, an erasure coded PG must hear from at
+least N/M of the replicas of the last interval to serve writes where N
+is the minimum number of chunks required to reconstruct. This ensures
+that we will select a last_update old enough to roll back at least N
+replicas. If a replica with an older last_update comes along later,
+we will be able to provide at least N chunks of any divergent object.
+
+Core Changes:
+
+- PG::choose_acting(), etc. need to be generalized to use PGBackend to
+ determine the authoritative log.
+- PG::RecoveryState::GetInfo needs to use PGBackend to determine
+ whether it has enough infos to continue with authoritative log
+ selection.
+
+PGBackend interfaces:
+
+- have_enough_infos()
+- choose_acting()
+
+PGTemp
+------
+
+Currently, an osd is able to request a temp acting set mapping in
+order to allow an up-to-date osd to serve requests while a new primary
+is backfilled (and for other reasons). An erasure coded pg needs to
+be able to designate a primary for these reasons without putting it
+in the first position of the acting set. It also needs to be able
+to leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to seperately return a primary. For most
+ cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+ a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+ and that all elements of acting are valid. This needs to be cleaned
+ up since the acting set may contain holes.
+
+Client Reads
+------------
+
+Reads with the replicated strategy can always be satisfied
+syncronously out of the primary osd. With an erasure coded strategy,
+the primary will need to request data from some number of replicas in
+order to satisfy a read. The perform_read() interface for PGBackend
+therefore will be async.
+
+PGBackend interfaces:
+
+- perform_read(): as with perform_write() it seems simplest to pass
+ the ops vector. The call to oncomplete will occur once the out_bls
+ have been appropriately filled in.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable. With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable. Worse, crush might cause chunk 2 to be written
+to an osd which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<chunk_id_t, pg_t> in order to distinguish different pg
+chunks on a single OSD.
+
+Because the mapping of object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names. To that end, the filestore must
+include the chunk id in the object key.
+
+Core changes:
+
+- The filestore vhobject_t needs to also include a chunk id making it
+ more like tuple<hobject_t, version_t, chunk_id_t>.
+- coll_t needs to include a chunk_id_t.
+- The OSD pg_map and similar pg mappings need to work in terms of a
+ cpg_t (essentially pair<pg_t, chunk_id_t>). Similarly, pg->pg
+ messages need to include a chunk_id_t
+- For client->PG messages, the OSD will need a way to know which PG
+ chunk should get the message since the OSD may contain both a
+ primary and non-primary chunk for the same pg
+
+Object Classes
+--------------
+
+We probably won't support object classes at first on Erasure coded
+backends.
+
+Scrub
+-----
+
+We currently have two scrub modes with different default frequencies:
+
+1. [shallow] scrub: compares the set of objects and metadata, but not
+ the contents
+2. deep scrub: compares the set of objects, metadata, and a crc32 of
+ the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects. The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a crc32 of the contents of
+each object. The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the PGBackend implementation must be in charge of
+actually doing the scan, and that the PGBackend implementation should
+be able to attach arbitrary information to allow PGBackend on the
+primary to scrub PGBackend specific metadata.
+
+The main catch, however, for erasure coded PG is that sending a crc32
+of the stored chunk on a replica isn't particularly helpful since the
+chunks on different replicas presumably store different data. Because
+we don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum. The replica
+then reports to the primary whether the checksums match.
+
+PGBackend interfaces:
+
+- scan()
+- scrub()
+- compare_scrub_maps()
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+Core changes:
+
+- Ensure that crush behaves as above for INDEP.
+
+Recovery
+--------
+
+The logic for recovering an object depends on the backend. With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded pg may be
+unrecoverable without being unfound. The "unfound" concept
+should probably then be renamed to unrecoverable. Also, the
+PGBackend impementation will have to be able to direct the search
+for pg replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+Core changes:
+
+- s/unfound/unrecoverable
+
+PGBackend interfaces:
+
+- might_have_unrecoverable()
+- recoverable()
+- recover_object()
+
+Backfill
+--------
+
+For the most part, backfill itself should behave similarly between
+replicated and erasure coded pools with a few exceptions:
+
+1. We probably want to be able to backfill multiple osds concurrently
+ with an erasure coded pool in order to cut down on the read
+ overhead.
+2. We probably want to avoid having to place the backfill peers in the
+ acting set for an erasure coded pg because we might have a good
+ temporary pg chunk for that acting set slot.
+
+For 2, we don't really need to place the backfill peer in the acting
+set for replicated PGs anyway. For 1, PGBackend::choose_backfill()
+should determine which osds are backfilled in a particular interval.
+
+Core changes:
+
+- Backfill should be capable of handling multiple backfill peers
+ concurrently even for replicated pgs (easier to test for now)
+- Backfill peers should not be placed in the acting set.
+
+PGBackend interfaces:
+
+- choose_backfill(): allows the implementation to determine which osds
+ should be backfilled in a particular interval.
--- /dev/null
+// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
+/*
+ * Ceph - scalable distributed file system
+ *
+ * Copyright (C) 2013 Inktank Storage, Inc.
+ *
+ * This is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License version 2.1, as published by the Free Software
+ * Foundation. See file COPYING.
+ *
+ */
+
+#ifndef CEPH_PGBACKEND_H
+#define CEPH_PGBACKEND_H
+
+#include "osd_types.h"
+
+/**
+ * PGBackend
+ *
+ * PGBackend defines an interface for logic handling IO and
+ * replication on RADOS objects. The PGBackend implementation
+ * is responsible for:
+ *
+ * 1) Handling client operations
+ * 2) Handling object recovery
+ * 3) Handling object access
+ */
+class PGBackend {
+public:
+ /// IO
+
+ /// Perform write
+ int perform_write(
+ const vector<OSDOp> &ops, ///< [in] ops to perform
+ Context *onreadable, ///< [in] called when readable on all reaplicas
+ Context *onreadable, ///< [in] called when durable on all replicas
+ ) = 0; ///< @return 0 or error
+
+ /// Attempt to roll back a log entry
+ int try_rollback(
+ const pg_log_entry_t &entry, ///< [in] entry to roll back
+ ObjectStore::Transaction *t ///< [out] transaction
+ ) = 0; ///< @return 0 on success, -EINVAL if it can't be rolled back
+
+ /// Perform async read, oncomplete is called when ops out_bls are filled in
+ int perform_read(
+ vector<OSDOp> &ops, ///< [in, out] ops
+ Context *oncomplete ///< [out] called with r code
+ ) = 0; ///< @return 0 or error
+
+ /// Peering
+
+ /**
+ * have_enough_infos
+ *
+ * Allows PGBackend implementation to ensure that enough peers have
+ * been contacted to satisfy its requirements.
+ *
+ * TODO: this interface should yield diagnostic info about which infos
+ * are required
+ */
+ bool have_enough_infos(
+ const map<epoch_t, pg_interval_t> &past_intervals, ///< [in] intervals
+ const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
+ ) = 0; ///< @return true if we can continue peering
+
+ /**
+ * choose_acting
+ *
+ * Allows PGBackend implementation to select the acting set based on the
+ * received infos
+ *
+ * @return False if the current acting set is inadequate, *req_acting will
+ * be filled in with the requested new acting set. True if the
+ * current acting set is adequate, *auth_log will be filled in
+ * with the correct location of the authoritative log.
+ */
+ bool choose_acting(
+ const map<int, pg_info_t> &peer_infos, ///< [in] received infos
+ int *auth_log, ///< [out] osd with auth log
+ vector<int> *req_acting ///< [out] requested acting set
+ ) = 0;
+
+ /// Scrub
+
+ /// scan
+ int scan(
+ const hobject_t &start, ///< [in] scan objects >= start
+ const hobject_t &up_to, ///< [in] scan objects < up_to
+ vector<hobject_t> *out ///< [out] objects returned
+ ) = 0; ///< @return 0 or error
+
+ /// stat (TODO: ScrubMap::object needs to have PGBackend specific metadata)
+ int scrub(
+ const hobject_t &to_stat, ///< [in] object to stat
+ bool deep, ///< [in] true if deep scrub
+ ScrubMap::object *o ///< [out] result
+ ) = 0; ///< @return 0 or error
+
+ /**
+ * compare_scrub_maps
+ *
+ * @param inconsistent [out] map of inconsistent pgs to pair<correct, incorrect>
+ * @param errstr [out] stream of text about inconsistencies for user
+ * perusal
+ *
+ * TODO: this interface doesn't actually make sense...
+ */
+ void compare_scrub_maps(
+ const map<int, ScrubMap> &maps, ///< [in] maps to compare
+ bool deep, ///< [in] true if scrub is deep
+ map<hobject_t, pair<set<int>, set<int> > > *inconsistent,
+ std:ostream *errstr
+ ) = 0;
+
+ /// Recovery
+
+ /**
+ * might_have_unrecoverable
+ *
+ * @param missing [in] missing,info gathered so far (must include acting)
+ * @param intervals [in] past intervals
+ * @param should_query [out] pair<int, cpg_t> shards to query
+ */
+ void might_have_unrecoverable(
+ const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
+ const map<epoch_t, pg_interval_t> &past_intervals,
+ set<pair<int, cpg_t> > *should_query
+ ) = 0;
+
+ /**
+ * might_have_unfound
+ *
+ * @param missing [in] missing,info gathered so far (must include acting)
+ */
+ bool recoverable(
+ const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing,
+ const hobject_t &hoid ///< [in] object to check
+ ) = 0; ///< @return true if object can be recovered given missing
+
+ /**
+ * recover_object
+ *
+ * Triggers a recovery operation on the specified hobject_t
+ * onreadable must be called before onwriteable
+ *
+ * @param missing [in] set of info, missing pairs for queried nodes
+ */
+ void recover_object(
+ const hobject_t &hoid, ///< [in] object to recover
+ const map<chunk_id_t, map<int, pair<pg_info_t, pg_missing_t> > &missing
+ Context *onreadable, ///< [in] called when object can be read
+ Context *onwriteable ///< [in] called when object can be written
+ ) = 0;
+
+ /// Backfill
+
+ /// choose_backfill
+ void choose_backfill(
+ const map<chunk_id_t, map<int, pg_info_t> > &peer_infos ///< [in] infos
+ const vector<int> &acting, ///< [in] acting set
+ const vector<int> &up, ///< [in] up set
+ set<int> *to_backfill ///< [out] osds to backfill
+ ) = 0;
+};
+
+#endif