* Andreas-Joachim Peters suggests to reduce copies to the minimum. When
possible the output arguments will just point to the input
argument. This must be documented as any side effect on the input
argument may modify the output argument
* Fix typos
* Fix may/could/must/should to better reflect what's mandatory and
what's not.
* Reword the explanation of minimum_to_decode_with_cost to not suggest
an implementation. This will need to be revisited anyway, when the
semantic of the cost is defined.
The in-tree Hadoop shim was a combination of libcephfs wrapper, and the
bits to support Hadoop. This has been replaced by src/java that
implements generic libcephfs wrappers, and externally, the hadoop shim
(see docs).
David Zafman [Wed, 11 Sep 2013 23:56:21 +0000 (16:56 -0700)]
osd/ReplicatedPG.cc: Verify that recovery is truly complete
Backportable change to insure that even if no new ops started or
are running that indeed recovery is complete. Prevents some
error condition or unforseen code path from crashing an osd.
Backport: dumpling, cuttlefish
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
osd: implement basic caching policies in ReplicatedPG
Right now these are very basic and aren't as sophisticated as we
want them to end up, but we have a skeleton for where to put the
decision-making logic.
If we get back a redirect reply, we clean up the Op's external references
and re-send using the target_oloc and target_oid. To facilitate this,
recalc_op_target() now only fills them in and overrides them with pool
cache semantics if they're empty.
Keep in mind that this is a pretty simple redirect formula -- the
Objecter will keep following redirects forever if that's what the OSDs
send back. The client is not providing any synchronization right now.
Objecter: write a helper function to clean up ops that need to be retried
We have a little block to clean them up if we get back EAGAIN, but it's
actually leaking map references; we will also use this for redirects
from the OSDs.
When present, clients must send the request to the location specified
by the redirect (by using the combine_with_locator() function on
request_redirect_t).
A separate mechanism must be used to ensure that clients see and respect
the redirect, as we do not bump up the minimum required version to
decode.
Analagous to the oloc->base_oloc rename we did in e2fcad09d94d965867147627b73e99da9454436f, we may specify a different
target name for a redirect. Rename the existing oid to base_oid to
avoid any confusion.
Loic Dachary [Wed, 28 Aug 2013 16:48:40 +0000 (18:48 +0200)]
ErasureCodeJerasure: plugin
Create the class matching the string found in the
erasure-code-technique parameter, using the same strings are the
original {encoder,decoder}.c examples from Jerasure-1.2A. Registers
the plugin in ErasureCodePluginRegistry.
ErasureCodeJerasureCauchy defines the prepare_schedule method to be used
by prepare method, which is the only one overloaded by
ErasureCodeJerasureCauchyOrig (calling cauchy_original_coding_matrix)
and ErasureCodeJerasureCauchyGood ( calling
cauchy_good_general_coding_matrix).
The schedule is retained for encoding and the bitmatrix for decoding.
parse : default to K=7, M=3, W=8 and packetsize = 8.
pad_in_length : pad to a multiple of k*w*packetsize*sizeof(int)
jerasure_encode, jerasure_decode map directly to the matching
jerasure functions
Loic Dachary [Thu, 29 Aug 2013 11:31:10 +0000 (13:31 +0200)]
ErasureCodeJerasure: unit test common to all techniques
A typed unit test is defined and must run regardless of the technique.
When a new technique is derived from ErasureCodeJerasure, it is added
to the JerasureTypes typedef and the test will validate that:
* it provides reasonable defaults for the technique specific
parameters
* it modifies the k, m and w to reasonable defaults depending
on the imposed constraints ( for instance Liber8tion requires
that w == 8 but the test sets it to 7 )
* the encoding of K=2, M=2 produces 4 chunks, the first two
of which contains the original buffer data showing the
code is systematic
* decoding when all 4 chunks are available indeed retrieves
the original buffer content
* decoding when the two data chunks are are missing indeed
retrieves the original buffer content
With the introduction of the erasure code pool, arguments to be
interpreted depending on the pool type must be introduced.
For instance the erasure code pool loads a plugin at run time will
use easure-code-k=10 to split each object in 10.
If key=value it is stored in the new properties data member of pg_pool_t
as properties[key] = value, otherwise the value is the empty string.
The pg_pool_t version is bumped to 10 and the encode/decode methods
modified to take the properties into account. The
generate_test_instances method creates a two entries map, one of which
is the empty string to cover the case when no value is specified.
Greg Farnum [Fri, 30 Aug 2013 23:33:31 +0000 (16:33 -0700)]
ReplicatedPG: do not meaninglessly fill in the reqid on make_writeable() cloning
This reqid is used to fill in a map that is used for giving clients
the correct version on replayed ops, unless the reqid is blank (in
which case it doesn't go into the map). Indirect clones are not ops
that the client wants to observe and the version is incorrect right now,
so don't fill it in.
Note that this should not have actually caused any buggy behavior, because
the map would have simply been overwritten by the real requested event
a short time later (while still protected by locks and things). But it's
very confusing.
Loic Dachary [Thu, 29 Aug 2013 10:58:53 +0000 (12:58 +0200)]
ErasureCodeJerasure: base class for jerasure ErasureCodeInterface
The ErasureCodeJerasure class is derived from ErasureCodeInterface and
is meant to be derived to implement each jerasure technique (
Reed-Solomon, Cauchy ... ).
The parameters K ( number of data chunks ), M ( number of coding chunks
) and W ( word size ) are data members common to all techniques. The
technique data member is expected to be set to a string describing the
technique for debugging purposes.
minimum_to_decode_with_cost ignores the cost and calls minimum_to_decode.
minimum_to_decode returns the first K chunks or an error if there are
not enough. Since all codes are systematic, when all chunks are
available returning the first K allows for concatenation and is the best
choice.
The encode method converts bufferlist into char* as expected by the
jerasure functions. The padding of the incoming buffer depends on the
technique and is computed by the pad_in_length method. Encoding is done
with the jerasure_encode method.
The decode method converts the char* returned by the jerasure functions
into bufferlists to be consumed by the caller. The decoding is done by
the jerasure_decode method.
The to_int convenience method is used to convert parameters. The
is_prime convenience method will be used by some techniques to validate
parameters.
Immediately after creating an ErasureCodeJerasure derived object, the
init method must be called. It will call the parse method to interpret
the parameters required by the technique and set the k, m and w data
members. The prepare method is expected to compute the matrix ( and
schedule if necessary ) and store it in a data member. The init method
will be called while holding the ErasureCodePluginRegistry mutex. The
encode and decode methods will not be protected by a mutex and may be
called by different threads for the benefit of different placement
groups. They will not have any side effect on the object.
Loic Dachary [Fri, 23 Aug 2013 20:22:08 +0000 (22:22 +0200)]
ErasureCodeJerasure: import jerasure-1.2A
The files are copied verbatim from
http://web.eecs.utk.edu/~plank/plank/papers/Jerasure-1.2A.tar and a
section is added to the top level COPYING file to reflect the BSD
license.
Loic Dachary [Wed, 28 Aug 2013 15:29:18 +0000 (17:29 +0200)]
ErasureCodePlugin: plugin registry tests and example
libec_example.la is a fully functional plugin based on
ErasureCodeExample to test the ErasureCodePlugin abstract
interface. It is dynamically loaded to test the
ErasureCodePluginRegistry implementation.
Although the plugin is built in the test directory, it will be
installed. noinst_LTLIBRARIES won't build the shared library, only the
static version which is not suitable for testing.
Loic Dachary [Wed, 28 Aug 2013 13:57:54 +0000 (15:57 +0200)]
ErasureCodePlugin: plugin registry
A ErasureCodePluginRegistry singleton holds all erasure plugin objects
derived from ErasureCodePlugin and dlopen(2) handles for the lifetime
of the OSD and is cleaned up by the destructor.
The registry has a single entry point ( method factory ) and should
be used as follows:
If the plugin requested ( "jerasure" in the example above ) is not
found in the *plugins* data member, the load method is called and will:
* dlopen(parameters["erasure-code-directory"] + "jerasure")
* f = dlsym("__erasure_code_init")
* f("jerasure")
* check that it registered "jerasure"
The plugin is expected to do something like
instance.add(plugin_name, new ErasureCodePluginJerasure());
to register itself.
The factory method is protected with a Mutex to avoid race
conditions when using the same plugin from two threads.
The erasure_codelib_LTLIBRARIES variable is added to the Makefile
and the plugins are expected to add themselves and be installed
in the $(libdir)/erasure-code
Loic Dachary [Wed, 28 Aug 2013 13:46:34 +0000 (15:46 +0200)]
ErasureCodePlugin: plugin interface
When dynamically loaded, a plugin is expected to define
int __erasure_code_init(char *plugin_name);
When called, it is responsible for registering an ErasureCodePlugin
derived object that provides a factory method from which the concrete
implementation of the ErasureCodeInterface object can be generated:
Loic Dachary [Mon, 19 Aug 2013 17:15:07 +0000 (19:15 +0200)]
ErasureCode: example implementation : K=2 M=1
An erasure code implementation designed for tests. Although it is fully
functional and could be used on actual data, it is mainly provided for
testing purposes. It splits data in two, computes an XOR parity and
can sustain the loss of one chunk.
The constructor will usleep(3) for parameters["usleep"] microseconds
so that the caller can create race conditions.
Loic Dachary [Mon, 19 Aug 2013 16:56:56 +0000 (18:56 +0200)]
ErasureCode: abstract interface
The erasure coded pool relies on this abstract interface to encode and
decode the chunks stored in the OSD. It has been designed to be
generic enough to accomodate the libraries and algorithms that are
most likely to be used. It does not claim to be universal.
- In "includes", inttypes.h was cluttering the system's one. This caused
random build errors on some systems/in some conditions. Renaming it.
- Add emergency defs of PRI*64 headers when int_types.h does not define
them (which, unfortunately, can happen on some systems).
Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>