osdc/ObjectCacher: finish contexts after dropping object reference
The context to finish can be class C_Client_PutInode, which may drop
inode's last reference. So we should first drop object's reference,
then finish contexts.
Sage Weil [Wed, 11 Sep 2013 22:09:59 +0000 (15:09 -0700)]
osd: block requests on object during COPY_FROM
Block any request on an object (read or write) during the COPY_FROM
operation.
This could potentially be broken down into read vs write operations without
much difficulty, but blocking any op indescriminately is sufficient for
now, so let's keep it simple.
Sage Weil [Wed, 11 Sep 2013 22:10:47 +0000 (15:10 -0700)]
osd: add infrastructure to block io on an obc
Add an is_blocked() method for the obc, and add infrastructure to block
any operations if it returns true. Clean up on_change(), and add a helper
to kick an obc when whatever condition leading to it being blocked is no
longer true.
Sage Weil [Thu, 5 Sep 2013 00:09:52 +0000 (17:09 -0700)]
osd/ReplicatedPG: stage object chunks to replicas during COPY_FROM
As we get each chunk of data during the COPY_FROM operation, write it out
to a temporary object on the replicas. When we get all the pieces, move
it into place.
On btrfs, kb_used + kb_avail can be much smaller than total kb, and
what really matters to avoid filling up the disk is how much space is
available, not how much we've used. Thus, compute the ratio we use to
determine full or nearfull from kb_avail rather than from kb_used.
Signed-off-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Sage Weil <sage@inktank.com>
Joe Buck [Sat, 14 Sep 2013 00:41:31 +0000 (17:41 -0700)]
Removing extraneous code
The ExternalResource code was unnecessary and caused
issues on CentOS. Removing it.
Update Makefile.am to reflect the fact that
an anonymous class was removed and its
$1.class file is no longer generated.
The in-tree Hadoop shim was a combination of libcephfs wrapper, and the
bits to support Hadoop. This has been replaced by src/java that
implements generic libcephfs wrappers, and externally, the hadoop shim
(see docs).
Fixes: #6175
Backport: dumpling
We get a buffer off the remote gateway which might
not be NULL terminated. The JSON parser needs the
buffer to be NULL terminated even though we provide
a buffer length as it calls strlen().
David Zafman [Wed, 11 Sep 2013 23:56:21 +0000 (16:56 -0700)]
osd/ReplicatedPG.cc: Verify that recovery is truly complete
Backportable change to insure that even if no new ops started or
are running that indeed recovery is complete. Prevents some
error condition or unforseen code path from crashing an osd.
Backport: dumpling, cuttlefish
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
rgw: drain pending requests before completing write
Fixes: #6268
When doing aio write of objects (either regular or multipart parts) we
need to drain pending aio requests. Otherwise if gateway goes down then
object might end up corrupted.
Sage Weil [Tue, 3 Sep 2013 22:13:40 +0000 (15:13 -0700)]
os/FileStore: implement collection_move_rename
This is similar to a collection_add + collection_move sequence in that we
apply the same replay guards. The difference is that we roll it up into
a single operation, change the filename, and make the omap content carry
over by calling DBObjectMap->clone (as there is no rename function or
collection awareness in the DBObjectMap).
rgw: when failing read from client, return correct error
Fixes: #6214
When getting a failed read from client when putting an object
we returned the wrong value (always 0), which in the chunked-
upload case ended up in assuming that the write was done
successfully.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
osd: implement basic caching policies in ReplicatedPG
Right now these are very basic and aren't as sophisticated as we
want them to end up, but we have a skeleton for where to put the
decision-making logic.
If we get back a redirect reply, we clean up the Op's external references
and re-send using the target_oloc and target_oid. To facilitate this,
recalc_op_target() now only fills them in and overrides them with pool
cache semantics if they're empty.
Keep in mind that this is a pretty simple redirect formula -- the
Objecter will keep following redirects forever if that's what the OSDs
send back. The client is not providing any synchronization right now.
Objecter: write a helper function to clean up ops that need to be retried
We have a little block to clean them up if we get back EAGAIN, but it's
actually leaking map references; we will also use this for redirects
from the OSDs.
When present, clients must send the request to the location specified
by the redirect (by using the combine_with_locator() function on
request_redirect_t).
A separate mechanism must be used to ensure that clients see and respect
the redirect, as we do not bump up the minimum required version to
decode.
Analagous to the oloc->base_oloc rename we did in e2fcad09d94d965867147627b73e99da9454436f, we may specify a different
target name for a redirect. Rename the existing oid to base_oid to
avoid any confusion.
Loic Dachary [Wed, 28 Aug 2013 16:48:40 +0000 (18:48 +0200)]
ErasureCodeJerasure: plugin
Create the class matching the string found in the
erasure-code-technique parameter, using the same strings are the
original {encoder,decoder}.c examples from Jerasure-1.2A. Registers
the plugin in ErasureCodePluginRegistry.
ErasureCodeJerasureCauchy defines the prepare_schedule method to be used
by prepare method, which is the only one overloaded by
ErasureCodeJerasureCauchyOrig (calling cauchy_original_coding_matrix)
and ErasureCodeJerasureCauchyGood ( calling
cauchy_good_general_coding_matrix).
The schedule is retained for encoding and the bitmatrix for decoding.
parse : default to K=7, M=3, W=8 and packetsize = 8.
pad_in_length : pad to a multiple of k*w*packetsize*sizeof(int)
jerasure_encode, jerasure_decode map directly to the matching
jerasure functions
Loic Dachary [Thu, 29 Aug 2013 11:31:10 +0000 (13:31 +0200)]
ErasureCodeJerasure: unit test common to all techniques
A typed unit test is defined and must run regardless of the technique.
When a new technique is derived from ErasureCodeJerasure, it is added
to the JerasureTypes typedef and the test will validate that:
* it provides reasonable defaults for the technique specific
parameters
* it modifies the k, m and w to reasonable defaults depending
on the imposed constraints ( for instance Liber8tion requires
that w == 8 but the test sets it to 7 )
* the encoding of K=2, M=2 produces 4 chunks, the first two
of which contains the original buffer data showing the
code is systematic
* decoding when all 4 chunks are available indeed retrieves
the original buffer content
* decoding when the two data chunks are are missing indeed
retrieves the original buffer content
With the introduction of the erasure code pool, arguments to be
interpreted depending on the pool type must be introduced.
For instance the erasure code pool loads a plugin at run time will
use easure-code-k=10 to split each object in 10.
If key=value it is stored in the new properties data member of pg_pool_t
as properties[key] = value, otherwise the value is the empty string.
The pg_pool_t version is bumped to 10 and the encode/decode methods
modified to take the properties into account. The
generate_test_instances method creates a two entries map, one of which
is the empty string to cover the case when no value is specified.