Yuan Zhou [Thu, 23 Oct 2014 05:27:45 +0000 (13:27 +0800)]
EC: document the LRC per layer plugin configuration
LRC now uses Jerasure as the default EC backend. But it is actually
possible to switch to other backend like Isa using the low level
configuration. This commits Adds documents on how to specify the EC
backend in each LRC layer:
We are dropping the requirement for MON_CAP_R for MMonGetMap.
Reason is simple enough: clients may need to contact the monitors and
obtain the latest monmap before authenticating. This happens, for
instance, when a client calls MonClient::get_monmap_privately(). The
osd uses this function during mkfs, prior to initializing a keyring or
even so much as existing.
Fixes: #9859 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Samuel Just [Mon, 20 Oct 2014 21:10:58 +0000 (14:10 -0700)]
PG:: reset_interval_flush and in set_last_peering_reset
If we have a change in the prior set, but not in the up/acting set, we go back
through Reset in order to reset peering state. Previously, we would reset
last_peering_reset in the Reset constructor. This did not, however, reset the
flush_interval, which caused the eventual flush event to be ignored and the
peering messages to not be sent.
Instead, we will always reset_interval_flush if we are actually changing the
last_peering_reset value.
Fixes: #9821
Backport: firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Mon, 13 Oct 2014 12:48:27 +0000 (14:48 +0200)]
erasure-code: add ErasureCode::encode unit test
Re-create and describe the situation that is fixed by 91a7e18f60bbc9acab3045baaa1b6505474ec4a9 which reworks the buffer
preparation function provided by ErasureCode::encode.
Janne Grunau [Mon, 29 Sep 2014 12:34:31 +0000 (14:34 +0200)]
erasure code: use 32-byte aligned buffers
Requiring page aligned buffers and realigning the input if necessary
creates measurable oberhead. ceph_erasure_code_benchmark is between
10-20% faster depending on the workload.
Also prevents a misaligned buffer when bufferlist::c_str(bufferlist)
has to allocate a new buffer to provide continuous one. See bug #9408
Loic Dachary [Mon, 13 Oct 2014 14:32:18 +0000 (16:32 +0200)]
common: add an aligned buffer with less alignment than a page
SIMD optimized erasure code computation needs aligned memory. Buffers
aligned to a page boundary are not needed though. The buffers used
for the erasure code computation are typical smaller than a page.
The typical alignment requirements SIMD operations are 16 bytes for
SSE2 and NEON and 32 bytes for AVX/AVX2.
Add new prototypes with an align argument, similar to the one enforcing
page alignment. The implementation is exactly the same, except for the
align parameter. The page alignment method are then implemented as calls
to the more generic methods.
The align parameter is an unsigned (same type as CEPH_PAGE_SIZE). The
CEPH_PAGE_MASK value ( ~(CEPH_PAGE_SIZE - 1) ) was only used as
~CEPH_PAGE_MASK, i.e. equivalent of (CEPH_PAGE_SIZE - 1) once the double
~~ is reduced. These occurrence are replaced with (align - 1). The type
of CEPH_PAGE_MASK is an unsigned long which probably because it was
~(CEPH_PAGE_SIZE). When using (align - 1) as a mask for both
CEPH_PAGE_SIZE and SIMD alignment there is no need to use an unsigned
long because there is no risk of overflowing the unsigned value.
The CYGWIN specific code is also modified but not tested.
Unit tests are added for the new methods.
Signed-off-by: Janne Grunau <j@jannau.net> Signed-off-by: Loic Dachary <loic-201408@dachary.org>
Jason Dillaman [Tue, 21 Oct 2014 07:42:13 +0000 (03:42 -0400)]
rbd: Correct readahead divide by zero exception
When readahead is used on old-format RBD images, a divide
by zero signal will be thrown. This was caused by initializing
the readahead alignments prior to initializing the stripe layout
of old-format RBD images.
Fixes: 9857 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Loic Dachary [Thu, 16 Oct 2014 06:04:03 +0000 (23:04 -0700)]
cli: do not parse injectargs arguments twice
The arguments of injectargs being valid ceph arguments, they are.
consumed when the ceph cli calls rados.conf_parse_argv(). It can be
worked around by obsuring them as in:
where '--osd_debug_drop_ping_probability 444' is a single argument that
does not match any known argument. The trick is that it will be
evaluated again once it reaches the OSD or the MON and translated into
the expected list of arguments. Although it is clear once explained, it
is obscure and leads to strange combinations such as:
this one is unfortunately much less documented and the user does not
usually know the exact semantic of --, let alone where it should be
placed.
The simpler solution is to split the argument list in two if
"injectargs" is found. The arguments that show after the "injectargs"
argument is removed from the list of arguments until parsing is
complete. It implements the more intuitive syntax:
ceph tell osd.0 injectargs --osd_debug_op_order
and the other forms are still valid for backward compatibility.
Adam Crume [Wed, 8 Oct 2014 00:45:53 +0000 (17:45 -0700)]
Fix read performance regression in ObjectCacher
The regression was introduced in commit 4fc9fffc494abedac0a9b1ce44706343f18466f1. The problem is that the cache
thinks it's full (when it's not), so it defers the read. This change
frees up cache space if necessary and only defers the read if enough
space cannot be freed.
qa/workunits: cephtool: don't remove self's key on auth tests
Suites run with CEPH_TEST_CLI_DUP_COMMAND=1, which will send a duplicate
command for every command issued with the 'ceph' tool. Behavior is to
get a reply from the command and then send a duplicate, looking for the
same outcome (guaranteeing idempotency of the operations). However, it
so happens that if you remove the entity's own key from the keyring and
you happen to be unlucky enough so that the client's connection gets
failed (we also run tests with connection failure injections), the
'ceph' tool won't be able to reconnect to the cluster to send the
duplicate command (as it's entity no longer exists in the cluster's
keyring).
We rewrite the test instead of resorting to ugly hacks to work around
this behavior, simply having a new 'role-definer' added by the existing
'role-definer' (which we weren't testing anyway, so bonus points for
that) and then have one removing the other (to test the procedure) and
finally using 'client.admin' to remove the last 'role-definer'.
Fixes: #9820 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: MDSMonitor: wait for osdmon to be writable when requesting proposal
Otherwise we may end up requesting the osdmon to propose while it is
mid-proposal. We can't simply return EAGAIN to the user either because
then we would have to expect the user to be able to successfully race
with the whole cluster in finding a window in which 'mds fs new' command
would succeed -- which is not a realistic expectation. Having the
command to osdmon()->wait_for_writable() guarantees that the command
will be added to a queue and that we will, eventually, tend to it.
Fixes: #9794 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: MDSMonitor: don't return -EINVAL if function is bool
Returning -EINVAL on a function that expects bool and the error code to
be in a variable 'r' can only achieve one thing: if this path is ever
touched, instead of returning an error as it was supposed to, we're
returning 'true' with 'r = 0' and, for no apparent reason, the user will
think everything went smoothly but with no new fs created.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: MDSMonitor: check all conditions are met *before* osdmon proposal
We should not allow ourselves to request the osdmon to propose before we
know for sure that we meet the required conditions to go through with
our own state change. Even if we still can't guarantee that our
proposal is going to be committed, we shouldn't just change the osdmon's
state just because we can. This way, at least, we make sure that our
checks hold up before doing anything with side-effects.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
We were just setting return code to -EINVAL, while allowing the logic to
continue regardless. If we are to return error, then we should abort
the operation as well and let the user know it went wrong instead of
continuing as if nothing had happened.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Jianpeng Ma [Fri, 17 Oct 2014 06:04:40 +0000 (14:04 +0800)]
test: fix compile warning in bufferlist.cc
test/bufferlist.cc: In member function ‘virtual void
Buffer_constructors_Test::TestBody()’:
test/bufferlist.cc:154:36: warning: ignoring return value of ‘int
system(const char*)’, declared with attribute warn_unused_result
[-Wunused-result]
::system("echo ABC > testfile");
^
test/bufferlist.cc: In member function ‘virtual void
TestRawPipe::SetUp()’:
test/bufferlist.cc:182:36: warning: ignoring return value of ‘int
system(const char*)’, declared with attribute warn_unused_result
[-Wunused-result]
::system("echo ABC > testfile");
^
test/bufferlist.cc: In member function ‘virtual void
BufferList_read_file_Test::TestBody()’:
test/bufferlist.cc:1768:53: warning: ignoring return value of ‘int
system(const char*)’, declared with attribute warn_unused_result
[-Wunused-result]
::system("echo ABC > testfile ; chmod 0 testfile");
^
test/bufferlist.cc:1770:32: warning: ignoring return value of ‘int
system(const char*)’, declared with attribute warn_unused_result
[-Wunused-result]
::system("chmod +r testfile");
^
test/bufferlist.cc: In member function ‘virtual void
BufferList_read_fd_Test::TestBody()’:
test/bufferlist.cc:1781:34: warning: ignoring return value of ‘int
system(const char*)’, declared with attribute warn_unused_result
[-Wunused-result]
::system("echo ABC > testfile");
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Jianpeng Ma [Fri, 17 Oct 2014 05:19:59 +0000 (13:19 +0800)]
librbd: fix compile warning in librbd/internal.cc.
librbd/internal.cc: In function 'void
librbd::readahead(librbd::ImageCtx*, const std::vector<std::pair<long
unsigned int, long unsigned int> >&, const md_config_t*)':
librbd/internal.cc:3150:38: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
ictx->total_bytes_read > conf->rbd_readahead_disable_after_bytes;
^
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Gregory Farnum [Thu, 16 Oct 2014 13:57:34 +0000 (06:57 -0700)]
Merge pull request #2628 from ceph/wip-client-flock
Wip client flock
Add support for file locking to the userspace client, and improve blocked-lock cancellation so that it doesn't remove locks that succeeded when racing.
Loic Dachary [Thu, 16 Oct 2014 00:14:53 +0000 (17:14 -0700)]
mon: add the osd crush rename-bucket command
The synopsis is:
osd crush rename-bucket name1 name2
It is made idempotent by interpreting -EALREADY as returned by
CrushWrapper::rename_bucket return as success.
The crush_rename_bucket method first checks for errors with
CrushWrapper::can_rename_bucket if there is no pending crush so that it
can return early and avoid the creation of a pending crush map.
If renaming is possible, CrushWrapper::rename_bucket is called on the
pending crush map (and creates it indirectly if it does not already
exists).
Loic Dachary [Thu, 16 Oct 2014 00:06:12 +0000 (17:06 -0700)]
crush: add CrushWrapper::rename_item and can_rename_item
The can_rename_item is a const method checking if renaming an item could
succeed. If not it returns a unique -errno code and a human readable
message message.
Trying to rename a non existent item into an existent item returns
-EALREADY which can be treated as success if renaming is to be
idempotent.
Loic Dachary [Thu, 16 Oct 2014 00:02:58 +0000 (17:02 -0700)]
crush: improve constness of CrushWrapper methods
A number of CrushWrapper get methods or predicates were not const
because they need to maintain transparently the rmaps. Make the rmaps
mutable and update the constness of the methods to match what the caller
would expect.