Sage Weil [Fri, 24 Oct 2014 16:32:20 +0000 (09:32 -0700)]
osdc/Objecter: fix tick_event handling in shutdown vs tick race
If we fail to cancel the tick_event, we rely on tick() itself to clear
tick_event. I'm not quite sure how we got this wrong in the previous
commit, but this boils down to two cases:
1) shutdown() successfully cancels the event and clears tick_event. tick()
never runs. tick_event == NULL when we finish.
2) shutdown() fails to cancel the event because it has already started. In
this case tick itself is blocking (or about to block) waiting on the
rlock. When it does run it will clear tick_event itself, then see
initiazed == 0 and exit without rescheduling.
Fixes: #9873 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 24 Oct 2014 16:20:41 +0000 (09:20 -0700)]
common/Timer: recheck stopping before sleep if we dropped the lock
If we have safe_callbacks==false, the stopping flag may have changed while
we were doing our callback. Recheck it and exit to avoid a deadlock on
shutdown.
Noah Watkins [Thu, 23 Oct 2014 20:22:52 +0000 (13:22 -0700)]
java: fill in stat structure correctly
Added stat filling helper function but only stat and lstat were updated.
This patch makes fstat use it. Crucially the fstat wasn't updating the
mode flags.
We are dropping the requirement for MON_CAP_R for MMonGetMap.
Reason is simple enough: clients may need to contact the monitors and
obtain the latest monmap before authenticating. This happens, for
instance, when a client calls MonClient::get_monmap_privately(). The
osd uses this function during mkfs, prior to initializing a keyring or
even so much as existing.
Fixes: #9859 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Samuel Just [Mon, 20 Oct 2014 21:10:58 +0000 (14:10 -0700)]
PG:: reset_interval_flush and in set_last_peering_reset
If we have a change in the prior set, but not in the up/acting set, we go back
through Reset in order to reset peering state. Previously, we would reset
last_peering_reset in the Reset constructor. This did not, however, reset the
flush_interval, which caused the eventual flush event to be ignored and the
peering messages to not be sent.
Instead, we will always reset_interval_flush if we are actually changing the
last_peering_reset value.
Fixes: #9821
Backport: firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Mon, 13 Oct 2014 12:48:27 +0000 (14:48 +0200)]
erasure-code: add ErasureCode::encode unit test
Re-create and describe the situation that is fixed by 91a7e18f60bbc9acab3045baaa1b6505474ec4a9 which reworks the buffer
preparation function provided by ErasureCode::encode.
Janne Grunau [Mon, 29 Sep 2014 12:34:31 +0000 (14:34 +0200)]
erasure code: use 32-byte aligned buffers
Requiring page aligned buffers and realigning the input if necessary
creates measurable oberhead. ceph_erasure_code_benchmark is between
10-20% faster depending on the workload.
Also prevents a misaligned buffer when bufferlist::c_str(bufferlist)
has to allocate a new buffer to provide continuous one. See bug #9408
Loic Dachary [Mon, 13 Oct 2014 14:32:18 +0000 (16:32 +0200)]
common: add an aligned buffer with less alignment than a page
SIMD optimized erasure code computation needs aligned memory. Buffers
aligned to a page boundary are not needed though. The buffers used
for the erasure code computation are typical smaller than a page.
The typical alignment requirements SIMD operations are 16 bytes for
SSE2 and NEON and 32 bytes for AVX/AVX2.
Add new prototypes with an align argument, similar to the one enforcing
page alignment. The implementation is exactly the same, except for the
align parameter. The page alignment method are then implemented as calls
to the more generic methods.
The align parameter is an unsigned (same type as CEPH_PAGE_SIZE). The
CEPH_PAGE_MASK value ( ~(CEPH_PAGE_SIZE - 1) ) was only used as
~CEPH_PAGE_MASK, i.e. equivalent of (CEPH_PAGE_SIZE - 1) once the double
~~ is reduced. These occurrence are replaced with (align - 1). The type
of CEPH_PAGE_MASK is an unsigned long which probably because it was
~(CEPH_PAGE_SIZE). When using (align - 1) as a mask for both
CEPH_PAGE_SIZE and SIMD alignment there is no need to use an unsigned
long because there is no risk of overflowing the unsigned value.
The CYGWIN specific code is also modified but not tested.
Unit tests are added for the new methods.
Signed-off-by: Janne Grunau <j@jannau.net> Signed-off-by: Loic Dachary <loic-201408@dachary.org>
Adam Crume [Wed, 8 Oct 2014 00:45:53 +0000 (17:45 -0700)]
Fix read performance regression in ObjectCacher
The regression was introduced in commit 4fc9fffc494abedac0a9b1ce44706343f18466f1. The problem is that the cache
thinks it's full (when it's not), so it defers the read. This change
frees up cache space if necessary and only defers the read if enough
space cannot be freed.
qa/workunits: cephtool: don't remove self's key on auth tests
Suites run with CEPH_TEST_CLI_DUP_COMMAND=1, which will send a duplicate
command for every command issued with the 'ceph' tool. Behavior is to
get a reply from the command and then send a duplicate, looking for the
same outcome (guaranteeing idempotency of the operations). However, it
so happens that if you remove the entity's own key from the keyring and
you happen to be unlucky enough so that the client's connection gets
failed (we also run tests with connection failure injections), the
'ceph' tool won't be able to reconnect to the cluster to send the
duplicate command (as it's entity no longer exists in the cluster's
keyring).
We rewrite the test instead of resorting to ugly hacks to work around
this behavior, simply having a new 'role-definer' added by the existing
'role-definer' (which we weren't testing anyway, so bonus points for
that) and then have one removing the other (to test the procedure) and
finally using 'client.admin' to remove the last 'role-definer'.
Fixes: #9820 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Guang Yang [Mon, 13 Oct 2014 04:18:45 +0000 (04:18 +0000)]
The fix for issue 9614 was not completed, as a result, for those erasure coded PGs with one OSD down, the state was wrongly marked as active+clean+degraded. This patch makes sure the clean flag is not set for such PG. Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Greg Farnum [Thu, 9 Oct 2014 22:12:19 +0000 (15:12 -0700)]
mds: Locker: fix a NULL deref in _update_cap_fields
The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.
David Zafman [Wed, 24 Sep 2014 23:02:21 +0000 (16:02 -0700)]
osd: Return EOPNOTSUPP if a set-alloc-hint occurs with OSDs that don't support
Add CEPH_FEATURE_OSD_SET_ALLOC_HINT feature bit
Collect the intersection of all peer feature bits during peering
When handling CEPH_OSD_OP_SETALLOCHINT check that all OSDs support it
by checking for CEPH_FEATURE_OSD_SET_ALLOC_HINT feature bit.
Noah Watkins [Sun, 5 Oct 2014 20:15:13 +0000 (13:15 -0700)]
client: clean-up objecter on failed client init
During mount() the objecter isn't shutdown if the mon client fails to
initialize. Objecter asserts in destructor expect it to have been
shutdown but this skipped.
hadoop@plana85:~$ ./hadoop/bin/hadoop fs -ls /
14/10/05 12:35:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
osdc/Objecter.cc: In function 'virtual Objecter::~Objecter()' thread 7ff422705700 time 2014-10-05 12:35:51.090776
osdc/Objecter.cc: 3927: FAILED assert(!m_request_state_hook)
ceph version 0.85-981-g25bcc39 (25bcc39bb809e2d13beea1529e4ab92d1b61fa5b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x7ff3f5c28f7f]
2: (Objecter::~Objecter()+0x397) [0x7ff3f5bca707]
3: (Objecter::~Objecter()+0x9) [0x7ff3f5bca8b9]
4: (Client::~Client()+0x7d) [0x7ff3f5b6770d]
5: (Client::~Client()+0x9) [0x7ff3f5b680a9]
6: (ceph_mount_info::mount(std::string const&)+0x149) [0x7ff3f5b1fa49]
7: (ceph_mount()+0x4e) [0x7ff3f5b1dcbe]
8: (Java_com_ceph_fs_CephMount_native_1ceph_1mount()+0xb7) [0x7ff4158b1c97]
9: [0x7ff41839dd68]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted
Johnu George [Mon, 29 Sep 2014 17:07:44 +0000 (10:07 -0700)]
Crush: Ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by indep rule. The fix for firstn rules is already
merged as part of bug #9492. Required test files are added.
Johnu George [Wed, 24 Sep 2014 16:32:50 +0000 (09:32 -0700)]
Crush: Ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.
mon: Monitor: let 'handle_command()' deal with caps validation
If a given client doesn't have the required caps when running a command,
it must receive an EACCES or EPERM reply. This is already handled by
Monitor::handle_command(), which does an exceptionally good job at it.
Therefore, and unlike other messages that do not expect return values,
we can't simply drop the message if the client doesn't have the
appropriate capabilities, or things can get very weird very fast from
the user's perspective. Dropping the message for a command without a
reply has roughly the same effect as loss of quorum (timeout, pipes
failing) and confusion may ensue from it.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
qa/workunits: mon: auth_caps: account for mon blank caps
test creating and entity with blank caps with and without '--force'
being specified. without '--force' they must fail with EINVAL as the
monitor will not be able to parse them.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: MonCommands: adjust indentation for 'auth add'
Eye-candy. We changed indentation of a few other entries and this one
was just too darn obvious, itching all over, night terrors, the whole
nine yards.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
qa/workunits: mon: auth_caps: variables must be local
We have variables with the same name that are being shared! We don't
hit any issues with it currently because the code just kind of works
even though that happens. Add a bit of new logic that relies on an
immutable return code (for instance) and we're in the woods.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
mon: Monitor: create logical divisions on dispatch() based on msg nature
Instead of a single switch(), have multiple switch() and order them by
increasing necessity of privileges.
This patch thus divides the big switch into:
- messages not requiring auth/caps checks at all
- messages which caps shall be checked somewhere else
- messages the Monitor class needs to deal with but only require a
client to have enough caps for the monitor to consider handling them
- messages that only a monitor is allowed to send.
Backport: firefly
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>