Yan, Zheng [Sat, 29 Mar 2014 02:36:12 +0000 (10:36 +0800)]
mds: find approximal bounds when adjusting subtree auth
When finishing exporting a subtree, the exporter MDS drops locks and
sends MExportDirFinish message to the importer MDS. The bounds of
subtree can get fragmented by third party before the importer MDS
receives the MExportDirFinish message. So the importer MDS can add
inaccurate bounds to the EImportFinish event.
The fix is find approximal bounds when finishing ambiguous imports.
Loic Dachary [Sat, 29 Mar 2014 09:34:29 +0000 (10:34 +0100)]
erasure-code: do not attempt to compile SSE4 on i386
SSE4 are only not availabe on older CPUs. Although the compiler could
probably generate the code, there is no point in doing so. The SSE4.1,
SSE4.2 and PCLMUL cpu features are only tested if the target CPU is
AMD64 or x86_64.
Yan, Zheng [Fri, 28 Mar 2014 17:53:15 +0000 (01:53 +0800)]
mds: commit new dirfrag before splitting it
Commit 6e013cd6 (properly set COMPLETE flag when merging dirfrags)
tries solving the issue that new dirfrag's COMPLETE flag gets lost
if MDS splits the new dirfrag, then the fragment operation gets
rolled back. It records the original dirfrag's COMPLETE flag when
EFragment PREPARE event is encountered. If the fragment operation
needs to rollback, The COMPLETE flag is journaled in corresponding
EFragment ROLLBACK event. This is problematic when the ROLLBACK
event and the "mkdir" event belong to different log segments. After
the log segment that contains the "mkdir" event is trimmed, the
dirfrag can not be considered as complete.
The fix is commit new dirfrag before splitting it. After dirfrag is
committed to object store, losing COMPLETE flag is not a big deal.
Dmitry Smirnov [Sat, 29 Mar 2014 00:59:24 +0000 (11:59 +1100)]
init: fix OSD startup issue
On machines with MON and OSDs (on boot) OSDs started shortly after MON startup
but MON needs time to become oprational so OSDs fail to start due to short
timeout because they don't have enough time to establish communication with
cluster. This is even more likely to happen when there are other monitors down
which is not unusual when servers are rebooting after power failure.
Increasing timeout significantly improves chances for successful OSD start.
Sage Weil [Fri, 28 Mar 2014 19:34:07 +0000 (12:34 -0700)]
osdc/ObjectCacher: call read completion even when no target buffer
If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.
Sage Weil [Fri, 28 Mar 2014 20:10:06 +0000 (13:10 -0700)]
osd/PG: fix choose_acting revert to up case
If we decide to revert back to up, we need to
1- return false, so that we go into the NeedActingChange state, and
2- actually ask for that change.
It's too fugly to try to jump down to the existing queue_want_pg_temp
call 100+ lines down in this function, so just do it here. We already
know that we are requesting to clear the pg_temp.
Fixes: #7902
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Wed, 26 Mar 2014 02:15:15 +0000 (10:15 +0800)]
mds: fix negative dirstat assertion
When splitting dirfrag, delta dirstat is always added to the first new
dirfrag. Before the delta dirstat is propagated to inode, unlinking file
from the rest dirfrags can cause nagtive inode dirstat.
Yan, Zheng [Wed, 26 Mar 2014 01:51:23 +0000 (09:51 +0800)]
mds: fix stack overflow caused by nested dispatch
Commit bc3325b37 fixes a stack overflow bug happens when replaying
client requests. Similar stack overflow can happens when processing
finished contexts.
Yan, Zheng [Mon, 24 Mar 2014 08:47:04 +0000 (16:47 +0800)]
mds: don't clear scatter dirty when cache rejoin ack is received
The auth mds has received dirty scatterlock state. But it hasn't
journaled the dirty state yet. The log segment that marked the
scatterlock dirty need to be preserved. Therefore, we can't clear
the dirty flag of scatterlock.
Yan, Zheng [Sun, 23 Mar 2014 09:47:05 +0000 (17:47 +0800)]
mds: trim empty non-auth dirfrags
Fragmenting a non-auth dirfrag results several smaller dirfrags. Some
of the resulting dirfrags can be empty, which are not used to connected
to auth subtree.
Yan, Zheng [Thu, 20 Mar 2014 03:30:46 +0000 (11:30 +0800)]
mds: handle race between cache rejoin and fragmenting
MDCache::handle_cache_expire() ignores mismatched dirfrags. this is
OK during normal operation because MDS doesn't trim replica inode
whose dirfrags are likely being fragmented (see commit 22535340).
During recovery, the recovering MDS can reveive survivor MDS' cache
expire message before it sends cache rejoin acks. In this case,
there still can be mismatched dirfrags, but nothing prevents the
survivor MDS to trim inode of these mismatched dirfrags. So there
can be unconnected dirfrags when the recovering MDS sends cache
rejoin acks.
The fix is, when mismatched dirfrag is encountered during recovery,
check if inode of the dirfrag is still replicated to the sender MDS.
If the inode is not replicated, remove the sender MDS from replica
maps of all child dirfrags.
Yan, Zheng [Wed, 19 Mar 2014 11:56:26 +0000 (19:56 +0800)]
mds: handle interaction between slave rollback and fragmenting
For slave rename and rmdir events, the MDS needs to preserve non-auth
dirfrag where the renamed inode originally lives in until slave commit
event is encountered. Current method to handle this is use MDCache::
uncommitted_slave_rename_olddir to track any non-auth dirfrag that
need to be preserved. This method does not works well if any preserved
dirfrag gets fragmented by log event (such as ESubtreeMap) between the
slave prepare event and the slave commit event.
The fix is tracking inode of dirfrag instead of tracking dirfrag that
need to preserved directly.
Yan, Zheng [Fri, 28 Mar 2014 04:57:29 +0000 (12:57 +0800)]
mds: properly propagate dirty dirstat to auth inode
Propagate dirty dirstat to freezing auth inode if the inode is
already auth pinned by the Mutation. Otherwiese the dirstat can
be propagated to inode after client changes inode's mtime.
Sage Weil [Thu, 27 Mar 2014 22:12:25 +0000 (15:12 -0700)]
osd/ReplicatedPG: tolerate missing clones in cache pools
A few cases:
- As we are working through the list, if we see a clone that is lower than
the next one we were expecting, we should be able to skip them.
- If we see a head, we can skip all of the rest of the clones.
- If we get to the end and next_clone was set, we can ignore it.
Sage Weil [Thu, 27 Mar 2014 20:51:15 +0000 (13:51 -0700)]
osd/ReplicatedPG: improve clone vs head checking
- notice when we are missing a clone (that isn't at the end of the list)
- notice when we are missing a clone on the last object in the scrub map
- do not assert when we are missing a clone
There is still more we could do to improve this (like noticing one missing
clone but still checking the others), but we'll leave that aside for just
a moment...
Sage Weil [Thu, 27 Mar 2014 20:28:10 +0000 (13:28 -0700)]
ceph_test_rados_api_tier: scrub while cache tier is missing clones
Trigger a scrub to verify that we can handle a cache tier that is missing
some clones. We rely on the test harness to notice the error, and we do
not confirm that the scrub happened. In practice this is plenty of time,
however.
Yehuda Sadeh [Thu, 27 Mar 2014 17:53:25 +0000 (10:53 -0700)]
rgw: use s->content_length instead of s->length
Fixes: #7876
Need to use the actual content length, not the pointer to the string.
This was probably working because there's correlation to when
content_length > 0 to whether s->length is not null.
Loic Dachary [Thu, 27 Mar 2014 10:07:11 +0000 (11:07 +0100)]
erasure-code: test encode/decode of SSE optimized jerasure plugins
If the machine running make check has the required CPU features
available, load the SSE optimized plugin and check that it can encode /
decode a simple payload. If the CPU features are not available, only
test the generic plugin and display an informative message about the
tests that were skipped.
Loic Dachary [Thu, 27 Mar 2014 10:06:24 +0000 (11:06 +0100)]
erasure-code: test jerasure SSE optimized plugins selection
Test the selection of the plugin depending on the CPU features. The
prefix of the plugin is "jerasure" by default (jerasure_generic,
jerasure_sse3, jerasure_sse4) and can be modified with the
"jerasure-name" parameter. A test plugin is created for each
variant (test_jerasure_generic, test_jerasure_sse3, test_jerasure_sse4).
The flags set by ceph_probe are modified by the test to check if the
expected plugin suffix is appended.
Loic Dachary [Wed, 26 Mar 2014 10:16:01 +0000 (11:16 +0100)]
erasure-code: SSE optimized jerasure plugins
The jerasure plugin is compiled with three sets of flags:
* jerasure_generic with no SSE optimization
* jerasure_sse3 with SSE2, SSE3 and SSSE3 optimizations
* jerasure_sse4 with SSE2, SSE3, SSSE3, SSE41, SSE42 and PCLMUL optimizations
The jerasure plugin loads the appropriate plugin depending on the CPU
features detected at runtime.
To avoid confusion, the jerasure v1 branch that contains commits pending
review upstream is named v2-ceph and the gf-complete v2 branch is named
v2-ceph.
Loic Dachary [Sat, 25 Jan 2014 21:35:34 +0000 (22:35 +0100)]
erasure-code: allow loading a plugin from factory()
The Mutex scope is restricted to only protect the load() method and not
the factory() method. This allows a plugin to load another plugin from
within the factory() method.
Sage Weil [Thu, 27 Mar 2014 00:47:06 +0000 (17:47 -0700)]
osd: do not make pg_pool_t incompat when hit_sets are enabled
If we enable HitSet tracking, the OSD needs to know this, but clients do
not care. Setting the compat version is too heavyweight as it locks out
older kernels (*any* currents, currently) who are unaffected by the new
fields.
Sage Weil [Sat, 22 Mar 2014 22:39:26 +0000 (15:39 -0700)]
osd: trim copy-get backend read to object size
We are passing a big number to the backend to read and it is trimming it
to the stripe boundary, and then setting the cursor at a slightly smaller
offset bound by oi.size. This is invalid, and will trigger an assert in
the _write_copy_chunk code:
0> 2014-03-21 15:12:23.761509 7f8dd2324700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::_write_copy_chunk(ReplicatedPG::CopyOpRef, PGBackend::PGTransaction*)' thread 7f8dd2324700 time 2014-03-21 15:12:23.758866
osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
To fix this, trim the buffer to the correct length in the completion
context.
Fixes: #7823 Signed-off-by: Sage Weil <sage@inktank.com>