mds: add dentries in dirfrag to LRU in reverse order
Files in a dirfrag are usually processed in the order of readdir
results. Files at the beginning of are more likely to be used in
the future than files at the last.
Josh Durgin [Mon, 31 Mar 2014 21:53:31 +0000 (14:53 -0700)]
librbd: skip zeroes when copying an image
This is the simple coarse-grained solution, but it works well in
common cases like a small base image resized with a bunch of empty
space at the end. Finer-grained sparseness can be copied by using rbd
{export,import}-diff.
By default, we don't send out maps with primary_temp mappings because
there is no infrastructure in place that would make sure that the
entire cluster knows about primary_temp. Add an option to allow
primary_temp mappings, for development purposes.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 31 Mar 2014 17:42:23 +0000 (10:42 -0700)]
mon/PGMap: clear pool sum when last pg is deleted
Use the x.0 pg as a sentinel for the existence of the pool. Note that we
have to clean in up two paths: apply_incrmenetal (which is actually
deprecated) and the normal PGMonitor refresh.
Fixes: #7912 Signed-off-by: Sage Weil <sage@inktank.com>
Loic Dachary [Sat, 29 Mar 2014 10:30:42 +0000 (11:30 +0100)]
doc: pgbackend dev doc outdated notice
* Warn the reader that the implementation is ahead and may differ
* Update the links to the Firefly branch
* Remove links to issues used during development to avoid confusion
Loic Dachary [Sat, 29 Mar 2014 10:27:00 +0000 (11:27 +0100)]
doc: erasure code developer notes updates
Update the introduction to explain erasure code profiles. Remove
obsolete explanations about partial writes etc. Remove links to tickets
used during development. Update permalinks to be closer to
Firefly (v0.78).
Yan, Zheng [Sun, 30 Mar 2014 01:21:57 +0000 (09:21 +0800)]
fuse: implement 'access' low level function
Add an empty 'access' function to fuse low level functions. This
allow us to use ceph-fuse with fuse_default_permissions = false.
'fuse_default_permissions = false' can significantly improve the
speed of create/removing large number of files.
When fuse_default_permissions is true, the fuse kernel module sends
a getattr request whenever the kernel needs to check a directory's
permission. getattr (STAT_CAP_INODE_ALL) can be very slow if the
directory was just modified.
Yan, Zheng [Sat, 29 Mar 2014 02:36:12 +0000 (10:36 +0800)]
mds: find approximal bounds when adjusting subtree auth
When finishing exporting a subtree, the exporter MDS drops locks and
sends MExportDirFinish message to the importer MDS. The bounds of
subtree can get fragmented by third party before the importer MDS
receives the MExportDirFinish message. So the importer MDS can add
inaccurate bounds to the EImportFinish event.
The fix is find approximal bounds when finishing ambiguous imports.
Loic Dachary [Sat, 29 Mar 2014 09:34:29 +0000 (10:34 +0100)]
erasure-code: do not attempt to compile SSE4 on i386
SSE4 are only not availabe on older CPUs. Although the compiler could
probably generate the code, there is no point in doing so. The SSE4.1,
SSE4.2 and PCLMUL cpu features are only tested if the target CPU is
AMD64 or x86_64.
Yan, Zheng [Fri, 28 Mar 2014 17:53:15 +0000 (01:53 +0800)]
mds: commit new dirfrag before splitting it
Commit 6e013cd6 (properly set COMPLETE flag when merging dirfrags)
tries solving the issue that new dirfrag's COMPLETE flag gets lost
if MDS splits the new dirfrag, then the fragment operation gets
rolled back. It records the original dirfrag's COMPLETE flag when
EFragment PREPARE event is encountered. If the fragment operation
needs to rollback, The COMPLETE flag is journaled in corresponding
EFragment ROLLBACK event. This is problematic when the ROLLBACK
event and the "mkdir" event belong to different log segments. After
the log segment that contains the "mkdir" event is trimmed, the
dirfrag can not be considered as complete.
The fix is commit new dirfrag before splitting it. After dirfrag is
committed to object store, losing COMPLETE flag is not a big deal.
Dmitry Smirnov [Sat, 29 Mar 2014 00:59:24 +0000 (11:59 +1100)]
init: fix OSD startup issue
On machines with MON and OSDs (on boot) OSDs started shortly after MON startup
but MON needs time to become oprational so OSDs fail to start due to short
timeout because they don't have enough time to establish communication with
cluster. This is even more likely to happen when there are other monitors down
which is not unusual when servers are rebooting after power failure.
Increasing timeout significantly improves chances for successful OSD start.
Sage Weil [Fri, 28 Mar 2014 04:09:13 +0000 (21:09 -0700)]
msgr: add KEEPALIVE2 feature
This is similar to KEEPALIVE, except a timestamp is also exchanged. It is
sent with the KEEPALIVE, and then returned with the ACK. The last
received stamp is stored in the Connection so that it can be queried for
liveness. Since all of the users of keepalive are already regularly
triggering a keepalive, they can check the liveness at the same time.
Sage Weil [Fri, 28 Mar 2014 19:34:07 +0000 (12:34 -0700)]
osdc/ObjectCacher: call read completion even when no target buffer
If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.
Sage Weil [Fri, 28 Mar 2014 20:10:06 +0000 (13:10 -0700)]
osd/PG: fix choose_acting revert to up case
If we decide to revert back to up, we need to
1- return false, so that we go into the NeedActingChange state, and
2- actually ask for that change.
It's too fugly to try to jump down to the existing queue_want_pg_temp
call 100+ lines down in this function, so just do it here. We already
know that we are requesting to clear the pg_temp.
Fixes: #7902
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Wed, 26 Mar 2014 02:15:15 +0000 (10:15 +0800)]
mds: fix negative dirstat assertion
When splitting dirfrag, delta dirstat is always added to the first new
dirfrag. Before the delta dirstat is propagated to inode, unlinking file
from the rest dirfrags can cause nagtive inode dirstat.
Yan, Zheng [Wed, 26 Mar 2014 01:51:23 +0000 (09:51 +0800)]
mds: fix stack overflow caused by nested dispatch
Commit bc3325b37 fixes a stack overflow bug happens when replaying
client requests. Similar stack overflow can happens when processing
finished contexts.
Yan, Zheng [Mon, 24 Mar 2014 08:47:04 +0000 (16:47 +0800)]
mds: don't clear scatter dirty when cache rejoin ack is received
The auth mds has received dirty scatterlock state. But it hasn't
journaled the dirty state yet. The log segment that marked the
scatterlock dirty need to be preserved. Therefore, we can't clear
the dirty flag of scatterlock.
Yan, Zheng [Sun, 23 Mar 2014 09:47:05 +0000 (17:47 +0800)]
mds: trim empty non-auth dirfrags
Fragmenting a non-auth dirfrag results several smaller dirfrags. Some
of the resulting dirfrags can be empty, which are not used to connected
to auth subtree.