Sage Weil [Fri, 18 Jan 2013 06:00:42 +0000 (22:00 -0800)]
mds: open mydir after replay
In certain cases, we may replay the journal and not end up with the
dirfrag for mydir open. This is fine--we just need to open it up and
fetch it below.
Yan, Zheng [Sun, 27 Jan 2013 07:14:55 +0000 (15:14 +0800)]
mds: fix 'discover' handling in the rejoin stage
If the MDS is the resolve stage, current MDCache::handle_discover() only handles
'discover' from MDS that it has already gotten rejoin acknowledgement. This can
cause circular wait because MDCache::rejoin_gather_finish() fetches reconnected
inodes before send rejoin acknowledgements, and fetching reconnected inode may
triggers 'discover'. The fix is not delay handling 'discover' from MDS that are
also in the rejoin stage.
Yan, Zheng [Sat, 19 Jan 2013 01:24:12 +0000 (09:24 +0800)]
mds: fetch missing inodes from disk
The problem of fetching missing inodes from replicas is that replicated inodes
does not have up-to-date rstat and fragstat. So just fetch missing inodes from
disk
Yan, Zheng [Fri, 18 Jan 2013 14:54:02 +0000 (22:54 +0800)]
mds: move variables special to rename into MDRequest::more
My previous patches add two pointers (ambiguous_auth_inode and
auth_pin_freeze) to class Mutation. They are both used by cross
authority rename, both point to the renamed inode. Later patches
need add more rename special state to MDRequest, So just move them
into MDRequest::more
Yan, Zheng [Mon, 21 Jan 2013 02:04:03 +0000 (10:04 +0800)]
mds: don't journal opened non-auth inode
If we journal opened non-auth inode, during journal replay, the corresponding
entry will add non-auth objects to the cache. But the MDS does not journal all
subsequent modifications (rmdir,rename) to these non-auth objects, so the code
that manages cache and subtree may get confused. Besides non-auth objects will
be trimmed at the resolve stage.
Yan, Zheng [Wed, 16 Jan 2013 11:58:49 +0000 (19:58 +0800)]
mds: don't replace existing slave request
The MDS may receive a client request, but find there is an existing
slave request. It means other MDS is handling the same request, so
we should not replace the slave request with a new client request,
just forward the request.
The client request may include embeded cap releases, we need process
them even the request is forwarded.
Yan, Zheng [Wed, 16 Jan 2013 11:38:38 +0000 (19:38 +0800)]
mds: always use {push,pop}_projected_linkage to change linkage
Current code skips using {push,pop}_projected_linkage to modify replica
dentry's linkage. This confuses EMetaBlob::add_dir_context() and makes
it record out-of-date path when TO_ROOT mode is used. This patch changes
the code to always use {push,pop}_projected_linkage to modify dentry's
linkage. It makes sure MDCache::create_subtree_map() record correct and
up-to-date subtree map.
Yan, Zheng [Sat, 19 Jan 2013 01:49:04 +0000 (09:49 +0800)]
mds: send resolve messages after all MDS reach resolve stage
Current code sends resolve messages when resolving MDS set changes.
There is no need to send resolve messages when some MDS leave the
resolve stage. Sending message while some MDS are replaying is also
not very useful.
Yan, Zheng [Fri, 18 Jan 2013 11:41:48 +0000 (19:41 +0800)]
mds: split reslove into two sub-stages
The resolve stage serves to disambiguate the fate of uncommitted slave
updates and resolve subtrees authority. The MDS sends resolve message
that claims subtrees authority immediately when reslove stage is entered,
When receiving a resolve message, the MDS also processes it immediately.
This may cause problem if there are uncommitted slave rename and some of
them need rollback later. It's because slave rename rollback may modify
subtree map.
The fix is split reslove into two sub-stages, the first sub-stage serves
to disambiguate slave updates, do slave commit or rollback. After the
the first sub-stage finishes, the MDS sends resolve messages that claim
subtrees authority to other MDS and processes received resolve messages.
Yan, Zheng [Sat, 19 Jan 2013 05:00:29 +0000 (13:00 +0800)]
mds: fix slave rename rollback
The main issue of old slave rename rollback code is that it assumes
all affected objects are in the cache. The assumption is not true
when MDS does rollback in the resolve stage. This patch removes the
assumption and makes Server::do_rename_rollback() check individual
object and roll back change.
Yan, Zheng [Sat, 19 Jan 2013 04:57:31 +0000 (12:57 +0800)]
mds: preserve non-auth/unlinked objects until slave commit
The MDS should not trim objects in non-auth subtree immediately after
replaying a slave rename. Because the slave rename may require rollback
later and these objects are needed for rollback.
Yan, Zheng [Fri, 18 Jan 2013 06:08:45 +0000 (14:08 +0800)]
mds: force journal straydn for rename if necessary
rename may overwrite an empty directory inode and move it into stray
directory. MDS who has auth subtree beneath the overwrited directory
need journal the stray dentry when handling rename slave request.
Yan, Zheng [Fri, 18 Jan 2013 02:47:21 +0000 (10:47 +0800)]
mds: fix "had dentry linked to wrong inode" warning
The reason of "had dentry linked to wrong inode" warning is that
Server::_rename_prepare() adds the destdir to the EMetaBlob before
adding the straydir. So during MDS recovers, the destdir is first
replayed. The old inode is directly replaced by the source inode.
We can void the warning by adding the straydir first.
Yan, Zheng [Sat, 19 Jan 2013 00:30:23 +0000 (08:30 +0800)]
mds: don't set xlocks on dentries done when early reply rename
_rename_finish() does not send dentry link/unlink message to replicas.
We should prevent dentries that are modified by the rename operation
from getting new replicas while the rename operation is committing.
So don't set xlocks on dentries "done".
Sage Weil [Mon, 28 Jan 2013 03:57:58 +0000 (19:57 -0800)]
mon: set limit so that we do not an entire down subtree out
Add new configurable 'mon osd down out subtree limit' so that you can
prevent marking out an entire subtree. If for example an entire rack is
down, do not mark anything in it out. If less than the whole rack is down,
everything is fair game.
Danny Al-Gaaf [Mon, 28 Jan 2013 15:33:43 +0000 (16:33 +0100)]
rbd-fuse: fix usage of conn->want
Fix usage of conn->want and FUSE_CAP_BIG_WRITES. Both need libfuse
version >= 2.8. Encapsulate the related code line into a check for
the needed FUSE_VERSION as already done in ceph-fuse in some cases.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sun, 27 Jan 2013 20:57:31 +0000 (21:57 +0100)]
utime: fix narrowing conversion compiler warning in sleep()
Fix compiler warning:
./include/utime.h: In member function 'void utime_t::sleep()':
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_sec' from
'__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is
ill-formed in C++11 [-Wnarrowing]
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_nsec' from
'__u32 {aka unsigned int}' to 'long int' inside { } is
ill-formed in C++11 [-Wnarrowing]
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
mon: Monitor: rework timecheck code to clarify logic boundaries
The initial timecheck implementation relied on a cleanup function to
clean the state each time we changed epochs (or we got out of quorum),
and we would have to clean up the state in-between rounds in a potentially
confusing way some time down the line.
This patch creates logic boundaries in the code flow, making it clear
where we set up or clear the state when we start or finish an epoch, and
where we set up or clear the round state in-between rounds. It also
allowed for some other changes in behavior, such as when we set-up the
timecheck event, or when we cancel it. Despite the slight increase in
size, the mechanism just got more easily understandable than it was before.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Yan, Zheng [Sun, 6 Jan 2013 01:15:55 +0000 (09:15 +0800)]
mds: properly set error_dentry for discover reply
If MDCache::handle_discover() receives an 'discover path' request but
can not find the base inode. It should properly set the 'error_dentry'
to make sure MDCache::handle_discover_reply() checks correct object's
wait queue.
Yan, Zheng [Fri, 4 Jan 2013 02:36:50 +0000 (10:36 +0800)]
mds: introduce XSYN to SYNC lock state transition
If lock is in XSYN state, Locker::simple_sync() firstly try changing
lock state to EXCL. If it fail to change lock state to EXCL, it just
returns. So Locker::simple_sync() does not guarantee the lock state
eventually changes to SYNC. This issue can cause replica that requests
read lock hang. The fix is introduce an intermediate state for XSYN
to SYNC transition.
Yan, Zheng [Thu, 17 Jan 2013 07:29:21 +0000 (15:29 +0800)]
mds: allow journaling multiple root inodes in EMetaBlob
In some cases (rename, rmdir, subtree map), we may need journal multiple
root inodes (/, mdsdir) in one EMetaBlob. This patch modifies EMetaBlob
format to support journaling multiple root inodes.
Yan, Zheng [Sat, 5 Jan 2013 02:07:11 +0000 (10:07 +0800)]
mds: lock remote inode's primary dentry during rename
commit 1203cd2110 (mds: allow open_remote_ino() to open xlocked dentry)
makes Server::handle_client_rename() xlocks remote inodes' primary
dentry so witness MDS can open xlocked dentry. But I added remote inodes'
projected primary dentries to the xlock list. This is wrong because
projected dentries are invisible for path traverse.
Yan, Zheng [Thu, 17 Jan 2013 06:50:44 +0000 (14:50 +0800)]
mds: check deleted directory in Server::rdlock_path_xlock_dentry
Commit b03eab22e4 (mds: forbid creating file in deleted directory)
is not complete, mknod, mkdir and symlink are missed. Move the ckeck
into Server::rdlock_path_xlock_dentry() fixes the issue.
Yan, Zheng [Fri, 11 Jan 2013 07:46:59 +0000 (15:46 +0800)]
mds: fix end check in Server::handle_client_readdir()
commit 1174dd3188 (don't retry readdir request after issuing caps)
introduced an bug that wrongly marks 'end' in the the readdir reply.
The code that touches existing dentries re-uses an iterator, and the
iterator is used for checking if readdir is end.
mon: Elector: reset the acked leader when the election finishes and we lost
Failure to do so will mean that we will always ack the same leader during
an election started by another monitor. This had been working so far
because we were still acking the existing leader if he was supposed to
still be the leader; or we were acking a new potentially leader; or we
would eventually fall behind on an election and start a new election
ourselves, thus resetting the previously acked leader. While this wasn't
something that mattered much until now, the timechecks code stumbled into
this tiny issue and was failing hard at completing a round because there
wouldn't be a reset before the election started -- timechecks are bound
to election epochs.
Fixes: #3854 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Dan Mick [Tue, 30 Oct 2012 21:02:53 +0000 (14:02 -0700)]
rbd-fuse: add simple RBD FUSE client
Currently written in C on FUSE hi-level interfaces, so error reporting
could be better. No serious work done for performance. But it's
usable as it stands.
Specify -c <conf> and a mountpoint, and images show up as files in
that mountpoint. You can create new images; they'll be created
with attributes stored in xattrs:
Images may be truncated or extended by rewriting. Currently
once an image is opened, it's not closed, so it can't be deleted
or changed outside of the fuse path.
Dan Mick [Sat, 26 Jan 2013 05:22:45 +0000 (21:22 -0800)]
s3/php: update to 1.5? version of API
Something like v1.5 of the Amazon PHP library requires the AmazonS3
constructor to be given an array of parameters rather than using
the globals. More research needs to happen, and particularly
about the v2 API, but this might solve someone's problem with
v1.5 while we do that research.
Ross Turk [Fri, 25 Jan 2013 20:48:31 +0000 (12:48 -0800)]
doc: wider sidebar, larger font, cleaned tip CSS
The sidebar is now about a hundred pixels wider and the fonts
are larger throughout. This works a lot better when you get
deep into the doc structure - it used to wrap horribly.
I also fixed how literals look inside .tip and .important.
For some reason, the lookup() retry loop (for when happened to
race with a removal and grab an invalid WeakPtr) locked
the lock again. This causes the #3836 crash since the lock
is already locked. It's rare since it requires a lookup between
invalidation of the WeakPtr and removal of the WeakPtr entry.
Fixes: #3836
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 25 Jan 2013 17:29:37 +0000 (09:29 -0800)]
osd: share incoming maps via Connection*, not addrs
Kill a set of parallel methods that are using the old addr/inst-based
msgr APIs, and instead use Connection handles. This is much safer and gets
us closer to killing the old msgr API.
Sage Weil [Fri, 25 Jan 2013 17:27:00 +0000 (09:27 -0800)]
osd: pass new maps to dead osds via existing Connection
Previously we were sending these maps to dead osds via their old addrs
using a new outgoing connection and setting the flags so that the msgr
would clean up. That mechanism is possibly buggy and fragile, and we can
avoid it entirely if we just reuse the existing heartbeat Connection.
Sage Weil [Fri, 25 Jan 2013 17:25:28 +0000 (09:25 -0800)]
osd: requeue osdmaps on heartbeat connections for cluster connection
If we receive an OSDMap on the cluster connection, requeue it for the
cluster messenger, and process it there where we normally do. This avoids
any concerns about locking and ordering rules.
The allowance is not only added for btrfs as of commit e639254a0c5f8e3528fa8f2b2b451296653556bc, which makes us happy
for both non-btrfs (lower latency) and btrfs (better small io
throughput, no big stall during commit).
Sage Weil [Thu, 24 Jan 2013 06:16:49 +0000 (22:16 -0800)]
os/FileStore: only adjust up op queue for btrfs
We only need to adjust up the op queue limits during commit for btrfs,
because the snapshot initiation (async create) is currently
high-latency and the op queue is quiesced during that period.
This lets us revert 44dca5c, which disabled the extra allowance because
it is generally bad for non-btrfs writeahead mode.