Sage Weil [Sat, 5 Jan 2013 18:39:08 +0000 (10:39 -0800)]
msg/Pipe: associate sending msgs to con inside lock
Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).
Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit. (And fix one lingering user to use that
helper appropriately.)
Fixes: #3731 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Samuel Just [Fri, 4 Jan 2013 20:43:52 +0000 (12:43 -0800)]
ReplicatedPG: remove old-head optization from push_to_replica
This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone. In general, using head in clone_subsets is tricky since
we might be writing to head during the push. calc_clone_subsets does not
consider head (probably for this reason). Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.
Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.
Fixes: #3698 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 4 Jan 2013 01:15:07 +0000 (17:15 -0800)]
os/FileStore: fix non-btrfs op_seq commit order
The op_seq file is the starting point for journal replay. For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap. We normally ignore current/ contents anyway.
On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).
This fixes a serious bug that could cause data loss and corruption after
a power loss event. For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.
Fixes: #3721 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Thu, 3 Jan 2013 17:59:45 +0000 (09:59 -0800)]
OSD: for old osds, dispatch peering messages immediately
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message. However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval. Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 3 Jan 2013 06:20:06 +0000 (22:20 -0800)]
osd: let pgs process map advances before booting
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow. The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD. The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.
Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call. This is harmless since we are
not yet processing actual ops; we only need to be async when active.
Fixes: #3714 Signed-off-by: Sage Weil <sage@inktank.com>
Sam Lang [Wed, 2 Jan 2013 22:07:13 +0000 (16:07 -0600)]
fuse: Fix cleanup code path on init failure
With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored. Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.
Sage Weil [Mon, 31 Dec 2012 23:22:23 +0000 (15:22 -0800)]
libcephfs: choose more unique nonce
We were using a per-process counter combined with the pid. A short
running process can easily loop through and reuse the same pid later.
Instead, go for 48 bits of randomness and the pid. This way if we get
a dup pid we'll only get a dup nonce once out of 2^48 tries.
Avoids #3630 when running a libcephfs test in a loop (so that the pid
is eventually reused). This is a better fix than the broken 8b599083705c2495810c00f9f5fd5bb8ace7f32e. The real solution on the MDS
side involves cleaning up the msgr/MDS interaction with session
shutdown.
Sage Weil [Mon, 31 Dec 2012 23:23:29 +0000 (15:23 -0800)]
client: fix _create
make_request() clear out req->reply and frees req; we can't inspect
it here.
Instead, just assume that extra_bl is the create flag/ino if it is
present. Old code does not include an extra_bl on CREATE, and new code
will have the same first bytes for compatibility.
Josh Durgin [Sun, 30 Dec 2012 04:35:15 +0000 (20:35 -0800)]
librbd: fix race between unprotect and clone
Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.
Josh Durgin [Sat, 29 Dec 2012 00:54:51 +0000 (16:54 -0800)]
librbd: add {rbd_}open_read_only()
Since 58890cfad5f7bee933baa599a68e6c65993379d4, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.
For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.
Josh Durgin [Sat, 29 Dec 2012 03:47:09 +0000 (19:47 -0800)]
OSD: remove RD flag from CALL ops
20496b8d2b2c3779a771695c6f778abbdb66d92a forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.
Josh Durgin [Sat, 29 Dec 2012 03:44:36 +0000 (19:44 -0800)]
cls_rbd: get_children does not need write permission
This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.
Sage Weil [Sat, 29 Dec 2012 01:20:43 +0000 (17:20 -0800)]
msg/Pipe: use state_closed atomic_t for _lookup_pipe
We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock. Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.
Sage Weil [Sun, 23 Dec 2012 17:22:18 +0000 (09:22 -0800)]
msgr: fix race on Pipe removal from hash
When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry. The Pipe in this case will
have STATE_CLOSED. Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.
This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.
Sage Weil [Sun, 23 Dec 2012 17:19:05 +0000 (09:19 -0800)]
msgr: don't queue message on closed pipe
If we have a con that refs a pipe but it is closed, don't use it. If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached. Either way,
Remove the special-case check, which does not inform the peer what
protocol features are missing. It also enforces this requirement even
when we negotiate auth none.
Sam Lang [Wed, 19 Dec 2012 20:17:29 +0000 (10:17 -1000)]
mds: Return created inode in mds reply to create
If multiple clients race to create a file, multiple clients will send a
create request and get back a valid dentry+inode, but only one client
will actually win the race to create the file. All other clients should
treat the reply as an open of an existing file and check permissions.
This patch adds the created inode number to the mds create reply if that
request actually created the inode/file (and the feature is supported),
so the client can properly check permissions if the inode number isn't
returned. Fixes #3625.
Sam Lang [Mon, 17 Dec 2012 19:54:23 +0000 (09:54 -1000)]
client: Make ll_create use _create
This is a fix for bug #3625, where multiple clients race to create a
file, and the loser returns EEXIST instead of a valid file handle.
The patch modifies ll_create in the Client class to use _create(),
which sends the request to the MDS (where an atomic create/open is
performed).
Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]
log: broadcast cond signals
We were using a single cond, and only signalling one waiter. That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.
Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.
Intead, break the single cond into two: one for loggers, and one for the
flusher. Always signal the (one) flusher, and always broadcast to all
loggers.
Backport: bobtail, argonaut Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Sage Weil [Fri, 28 Dec 2012 19:34:47 +0000 (11:34 -0800)]
osd: allow RecoveryDone self-transition in RepNotRecovering
In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill. In that case, we want to just stayin
RepNotRecovering.
It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.
Fixes: #3689 Signed-off-by: Sage Weil <sage@inktank.com>
Sam Lang [Fri, 28 Dec 2012 19:58:39 +0000 (13:58 -0600)]
ceph-fuse: Avoid doing handle cleanup in dtor
The CephFuse::Handle class needs the client
pointer to be valid for finalizing, so don't finalize
in the destructor (which doesn't get called till the
fuse handle leaves scope), instead use a finalize method that
gets called explicitly before the client pointer is freed.
Sam Lang [Thu, 6 Dec 2012 05:21:12 +0000 (23:21 -0600)]
ceph-fuse: Split main into init/main/finalize
With the invalidate callback enabled for fuse, the Client::unmount
call requires the fuse channel and session objects remain for performing
the invalidate callbacks. This patch splits the ceph_fuse_ll_main
call into init, main, and finalize functions, so finalization of the
channel and session objects can be done after the unmount completes.
The patch includes cleanup for the code in fuse_ll.cc to make it more
in the style of C++ and make use of the pimpl idiom to hide the fuse
structures within the CephFuse::Handle pimpl class.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]
osd: drop 'osd recovery max active' back to previous default (5)
Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high. In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).
Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.
Sage Weil [Sun, 16 Dec 2012 20:26:06 +0000 (12:26 -0800)]
mds: replace closed sessions on connect
If a connection comes and there is a closed session attached, remove it.
This is probably a failure of an old session to get cleaned up properly,
and in certain cases it may even be from a different client (if the addr
nonce is reused). In that case this prevents further damage, although
a complete solution would also clean up the closed connection state if
there is a fault. See #3630.
This fixes a hang that is reproduced by running the libcephfs
Caps.ReadZero test in a loop; eventually the client addr is reused and
we are linked to an ancient Session with a different client id.
Backport: bobtail Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 3 Mar 2012 23:39:06 +0000 (15:39 -0800)]
mds: don't force in->first == dn->first
The fullbit sets it now. For multiversion inodes, it's "first" can be in
the future, since this dentry may not have changed when the inode was
cowed in place. (OTOH, the dentry cannot have changed without the inode
also have changing.)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Yan, Zheng [Sat, 8 Dec 2012 14:43:32 +0000 (22:43 +0800)]
mds: fix race between send_dentry_link() and cache expire
MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.
Yan, Zheng [Mon, 10 Dec 2012 07:43:44 +0000 (15:43 +0800)]
mds: fix file existing check in Server::handle_client_openc()
Creating new file needs to be handled by directory fragment's auth
MDS, opening existing file in write mode needs to be handled by
corresponding inode's auth MDS. If a file is remote link, its parent
directory fragment's auth MDS can be different from corresponding
inode's auth MDS. So which MDS to handle create file request can be
affected by if the corresponding file already exists.
handle_client_openc() calls rdlock_path_xlock_dentry() at the very
beginning. It always assumes the request needs to be handled by
directory fragment's auth MDS. When handling a create file request,
if the file already exists and remotely linked to a non-auth inode,
handle_client_openc() falls back to handle_client_open(),
handle_client_open() forwards the request because the MDS is not
inode's auth MDS. Then when the request arrives at inode's auth MDS,
rdlock_path_xlock_dentry() is called, it will forward the request
back.
Yan, Zheng [Sun, 9 Dec 2012 05:03:41 +0000 (13:03 +0800)]
mds: don't retry readdir request after issuing caps
If remote linkage without inode is encountered after some caps are
issued, Server::handle_client_readdir() should send the reply to
client immediately instead of retrying the request after opening
the remote dentry. This is because the MDS may want to revoke these
caps before the MDS succeeds in opening the remote dentry.
Yan, Zheng [Sat, 8 Dec 2012 16:53:28 +0000 (00:53 +0800)]
mds: take export lock set before sending MExportDirDiscover
Migrator::export_dir() only check if it can lock the export lock set
but not take the lock set. So someone else can change the path to
the exporting dir and confuse Migrator::handle_export_discover().