Yan, Zheng [Mon, 5 May 2014 08:23:14 +0000 (16:23 +0800)]
mds: properly wake up dentry waiters after fragmenting dirfrag
When active MDS wants to fragment a replica dirfrag, it should set
the 'replay' parameter of MDCache::adjust_dir_fragments() to false.
It makes sure that CDir::merge/split wake up any dentry waiter.
Ilya Dryomov [Fri, 2 May 2014 11:13:11 +0000 (15:13 +0400)]
test_librbd_fsx: fix a bug in docloseopen()
docloseopen() always opens $iname image. This is bad, because the
image we had opened could have been something like $iname-clone3. Fix
it by leveraging the fact that rbd_ctx has an image name field.
Ilya Dryomov [Fri, 2 May 2014 11:13:11 +0000 (15:13 +0400)]
test_librbd_fsx: add krbd mode support
Add krbd mode support (-K) to test krbd in the same way librbd is
tested. This introduces a dependency on libkrbd and, because it's
a C++ static library, requires C++ linking. The rbd_operations
framework can be extended in the future to also test rbd_fuse.
Ilya Dryomov [Thu, 1 May 2014 12:08:03 +0000 (16:08 +0400)]
test_librbd_fsx: add a flag to disable randomized striping
In preparation for krbd mode support, introduce an option to disable
randomized striping. The kernel as of 3.15 does not support "fancy"
striping and will not map images with non-default striping values.
Ilya Dryomov [Thu, 1 May 2014 12:08:03 +0000 (16:08 +0400)]
test_librbd_fsx: add holebdy option
In preparation for krbd mode support, provide an option to specify
alignment for discards. The kernel will reject discard requests whose
offset and length are not sector-size aligned.
Ilya Dryomov [Thu, 1 May 2014 12:08:03 +0000 (16:08 +0400)]
test_librbd_fsx: make resizes sector-size aligned
In preparation for krbd mode support, change check_trunc_hack() to
resize to a sector-size aligned value. The kernel will not work with
images whose size is not sector-size aligned.
test_librbd_fsx: use posix_memalign() to allocate aligned buffers
Use posix_memalign() to allocate good_buf and temp_buf, which must be
writebdy and readbdy aligned respectively. Using round_ptr_up() the
way it is used makes fsx crash on free()s at the end of main(), because
the pointer returned by malloc() is overwritten by the aligned pointer.
Gregory Farnum [Wed, 7 May 2014 06:06:14 +0000 (23:06 -0700)]
Merge pull request #1532 from ceph/wip-fast-dispatch
fast dispatch
This series adds an ms_fast_dispatch interface to the Messenger/Dispatcher, designed so that you can dispatch messages directly from the Pipe threads without going through the Dispatch queue.
It also sets the OSD to make use of this interface for most operations, and switches to finer-grained locking and use of local data in a bunch of different paths to enable that.
Yehuda Sadeh [Tue, 6 May 2014 23:55:27 +0000 (16:55 -0700)]
rgw: fix stripe_size calculation
Fixes: #8299
Backport: firefly
The stripe size calculation was broken, specifically affected cases
where we had manifest that described multiple parts.
Yan, Zheng [Sun, 4 May 2014 02:15:36 +0000 (10:15 +0800)]
mds: allow negetive rstat
When splitting dirfrag, delta rstat is always added to the first new
dirfrag. Ancestors of the dirfrag may have nagtive rstat before the
delta rstat has been propagated to them. For example:
inode 100 n(v1 b-4096)
|- dir 100 n(v1 b-4096)
|- dentry
|- inode 101 n(v1 b-4096)
|- dir 101.0* n(v1)/n(v1 b-4096)
|- dir 101.1* n(v1)
mds: fix frozen inode check in MDCache::handle_discover()
When MDCache::handle_discover() encounters a frozen dirfrag, it should
proceed if the dirfrag is being merged, but the MDS hasn't frozen all
dirfrags yet. When MDCache::handle_discover() checks if a inode is
frozen, it should use CInode::is_frozen_inode() (which doesn't check if
inode's parent dirfrag is frozen).
mds: include authpinned objects in remote authpin request
Server::handle_slave_auth_pin() may drop old authpins if it encounters
object that is not authpinable. So it is better to include objects that
have already been authpinned in the remote authpin reuqest.
Yan, Zheng [Fri, 28 Mar 2014 05:30:07 +0000 (13:30 +0800)]
mds: maintain auth bits during replay
Objects' STATE_AUTH bits are set when replaying EImportStart event.
MDCache::trim_non_auth_subtree() clear objects' STATE_AUTH bits when
replaying EExport event.
Yehuda Sadeh [Tue, 6 May 2014 18:06:29 +0000 (11:06 -0700)]
rgw: cut short object read if a chunk returns error
Fixes: #8289
Backport: firefly, dumpling
When reading an object, if we hit an error when trying to read one of
the rados objects then we should just stop. Otherwise we're just going
to continue reading the rest of the object, and since it can't be sent
back to the client (as we have a hole in the middle), we end up
accumulating everything in memory.
Greg Farnum [Mon, 5 May 2014 04:58:04 +0000 (21:58 -0700)]
Pipe: wait for Pipes to finish running, instead of just stop()ing them
Add a stop_and_wait() function that, in addition to closing the Pipe and killing
its socket, waits for any fast_dispatch call which is in-progress. Use this in
several parts of the Pipe and SimpleMessenger code where appropriate.
This fixes several races with fast_dispatch and other avenues; here are two:
1) It could be that we grab the lock while the existing pipe is fast_dispatching
and then proceed to dispatch messages ourself, beating it. Instead, wait for
the other pipe. Add a "reader_dispatching" member which tells bus this is
happening, and when re-locking, signal the cond if we're shutting down.
2) It could be that a normally-dispatched Message in the OSD triggers a
mark_down() on the Connection and then clears out the Session
(Connection::priv) pointer, causing a racing fast_dispatch()'ed function to
assert out in the OSD because it requires a valid Session.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 6 May 2014 18:01:27 +0000 (11:01 -0700)]
osd/ReplicatedPG: fix whiteouts for other cache mode
We were special casing WRITEBACK mode for handling whiteouts; this needs to
also include the FORWARD and READONLY modes. To avoid having to list
specific cache modes, though, just check != NONE.
Fixes: #8296
Backport: firefly Signed-off-by: Sage Weil <sage@inktank.com>
SimpleMessenger: Don't grab the lock when sending messages if we don't have to
We'd like it if sending a message didn't require any global locks, but the
submit_message() function conditionally needs it in order to create new
Pipes. So:
1) When failing on a dud Pipe, verify that it's still the Pipe the Connection
is linked to; if not, try sending along the newly-linked Pipe.
2) Add an "already_locked" param to submit_message
3) Have the Connection-based interface set this param to false, and
the addr-based interface set it to true, reflecting whether they have
taken the SimpleMessenger::lock.
4) If we discover we need to reference global data structures in
submit_mesage:
4a) if locked, do as we previously have
4b) if not locked, take the lock and call into submit_message again.
The net effect of this is that in the typical case, the Connection-based
_send_message() function no longer acquires global locks, only per-Connection
ones. In the case where the Connection must recreate a Pipe, it falls back to
performing like the addr-based _send_message() does. In the case where
we are racing with somebody else recreating a Pipe(either us or the other
end), we may try twice but we will still only take per-Connection/Pipe locks,
which is a fair tradeoff for not taking the global lock.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
OSD: scan for dropped PGs in consume_map instead of advance_map
We have to wait until after we know that nobody will be adding ops for
newly-dead PGs to the list. While we're moving it, switch the locking
so we only hold a write lock while deleting the actual lists.
OSD: do not take the pre_publish_lock in connection utility functions
They loop back around for local connections and deadlock, so we use the
map reservation mechanism instead.
TODO: actually that issue is out of date, do we still want this change?
Greg Farnum [Wed, 26 Mar 2014 22:04:39 +0000 (15:04 -0700)]
OSD: enable ms_fast_dispatch
We've been setting it up, now this patch actually adds a fast path for osd ops
which bypasses the osd_lock and should not block on any longly held locks. In
addition to the actual ms_fast_dispatch; we take advantage of the fast_notify
functions in order to create a Session for every peer, since that is now the
data structure around which we handle incoming Messages and waitlisting; and
fast_preprocess in order to track when a peer has already sent us a new map
(otherwise, if we see an op with a too-new epoch, we have to request it from
the monitor).
Signed-off-by: Samuel Just <sam.just@inktank.com> Signed-off-by: Greg Farnum <greg@inktank.com>
Don't hold the old PG's lock in _create_lock_pg. Instead, just copy the
necessary data bits into a holding location. Note that this means we aren't
protecting it against change while the new PG is created, which I *think*
is okay...