Sage Weil [Tue, 3 Dec 2013 16:16:41 +0000 (08:16 -0800)]
crush/mapper: new SET_CHOOSE_LEAF_TRIES command
Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).
(We should do the same for the other tunables, by the way!)
Sage Weil [Tue, 3 Dec 2013 01:17:13 +0000 (17:17 -0800)]
crush/mapper: pass parent r value for indep call
Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times. This avoids doing
useless work (calling multiple times and trying the same values).
Sage Weil [Tue, 3 Dec 2013 01:15:56 +0000 (17:15 -0800)]
crush/mapper: clarify numrep vs endpos
Pass numrep (the width of the result) separately from the number of results
we want *this* iteration. This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.
Sage Weil [Sat, 2 Nov 2013 18:54:09 +0000 (11:54 -0700)]
crush/mapper: strip firstn conditionals out of crush_choose, rename
Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals. This ends up
stripping out *lots* of code.
Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode. This appears to
have happened waaaay back in commit dae8bec9 (or earlier)... 2007.
Sage Weil [Sun, 11 Aug 2013 21:35:19 +0000 (14:35 -0700)]
crush: return CRUSH_ITEM_UNDEF for failed placements with indep
For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller. For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.
Sage Weil [Sun, 11 Aug 2013 21:19:11 +0000 (14:19 -0700)]
crush: eliminate CRUSH_MAX_SET result size limitation
This is only present to size the temporary scratch arrays that we put on
the stack. Let the caller allocate them as they wish and remove the
limitation.
Josh Durgin [Mon, 2 Dec 2013 22:54:04 +0000 (14:54 -0800)]
osd: read into correct variable for magic string
4d140a71a1a48081b449b7d8dde808eb6e74c6b2 refactored this and
introduced a bug. peek_meta() was accidentally reading into magic,
then replacing magic with val, which was always the empty string,
resulting in the osd always failing to start due to 'mismatched'
magic values.
Sage Weil [Mon, 2 Dec 2013 16:31:23 +0000 (08:31 -0800)]
osd/OSDMap: add region, pdu, pod types while we are at it
One use noted that they have a 'pdu' type in their type hierarchy that
typically spans multiple racks. Others are known to use the 'pod'
terminology; add that to. And I can imagine 'region' above datacenter.
Factor this into a helper to make things a bit less fragile.
Sage Weil [Sun, 10 Nov 2013 06:03:42 +0000 (22:03 -0800)]
osd/OSDMap: add 'chassis' to default type hierarchy
A chassis is usually bigger than a host but smaller than a rack. This will
be useful for a broad class of modern hardware that sticks multiple hosts
in the same chassis (in sleds, or on cards, or blades, or whatever).
Sage Weil [Sun, 3 Nov 2013 04:21:39 +0000 (21:21 -0700)]
os/ObjectStore: add {read,write}_meta
Move these from the OSD. Use a generic implementation in ObjectStore that
hopefully all backends can share (so that it can remain in sync with the
start/stop scripts, ceph-disk, and other orchestration machinery).
Sage Weil [Sun, 3 Nov 2013 02:19:09 +0000 (19:19 -0700)]
osd: construct ObjectStore outside of OSD
This lets ceph_osd.cc handle the config details and use it directly for
all of the random command-line stuff, eliminating a bunch of mostly-
useless static wrappers in OSD.
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs
When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map. Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.
Fixes: #6910 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 26 Nov 2013 21:20:21 +0000 (13:20 -0800)]
PG: retry GetLog() each time we get a notify in Incomplete
If for some reason there are no up OSDs in the history which
happen to have usable copies of the pg, it's possible that
there is a usable copy elsewhere on the cluster which will
become known to the primary if it waits.
Fixes: #6909 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Tue, 26 Nov 2013 06:41:00 +0000 (14:41 +0800)]
mds: remove superfluous warning of releasing lease
When receiving the lease release message, it's possible that the lease
has already expired and the corresponding dentry has been trimmed from
the cache.
David Zafman [Mon, 25 Nov 2013 20:57:19 +0000 (12:57 -0800)]
osd: Remove bogus assert(active == acting.size())
We saw this assert because active is not correctly computed.
Remove assert and incorrectly computed active count.
We already use acting.size() to determine whether to set PG_STATE_DEGRADED.
Fixes: #6896 Signed-off-by: David Zafman <david.zafman@inktank.com>
Josh Durgin [Fri, 18 Oct 2013 15:23:40 +0000 (08:23 -0700)]
buffer: enable tracking of calls to c_str()
Track buffer::ptr::c_str() to catch internal calls that use it, like
buffer::ptr::cmp(). buffer::list::c_str() will be captured by this as
well, since it will do a final buffer::ptr::c_str() and possibly
several more if it needs to rebuild into a single raw buffer.
Josh Durgin [Fri, 18 Oct 2013 14:46:34 +0000 (07:46 -0700)]
buffer: attempt to size raw_pipe buffers
Make sure the requested length is below the maximum pipe size for now,
since we're only using one pipe and splicing once into and out of
it. The default max is 1MB on recent kernels, so this isn't such a
terrible limitation.
To get around this we could use multiple pipes, or keep both source and
destination fds open at the same time and call splice many times. This
is more usual usage for splice, but would require a lot more work to
restructure the filestore and messenger to handle it.
Josh Durgin [Mon, 21 Oct 2013 19:40:30 +0000 (12:40 -0700)]
buffer: add methods to read and write using zero copy
Create explicit methods for testing. Make buffer::list::write_fd() use
zero-copy if all the buffers support it. Don't automatically handle
reads yet, since we need better detection of read length first.
Josh Durgin [Mon, 21 Oct 2013 15:58:56 +0000 (08:58 -0700)]
buffer: create raw pipe-based buffer
This uses a pipe to reference kernel memory so we can use splice(2) to
avoid extra data copies. Take an fd in the factory to create it, since
that's the only way to use it efficiently, which is its whole purpose.