Josh Durgin [Mon, 2 Dec 2013 22:54:04 +0000 (14:54 -0800)]
osd: read into correct variable for magic string
4d140a71a1a48081b449b7d8dde808eb6e74c6b2 refactored this and
introduced a bug. peek_meta() was accidentally reading into magic,
then replacing magic with val, which was always the empty string,
resulting in the osd always failing to start due to 'mismatched'
magic values.
Sage Weil [Mon, 2 Dec 2013 16:31:23 +0000 (08:31 -0800)]
osd/OSDMap: add region, pdu, pod types while we are at it
One use noted that they have a 'pdu' type in their type hierarchy that
typically spans multiple racks. Others are known to use the 'pod'
terminology; add that to. And I can imagine 'region' above datacenter.
Factor this into a helper to make things a bit less fragile.
Sage Weil [Sun, 10 Nov 2013 06:03:42 +0000 (22:03 -0800)]
osd/OSDMap: add 'chassis' to default type hierarchy
A chassis is usually bigger than a host but smaller than a rack. This will
be useful for a broad class of modern hardware that sticks multiple hosts
in the same chassis (in sleds, or on cards, or blades, or whatever).
Sage Weil [Sun, 3 Nov 2013 04:21:39 +0000 (21:21 -0700)]
os/ObjectStore: add {read,write}_meta
Move these from the OSD. Use a generic implementation in ObjectStore that
hopefully all backends can share (so that it can remain in sync with the
start/stop scripts, ceph-disk, and other orchestration machinery).
Sage Weil [Sun, 3 Nov 2013 02:19:09 +0000 (19:19 -0700)]
osd: construct ObjectStore outside of OSD
This lets ceph_osd.cc handle the config details and use it directly for
all of the random command-line stuff, eliminating a bunch of mostly-
useless static wrappers in OSD.
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs
When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map. Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.
Fixes: #6910 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 26 Nov 2013 21:20:21 +0000 (13:20 -0800)]
PG: retry GetLog() each time we get a notify in Incomplete
If for some reason there are no up OSDs in the history which
happen to have usable copies of the pg, it's possible that
there is a usable copy elsewhere on the cluster which will
become known to the primary if it waits.
Fixes: #6909 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Tue, 26 Nov 2013 06:41:00 +0000 (14:41 +0800)]
mds: remove superfluous warning of releasing lease
When receiving the lease release message, it's possible that the lease
has already expired and the corresponding dentry has been trimmed from
the cache.
David Zafman [Mon, 25 Nov 2013 20:57:19 +0000 (12:57 -0800)]
osd: Remove bogus assert(active == acting.size())
We saw this assert because active is not correctly computed.
Remove assert and incorrectly computed active count.
We already use acting.size() to determine whether to set PG_STATE_DEGRADED.
Fixes: #6896 Signed-off-by: David Zafman <david.zafman@inktank.com>
Josh Durgin [Fri, 18 Oct 2013 15:23:40 +0000 (08:23 -0700)]
buffer: enable tracking of calls to c_str()
Track buffer::ptr::c_str() to catch internal calls that use it, like
buffer::ptr::cmp(). buffer::list::c_str() will be captured by this as
well, since it will do a final buffer::ptr::c_str() and possibly
several more if it needs to rebuild into a single raw buffer.
Josh Durgin [Fri, 18 Oct 2013 14:46:34 +0000 (07:46 -0700)]
buffer: attempt to size raw_pipe buffers
Make sure the requested length is below the maximum pipe size for now,
since we're only using one pipe and splicing once into and out of
it. The default max is 1MB on recent kernels, so this isn't such a
terrible limitation.
To get around this we could use multiple pipes, or keep both source and
destination fds open at the same time and call splice many times. This
is more usual usage for splice, but would require a lot more work to
restructure the filestore and messenger to handle it.
Josh Durgin [Mon, 21 Oct 2013 19:40:30 +0000 (12:40 -0700)]
buffer: add methods to read and write using zero copy
Create explicit methods for testing. Make buffer::list::write_fd() use
zero-copy if all the buffers support it. Don't automatically handle
reads yet, since we need better detection of read length first.
Josh Durgin [Mon, 21 Oct 2013 15:58:56 +0000 (08:58 -0700)]
buffer: create raw pipe-based buffer
This uses a pipe to reference kernel memory so we can use splice(2) to
avoid extra data copies. Take an fd in the factory to create it, since
that's the only way to use it efficiently, which is its whole purpose.
Josh Durgin [Wed, 16 Oct 2013 23:23:36 +0000 (16:23 -0700)]
buffer: abstract raw data related methods
Create a virtual function that returns the raw data instead of
accessing it directly, so raw buffers backed by pipes can be used as
buffer::ptrs. Make raw::is_page_aligned() virtual so it will not need
to look at the raw data for a pipe-based buffer.
Fixes: #6829
Backport: dumpling, emperor
We didn't init this member variable, which might cause that when
modifying user info that has this flag set the 'system' flag might
inadvertently reset.
mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump
Code would assume a formatter would always be defined. If a 'plain'
formatter or even an invalid formatter were to be supplied, the monitor
would crash and burn in poor style.
Fixes: 6820
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: OSDMonitor: receive CephInt on 'osd pool set' instead on CephString
This partially reverts 2fe0d0d9 in order to allow Emperor monitors to
forward mon command messages to Dumpling monitors without breaking a
cluster.
The need for this patch became obvious after issue #6796 was triggered.
Basically, in a mixed cluster of Emperor/Dumpling monitors, if a client
happens to obtain the command descriptions from an Emperor monitor and
then issue an 'osd pool set' this can turn out in one of two ways:
1. client msg gets forwarded to an Emperor leader and everything's a-okay;
2. client msg gets forwarded to a Dumpling leader and the string fails to
be interpreted without the monitor noticing, thus leaving the monitor with
an uninitialized variable leading to trouble.
If 2 is triggered, a multitude of bad things can happen, such as thousands
of pg splits, due to a simple 'osd set pool foo pg_num 128' turning out
to be interpreted as 109120394 or some other random number.
This patch is such that we make sure the client sends an integer instead
of a string. We also make sure to interpret anything the client sends as
possibly being a string, or an integer.
Fixes: 6796
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Josh Durgin [Thu, 21 Nov 2013 02:35:34 +0000 (18:35 -0800)]
test: use older names for module setup/teardown
setUp and tearDown require nosetests 0.11, but 0.10.4 is the latest on
centos. Rename to use the older aliases, which still work with newer
versions of nosetests as well.
Fixes: #6368 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
David Zafman [Fri, 11 Oct 2013 22:53:49 +0000 (15:53 -0700)]
osd: Backfill peers should not be included in the acting set
Create actingbackfill in choose_acting()
Use first backfill target as previously
Add asserts to catch inappropriate use of actingbackfill
Use is_acting() in proc_replica_info() because this is before actingbackfill set
Remove backfill_targets from stray_set to prevent purge_strays from removing collection
Can't check is_replica() anymore for backfill operations since a backfill isn't
a replica due to acting set change.
fixes: #5855
Signed-off-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Wed, 30 Oct 2013 18:21:56 +0000 (11:21 -0700)]
ReplicatedPG/PGBackend: block all ops other than Pull prior to active
Previously, it was guarranteed that prior to activation, flushed would
be false on a replica. Now, there may be a period where flushed is true
due to the flush in Stray completing prior to activation and flushed
being false again. This is necessary since shortly it won't be possible
to determine from the osdmap whether a stray will be activated in a
particular interval.