Sage Weil [Mon, 2 Dec 2013 16:31:23 +0000 (08:31 -0800)]
osd/OSDMap: add region, pdu, pod types while we are at it
One use noted that they have a 'pdu' type in their type hierarchy that
typically spans multiple racks. Others are known to use the 'pod'
terminology; add that to. And I can imagine 'region' above datacenter.
Factor this into a helper to make things a bit less fragile.
Sage Weil [Sun, 10 Nov 2013 06:03:42 +0000 (22:03 -0800)]
osd/OSDMap: add 'chassis' to default type hierarchy
A chassis is usually bigger than a host but smaller than a rack. This will
be useful for a broad class of modern hardware that sticks multiple hosts
in the same chassis (in sleds, or on cards, or blades, or whatever).
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs
When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map. Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.
Fixes: #6910 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 26 Nov 2013 21:20:21 +0000 (13:20 -0800)]
PG: retry GetLog() each time we get a notify in Incomplete
If for some reason there are no up OSDs in the history which
happen to have usable copies of the pg, it's possible that
there is a usable copy elsewhere on the cluster which will
become known to the primary if it waits.
Fixes: #6909 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Tue, 26 Nov 2013 06:41:00 +0000 (14:41 +0800)]
mds: remove superfluous warning of releasing lease
When receiving the lease release message, it's possible that the lease
has already expired and the corresponding dentry has been trimmed from
the cache.
David Zafman [Mon, 25 Nov 2013 20:57:19 +0000 (12:57 -0800)]
osd: Remove bogus assert(active == acting.size())
We saw this assert because active is not correctly computed.
Remove assert and incorrectly computed active count.
We already use acting.size() to determine whether to set PG_STATE_DEGRADED.
Fixes: #6896 Signed-off-by: David Zafman <david.zafman@inktank.com>
Josh Durgin [Fri, 18 Oct 2013 15:23:40 +0000 (08:23 -0700)]
buffer: enable tracking of calls to c_str()
Track buffer::ptr::c_str() to catch internal calls that use it, like
buffer::ptr::cmp(). buffer::list::c_str() will be captured by this as
well, since it will do a final buffer::ptr::c_str() and possibly
several more if it needs to rebuild into a single raw buffer.
Josh Durgin [Fri, 18 Oct 2013 14:46:34 +0000 (07:46 -0700)]
buffer: attempt to size raw_pipe buffers
Make sure the requested length is below the maximum pipe size for now,
since we're only using one pipe and splicing once into and out of
it. The default max is 1MB on recent kernels, so this isn't such a
terrible limitation.
To get around this we could use multiple pipes, or keep both source and
destination fds open at the same time and call splice many times. This
is more usual usage for splice, but would require a lot more work to
restructure the filestore and messenger to handle it.
Josh Durgin [Mon, 21 Oct 2013 19:40:30 +0000 (12:40 -0700)]
buffer: add methods to read and write using zero copy
Create explicit methods for testing. Make buffer::list::write_fd() use
zero-copy if all the buffers support it. Don't automatically handle
reads yet, since we need better detection of read length first.
Josh Durgin [Mon, 21 Oct 2013 15:58:56 +0000 (08:58 -0700)]
buffer: create raw pipe-based buffer
This uses a pipe to reference kernel memory so we can use splice(2) to
avoid extra data copies. Take an fd in the factory to create it, since
that's the only way to use it efficiently, which is its whole purpose.
Josh Durgin [Wed, 16 Oct 2013 23:23:36 +0000 (16:23 -0700)]
buffer: abstract raw data related methods
Create a virtual function that returns the raw data instead of
accessing it directly, so raw buffers backed by pipes can be used as
buffer::ptrs. Make raw::is_page_aligned() virtual so it will not need
to look at the raw data for a pipe-based buffer.
Fixes: #6829
Backport: dumpling, emperor
We didn't init this member variable, which might cause that when
modifying user info that has this flag set the 'system' flag might
inadvertently reset.
mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump
Code would assume a formatter would always be defined. If a 'plain'
formatter or even an invalid formatter were to be supplied, the monitor
would crash and burn in poor style.
Fixes: 6820
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: OSDMonitor: receive CephInt on 'osd pool set' instead on CephString
This partially reverts 2fe0d0d9 in order to allow Emperor monitors to
forward mon command messages to Dumpling monitors without breaking a
cluster.
The need for this patch became obvious after issue #6796 was triggered.
Basically, in a mixed cluster of Emperor/Dumpling monitors, if a client
happens to obtain the command descriptions from an Emperor monitor and
then issue an 'osd pool set' this can turn out in one of two ways:
1. client msg gets forwarded to an Emperor leader and everything's a-okay;
2. client msg gets forwarded to a Dumpling leader and the string fails to
be interpreted without the monitor noticing, thus leaving the monitor with
an uninitialized variable leading to trouble.
If 2 is triggered, a multitude of bad things can happen, such as thousands
of pg splits, due to a simple 'osd set pool foo pg_num 128' turning out
to be interpreted as 109120394 or some other random number.
This patch is such that we make sure the client sends an integer instead
of a string. We also make sure to interpret anything the client sends as
possibly being a string, or an integer.
Fixes: 6796
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Josh Durgin [Thu, 21 Nov 2013 02:35:34 +0000 (18:35 -0800)]
test: use older names for module setup/teardown
setUp and tearDown require nosetests 0.11, but 0.10.4 is the latest on
centos. Rename to use the older aliases, which still work with newer
versions of nosetests as well.
Fixes: #6368 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
David Zafman [Fri, 11 Oct 2013 22:53:49 +0000 (15:53 -0700)]
osd: Backfill peers should not be included in the acting set
Create actingbackfill in choose_acting()
Use first backfill target as previously
Add asserts to catch inappropriate use of actingbackfill
Use is_acting() in proc_replica_info() because this is before actingbackfill set
Remove backfill_targets from stray_set to prevent purge_strays from removing collection
Can't check is_replica() anymore for backfill operations since a backfill isn't
a replica due to acting set change.
fixes: #5855
Signed-off-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Wed, 30 Oct 2013 18:21:56 +0000 (11:21 -0700)]
ReplicatedPG/PGBackend: block all ops other than Pull prior to active
Previously, it was guarranteed that prior to activation, flushed would
be false on a replica. Now, there may be a period where flushed is true
due to the flush in Stray completing prior to activation and flushed
being false again. This is necessary since shortly it won't be possible
to determine from the osdmap whether a stray will be activated in a
particular interval.
rallred [Tue, 12 Nov 2013 15:29:19 +0000 (08:29 -0700)]
RBD Documentation and Example fixes for --image-format
- RBD Documentation, --image-format wrongly specified as --format in examples
- RBD Documentation, better describe image format, to differentiate from --format
Josh Durgin [Mon, 18 Nov 2013 22:39:12 +0000 (14:39 -0800)]
osd: fix bench block size
The command was declared to take 'size' in dumpling, but was trying to
read 'bsize' instead, so it always used the default of 4MiB. Change
the bench command to read 'size', so it matches what existing clients
are sending.
Shipping an object_info_t to a replica with the dirty
flag set would cause the replica to interpret that
object as being lost. Instead, we always encode
lost into the slot where dumpling expects to find
it and add another field at the end of the encoding.
Backport: emperor Fixes: #6761 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Tue, 12 Nov 2013 23:15:26 +0000 (15:15 -0800)]
ReplicatedPG: test for missing head before find_object_context
find_object_context doesn't return EAGAIN for a missing head.
I chose not to change that behavior since it might hide bugs
in the future. All other callers check for missing on head
before calling into find_object_context because we potentially
need head or snapdir to map a snapid onto a clone.
Backport: emperor Fixes: 6758 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 12 Nov 2013 21:39:04 +0000 (13:39 -0800)]
JounralingObjectStore: journal->committed_thru after replay
It's possible that the osd stopped between when the filestore
op_seq file was updated and when the journal was trimmed. In
that case, it's possible that on boot the journal might be
full, and yet not be trimmed because commit_start assumes
there is no work to do. Calling committed_thru on the journal
ensures that the journal matches committed_seq.
Backport: emperor dumpling Fixes: 6756 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>