Samuel Just [Mon, 13 Feb 2012 19:49:42 +0000 (11:49 -0800)]
ReplicatedPG: refactor push and pull
Now, push progress is represented by ObjectRecoveryProgress. In
particular, rather than tracking data_subset_*ing, we track the furthest
offset before which the data will be consistent once cloning is complete.
sub_op_push now separates the pull response implementation from the
replica push implementation.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Sun, 12 Feb 2012 01:53:47 +0000 (17:53 -0800)]
ReplicatedPG: is_degraded may return true for backfill
If is_degraded returns true for backfill, the object may not be
in any replica's missing set. Only call start_recovery_op if
we actually started an op. This bug could cause a stuck
in backfill error.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Sat, 11 Feb 2012 22:55:06 +0000 (14:55 -0800)]
osd: filter trimming|purged snaps out of op SnapContext
We can receive an op with an old SnapContext that includes snaps that we've
already trimmed or are in the process of trimming. Filter them out!
Otherwise we will recreate and add links into collections we've already
marked as removed, and we'll get things like ENOTEMPTY when we try to
remove them. Or just leave them laying around.
Fixes: #1949 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Sat, 11 Feb 2012 17:28:14 +0000 (09:28 -0800)]
osd: queue pg removal under pg's epoch
The PG may be doing work relative to a different epoch than what the osd
has. Make sure the PG removal message is queued under that epoch to avoid
confusing/crashing the recipient like so:
2012-02-10 23:26:35.691793 7f387281f700 osd.3 514 queue_pg_for_deletion: 0.0
osd/OSD.cc: In function 'void OSD::handle_pg_remove(OpRequest*)' thread 7f387281f700 time 2012-02-10 23:26:35.691820
osd/OSD.cc: 4860: FAILED assert(pg->get_primary() == m->get_source().num())
Greg Farnum [Fri, 10 Feb 2012 23:07:10 +0000 (15:07 -0800)]
mon: remove the last_consumed setting in Paxos
This was only ever used while initializing the Paxos machine, and it
doesn't need to be. Its existence is just an invitation to have races
between updating the stashed data and the stashed version.
Greg Farnum [Fri, 10 Feb 2012 23:02:03 +0000 (15:02 -0800)]
mon: handle inconsistent disk states on startup.
This lets us recover from an interrupted slurp while still noticing
other corruption issues. Rather than running init() and then
update_from_paxos() on each instance, we run init() and check
consistency. If it is consistent, we update_from_paxos as before. If
it is not, we do nothing and detect the slurping state
in handle_probe_reply(). (This assumes the disk was in a slurping state. If not, the
daemon crashes because something else went horribly wrong.)
While we're at it, remove unnecessary sets of first_committed. These
are done in the call to pax->trim_to().
Sage Weil [Fri, 10 Feb 2012 22:38:13 +0000 (14:38 -0800)]
messages: populate header.version in constructor
Define a HEAD_VERSION and COMPAT_VERSION for any versioned message. Pass
to Message constructor so that it is always initialized, even from the
the default constructor. That's needed because we use that to check
decoding compatibility when receiving/decoding messages.
If we are conditionally encoding an old version, explicitly set
header.version in encode_payload().
We also set compat_version to demonstrate what will happen for future
revisions. In this case, it's moot, because no old code understands
compat_version yet: nobody with old decode code will see these values
anyway. But use this opportunity to demonstrate how it would be used in
the future.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Fri, 10 Feb 2012 18:42:24 +0000 (10:42 -0800)]
mon: add a slurping flag to the Paxos state
Set it before we start slurping, and clear it when we end slurping.
This allows us to differentiate between deliberately inconsistent
disk states, and broken disk states. Run simple checks in a new
is_consistent() call.
Greg Farnum [Fri, 10 Feb 2012 17:16:58 +0000 (09:16 -0800)]
mon: initialize paxos state in constructor
These should all be initialized in init() anyway
(except accepted_pn_from, which is set in collect and handle_collect),
but initializing them to safe defaults in the constructor provides
a safety net.
Sage Weil [Fri, 10 Feb 2012 05:54:34 +0000 (21:54 -0800)]
osd: new encoding for pg_create_t
There was no version encoding previously, so this is an incompatible
change. Fortunately this type is only used in one place, MOSDPGCreate,
so we'll rev that encoding and compensate there. All is well!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 10 Feb 2012 00:20:31 +0000 (16:20 -0800)]
filestore: wait to start op if other ops are in line
We can have a sequence like:
- commit_start, blocked=true
- op_start thread A gets in line
- op_start thread B gets in line
- commit finished, blocked=false
- thread A goes
- op_start thread C sees blocked=false and continues
-> order broken
If there are people in line from a previous block, we need to get in line,
even if blocked == false.
Sage Weil [Thu, 9 Feb 2012 19:23:10 +0000 (11:23 -0800)]
filestore: fix op queue quiesce during commit
When I added the ordering constraint fix back in 259c509a I got the
check backwards. We want to wait if we are blocked OR we are not in the
front of the line (i.e., proceed if we are not blocked AND first in line).
Fixes: #2046 Signed-off-by: Sage Weil <sage@newdream.net>
Josh Durgin [Thu, 9 Feb 2012 01:38:52 +0000 (17:38 -0800)]
ReplicatedPG: don't count deletions as ops
Counting them as ops but not requeueing the pg for recovery causes
backfill to stall when only deletions are sent in
recover_backfill(). Deletions are cheap and don't need to be acked, so
we can simply stop counting them as ops.
Yehuda Sadeh [Thu, 9 Feb 2012 01:08:39 +0000 (17:08 -0800)]
rgw: don't treat plus as a space in url decode
Any special character encoding should be done through %hex. The
plus sign is a valid character in object names, and in user id
(when used in signed urls).
Sage Weil [Wed, 8 Feb 2012 19:20:47 +0000 (11:20 -0800)]
osd: flush on activate
PG::activate() can make lots of changes, most notably clean_up_local()
which deletes lots of local objects. Those changes need to be flushed
to the fs before we start servicing requests or else we risk processing a
client read on those objects.
Fixes: #1974 Signed-off-by: Sage Weil <sage@newdream.net>
Greg Farnum [Wed, 8 Feb 2012 19:08:51 +0000 (11:08 -0800)]
mon: waitlist new sessions trying to connect while we're out of quorum
If we're stuck out of the quorum, we don't want clients connecting to
to us. Instead, waitlist their requests; process them when we get into
a quorum and look at them every tick so we can toss them out
(if we take too long to get into a quorum).
This change is smaller than it looks - most requests would
previously have been blocked anyway while waiting for Paxos to be
readable, and the ones that weren't (eg, one-time map subscribes)
should have been!
This also means we can tear out the cleanup code for new Sessions, so
tick() looks better.