Sage Weil [Thu, 27 Oct 2011 16:47:20 +0000 (09:47 -0700)]
filejournal: journal_replay_from
Force journal replay from a point other than the op_seq recorded by the
fs. This is useful if you want to skip bad entries in the journal (e.g.,
because they were non-idempotent and you know they were applied and the fs
operations were fully ordered).
Sage Weil [Mon, 24 Oct 2011 20:55:29 +0000 (13:55 -0700)]
osd: fix last_complete adjustment after recovering an object
After we recover each object, we try to raise the last_complete value
(and matching complete_to iterator). If our log was purely a backlog, this
won't necessarily end up bringing last_complete all the way up to the
last_update value, and we'll fail an assert later.
If complete_to does reach the end of the log, then we fast-forward
last_complete to last_update.
The crash we were hitting was in finish_recovery(), and looked something
like
Sage Weil [Sun, 23 Oct 2011 06:07:10 +0000 (23:07 -0700)]
osd: fix generate_past_intervals maybe_went_rw on oldest interval
We stop working backwards when we hit last_epoch_clean, which means for the
oldest interval first_epoch may not be the _real_ first_epoch. (We can't
continue working backward because we may have thrown out those maps
entirely.)
However, if the last_epoch_clean epoch is contained within that interval,
we know that the OSD did in fact go rw because it had to have completed
recovery (and thus peering) to set last_clean_epoch in the first place.
This fixes cases where two different nodes have slightly different
past intervals, generate different prior probe sets as a result, and
flip/flop on the acting set choice. (It may have eventually resolved when
the wrongly excluded node's notify races and arrives in time to be
considered, but that's still clearly no good.)
This does leave the start epoch for that oldest interval incorrect. That
doesn't currently matter except that it's confusing, but I'm not sure how
to mark it properly, or if it's worth the effort.
Sage Weil [Tue, 25 Oct 2011 05:21:43 +0000 (22:21 -0700)]
osd: fix/simplify op discard checks
Use a helper to determine when we should discard an op due to the client
being disconnected. Use this when the op is first received, (re)queued,
and dequeued.
Fix the check to keep ops that are replayed ACKs, as we should make every
effort to reapply those even when the client goes away.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 25 Oct 2011 04:44:36 +0000 (21:44 -0700)]
osd: handle missing/degraded in op thread
The _handle_op() method (and friends) are called when an op is initially
queued and when it is requeued. In the requeue case we have to be more
careful because the caller may be in the middle of doing all sorts of
random stuff. That means we need to limit ourselves to queueing or
discarding the op, and refrain from doing anything else with dangerous
side effects.
Josh Durgin [Mon, 24 Oct 2011 19:38:01 +0000 (12:38 -0700)]
librados: use stored snap context for all operations
Using an empty snap context led to the failure of
test_rbd.TestImage.test_rollback_with_resize, since clones weren't
created when deleting objects. This test now passes.
Greg Farnum [Thu, 20 Oct 2011 23:26:15 +0000 (16:26 -0700)]
rgw: fix check_disk_state; add a strip_namespace function.
Use copies of the IoCtx rather than references so that
we can set locators without breaking stuff, and make use of the
on-disk locators which we just added.
Greg Farnum [Thu, 20 Oct 2011 20:49:45 +0000 (13:49 -0700)]
rgw: rename translate_raw_obj to translate_raw_obj_to_obj_in_ns
And document it. Because the naming is so bad that neither I nor
the author noticed it wasn't doing what we wanted it to until I ran
a test and it failed.
Sage Weil [Sun, 23 Oct 2011 23:16:03 +0000 (16:16 -0700)]
osd: pg_pool_t: set crash_replay_interval on data pool when decoding old
We want to preserve the crash_replay_interval on old clusters being
upgraded. Kludge this by setting it to 60 (the old default) if the
crush_ruleset == 0 and owner == 0, which is normally true for just the
data pool.
This may catch other pools they created by hand, but it's still better
than having the replay interval for all pools when it is not needed.
Sage Weil [Sun, 23 Oct 2011 22:32:58 +0000 (15:32 -0700)]
osd: make osd replay interval a per-pool property
Change the config value to only control the interval set when the data
pool is first created (presumably during mkfs). Start replay interval
based on the pool property.
Introduce a per-pool crash_replay_interval so we can control whether
the OSD waits for replayed ACKed but not COMMITted requests for this
PG. For the metadata and rbd pools, for instance, the replay window
is useless.
Introduce a generic flags field, while we're modifying the encoding.
Sage Weil [Sun, 23 Oct 2011 03:41:03 +0000 (20:41 -0700)]
osd: fix PG::Log::copy_after wrt backlogs (again)
Commit 68fe748fc2d703623050e8f2a448a0fd31ca8a0f fixed half of this problem,
but set this->tail incorrectly. If we read olog.tail, the entry we are
on is a backlog entry, and probably not other.tail. Do not reset tail in
this case because we already set it to other.tail above.
OTOH if we hit v, we do want to set this->tail to the current record as it
is the one that precedes the first log entry.
This fixes an incorrect log.tail send to other nodes, which eventually
propagates as a log bound mismatch. For example,
Sage Weil [Fri, 21 Oct 2011 22:23:51 +0000 (15:23 -0700)]
osd: move may_need_replay calculation out of PriorSet
Although they both depend on past intervals, they are unrelated. Factor
out the may_need_replay calculation from PriorSet. Instead, do it right
before we activate when we need to decide whether to do a replay window
or not.
Sage Weil [Fri, 21 Oct 2011 22:02:34 +0000 (15:02 -0700)]
osd: fix last_clean interval bounds
It was _first and _last, inclusive, but the epochs are really points in
time, so _last should have been non-inclusive. Rename the variables
_begin and _end, print them as proper intervals [begin,end), and fix the
PriorSet calculation to interpret the end bound properly.
Also break that check out into separate cases so that it is clear what is
really happening.
Sage Weil [Fri, 21 Oct 2011 21:45:59 +0000 (14:45 -0700)]
mon: fix last_clean_interval calculation
This up_rom == first check is old and wrong. It may have been correct at
the time, when the OSD had a defined shutdown procedure, but that is not
currently the case. And if/when it is, the OSD can simply provide an
accurate clean_thru value.
Sage Weil [Fri, 21 Oct 2011 21:44:56 +0000 (14:44 -0700)]
osd: eliminate CRASHED state
This was an intermediate state that indicated that replay would be needed.
It was poorly named, and not very useful. Instead, just set the REPLAY
bit if we need replay, and then do it. No need for a separate CRASHED.
Sage Weil [Fri, 21 Oct 2011 16:56:19 +0000 (09:56 -0700)]
osd: simplify finalizing scrub on replica
We can simply call osr.flush() (with pg lock held) to ensure that prior
writes are visible and scrubbable. This avoids the funky handoff to
op_applied() (which didn't seem to work for me just now, although I didn't
fully debug.
Sage Weil [Fri, 21 Oct 2011 16:14:15 +0000 (09:14 -0700)]
osd: PriorSet: acting/up membership implies still alive
If the osd is in the acting or up sets, we can assume they are still alive,
even though we don't know that for sure, because if they are not, we will
rebuild PriorSet.
Note that we have a dependency here on up_thru that we could/should rebuild
PriorSet based on, IF we think it might change the value of the CRASHED
flag and IF we care enough. Right now we don't. Marking CRASHED when we
don't need to is conservative, and not dangerous.