Sage Weil [Wed, 1 Dec 2010 21:48:56 +0000 (13:48 -0800)]
filestore: call lower-level do_transactions() during journal replay
We used to call apply_transactions, which avoided rejournaling anything
because the journal wasn't writeable yet, but that uses all kinds of other
machinery that relies on threads and finishers and such that aren't
appropriate or necessary when we're just replaying journaled events.
Instead, call the lower-level do_transactions() directly.
I've found the manpage problem that I've noted before. It's about
monmaptool, the CLI says it's usage:
[--print] [--create [--clobber]] [--add name 1.2.3.4:567] [--rm name]
<mapfilename>
But the manpage states this as an example:
monmaptool --create --add 192.168.0.10:6789 --add 192.168.0.11:6789 --add
192.168.0.12:6789 --clobber monmap
This definitely misses 'name' after the 'add' switch, resulting:
"invalid ip:port '--add'" as an error message. Attached patch fixes this
inconsistency.
Create a copy constructor for object_info_t, since we often want to copy
an object_info_t and would rather not try to remember all the fields.
Drop the lost parameter from one of the other constructors, because it's
not used that much.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
In PG::mark_obj_as_lost, we have to mark a missing object as lost. We
should not assume that we have an old version of the missing object in
the ObjectStore. If the object doesn't exist in the object store, we
have to create it so that recovery can function correctly.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 30 Nov 2010 23:43:53 +0000 (15:43 -0800)]
mds: fix resolve for surviving observers
Make all survivors participate in resolve stage, so that survivors can
properly determine the outcome of migrations to the failed node that did
not complete.
The sequence (before):
- A starts to export /foo to B
- C has ambiguous auth (A,B) in it's subtree map
- B journals import_start
- B fails
...
- B restarts
- B sends resolves to everyone
- does not claim /foo
- A sends resolve _only_ to B
- does claim /foo
- B knows it's import did not complete
- C doesn't know anything. Also, the maybe_resolve_finish stuff was
totally broken because the recovery_set wasn't initialized
See new (commented out) assert in Migrator.cc to reproduce the above.
In _process_pg_info, if the primary sends us a PG::Log, a replica should
merge that log into its own.
mark_all_unfound_as_lost / share_pg_log: don't send the whole PG::Log.
Just send the new entries that were just added when marking the objects
as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 30 Nov 2010 20:48:32 +0000 (12:48 -0800)]
osd: fix misuses of OLOC_BLANK
Commit 6e2b594b fixed a bunch of bad get_object_context() calls, but even
with the parameter fixed some were still broken. Pass in a valid oloc in
those cases. The only places where OLOC_BLANK _is_ still uses is when we
know we have the object locally and will load a valid value off disk.
Sage Weil [Tue, 30 Nov 2010 16:30:57 +0000 (08:30 -0800)]
filestore: make sure blocked op_start's wake up in order
If they wake up out of order (which, theoretically, they could before) we
can screw up journal submitting order in writebehind mode, or apply order
in parallel and writeahead journal mode.
Sage Weil [Tue, 30 Nov 2010 16:24:57 +0000 (08:24 -0800)]
filestore: assert op_submit_finish is called in order
Verify/assert that we aren't screwing up the submission pipeline ordering.
Namely, we want to make sure that if op_apply_start() blocks, we wake up
in the proper order and don't screw up the journaling.
Sage Weil [Tue, 30 Nov 2010 15:54:42 +0000 (07:54 -0800)]
filejournal: rework journal FULL behavior and fix throttling
Keep distinct states for FULL, WAIT, and NOTFULL.
The old code was more or less correct at one point, but assumed the seq
changed on each commit, not each operation; in it's prior state it was
totally broken.
Also fix throttling (we were leaking items in the throttler that were
submitted while the journal was full).
Sage Weil [Tue, 30 Nov 2010 15:51:16 +0000 (07:51 -0800)]
filestore: refactor op_queue/journal locking
- Combine journal_lock and lock.
- Move throttling outside of the lock (this fixes potential deadlock in
parallel journal mode)
- Make interface nomenclature a bit more helpful
Install SIGSEGV / SIGABORT handlers with sigaction using SA_RESETHAND.
This will ensure that if the signal handler itself encounters another
fault, the default signal handler (usually dump core) will be what is
used. Also, flush the log before dumping core.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 23 Nov 2010 18:25:39 +0000 (10:25 -0800)]
client: remove inode from flush_caps list when auth_cap changes
Avoid confusing other code (e.g. kick_flushing_caps) by staying on the mds
flushign_caps list when we don't even have an auth_cap with them anymore.
We'll need to re-flush to a new MDS later.
osd: PG::read_log: don't be clever with lost xattr
Formerly, we had a special case in read_log for dealing with objects
whose objects were present on the disk, but not their attributes. This
conflicts with our plans to mark objects as lost by putting a bit in the
object attributes, since without those attributes, we'll never know if
the objects were formerly marked as lost.
This should almost never happen, and if it does, we just handle the
objects as missing in the normal way.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
The might_have_unfound set is used by the primary OSD during recovery.
This set tracks the OSDs which might have unfound objects that the
primary OSD needs. As we receive Missing from each OSD in
might_have_unfound, we will remove the OSD from the set.
When might_have_unfound is empty, we will mark objects as LOST if the
latest version of the object resided on an OSD marked as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We want to remove replicas that we don't ack, but those don't appear in
the strong_inode map; they're appended to the base_inode bufferlist. Make
a (temporary) set to track who those are so that we know who to get rid of.
Greg Farnum [Mon, 22 Nov 2010 23:04:22 +0000 (15:04 -0800)]
types: Allow inodeno_t structs to alias.
This removes a compiler warning that appeared in a gcc upgrade and
is apparently erroneous, about its usage violating strict-aliasing rules
when the + operator is used.
Greg Farnum [Mon, 22 Nov 2010 23:02:54 +0000 (15:02 -0800)]
messenger: init rc to -1, removing compiler warning.
This actually is initialized before all uses, but compilers tend to
have trouble with assignment in if-else branches, and -1 is considered
invalid so there's no danger of refactoring breaking anything.
Samuel Just [Tue, 16 Nov 2010 23:29:40 +0000 (15:29 -0800)]
Causes the MDSes to switch among a set of stray directories when
switching to a new journal segment.
MDSCache:
The stray member has been replaced with strays, an array of inodes
representing the set of available stray directories, as well as
stray_index indicating the index of the current stray directory.
get_stray() now returns a pointer to the current stray directory
inode.
advance_stray() advances stray_index to the next stray directory.
migrate_stray no longer takes a source argument, the source mds
is inferred from the parent of the dir entry.
stray dir entries are now stray<index> rather than stray.
scan_stray_dir now scans all stray directories.
MDSLog:
start_new_segment now calls advance_stray() on MDSCache to force a new
stray directory.
mdstypes:
NUM_STRAY indicates the number of stray directories to use per MDS
MDS_INO_STRAY now takes an index argument as well as the mds number
MDS_INO_STRAY_OWNER(i) returns the mds owner of the stray directory i
MDS_INO_STRAY_OWNER(i) returns the index of the stray directory i
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
generate_past_intervals:generate back to lastclean
PG::generate_past_intervals needs to generate all the intervals back to
history.last_epoch_clean, rather than just to
history.last_epoch_started. This is required by
PG::build_might_have_unfound, which needs to examine these intervals
when building the might_have_unfound set.
Move the check for whether past_intervals is up-to-date into
generate_past_intervals itself. Fix the check.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>