We need to ensure that buckets are output after their dependencies. The
best way to do this is a depth-first traversal of the bucket directed
acyclic graph. The previous solution was incorrect because it in some
cases it didn't traverse the graph in the right order.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
All the callers of CrushWrapper::get_bucket() check for error codes, but
not for NULL returns. So if there is no bucket (i.e., a NULL pointer) at
crush->bucket[i], just return the error code ENOENT. This is consistent
with how we handle other out-of-bounds requests.
Also, don't allow the caller to get us to try to access negative indices
in crush->bucket.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
In crushtool, dump buckets in tree order. Buckets which reference other
buckets must be dumped after their depedencies, or else re-compilation
will fail.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Sat, 6 Nov 2010 18:35:54 +0000 (11:35 -0700)]
mds: remove MIX_STALE
Yay, we don't need it!
If we can't update the frag on scatter, fine. The staleness of the frag
is implicit in the frag's scatter stat version not matching the inode's.
If/when we do want to update it, the frag will clearly be writable, and
we can bring it back in sync then.
Sage Weil [Sun, 7 Nov 2010 03:17:32 +0000 (20:17 -0700)]
mds: don't use helper for rename srcdn
The rdlock_path_xlock_dentry helper works for _auth_ dentries that we
create locally in an auth dirfrag. For the srcdn, we need to discover an
_existing_ dentry that is not necessarily auth.
Call path_traverse ourselves, but be careful to take the appropriate locks
on the resulting dn, dir, and ancestors.
Sage Weil [Sat, 6 Nov 2010 18:02:13 +0000 (11:02 -0700)]
mds: never complete a gather on a flushing lock
The scatter_writebehind() takes a wrlock, but that may still allow the lock
to complete a gather to LOCK and even move to say MIX before the data is
committed. Bad news!
Sage Weil [Sat, 6 Nov 2010 04:52:28 +0000 (21:52 -0700)]
mds: preserve stale state on import; some cleanup
Our new invariant is that MIX_STALE always implies is_stale(). And on
import, if is_stale(), MIX becomes MIX_STALE. This ensures that a replica
that we put into MIX_STALE doesn't turn back into MIX if we import it
and take the auth's state in CInode::decode_import().
Previously I changed the std::multimap decoder to minimize the number of
constructor invocations. However, it could be much more expensive to
copy an initialized (decoded) val_t than to copy an empty one. For
example, if we are decoding std::multimap < int, std::set <int> >. So
change the code to insert a non-decoded val_t again.
However, this still saves two constructor invocations over the original.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Samuel Just [Thu, 21 Oct 2010 23:54:01 +0000 (16:54 -0700)]
PG.cc: build_scrub_map now drops the PG lock while scanning the PG
build_inc_scrub_map scans all files modified since the given
version number and creates an incremental scrub map to
be merged with a scrub map created with build_scrub_map.
This scan is done while holding the pg lock.
ScrubMap.objects is now represented as a map rather than as
a vector.
PG.h: Added last_update_applied and finalizing_scrub members to
PG.
ReplicatedPG.cc:
calc_trim_to will not trim the log during a scrub (since
replicas need the log to construct incremental maps)
sub_op_modify_oplied and op_applied maintain a
last_update_applied PG member to be used for determining
how far back a replica need go to construct an
incremental scrub map.
osd_types.h:
Added merge_incr method for combining a scrub map with
a subsequent incremental scrub map.
ScrubMap.objects is now a map from sobject_t to object.
PG scrubs will now drop the PG lock while initially scanning the PG
collection allowing writes to continue. The scrub map will be tagged
with the most recent version applied. After halting writes, the
primary will request an incremental map from any replicas whose map
versions do not match log.head.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Sage Weil [Fri, 5 Nov 2010 06:15:06 +0000 (23:15 -0700)]
mds: do not bump scatter stat lock in predirty_journal_parents
If we're in the MIX state, we clearly can't touch this without screwing up
the delicate scatter/gather behavior. If we're in, say, LOCK, there is
still no reason to update it. One frag at least is local and auth if we
are in this code, but there may be other frags on other nodes. This would
just make them appear stale when they are not.
Sage Weil [Fri, 5 Nov 2010 05:48:09 +0000 (22:48 -0700)]
mds: mark scatterlock stale on import of stale frag scatter stat
When the lock scattered, if we didn't have an auth frag that was frozen,
we go into MIX state. Later, we may import a stale dirfrag. We need to
move to MIX_STALE at that point, and/or mark the lock stale so that any
subsequent transition does so.
Sage Weil [Fri, 5 Nov 2010 05:44:01 +0000 (22:44 -0700)]
mds: match bottom half of assilate_dirty_rstat_inodes with a dir flag
We only do the assimilate_dirty_rstat_inodes if we do an update AND the
frag rstat was non-stale, but the bottom half (_finish) doesn't have the
same info to know whether we did it because the top half updates the
fragstat version. Use a flag to indicate we've updated the dirfrag so
the bottom half will only run when needed.
Sage Weil [Fri, 5 Nov 2010 05:19:53 +0000 (22:19 -0700)]
mds: fix inode version used for inest in decode_lock_state
We need to pass the inode rstat's version into finish_scatter_update, not
the shadowed local variable. Otherwise we don't update the dirfrag when
we should.
Sage Weil [Thu, 4 Nov 2010 05:22:54 +0000 (22:22 -0700)]
mds: wait for last_failure_osd_epoch before starting journal replay
This is extremely important, and it forces the MDS to get the osdmap that
includes the blacklist entry for its predecessor. This in turn means that
any OSD we contact trying to read the journal will be forced to get that
osdmap (or newer) before handling our read request, which means that
anything we read cannot be overwritten by a racing request from our
predecessor. This prevents two MDSs writing to the journal at the same
time.
This change fixes potential (and observed!) journal corruption.
Don't start the SafeTimer when class Monitor is created. We want to hold off on
starting the thread until SimpleMessenger has fork()ed the process. Instead,
start the timer thread in Timer::init().
Use an auto_ptr to store the SafeTimer.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Logger.cc: avoid creating SafeTimer in global-ctor
Don't create a SafeTimer at global constructor time. Timers
contain a Thread, and the library stuff may not have been initialized at
global constructor time. Instead, just create the timer when we need it,
in flush_all_loggers.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Rework Timer and SafeTimer to be more efficient and to handle shutdown
correctly. Document the API, especially what locks need to held where.
The destructor for both Timer and SafeTimer now joins the timer thread
safely. The shutdown() function is available to callers who want to join
it before the Timer is destroyed.
To make things more efficient, don't create a new std::set every time we
insert a Context. Use multimap instead. Don't signal the condition
variable unless the event we have insert comes before all the other
events in the scheduled map. Don't allocate an extra Context in
SafeTimer.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>