Sage Weil [Mon, 17 Nov 2008 21:23:03 +0000 (13:23 -0800)]
mds: use last_sent (not last_open) to untangle cap release races
If we use last_open, the client has to be smart about ignoring
MDS revocations after it sends a release request. (Or, the MDS has
to somehow know the ack is for an old cap.) Instead, just
serialize release over all cap messages sent to the client. It may
make for a slightly chattier cap release in some cases, but those
cases should be very rare, and this is simpler.
Sage Weil [Mon, 17 Nov 2008 17:03:47 +0000 (09:03 -0800)]
osd: remember past intervals instead of recalculating each time
This _vastly_ improves the speed of build_prior (and thus activate_map).
There is no need to recalculate this information each time as it is fully
dependent on _old_ OSDMaps, not current cluster state.
Sage Weil [Fri, 14 Nov 2008 23:31:07 +0000 (15:31 -0800)]
mds: adjust purge_stray sequence; include explicit ino destroy
First purge the inode content. Don't bother journaling our intent,
as that's implied by the fact that it's an unused stray.
Once purged, journal an event that destroys the inode and unlinks
the dentry. Don't remove null dentry itself, as we still need to
update the stray dir... it will get removed when that is committed.
Sage Weil [Fri, 14 Nov 2008 21:48:30 +0000 (13:48 -0800)]
mon: commit large numbers of state values quickly
Write them all, then sync once at the end.
Also include some infrastructure for using the latest stashed value
to recover. Don't use it yet, though. The interaction with
keeping last_committed and latest stashed values in sync wrt a
failure between the two is a bit tricky.
Sage Weil [Thu, 13 Nov 2008 21:30:26 +0000 (13:30 -0800)]
mds: treat open requests as non-idempotent
The problem is that the reply contains a capability, and as such
is statefull and can't be lost. Forwards by the MDS on behalf of
the client, however, introduce the possibility of multiple copies
or a request in flight if one of the MDSs fails, and the client
will drop any duplicate replies it receives.
Alternatively, the client could _also_ parse duplicate responses
(i.e. call fill_trace). I'm not sure if that's a good idea. In
any case, MDS forwarded requests are only really important for
dealing with flash flood scenarios on extremely large clusters,
so let's just set this aside for now.
Sage Weil [Wed, 12 Nov 2008 22:15:22 +0000 (14:15 -0800)]
mds: multiversion inodes with multiple links, too
We may have remote links that get snapped. They need to be able to
find the (single) anchored multiversion inode to get the correct
version. (The anchor table isn't versioned.)
We'll need to go a step further than this and create snaprealms for
some of these too in order to handle inodes linked into multiple
realms. But that needs backpointers first...
Sage Weil [Wed, 12 Nov 2008 19:23:01 +0000 (11:23 -0800)]
mds: remove cap _after_ journaling update, at the same time we send the msg
There was an ordering problem that could come up when we prepared
a release message and removed the cap, but then didn't send it to
the client until after the update was journaled. This could cause
us to remove the _next_ instance of the capability (from a
subseqent open) in certain circumstances.
Instead, wait until after we journal the update before removing
the client cap and sending the ack. Since time has passed,
reverify the release request seq is still >= the last_open at
that time. Introduce a helper to avoid duplicating code for the
case where no journaling is necessary and the cap is immediately
released in _do_cap_update.
Sage Weil [Wed, 12 Nov 2008 18:23:15 +0000 (10:23 -0800)]
mds: use null snap context for purge if no realm
The inode may be unlinked, e.g. when we are replaying a journaled
purge_inode EUpdate. The snapc is not really important, as the
OSD will use the newer snapc it has for the object. And we only
really care when we're purging the HEAD anyway.
Sage Weil [Mon, 10 Nov 2008 23:53:26 +0000 (15:53 -0800)]
mds: fix replay lookup of snapshotted null dentries
Look up replayed dentry by dnLAST, not dnfirst, as we do with
primary and remote dentries, because that is how we identify
dentry instances in the timeline.
Sage Weil [Mon, 10 Nov 2008 23:25:24 +0000 (15:25 -0800)]
osd: ignore logs i don't expect without freaking out
We may get a log we didn't think we requested if the prior_set gets rebuilt
or our peering is restarted for some other reason. Just ignore it, instead
of asserting.