Sage Weil [Thu, 13 Nov 2008 21:30:26 +0000 (13:30 -0800)]
mds: treat open requests as non-idempotent
The problem is that the reply contains a capability, and as such
is statefull and can't be lost. Forwards by the MDS on behalf of
the client, however, introduce the possibility of multiple copies
or a request in flight if one of the MDSs fails, and the client
will drop any duplicate replies it receives.
Alternatively, the client could _also_ parse duplicate responses
(i.e. call fill_trace). I'm not sure if that's a good idea. In
any case, MDS forwarded requests are only really important for
dealing with flash flood scenarios on extremely large clusters,
so let's just set this aside for now.
Sage Weil [Wed, 12 Nov 2008 22:15:22 +0000 (14:15 -0800)]
mds: multiversion inodes with multiple links, too
We may have remote links that get snapped. They need to be able to
find the (single) anchored multiversion inode to get the correct
version. (The anchor table isn't versioned.)
We'll need to go a step further than this and create snaprealms for
some of these too in order to handle inodes linked into multiple
realms. But that needs backpointers first...
Sage Weil [Wed, 12 Nov 2008 19:23:01 +0000 (11:23 -0800)]
mds: remove cap _after_ journaling update, at the same time we send the msg
There was an ordering problem that could come up when we prepared
a release message and removed the cap, but then didn't send it to
the client until after the update was journaled. This could cause
us to remove the _next_ instance of the capability (from a
subseqent open) in certain circumstances.
Instead, wait until after we journal the update before removing
the client cap and sending the ack. Since time has passed,
reverify the release request seq is still >= the last_open at
that time. Introduce a helper to avoid duplicating code for the
case where no journaling is necessary and the cap is immediately
released in _do_cap_update.
Sage Weil [Wed, 12 Nov 2008 18:23:15 +0000 (10:23 -0800)]
mds: use null snap context for purge if no realm
The inode may be unlinked, e.g. when we are replaying a journaled
purge_inode EUpdate. The snapc is not really important, as the
OSD will use the newer snapc it has for the object. And we only
really care when we're purging the HEAD anyway.
Sage Weil [Mon, 10 Nov 2008 23:53:26 +0000 (15:53 -0800)]
mds: fix replay lookup of snapshotted null dentries
Look up replayed dentry by dnLAST, not dnfirst, as we do with
primary and remote dentries, because that is how we identify
dentry instances in the timeline.
Sage Weil [Mon, 10 Nov 2008 23:25:24 +0000 (15:25 -0800)]
osd: ignore logs i don't expect without freaking out
We may get a log we didn't think we requested if the prior_set gets rebuilt
or our peering is restarted for some other reason. Just ignore it, instead
of asserting.
Sage Weil [Sun, 9 Nov 2008 16:43:14 +0000 (08:43 -0800)]
client: adjust objecter locking
We want to unlock client_lock before claling into objecter, mainly because the callbacks
rely on SafeCond that take a lock to signal a Cond and that gets awkward without mutex
recursion (see _write()'s sync case).
Sage Weil [Fri, 7 Nov 2008 21:26:41 +0000 (13:26 -0800)]
mds: match last snap exactly on replay, add_*_dentry
In general, we add new snapped dentries and THEN the new live dentry
to the metablob. That means that during replay, we see [2,2] followed
by [3,head], replacing [2,head]. The [2,2] dentry should be added
anew, without paying heed to [2,head], and then the [3,head] should
replace/update [2,head].
It was mainly just the assertions in add_*_dentry that were getting
in the way.. but the lookup_exact_snap is also slightly faster.
Sage Weil [Fri, 7 Nov 2008 00:28:17 +0000 (16:28 -0800)]
mds: check dn->last when finding existing dentries during replay
We can't simply search for an existing dentry based on the name and end
snap, as that may turn up the wrong item. For example, if we have
[2,head] and the replaying operations cowed that to [2,2] and [3,head], then
if we replay the [2,2] item first we'll find [2,head] (the _wrong_ dentry)
and throw an assertion.
Sage Weil [Thu, 6 Nov 2008 23:03:49 +0000 (15:03 -0800)]
osd: improve build_prior logic
If, during some interval since the pg last went active, we may have gone
rw, but none of the osds survived, then we include all of those osds
in the prior_set (even tho they're down), because they may have written data
that we want.
The prior logic appears to have been broken. It was only looking at the
primary osd.