Sage Weil [Thu, 28 Jul 2011 22:24:23 +0000 (15:24 -0700)]
mds: fix log trimming races
trim() would iterate over segments. It would take the *p segment, ++p,
then call try_expire(). But the _expired() function would also clean up
and (if possible) retire subsequent segments on the list if they were on
the expired list, invalidating the p iterator.
Untangle the mess by making expired segment trimming (i.e. removing from
segment list) a separate operation performed only by trim() (probably a
good idea anyway). This keeps the iterator safe/stable.
Sage Weil [Thu, 28 Jul 2011 21:51:06 +0000 (14:51 -0700)]
mds: separate type for gratuitous debug ESubtreeMaps
Give these a different type so they are not interpreted as subtree
boundaries during replay. Otherwise we break the truncate_finish code,
which references the truncate_start logsegment by offset. Probably other
stuff too.
Sage Weil [Thu, 28 Jul 2011 18:27:25 +0000 (11:27 -0700)]
client: open session with all mds targets
If we have an open session with an mds, we need to have an open session.
The problem is if we, say,
- client has old mdsmap
- mds A adds B as target in mdsmap
- send request to mds A
- A exports to B
- we get the EXPORT, but B isn't listed as a target for A in client map
- client gets updated map
At the time we receive the map we need to open the session to B. We can't
really do it when we get the EXPORT because we don't know the target MDS.
We can either track which exports are pending to do it, or just blindly
open sessions with targets for any MDSs we have caps with. Which is
basically every session we have open. That's simplest for now.
mds: Split the CInode::scatter_wanted field in two
We use this field to indicate we want a scatter or an unscatter. Make
that distinction explicit.
Also, clear the unscatter_wanted in simple_lock when we start a gather!
Sage Weil [Thu, 28 Jul 2011 17:10:51 +0000 (10:10 -0700)]
heartbeatmap: warn if previous deadline is missed
This will generate missed deadline noise in the log that may otherwise be
missed by an infrequent heartbeat_interval. We generally want to know if
deadlines are missed, but we don't necessarily need to touch the heartbeat
file every second. This gets us both.
Sage Weil [Thu, 28 Jul 2011 05:38:43 +0000 (22:38 -0700)]
heartbeatmap: introduce heartbeat_map
Each thread registered and gets a private structure it can write a timeout
value to. The timeout is time_t and always fits in a single word, so no
locking is used to update it.
Anyone can call is_healthy() to find out if any timeouts have expired.
Eventually some background thread will do this.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 28 Jul 2011 04:44:44 +0000 (21:44 -0700)]
mds: mark ambig imports in ESubtreeMap during resolve
During resolve we may journal EImportFinish(true/false) as we resolve our
imports/exports. And as a side-effect we may journal an ESubtreeMap. We
need to properly mark ambig subtrees in that entry based on the
my_ambiguous_imports (resolve state), not just the migrator state (for the
active mds).
Note that the other Migrator::is_ambiguous_import() user
(send_resolve_now()) already does this correctly.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 28 Jul 2011 03:46:03 +0000 (20:46 -0700)]
mds: handle aborted slave rename while waiting for second prep
When we get the first prep, we may respond to the master with an expanded
list of witnesses for the rename before making any change (or rollback
plan). If the master fails before sending the second prep attempt, we
may end up in the abort path of _commit_slave_rename() with an empty
rollback_bl. That's fine; don't crash. We still need to unfreeze the
srci, but can skip the do_rename_rollback since we didn't actually journal
a change.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 27 Jul 2011 21:38:38 +0000 (14:38 -0700)]
mds: honor scatter_wanted while freezing
- mds A authpins item on mds B
- mds B starts to freeze tree containing item
- mds A tries wrlock_start on A, sends REQSCATTER to B
- mds B lock is unstable, sets scatter_wanted
- mds B lock stabilizes, calls try_eval, defers because freezing.
-> deadlock
In general, we want to avoid the eval while freezing to prevent starvation.
However, in this case with the multi-mds locking, we need to honor
the scatter_wanted even so.
Insert this check in try_eval(). This will catch it on the first try_eval
call after the lock stabilizes. The ambiguous auth will never catch us
while freezing, and the master holds an auth_pin to prevent a freeze, so
we will never defer the eval; no need to do the same logic in the other
eval method (eval(MDSCacheObject*, ...)) used for retry.
Sage Weil [Wed, 27 Jul 2011 21:32:24 +0000 (14:32 -0700)]
mds: try_eval in many places
These are the obvious places where we drop locks and may need to defer the
eval until after unfreeze. There are probably more; a full audit is in
order.
Sage Weil [Wed, 27 Jul 2011 20:13:38 +0000 (13:13 -0700)]
mds: implement try_eval() on a single lock
We frequently call eval() on locks, usually after dropping an rd/wr/xlock.
At that point the eval() may do nothing because the object is now freezing
or frozen. However, we still need to do the eval eventually.
These callers should eventually all switch to try_eval(), and retry as
needed.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We were accidentally setting them to standby-for-rank -1 if their
leader MDS wasn't active on startup. Things worked out in the end
anyway since they would go from standby to active for the appropriate
rank, but we want them to be in proper standby-replay!
Sage Weil [Wed, 27 Jul 2011 19:42:21 +0000 (12:42 -0700)]
mds: make two passes on scatter_nudge
It's possible for scatter_nudge on a scatterlock in LOCK with dirty set to
go to MIX immediately and remain stable. Give two 'nudge' passes before
we stop to avoid looping.
This fixes an assert failure where a nudge from log trimming ended up in a
stable state and asserted (!c). The second pass will go trigger the dirty
writebehind.
Sage Weil [Tue, 26 Jul 2011 23:10:48 +0000 (16:10 -0700)]
mds: fix projected rename adjustment
- we may journal one (or _maybe_ both, probably not) of the subtree root
addition OR the bound addition, depending on whether oldparent and
newparent are auth.
- we can't rely on get_subtree_root() to move bounds since the projected
subtree isn't a root in the real tree. use CDir::contains() instead.
Sage Weil [Tue, 26 Jul 2011 21:58:35 +0000 (14:58 -0700)]
mds: track projected rename effect on subtree map
Renames can effect the subtree map. We don't actually update it until the
rename commits, but while it is in flight to the jouranl we may journal
an ESubtreeMap event, which should reflect the change.
Adjust all callers (that journal) to project their change.
Mirror the logic in adjust_subtree_after_rename() in the subtree map
journaling code. Be a bit careful, because we only journal subtrees that
we are auth for, so there are some additional checks to get a consistent
result.
Sage Weil [Tue, 26 Jul 2011 17:52:49 +0000 (10:52 -0700)]
mds: only create up renamed diri frag subtrees if they differ from parent
Commit 00ec86a2041 opens up subtrees with CDIR_AUTH_UNDEF blindly for any
renamed dir inode. This is correct on the rename target, but not on a
random observer, where we end up with the parent and child having the
same auth. Oddly the comment seemed to have it right. Fix the code.
Sage Weil [Tue, 26 Jul 2011 05:17:24 +0000 (22:17 -0700)]
client: reencode cap releases for each request
I think commit f7170f9 was based on some of my bad advice. Every time the
client sends a request, it should look at what caps it has that might
conflict with the operation and (if possible) release them with the
request. I suspect I was confusing this with the case on the MDS side of
things where we only process the release(s) when we first receive the
message and not when it is deferred/retried.
Specifically, this fixes a problem where we send a request to mds A and
release some set of caps, A tells us to talk to B instead, and we resend
the same message with (old, now bogus) releases intended for A to B
instead, where they probably make no sense.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This class has subclasses representing temporary and permanent
exceptions, as well as argument parsing errors. An instance of this
class can be created from a message or from another exception.
We always print out the type of exception on the last line of stderr,
right after the exception information.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Mon, 25 Jul 2011 20:15:11 +0000 (13:15 -0700)]
objecter: treat RESETSESSION like a reset
Commit 065cdf5 rewrote ms_handle_reset but didn't adjust
ms_handle_remote_reset (they used to be identical). The result is lost
MOSDOps if the osd ever sends a RESETSESSION.
Sage Weil [Mon, 25 Jul 2011 05:08:13 +0000 (22:08 -0700)]
mds: be careful about calls to try_subtree_merge
try_subtree_merge will, on occasion, journal something. And anytime we
journal something we may open a new segment and journal an ESubtreeMap.
That means we subtree state needs to be consistent with any in-progress
or finishing migrations.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 24 Jul 2011 21:22:12 +0000 (14:22 -0700)]
mds: submit_entry last
MDLog::submit_entry() may journal an ESubtreeMap as a side-effect, so make
sure we have updated our state correctly _before_ calling it. The safest
is to just do it last.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>