Josh Durgin [Tue, 2 Aug 2011 19:14:06 +0000 (12:14 -0700)]
osd: put_object_context: tolerate pgs being deleted
PGs that are queued for deletion won't be in the osdmap,
and may not be in the pg_map, but if they are, it's safe to
put object_context. Otherwise, the pg is being deleted and
will clean up the object contexts itself.
osd, pg: clean up watchers on pg deletion and shutdown
Watchers and their object contexts need to be cleaned up so
they aren't used after the pg is gone. This happened if the
pool was deleted and the connection to the watcher was reset.
Sage Weil [Sat, 30 Jul 2011 05:10:17 +0000 (22:10 -0700)]
mds: fix create_subtree_map for new dirs
Currently mkdir foo ; rmdir foo fails because we can't get_subtree_map()
on a new directory that isn't linked in the committed plane. Since we are
journaling the projected subtree, it makes sense to use
get_projected_subtree_map() here.
It's easiest to keep in both the old and new directories in the rename
project map instead of looking at the next-to-most-recent parent for the
inode. The committed version is irrelevant (could conceivably be multiple
renames behind) and the current projected parent is just newdir; we need
olddir too, and we don't project for cross-mds rename anyway.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This allows clients to determine whether they have the latest
mds, mon, or osd map. This is useful for figuring out if a pool
does not exist, or if the osdmap with it simply hasn't been
received yet.
Sage Weil [Fri, 29 Jul 2011 20:44:24 +0000 (13:44 -0700)]
mds: identify slave requests with reqid + attempt number
We need to distinguish between different attempts to process a request, or
else we can get annoying races in the slave request handling code. E.g.,
- request sent to mds A
- A authpins items on B, B registered slave_request
- A forwards request to C, sends slave finish to B
- C receives request, sends authpin slave request to B
- B receives C's authpin request, discards (*)
- B receives A's finish, closes slave request
Sage Weil [Thu, 28 Jul 2011 22:24:23 +0000 (15:24 -0700)]
mds: fix log trimming races
trim() would iterate over segments. It would take the *p segment, ++p,
then call try_expire(). But the _expired() function would also clean up
and (if possible) retire subsequent segments on the list if they were on
the expired list, invalidating the p iterator.
Untangle the mess by making expired segment trimming (i.e. removing from
segment list) a separate operation performed only by trim() (probably a
good idea anyway). This keeps the iterator safe/stable.
Sage Weil [Thu, 28 Jul 2011 21:51:06 +0000 (14:51 -0700)]
mds: separate type for gratuitous debug ESubtreeMaps
Give these a different type so they are not interpreted as subtree
boundaries during replay. Otherwise we break the truncate_finish code,
which references the truncate_start logsegment by offset. Probably other
stuff too.
Sage Weil [Thu, 28 Jul 2011 18:27:25 +0000 (11:27 -0700)]
client: open session with all mds targets
If we have an open session with an mds, we need to have an open session.
The problem is if we, say,
- client has old mdsmap
- mds A adds B as target in mdsmap
- send request to mds A
- A exports to B
- we get the EXPORT, but B isn't listed as a target for A in client map
- client gets updated map
At the time we receive the map we need to open the session to B. We can't
really do it when we get the EXPORT because we don't know the target MDS.
We can either track which exports are pending to do it, or just blindly
open sessions with targets for any MDSs we have caps with. Which is
basically every session we have open. That's simplest for now.
mds: Split the CInode::scatter_wanted field in two
We use this field to indicate we want a scatter or an unscatter. Make
that distinction explicit.
Also, clear the unscatter_wanted in simple_lock when we start a gather!
Sage Weil [Thu, 28 Jul 2011 17:10:51 +0000 (10:10 -0700)]
heartbeatmap: warn if previous deadline is missed
This will generate missed deadline noise in the log that may otherwise be
missed by an infrequent heartbeat_interval. We generally want to know if
deadlines are missed, but we don't necessarily need to touch the heartbeat
file every second. This gets us both.
Sage Weil [Thu, 28 Jul 2011 05:38:43 +0000 (22:38 -0700)]
heartbeatmap: introduce heartbeat_map
Each thread registered and gets a private structure it can write a timeout
value to. The timeout is time_t and always fits in a single word, so no
locking is used to update it.
Anyone can call is_healthy() to find out if any timeouts have expired.
Eventually some background thread will do this.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 28 Jul 2011 04:44:44 +0000 (21:44 -0700)]
mds: mark ambig imports in ESubtreeMap during resolve
During resolve we may journal EImportFinish(true/false) as we resolve our
imports/exports. And as a side-effect we may journal an ESubtreeMap. We
need to properly mark ambig subtrees in that entry based on the
my_ambiguous_imports (resolve state), not just the migrator state (for the
active mds).
Note that the other Migrator::is_ambiguous_import() user
(send_resolve_now()) already does this correctly.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 28 Jul 2011 03:46:03 +0000 (20:46 -0700)]
mds: handle aborted slave rename while waiting for second prep
When we get the first prep, we may respond to the master with an expanded
list of witnesses for the rename before making any change (or rollback
plan). If the master fails before sending the second prep attempt, we
may end up in the abort path of _commit_slave_rename() with an empty
rollback_bl. That's fine; don't crash. We still need to unfreeze the
srci, but can skip the do_rename_rollback since we didn't actually journal
a change.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 27 Jul 2011 21:38:38 +0000 (14:38 -0700)]
mds: honor scatter_wanted while freezing
- mds A authpins item on mds B
- mds B starts to freeze tree containing item
- mds A tries wrlock_start on A, sends REQSCATTER to B
- mds B lock is unstable, sets scatter_wanted
- mds B lock stabilizes, calls try_eval, defers because freezing.
-> deadlock
In general, we want to avoid the eval while freezing to prevent starvation.
However, in this case with the multi-mds locking, we need to honor
the scatter_wanted even so.
Insert this check in try_eval(). This will catch it on the first try_eval
call after the lock stabilizes. The ambiguous auth will never catch us
while freezing, and the master holds an auth_pin to prevent a freeze, so
we will never defer the eval; no need to do the same logic in the other
eval method (eval(MDSCacheObject*, ...)) used for retry.
Sage Weil [Wed, 27 Jul 2011 21:32:24 +0000 (14:32 -0700)]
mds: try_eval in many places
These are the obvious places where we drop locks and may need to defer the
eval until after unfreeze. There are probably more; a full audit is in
order.
Sage Weil [Wed, 27 Jul 2011 20:13:38 +0000 (13:13 -0700)]
mds: implement try_eval() on a single lock
We frequently call eval() on locks, usually after dropping an rd/wr/xlock.
At that point the eval() may do nothing because the object is now freezing
or frozen. However, we still need to do the eval eventually.
These callers should eventually all switch to try_eval(), and retry as
needed.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We were accidentally setting them to standby-for-rank -1 if their
leader MDS wasn't active on startup. Things worked out in the end
anyway since they would go from standby to active for the appropriate
rank, but we want them to be in proper standby-replay!
Sage Weil [Wed, 27 Jul 2011 19:42:21 +0000 (12:42 -0700)]
mds: make two passes on scatter_nudge
It's possible for scatter_nudge on a scatterlock in LOCK with dirty set to
go to MIX immediately and remain stable. Give two 'nudge' passes before
we stop to avoid looping.
This fixes an assert failure where a nudge from log trimming ended up in a
stable state and asserted (!c). The second pass will go trigger the dirty
writebehind.
Sage Weil [Tue, 26 Jul 2011 23:10:48 +0000 (16:10 -0700)]
mds: fix projected rename adjustment
- we may journal one (or _maybe_ both, probably not) of the subtree root
addition OR the bound addition, depending on whether oldparent and
newparent are auth.
- we can't rely on get_subtree_root() to move bounds since the projected
subtree isn't a root in the real tree. use CDir::contains() instead.