Sage Weil [Thu, 5 May 2011 15:54:23 +0000 (08:54 -0700)]
osd: handle notify+info explicitly in GetInfo state
This fixes a few things:
- do not proceed past GetInfo if there are down osds. ever.
- if we get a new info that moves last_epoch_started forward,
rebuild prior, because we may have eliminated said down osds.
- if we get dup info, do nothing
- if we get new info, see if we can proceed to GetLog
This is all simpler/cleaner by handling Notify/Info (they're the same)
explicitly in the GetInfo state and not falling back to the parent
state handler.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 5 May 2011 15:12:24 +0000 (08:12 -0700)]
osd: fix GetInfo querying
Don't query for info we already have, or have already requested. Remove
unneeded helper so that this is simpler and we have access to the info
we need.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 5 May 2011 15:11:41 +0000 (08:11 -0700)]
osd: handle event notify/info/log from Initial
We shouldn't post a creation event and jump into peering/stray based on
pg creation when we are about to process more information or else we will
send out unnecessary queries. Instead, handle those from Initial and jump
to the appropriate state.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 4 May 2011 20:05:09 +0000 (13:05 -0700)]
osd: move directly to Reset state on pg load
Add Initial -> Reset transition on pg load. This avoids doing any
activation-type stuff (like sending messages) before we are ready. In
particularly, we want to advance through any new OSDMaps and only
send out queries/notifies/whatever when we get to the activate_map
stage.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Wed, 4 May 2011 17:21:54 +0000 (10:21 -0700)]
PG: ReplicaActive must repond to requests from discover_all_missing
If the peer does not yet have the pg during GetMissing, there won't be
a peer_missing entry for that peer. In that case, discover_all_missing
can legitimately request a missing set after the pg has gone active.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Josh Durgin [Wed, 4 May 2011 16:10:00 +0000 (09:10 -0700)]
PG: use a state_name member instead of overriding get_state_name
Also add debugging to each state constructor. Since dout uses
the recovery machine context, anything using it in the constructor
must be a state, not a simple_state.
Sage Weil [Tue, 3 May 2011 22:31:28 +0000 (15:31 -0700)]
osd: feed new pg mapping into state machine
instead of recalculating it. Also pass the last map into warm_restart,
while we're at it. Drop the Reset state constructor and instead repost
the AdvMap event before transitioning.
Sage Weil [Tue, 3 May 2011 20:08:35 +0000 (13:08 -0700)]
osd: fix pg log entry types to not always be delete
This was broken by the osd_trans work merged in 01f3526b62. We need to
use the obs reference to new_obs. This caused objects to be deleted during
pg recovery.
Sage Weil [Tue, 3 May 2011 19:34:54 +0000 (12:34 -0700)]
osdmap: allow incremental to represent osd deletion
Convert new_down to new_state, with values xored onto the old state. We
preserve compatibility with old incrementals because they were (virtually)
always 0, and we can special case that to mean toggle CEPH_OSD_UP. We
don't really care if clients get new values right.. if they don't clear
the EXISTS flag that doesn't really hurt them. It's only important that
the monitor get it right.
To ensure that, we rev the monitor internal protocol.
Samuel Just [Tue, 3 May 2011 00:03:56 +0000 (17:03 -0700)]
OSD,PG: Peering refactor
Previously, peering was handled by a defacto state machine in do_peer
and related methods. Peering state will now be encapsulated in
RecoveryState, which uses boost::state_chart internally to enforce an
explicit state machine abstraction. OSD::handle_pg_* pass off to
PG::handle_*, which pass messages to the state machine.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Fri, 22 Apr 2011 00:42:51 +0000 (17:42 -0700)]
OSD,PG: Move pg reset code from OSD::advance_map to PG
OSD::advance_map previously handled resetting the PG for peering. Now,
PG::acting_up_affected returns true if peering needs to be restarted and
PG::warm_restart takes care of restting the pg.
Sage Weil [Tue, 3 May 2011 01:19:32 +0000 (18:19 -0700)]
cfuse: encode/decode dev_t properly
The fuse layer passes through "encoded" dev_t values (probably for
compatibility reasons or something). I copied the encode/decode methods
from the kernel and encode/decode the st_rdev values where appropriate
(where struct stat is exposed directory or via the fuse_entry_param
struct).
Fixes: #1031 Signed-off-by: Sage Weil <sage@newdream.net>
MDS: fix handle_client_rename use of path_traverse.
It was using the MDS_TRAVERSE_DISCOVERXLOCK flag, which allows
path_traverse to return success if it encounters a NULL dentry. When
we're looking for a source inode, though, that doesn't work out! We
want MDS_TRAVERSE_DISCOVER, which will go away and look for the dentry
on other inodes but requires a linked dentry, not a NULL one.
Sage Weil [Sat, 30 Apr 2011 00:30:45 +0000 (17:30 -0700)]
mds: trim non-auth swallowed subtrees during resolve
Consider:
- peer auth for /foo
- ambiguous import /foo/bar
- peer claims /foo, swallows /foo/bar.
- disambiguate_imports sees we didn't get /foo/bar, cancels ambiguous
import.
-> we are left with /foo/bar (and content) in cache, even tho it is
non-auth.
Fix by pulling the try_trim_non_auth_subtree() back out of
cancel_ambiguous_import, and trimming the containing subtree in the
disambiguate (resolve completion) case. (For the journal replay case the
subtree structure is deterministic and no such check is needed.)
Sage Weil [Thu, 28 Apr 2011 22:17:18 +0000 (15:17 -0700)]
mds: ignore fragment_notify when dft state doesn't match
In particular, if there is a resolve in there somewhere, we may have found
out about this refragment from the src because they send resolve messages
to all nodes (to resolve ambiguous migrations). If that's the case we
can ignore the message.
Sage Weil [Thu, 28 Apr 2011 20:44:55 +0000 (13:44 -0700)]
mds: try_trim_non_auth_subtree on any canceled import (including resolve)
We were trimming on journal replay of an import failure, but not on a
canceled ambiguous import during resolve. Fix that by moving the call into
the helper (and passing a CDir* instead of a dirfrag_t).
Sage Weil [Thu, 28 Apr 2011 20:22:30 +0000 (13:22 -0700)]
mds: fix steal_dentry dir_auth_pins adjustment
Pass down the correct value for dir_auth_pins (dh->auth_pins plus the
inode's auth_pins, but nothing nested beneath the inode). The CDentry
doesn't track dir auth pins independently, and doesn't really need to.
Sage Weil [Thu, 28 Apr 2011 20:00:44 +0000 (13:00 -0700)]
mds: fix export_prep trace format
The prep message includes a spanning tree in the interior of the subtree
that includes all parent inodes of bounding dirfrags. That used to look
like
df dentry inode (dir dentry inode)*
The code to generate those traces was stopping if the df->ino had already
been included. The problem was that we may have done the that inode on a
different dirfrag.
so that we can start with a dentry (already had the dirfrag, same check
as before) or a dirfrag (already had the inode, the new case), or a '-'
(nothing at all). A single byte is used to indicate which it is and how
to start decoding.