Greg Farnum [Thu, 20 Jan 2011 19:57:23 +0000 (11:57 -0800)]
MDSMonitor: Adjust handling of MDSes asking for standby-replay.
1) If the MDS does not specify an MDS to follow, we mark them as
standing-by for -2. MDSMap::find_standby_for() has been modified
to grab these MDSes.
2) If an MDS asks for standby-replay and specifies a name but
not a rank, fill in the rank if the named MDS is known to us. If it
is not known, do nothing.
Greg Farnum [Wed, 19 Jan 2011 23:59:46 +0000 (15:59 -0800)]
mds: Adjust replay state changes and options parsing.
The MDS used to interpret g_conf.id as a rank. It no longer does
so and requires that standby ranks/names be set via the g_conf options,
or else along with the replay command in the CLI. Remove the MDS versions
of standby_for_[rank|name] and just use the ones in g_conf for simplicity.
However, the MDS only looks at the rank when switching to standby;
making names usable will require an update to the MDSMonitor code to
plug in ranks from names.
Greg Farnum [Fri, 21 Jan 2011 19:07:52 +0000 (11:07 -0800)]
mds: Keep journaler in readonly mode until replay completes.
Previously we were switching it off for the final non-standby replay
when a standby-replay got activated. This caused issues
since the states weren't quite correct!
Greg Farnum [Wed, 12 Jan 2011 22:54:01 +0000 (14:54 -0800)]
mds: use direct replay test when deciding whether to rebalance.
The previous use of standby_for_rank testing was prone to errors
and I think would have ended up causing bugs if it was in the middle
of a standby_replay run when it got a new MDS map pushing
it into regular replay mode.
The client side behavior here is correct: we should feed the raw pg into
osdmap->pg_to_acting_osds. The real problem is(was!) that pgp_num > pg_num
in current maps, which is illegal.
Sage Weil [Sat, 15 Jan 2011 00:57:38 +0000 (16:57 -0800)]
osd: drop messages from before we moved back to boot state
We want to make sure we ignore any messages sent to us before we moved
back to the boot state (after being wrongly marked down). This is only
a problem currently while we are in the BOOT state and waiting to be
re-added to the map, because we may then call _share_map_incoming and
send something on the new rebound messenger to an old peer. Also assert
that we are !booting there to be sure.
Yehuda Sadeh [Sat, 15 Jan 2011 00:29:41 +0000 (16:29 -0800)]
auth: new rotating secret ttl should depend on now() + ttl
Before it only depended on the previous rotating secret (which was
always bigger than g_clock.now()). Since the tickets rotation is
never being done exactly when the old ticket expires (probably takes
a few seconds after that), then we ended up having tickets that expire
much sooner than we expected.
Unit tests should not parse the normal "-c ceph.conf" command line
arguments, they should not read config files, etc. If something
needs initializing for a specific unit tests, we'll either fix it
to not need it, initialize it just for that, or figure some nicer
way of doing this.
Greg Farnum [Sat, 15 Jan 2011 00:22:11 +0000 (16:22 -0800)]
MDS: Use new C_Gather::get_num_remaining() in MDCache.
It was using get_num(), which now reports the number created.
This probably wouldn't have worked previously except that
~C_Gather::C_GatherSub was inappropriately calling rm_sub().
Greg Farnum [Sat, 15 Jan 2011 00:11:01 +0000 (16:11 -0800)]
C_Gather: Rewrite for thread safety.
Previously, C_Gather wasn't thread safe at all,
and there was an issue with creating subs while some
subs were being finished.
These issues are now fixed.
Sage Weil [Fri, 14 Jan 2011 06:08:56 +0000 (22:08 -0800)]
mds: use common helper to journal a client session close
We saw a bug where an ESession close was followed by an EMetaBlob on that
session (see 6d0dc4bf64b2792d6fc007268c5a42ae4e2e583c). My best guess is
that a session timeout raced with a request waiting on locks (only the
explicit client close path was calling request_kill). To avoid that,
introduce a helper to journal client close so that the common work (killing
any pending requests AND releasing prealloc inos) happen in all cases.
Sage Weil [Fri, 14 Jan 2011 06:08:40 +0000 (22:08 -0800)]
mds: tolerate (with warning) replayed op with bad prealloc_inos
This comes up when an ESesssion close is followed by an EMetaBlob that
uses a prealloc_ino. That isn't supposed to happen (it's probably a corner
case with session timeout vs a request waiting on locks that didn't
get killed/canceled?). But tolerate it during replay just the same.
Sage Weil [Thu, 13 Jan 2011 21:14:24 +0000 (13:14 -0800)]
filejournal: rewrite completion handling, fix ordering on full->notfull
Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.
This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions. That same logic needs to also apply to new items
submitted during that commit interval. This was broken before, but the
simpler structure fixes it. Fixes #666.
Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Thu, 13 Jan 2011 20:18:17 +0000 (12:18 -0800)]
PG: activate should not enqueue snap_trimmer on a replica
Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.
This should fix #702.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 23:09:51 +0000 (15:09 -0800)]
ReplicatedPG: Fix oi.size bug in _rollback_to
_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0. _rollback_to now resets
the size to match the rolled back object. Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 21:51:55 +0000 (13:51 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup
Previously, get_object_context and get_snapset_context did not register
the resulting objects. In some cases, these objects would not get
registered and multiple copies would end up created. This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 20:07:44 +0000 (12:07 -0800)]
ReplicatedPG: snap_trimmer work around
Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps. This work around should let kvmtest
come back up. A real fix is still needed.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Greg Farnum [Tue, 4 Jan 2011 21:32:47 +0000 (13:32 -0800)]
uclient: Switch how inodes link to dentries a bit.
Inodes now have a set of parent dentries, rather than a single
pointer. This allows the cache to accurately represent multiple
hard links.
Various minor adjustments were made so that this change in
format works and is error checked.
Making oldest_update a class variable complicates log merging and wastes
space in the PG struct. Even though memory is big, cachelines are still
small. Just calculate it when we need it.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>