In _process_pg_info, if the primary sends us a PG::Log, a replica should
merge that log into its own.
mark_all_unfound_as_lost / share_pg_log: don't send the whole PG::Log.
Just send the new entries that were just added when marking the objects
as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Install SIGSEGV / SIGABORT handlers with sigaction using SA_RESETHAND.
This will ensure that if the signal handler itself encounters another
fault, the default signal handler (usually dump core) will be what is
used. Also, flush the log before dumping core.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 23 Nov 2010 18:25:39 +0000 (10:25 -0800)]
client: remove inode from flush_caps list when auth_cap changes
Avoid confusing other code (e.g. kick_flushing_caps) by staying on the mds
flushign_caps list when we don't even have an auth_cap with them anymore.
We'll need to re-flush to a new MDS later.
osd: PG::read_log: don't be clever with lost xattr
Formerly, we had a special case in read_log for dealing with objects
whose objects were present on the disk, but not their attributes. This
conflicts with our plans to mark objects as lost by putting a bit in the
object attributes, since without those attributes, we'll never know if
the objects were formerly marked as lost.
This should almost never happen, and if it does, we just handle the
objects as missing in the normal way.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
The might_have_unfound set is used by the primary OSD during recovery.
This set tracks the OSDs which might have unfound objects that the
primary OSD needs. As we receive Missing from each OSD in
might_have_unfound, we will remove the OSD from the set.
When might_have_unfound is empty, we will mark objects as LOST if the
latest version of the object resided on an OSD marked as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We want to remove replicas that we don't ack, but those don't appear in
the strong_inode map; they're appended to the base_inode bufferlist. Make
a (temporary) set to track who those are so that we know who to get rid of.
Greg Farnum [Mon, 22 Nov 2010 23:04:22 +0000 (15:04 -0800)]
types: Allow inodeno_t structs to alias.
This removes a compiler warning that appeared in a gcc upgrade and
is apparently erroneous, about its usage violating strict-aliasing rules
when the + operator is used.
Greg Farnum [Mon, 22 Nov 2010 23:02:54 +0000 (15:02 -0800)]
messenger: init rc to -1, removing compiler warning.
This actually is initialized before all uses, but compilers tend to
have trouble with assignment in if-else branches, and -1 is considered
invalid so there's no danger of refactoring breaking anything.
Samuel Just [Tue, 16 Nov 2010 23:29:40 +0000 (15:29 -0800)]
Causes the MDSes to switch among a set of stray directories when
switching to a new journal segment.
MDSCache:
The stray member has been replaced with strays, an array of inodes
representing the set of available stray directories, as well as
stray_index indicating the index of the current stray directory.
get_stray() now returns a pointer to the current stray directory
inode.
advance_stray() advances stray_index to the next stray directory.
migrate_stray no longer takes a source argument, the source mds
is inferred from the parent of the dir entry.
stray dir entries are now stray<index> rather than stray.
scan_stray_dir now scans all stray directories.
MDSLog:
start_new_segment now calls advance_stray() on MDSCache to force a new
stray directory.
mdstypes:
NUM_STRAY indicates the number of stray directories to use per MDS
MDS_INO_STRAY now takes an index argument as well as the mds number
MDS_INO_STRAY_OWNER(i) returns the mds owner of the stray directory i
MDS_INO_STRAY_OWNER(i) returns the index of the stray directory i
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
generate_past_intervals:generate back to lastclean
PG::generate_past_intervals needs to generate all the intervals back to
history.last_epoch_clean, rather than just to
history.last_epoch_started. This is required by
PG::build_might_have_unfound, which needs to examine these intervals
when building the might_have_unfound set.
Move the check for whether past_intervals is up-to-date into
generate_past_intervals itself. Fix the check.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 22 Nov 2010 17:49:43 +0000 (09:49 -0800)]
osd: bind to new cluster address when wrongly marked down
If we come back up on the same address, there is a possible race. Other
nodes will mark_down when they see us go down. If we go up first, queue
some messages, and _then_ they see that we're down and mark_down, the
messages we queued will get lost. Since it's stateful on the cluster
backend, we need to introduce an ordering so that closing out the _old_
session doesn't break the new session. We do this by binding to a new
address (just a different port, actually) before marking ourselves back
up.
Greg Farnum [Mon, 22 Nov 2010 16:50:32 +0000 (08:50 -0800)]
client: only encode_cap_releases once per request.
Accomplish this by making a list of cap releases in the (permanent)
MetaRequest, and then copying that into the (potentially-temporary)
MClientRequest.
Sage Weil [Mon, 22 Nov 2010 03:59:43 +0000 (19:59 -0800)]
osd: unconditionally set up separate msgr instance for osd<->osd msgs
Always set up cluster_messenger (before we would only do so if there was
an explicit address configured for it). The overhead to do so is minimal,
it simplifies the code, and will allow us to fix down->up transitions
(later).
The test for unfound objects was reversed, leading us to try to pull
unfound objects and refrain from pulling objects that we knew how to
get. Should fix bug #585.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
ReplicatedPG::get_object_context takes three parameters. The last two
are "const object_locator_t& oloc" and "bool can_create".
Unfortunately, booleans can degrade to ints, and ints can be used to
initialize objects of type object_locator_t.
So when you make a call like:
> ctx->snapset_obc = get_object_context(snapoid, true);
What happens is that you actually call:
> get_object_context(snapoid, object_locator(1), false);
So you pass an invalid and *not* blank object_locator_t, and pass false
for can_create. This is not what the caller wanted. This change gets rid
of the default parameters and fixes the callers.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Don't loop in ReplicatedPG::start_recovery_ops. There is already a loop
in both recover_replicas and recover_primary that will try to do as many
recovery ops as it can, there's no need to repeat it. Also, the former
loop provably would never execute more than once because of the way
the code was structured.
If there are no more recovery operations to do, and PG::is_all_uptodate
is true at the end of ReplicatedPG::start_recovery_ops, call
finish_recovery.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Wed, 17 Nov 2010 22:37:38 +0000 (14:37 -0800)]
osd: rev PG::Info encoding for last_epoch_clean change
This was missed by 184fbf582b27c10b47101735a4495fe8c73ad186, so any fs
created between now and then won't decode properly. It's more important
to make an fs prior to that work, though, so that the upgrade path from
the last stable version works.
Sage Weil [Wed, 17 Nov 2010 19:39:24 +0000 (11:39 -0800)]
mds: wrlock scatterlocks to prevent a gather racing with split/merge logging
We have the dirs split in our cache for some time while journaling it to
disk, before the fragment_notify goes out. Make sure we don't do a
scatterlock gather during that time that will confuse the inode auth (who
has their dirfrags fragmented differently).
Track discover requests by tid. The old system of tracking outstanding
discovers was kludgey and somewhat broken. Also there is a possibility
of getting dup replies if someone does kick_requests().
There is still room for improvement with the logic detemrining when a
discover is sent: we may want to discover multiple dirfrags in parallel,
but the current code will only do one at a time.
Signed-off-by: Sage Weil <sage@newdream.net>
comment
Jim Schutt [Wed, 17 Nov 2010 20:39:52 +0000 (13:39 -0700)]
Detect broken system linux/fiemap.h
RedHat 5.5 has a /usr/include/linux/fiemap.h, but it is
broken because it does not itself include linux/types.h.
As a result, __u64 and friends are not defined.
We have a Ceph-local copy of fiemap.h, so use it
if the system version is broken.
While we're at it, fix up the configure message to
note we're using a local copy.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>