Sage Weil [Fri, 21 Jan 2011 18:08:26 +0000 (10:08 -0800)]
msgr: always start reaper
If we didn't explicitly bind (i.e. are a client), then we don't start
the accepter. That's fine. But the reaper thread start was also
conditional, when it shouldn't be; otherwise the client can't clean up
old Pipes (and their sockets).
Fixes: #732 Signed-off-by: Sage Weil <sage@newdream.net>
Previously, snap_trimmer would get the clone object information from the
object store rather than using find_object_context. This would cause
the cached version to not be updated with the new version in the case
that the object information got updated. As a result, the need field of
the missing object could get a stale version inconsistent with the most
recent logged version.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Wed, 19 Jan 2011 22:02:27 +0000 (14:02 -0800)]
ReplicatedPG.cc: update coi version and prior_version to match log
Caused error where oi on clone would not get updated version when snaps
was updated. oi.version would lag behind the missing item's need field
during recovery.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Tue, 18 Jan 2011 23:09:51 +0000 (15:09 -0800)]
mds: kick discovers when peers enter active|clientreplay|rejoin
We process discovers when active, clientreplay, or later stages of rejoin.
Wait until then to resend pending discovers. In particular, do NOT send
them when the node has just failed (from handle_mds_failure), or we will
crash.
Sage Weil [Sat, 15 Jan 2011 00:57:38 +0000 (16:57 -0800)]
osd: drop messages from before we moved back to boot state
We want to make sure we ignore any messages sent to us before we moved
back to the boot state (after being wrongly marked down). This is only
a problem currently while we are in the BOOT state and waiting to be
re-added to the map, because we may then call _share_map_incoming and
send something on the new rebound messenger to an old peer. Also assert
that we are !booting there to be sure.
Greg Farnum [Sat, 15 Jan 2011 00:22:11 +0000 (16:22 -0800)]
MDS: Use new C_Gather::get_num_remaining() in MDCache.
It was using get_num(), which now reports the number created.
This probably wouldn't have worked previously except that
~C_Gather::C_GatherSub was inappropriately calling rm_sub().
Greg Farnum [Sat, 15 Jan 2011 00:11:01 +0000 (16:11 -0800)]
C_Gather: Rewrite for thread safety.
Previously, C_Gather wasn't thread safe at all,
and there was an issue with creating subs while some
subs were being finished.
These issues are now fixed.
Sage Weil [Fri, 14 Jan 2011 06:08:40 +0000 (22:08 -0800)]
mds: tolerate (with warning) replayed op with bad prealloc_inos
This comes up when an ESesssion close is followed by an EMetaBlob that
uses a prealloc_ino. That isn't supposed to happen (it's probably a corner
case with session timeout vs a request waiting on locks that didn't
get killed/canceled?). But tolerate it during replay just the same.
Samuel Just [Thu, 13 Jan 2011 19:15:15 +0000 (11:15 -0800)]
ReplicatedPG: snap_trimmer skip removed snaps without collections
If no writes are made between two snapshots, the first won't get a snap
collection. Subsequently removing that snap causes a crash in
snap_trimmer since the collection does not exist.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 13 Jan 2011 19:02:58 +0000 (11:02 -0800)]
PG: added adjust_local_snaps, activate now checks local collections
adjust_local_snaps handles removing local collections contained in
to_check. On activate, pg will now remove local collections contained
in purged_snaps.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 13 Jan 2011 20:18:17 +0000 (12:18 -0800)]
PG: activate should not enqueue snap_trimmer on a replica
Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.
This should fix #702.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Sage Weil [Thu, 13 Jan 2011 21:14:24 +0000 (13:14 -0800)]
filejournal: rewrite completion handling, fix ordering on full->notfull
Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.
This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions. That same logic needs to also apply to new items
submitted during that commit interval. This was broken before, but the
simpler structure fixes it. Fixes #666.
Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Wed, 12 Jan 2011 23:09:51 +0000 (15:09 -0800)]
ReplicatedPG: Fix oi.size bug in _rollback_to
_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0. _rollback_to now resets
the size to match the rolled back object. Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 21:51:55 +0000 (13:51 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup
Previously, get_object_context and get_snapset_context did not register
the resulting objects. In some cases, these objects would not get
registered and multiple copies would end up created. This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 20:07:44 +0000 (12:07 -0800)]
ReplicatedPG: snap_trimmer work around
Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps. This work around should let kvmtest
come back up. A real fix is still needed.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Mon, 10 Jan 2011 22:45:06 +0000 (14:45 -0800)]
ReplicatedPG: Fix bug in rollback
Previously, _rollback_to assumed that the rollback was a noop if
ctx->clone_obc was set and it's prior version matches head's version.
However, this broke in sequences like:
Write "snap1 contents" to oid "blah"
create snapshot "snap1"
Write "snap2 contents" to oid "blah"
create snapshot "snap2"
rollback oid "blah" to snapshot "snap1"
In this case, make_writeable would have just cloned head to the snap2
clone, but the relevant clone is actually "snap1". _rollback_to now
verifies that the most recent clone is the correct one before assuming
that head is already correct.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 6 Jan 2011 23:48:13 +0000 (15:48 -0800)]
ReplicatedPG: clone_overlap should contain one entry per clone
Previously, writefull and _delete_head would remove the last
entry from snapset.clone_overlap. Now, the last entry becomes
an empty interval_set. clone_overlap should contain one entry
per clone.
The missing entries previously caused a bug in _rollback_to where
iter would be clone_overlap.end().
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Sage Weil [Fri, 24 Dec 2010 16:36:05 +0000 (08:36 -0800)]
osd: generate backlog if needed to get last_complete >= log.tail || backlog
If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering. This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!
Sage Weil [Mon, 3 Jan 2011 22:32:48 +0000 (14:32 -0800)]
mds: load root inode on replay if auth
If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified. If it has, this
is useless but fast/harmless. This only occurs for brand-new filesystems
where the mds is immediately restarted.
It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 20 Dec 2010 21:22:49 +0000 (13:22 -0800)]
osd: compensate for replicas with tail > last_complete
Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590). In that case, the primary
can compensate by sending more log info to the replica.
Sage Weil [Sat, 18 Dec 2010 05:02:58 +0000 (21:02 -0800)]
mds: make nested scatterlock state change check more robust
The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start). (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)
We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.
Sage Weil [Fri, 17 Dec 2010 23:12:17 +0000 (15:12 -0800)]
filestore: make OpSequencer::flush() work for writeahead journaling items
It was only waiting for items in the op_queue to complete. The goal is
to wait for anything we've called queue_transactions(&osr,...) on. If we
do writeahead journaling, though, there might be new ops that are still
journaling but not yet submitted to the fs that are missed.
This adds a journal queue to the OpSequencer, and uses it in the writeahead
case only.
Sage Weil [Fri, 17 Dec 2010 20:54:38 +0000 (12:54 -0800)]
osd: flush pg writes to disk before starting scrub scan
This avoids two races:
- we just completed recovery by pushing objects to the replica, and the
replica starts scanning before those writes reach the fs.
- we just trimmed to something after last_update_applied.
Sage Weil [Tue, 14 Dec 2010 17:26:12 +0000 (09:26 -0800)]
mon: trim pgmap less aggressively
This will make observer crashes due to missed states (#648) much harder to
hit. Eventually the pgmap state trim problem will go away when the
monitor/paxos code is restructured (#647).
Sage Weil [Sun, 12 Dec 2010 22:39:48 +0000 (14:39 -0800)]
mds: fix replay/resent vs completed request check
If it is a _replayed_ request, we should always send a simple ack if it is
completed, because the client doesn't not care about any additional caps.
If it is a _resent_ request, then we want to return useful caps on open or
create requests, even if any modification side-effects have already been
committed. The additional checks for completed already exist in the
create and open handlers.
Vangelis Koukis [Thu, 9 Dec 2010 18:53:22 +0000 (20:53 +0200)]
Fix overflow in FileJournal::_open_file()
[ The following text is in the "iso-8859-7" character set. ]
[ Your display is set for the "iso-8859-1" character set. ]
[ Some special characters may be displayed incorrectly. ]
Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.
Relevant messages from the osd logfile:
2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal
The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().
Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Wed, 8 Dec 2010 23:53:13 +0000 (15:53 -0800)]
filejournal: reset last_commited_seq if we find journal to be invalid
If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal. If that happens we also need to reset
last_committed_seq to avoid a crash like
Sage Weil [Tue, 7 Dec 2010 21:31:01 +0000 (13:31 -0800)]
mds: sync->mix replica state is sync->mix(2)
When auth first moves to sync->mix,
- auth sends AC_MIX to replicas
- replicas go to sync->mix
- replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
- auth gets all acks, sends AC_MIX again
- replica moves to MIX
So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.