Sage Weil [Thu, 13 Jan 2011 21:14:24 +0000 (13:14 -0800)]
filejournal: rewrite completion handling, fix ordering on full->notfull
Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.
This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions. That same logic needs to also apply to new items
submitted during that commit interval. This was broken before, but the
simpler structure fixes it. Fixes #666.
Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Thu, 13 Jan 2011 20:18:17 +0000 (12:18 -0800)]
PG: activate should not enqueue snap_trimmer on a replica
Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.
This should fix #702.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 23:09:51 +0000 (15:09 -0800)]
ReplicatedPG: Fix oi.size bug in _rollback_to
_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0. _rollback_to now resets
the size to match the rolled back object. Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 21:51:55 +0000 (13:51 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup
Previously, get_object_context and get_snapset_context did not register
the resulting objects. In some cases, these objects would not get
registered and multiple copies would end up created. This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Wed, 12 Jan 2011 20:07:44 +0000 (12:07 -0800)]
ReplicatedPG: snap_trimmer work around
Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps. This work around should let kvmtest
come back up. A real fix is still needed.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Greg Farnum [Tue, 4 Jan 2011 21:32:47 +0000 (13:32 -0800)]
uclient: Switch how inodes link to dentries a bit.
Inodes now have a set of parent dentries, rather than a single
pointer. This allows the cache to accurately represent multiple
hard links.
Various minor adjustments were made so that this change in
format works and is error checked.
Making oldest_update a class variable complicates log merging and wastes
space in the PG struct. Even though memory is big, cachelines are still
small. Just calculate it when we need it.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Tommi Virtanen [Tue, 11 Jan 2011 22:02:16 +0000 (14:02 -0800)]
Git ignored files cleanup.
Make gitignore entries not match recursively.
I wanted to introduce a directory "osdmaptool" to contain cli tests
for that tool, but all the files there were ignored because of these
rules. Better be explicit about what you want ignored.
Move all ignores for generated binaries to be together.
Samuel Just [Mon, 10 Jan 2011 22:45:06 +0000 (14:45 -0800)]
ReplicatedPG: Fix bug in rollback
Previously, _rollback_to assumed that the rollback was a noop if
ctx->clone_obc was set and it's prior version matches head's version.
However, this broke in sequences like:
Write "snap1 contents" to oid "blah"
create snapshot "snap1"
Write "snap2 contents" to oid "blah"
create snapshot "snap2"
rollback oid "blah" to snapshot "snap1"
In this case, make_writeable would have just cloned head to the snap2
clone, but the relevant clone is actually "snap1". _rollback_to now
verifies that the most recent clone is the correct one before assuming
that head is already correct.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Tommi Virtanen [Fri, 7 Jan 2011 21:15:40 +0000 (13:15 -0800)]
Use Google Test framework for unit tests.
Use ``make check`` to run the tests.
The src/gtest directory comes from ``svn export
http://googletest.googlecode.com/svn/tags/release-1.5.0 src/gtest``
and running "git add -f src/gtest".
gtest is licensed under the New BSD license, see src/gtest/COPYING.
For more on Google Test, see http://code.google.com/p/googletest/
Changed autogen.sh regenerate gtest automake files too. Make sure to
run ``./autogen.sh && ./configure`` after merging this commit, or
incremental builds may fail. The automake integration is inspired
heavily by the protobuf project, and may still be problematic.
Make git ignore files generated by gtest compilation.
Currently putting in just one new-style unit test, refactoring old
tests to fit will come in separate commits.
Note: if you are starting daemons, listening on TCP ports, using
multiple machines, mounting filesystems, etc, it's not a unit test
and does not belong in this setup. A framework for system/integration
tests will be provided later.
Sage Weil [Thu, 6 Jan 2011 22:28:01 +0000 (14:28 -0800)]
mds: take rdlocks on bounding dftlocks; clean up migrator lock code
We need to take an rdlock on bounding dirfrags during migration for a
rather irritating reason: when we export the bound inode, we need to send
scatterlock state for the dirfrags as well, so that the new auth also gets
the correct info. If we race with a refragment, this info is useless, as
we can't redivvy it up. And it's needed for the scatterlocks to work
properly: when the auth is in a sync/lock state it keeps each dirfrag's
portion in the local (auth OR replica) dirfrag.
So: take a rdlock on the bounding dirfrags to avoid this. Clean up the
Locker bulk rdlock interface while we're at it to be more general and
useful.
Also, while we're here, do an rdlock_try at this point. Note that we still
are going to fail more often than before, since dftlocks will frequently
be scattered if there has been a recent fragmentation. There is some
inevitable conflict here between refragmentation (which wants dftlock
in MIX) and exports (which want it SYNC). TODO.
Samuel Just [Fri, 7 Jan 2011 20:21:09 +0000 (12:21 -0800)]
ReplicatedPG: register_object_context and register_snapset_context cleanup
Previously, get_object_context and get_snapset_context did not register
the resulting objects. In some cases, these objects would not get
registered and multiple copies would end up created. This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
Samuel Just [Thu, 6 Jan 2011 23:48:13 +0000 (15:48 -0800)]
ReplicatedPG: clone_overlap should contain one entry per clone
Previously, writefull and _delete_head would remove the last
entry from snapset.clone_overlap. Now, the last entry becomes
an empty interval_set. clone_overlap should contain one entry
per clone.
The missing entries previously caused a bug in _rollback_to where
iter would be clone_overlap.end().
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
PGMonitor::prepare_pg_stats should check to see if the stats in the
MOSDPgStats message are the same as the ones we already have. If so, no
need to create an incremental; just send an ACK and return false.
The leading Monitor now marks osds as down if they haven't sent a
MOSDPGStat message in the last 15 minutes.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Always forward the PGStats to the leader, even if they are the same as
the old PGStats. The leader will mark as down osds that haven't sent
PGStats for a few minutes.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
After every g_conf.osd_mon_report_interval_max seconds, we send out a PG
stat update even if nothing has changed. This is to let the monitors
know that we're alive.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Greg Farnum [Mon, 20 Dec 2010 21:32:43 +0000 (13:32 -0800)]
mdcache: change replay trimming a bit.
Previously we were re-inserting dentrys on the open list. But if
there weren't any other available dentrys to trim, this could
have led to an infinite loop!
Now, we save them in a list and pop them back in once the trim
is done.
Greg Farnum [Fri, 17 Dec 2010 21:25:04 +0000 (13:25 -0800)]
mdlog: return EAGAIN if replay falls off the tail of the journal.
This can happen when we're following an active journal, and
would previously cause the MDS to shut down. Now we return EAGAIN,
so the MDS can recover as it likes.
Currently, that recovery is a simple respawn, as when we discover
we've fallen behind via probing.
Greg Farnum [Fri, 17 Dec 2010 00:47:30 +0000 (16:47 -0800)]
journaler: Add init_headers function, call when reading head off disk.
Uninitialized headers were causing a failed assert during replay,
and there's no good reason to leave them set at their defaults just
because the *current* incarnation of this MDS has never written to
disk!
Greg Farnum [Thu, 16 Dec 2010 19:53:38 +0000 (11:53 -0800)]
mds: After probing the journal, reset if we've fallen behind.
Previously, if the journal got trimmed and we missed log entries,
we failed out in the journaling step and stopped.
This is still possible and needs to be fixed, but pre-emptively checking
that we're still in the live part of the journal narrows the race range.
Greg Farnum [Wed, 8 Dec 2010 17:42:57 +0000 (09:42 -0800)]
MDSMonitor: Do not set the rank of an MDS in standby-replay
or oneshot-replay modes.
This was causing issues with identification in various circumstances,
and turns out to be unnecessary. The MDS now will set its whoami
variable from the standby_for_rank field if that's appropriate.
Greg Farnum [Tue, 7 Dec 2010 19:48:08 +0000 (11:48 -0800)]
Journaler: Remove the unused read_pos field.
Rename it to unused_field, fill the in-memory read_pos
from header.expire_pos, and fill unused_field with the expire_pos
for safety.
(The on-disk header pos was used to fill in read_pos, but it was
always reset to expire_pos before being used and was only ever
set at the end of replay.)
Greg Farnum [Wed, 1 Dec 2010 21:28:44 +0000 (13:28 -0800)]
MDS: Implement the hooks for standby_replay.
This commit adds the necessary state checks and machinery
for the MDS to go through a "looping" replay.
It does not yet implement online trimming, nor is there any
way to get the MDS into or out of a standby_replay state.
Greg Farnum [Wed, 24 Nov 2010 00:20:05 +0000 (16:20 -0800)]
mds: Create new STATE_ONESHOT_REPLAY for the MDS.
This takes over the previous behavior of STATE_STANDBY_REPLAY,
allowing standby-replay to be used for the upcoming continuous-replay
that will enable hot standbys.