Greg Farnum [Mon, 20 Dec 2010 21:32:43 +0000 (13:32 -0800)]
mdcache: change replay trimming a bit.
Previously we were re-inserting dentrys on the open list. But if
there weren't any other available dentrys to trim, this could
have led to an infinite loop!
Now, we save them in a list and pop them back in once the trim
is done.
Greg Farnum [Fri, 17 Dec 2010 21:25:04 +0000 (13:25 -0800)]
mdlog: return EAGAIN if replay falls off the tail of the journal.
This can happen when we're following an active journal, and
would previously cause the MDS to shut down. Now we return EAGAIN,
so the MDS can recover as it likes.
Currently, that recovery is a simple respawn, as when we discover
we've fallen behind via probing.
Greg Farnum [Fri, 17 Dec 2010 00:47:30 +0000 (16:47 -0800)]
journaler: Add init_headers function, call when reading head off disk.
Uninitialized headers were causing a failed assert during replay,
and there's no good reason to leave them set at their defaults just
because the *current* incarnation of this MDS has never written to
disk!
Greg Farnum [Thu, 16 Dec 2010 19:53:38 +0000 (11:53 -0800)]
mds: After probing the journal, reset if we've fallen behind.
Previously, if the journal got trimmed and we missed log entries,
we failed out in the journaling step and stopped.
This is still possible and needs to be fixed, but pre-emptively checking
that we're still in the live part of the journal narrows the race range.
Greg Farnum [Wed, 8 Dec 2010 17:42:57 +0000 (09:42 -0800)]
MDSMonitor: Do not set the rank of an MDS in standby-replay
or oneshot-replay modes.
This was causing issues with identification in various circumstances,
and turns out to be unnecessary. The MDS now will set its whoami
variable from the standby_for_rank field if that's appropriate.
Greg Farnum [Tue, 7 Dec 2010 19:48:08 +0000 (11:48 -0800)]
Journaler: Remove the unused read_pos field.
Rename it to unused_field, fill the in-memory read_pos
from header.expire_pos, and fill unused_field with the expire_pos
for safety.
(The on-disk header pos was used to fill in read_pos, but it was
always reset to expire_pos before being used and was only ever
set at the end of replay.)
Greg Farnum [Wed, 1 Dec 2010 21:28:44 +0000 (13:28 -0800)]
MDS: Implement the hooks for standby_replay.
This commit adds the necessary state checks and machinery
for the MDS to go through a "looping" replay.
It does not yet implement online trimming, nor is there any
way to get the MDS into or out of a standby_replay state.
Greg Farnum [Wed, 24 Nov 2010 00:20:05 +0000 (16:20 -0800)]
mds: Create new STATE_ONESHOT_REPLAY for the MDS.
This takes over the previous behavior of STATE_STANDBY_REPLAY,
allowing standby-replay to be used for the upcoming continuous-replay
that will enable hot standbys.
Sage Weil [Wed, 5 Jan 2011 23:31:06 +0000 (15:31 -0800)]
mds: change refragment journaling/store strategy
We had a serious problem before where we were updating the cache and
redivvying up the dentries among fragments, but not immediately
journaling it. This was okay only if we were lucky and no other update
journaled something (e.g. some random child journaling its ancestors).
Instead, journal (PREPARE) immediately and in parallel with the new
dirfrag stores. When the stores complete, journal again (COMMIT). On
journal replay, for any PREPAREs without matching COMMITS we immediately
journal a ROLLBACK.
Other behavior is essentially unchanged. We don't send the notify until
both the PREPARE and STORES complete. But that part doesn't really matter:
if we restart and rollback, peers will find out during resolve/rejoin,
as before.
command-line programs (as opposed to daemons) should send their logs to
stderr rather than to a log file, syslog, etc. This is especially
important because most users want to run the ceph command-line programs
as non-root, and often only root has permissions to add to the ceph
log directory.
Create a new function, set_foreground_logging, that overrides ceph.conf
settings to force all log output to stderr. For daemons, we still only
send the very highest priority messages to stderr, and only before they
daemonize().
Don't ever log to stdout because it interferes with scripts that parse
the output of stdout. Instead, log to stderr if the user gives the
--foreground or --nodaemon argument.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 4 Jan 2011 22:45:34 +0000 (14:45 -0800)]
mds: force fragmentation for ambiguous imports as well
Handle needed refragmentation for processing ambiguous bounds. That means
forcing the peers' subtree root fragmentation, and also interpreting the
peer's bounds appropriately, given that the peer's fragmentation may not
match our own.
Sage Weil [Tue, 4 Jan 2011 22:39:58 +0000 (14:39 -0800)]
mds: make resolve adjust dir fragmentation as needed
During resolve, adjust dir fragmentation as needed based on the subtrees
the sender explicitly claims. The given fragmentation on the root is
always valid. Their bounds may not be; only split our frags as needed if
they happen to be partially in and partially out of the sender's bounding
fragset.
Sage Weil [Tue, 4 Jan 2011 18:20:18 +0000 (10:20 -0800)]
client: fix frag selection code
Calling fragtree_t::contains() on a non-frag_t is nonsense and will crash.
And a fragtree is a complete partition of the space. What we really want
to check is if we know where to find the specific frag_t we need.
Sage Weil [Fri, 24 Dec 2010 16:36:05 +0000 (08:36 -0800)]
osd: generate backlog if needed to get last_complete >= log.tail || backlog
If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering. This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!
osd: Make g_conf.osd_max_notify_timeout a uint32_t
Make g_conf.osd_max_notify_timeout a uint32_t. Squashes an annoying
compiler warning and avoids the awkward issue of users specifying
negative timeouts.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 3 Jan 2011 22:32:48 +0000 (14:32 -0800)]
mds: load root inode on replay if auth
If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified. If it has, this
is useless but fast/harmless. This only occurs for brand-new filesystems
where the mds is immediately restarted.
It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Make g_conf.keyring a plain old string rather than an array of strings.
Don't do substitution using the user's HOME variable-- this could lead
to security holes for setuid processes.
Get rid of AuthMonitor::read_keyfile because there is already a Keyring
member function, Keyring::load, that does the same thing.
qa/rbd/common.sh: we can now use cconf to figure out what the keyring
is.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>