Sage Weil [Wed, 23 Feb 2011 22:25:06 +0000 (14:25 -0800)]
mds: fix export cancellation vs nested freezes
Prevent freezes from completing while we are canceling exports. Otherwise
if we are freezing /a/b and /a, and cancel /a/b, we may inadvertantly
complete the freeze on /a (synchronously) and confuse ourselves. Pin
all freezes beforehand so that when we cancel each one we do not cause
any others to prematurely complete.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Wed, 23 Feb 2011 21:55:43 +0000 (13:55 -0800)]
FileStore: fix OpSequencer::flush error
In writeahead mode, an op will dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. last_thru_q and last_thru_jq will now be tracked explicitly.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 21:34:01 +0000 (13:34 -0800)]
mon: fix dup mds takeover
Allow a standby to take over for a single MDS only by consistently looking
at the pending_mdsmap and not mdsmap. Mixing the two leads to all kinds
of confusion.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 21:01:08 +0000 (13:01 -0800)]
mds: refragment dirs when inode dirfragtree updates from journal
Force dir fragmentation specified by dirfragtree when replayed from
the journal.
Example:
mds0 is auth for /foo, mds1 is auth for /foo/bar.
mds1 fragments /foo/bar. journals etc.
mds0 gets fragment notify and the in-memory inode's dirfragtree changes.
mds0 journals the /foo/bar inode for some random reason.
mds0 imports /foo/bar.
On replay, mds0 refragments upon first mention of the new fragtree in the
journal, so that the dirfragtree <-> dir frags always match. Confusion is
avoided when we, say, import /foo/bar.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Feb 2011 20:45:21 +0000 (12:45 -0800)]
osd: fix recovery pointer when pulling head before snapid
If recovery wants to pull a snapped object and needs the head first, pull()
does that, but the caller doesn't ++skipped and incorrectly bumps the
recovery pointer, preventing us from going back and re-pulling the snapped
object later.
Return a tristate enum from pull so we can tell what it did and update our
recovery state appropriately.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Feb 2011 20:20:40 +0000 (12:20 -0800)]
osd: verify object version during push
Fail to push if the ondisk version doesn't match the version we want to
send.
This isn't supposed to happen. If it does it means we have a bug somewhere
else. Log something to the error log and don't push. This is better than
the current behavior, which goes into a loop (repeatedly pulling the object
and retrying when it's not the right version).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Feb 2011 17:40:47 +0000 (09:40 -0800)]
osd: improve up_thru request behavior
There is some epoch the OSD wants for up_thru, based on when the PG mapping
last changed. However, once the monitor gets to the point where it must
update the map, it should set up_thru to the most recent epoch the OSD has
seen (i.e. the epoch it is known to be "up thru"!). This will hopefully/
frequently avoid any subsequent up_thru requests.
MOSDAlive already has a separate field (in PaxosServiceMessage) to hold the
latest epoch; just fix the constructor to set it properly, and make the
monitor use it. No protocol change, yay!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Mon, 14 Feb 2011 21:24:40 +0000 (13:24 -0800)]
OSD: convert waiting_for_pg from hash_map to map.
This doesn't need to be a hash_map; there will only be an entry
for each PG that gets a message request while it's not active.
Shouldn't be too many PGs that that happens too, right?
Greg Farnum [Sat, 12 Feb 2011 01:25:14 +0000 (17:25 -0800)]
PG: convert hash_maps to maps, remove unused.
waiting_for_[missing|degraded]_object don't need to be
hash_maps, and we don't use stat_object_temp_rd at all.
Swap to map and remove to reduce per-PG memory consumption!
Shouldn't need to include DoutStreambuf.h; that's all implementation.
Don't include Mutex.h, since we don't use it.
*Do* include config.h, since we need it.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Josh Durgin [Fri, 18 Feb 2011 01:30:19 +0000 (17:30 -0800)]
librbd: hold image context lock minimally
Holding the image context lock during snapshot removal prevented the
client from responding to a notify, causing a deadlock. This could be
triggered by removing a snapshot while concurrently adding more to the
same image.
Greg Farnum [Tue, 15 Feb 2011 16:58:48 +0000 (08:58 -0800)]
Journaler: call set_layout after init_headers.
set_layout modifies last_committed, but then init_headers
uses operator= and overwrites those changes. In this case
it doesn't matter as they're both writing the same changes,
but make the ordering explicit for the future.
We should handle the situation where we assert() while already holding
the dout() lock. At the same time, we want to get the dout lock if we
can, because it makes the logs look nicer. pthread_mutex_trylock solves
the dilemma.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Convert _dout_lock to plain pthread_mutex_t. This way, we don't have to
depend on the order of global constructor initialization. It should also
be slightly more efficient. The dout_lock was never subject to lockdep
anyway, so that's not an issue.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Using ELF TLS via the __thread keyword is much faster than using
pthread_getspecific and pthread_setspecific. It's also much nicer
looking syntactically. Finally, the __thread keyword is going to be
standardized in C++0x. So there's no reason to have an infrastructure
dependent on pthread_getspecific.
There were no users so this shouldn't affect anything negatively.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Sat, 12 Feb 2011 06:47:51 +0000 (22:47 -0800)]
debian: add python, python-dev build-deps
Might be overkill? The error I see from pbuilder is
checking for a Python interpreter with version >= 2.4... none
error: configure: in `/tmp/buildd/ceph-0.24.3-676-gcde53e9':
error: configure: Failed to find Python 2.4 or newer
...but I'm guessing python-dev is needed too?
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Fri, 11 Feb 2011 23:54:18 +0000 (15:54 -0800)]
MDCache: add max_dir_commit_size.
Configured by setting mds_dir_max_commit_size in conf, or else
by looking at osd_max_write_size. This should lead to sane
max commits even if the user doesn't specify anything.
This will be used in the next commit or to by CDir.
Josh Durgin [Fri, 11 Feb 2011 21:21:05 +0000 (13:21 -0800)]
objecter: set linger op target pg when a linger is resent
send_linger always creates a new Op, but op_submit does not fill in
the target pg if an existing session is passed in, so when a linger
was resent, it had the wrong pg set.
This caused a crash in cosd with debugging turned on when running
testlibrbd twice. This occurred because the object context for the
linger in the wrong pg had no object name set.
Currently, we haven't read the configuration at the time we initialize
these locks. So we can't know whether lockdep has been enabled, or what
verbosity it is supposed to have. So just disable it on these locks.
Potentially ExportControl's initialization could be moved to after
g_conf.lockdep and g_conf.debug_lockdep have been read from the
configuration, if lockdep is needed for this component.
ConfFile probably doesn't need a lock at all, but that's another story.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Greg Farnum [Fri, 11 Feb 2011 23:54:18 +0000 (15:54 -0800)]
MDCache: add max_dir_commit_size.
Configured by setting mds_dir_max_commit_size in conf, or else
by looking at osd_max_write_size. This should lead to sane
max commits even if the user doesn't specify anything.
This will be used in the next commit or to by CDir.