Sage Weil [Thu, 3 Mar 2011 00:13:54 +0000 (16:13 -0800)]
mds: rip out rename linkmerge support
It turns out POSIX says rename(a,b) is a no-op when a and b link to the
same inode. This is super weird but good news because it means we can
rip out a bunch of poorly tested code.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Alexandre Oliva [Wed, 2 Mar 2011 21:39:09 +0000 (13:39 -0800)]
cmds/cosd: Fix IsHeapProfilerRunning implicit return type cast.
G++ complains about the difference between the return type of tcmalloc's
IsHeapProfilerRunning (int) and the return type of the function that
g_conf.profiler_running is supposed to point to (bool). We could
probably get away with a type-cast, but as a compiler developer and
former C++ language lawyer, I'd rather not take the risk of destroying
the universe by invoking undefined behavior ;-)
Sage Weil [Tue, 1 Mar 2011 00:05:08 +0000 (16:05 -0800)]
osd: trigger discover_all_missing after replay delay
We were calling discover_all_missing only when we went immediately active,
not after we were in the replay state (which triggers from a timer event
that calls OSD::activate_pg(). Move the call into PG::activate() so that
we catch both callers.
This requires passing in a query_map from the caller. While we're at it,
clean up some other instances where we are defining a new query_map
deep within the call tree.
Fixes: #847 (I hope) Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
The headers and ceph_fs.cc are written such that they can be shared
verbatim between the kernel and userspace code. Omitting the headers
was deliberate, because they differ depending on the build environment.
The default file layout seems fine in config.cc, since it is declared
in config.h, and is a bunch of tunables we generally try to keep in
config.cc.
The previous change changed all PoolHandle uses to IoContext. This
change also renames the variable names.
Also fix a few API functions whose names weren't quite right after the
previous change. rados_pool_list really does just list pools-- it has
nothing to do with ioctxes.
rados_ioctx_change_auid should be rados_ioctx_pool_set_auid. Although it
takes an ioctx as an argument, it operates on the pool.
rados_ioctx_close should just return void. APIs where the close
operation can fail are broken. What is the user supposed to do if
closing doesn't work?
Also, fix a few test programs that got overlooked earlier.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Thu, 24 Feb 2011 20:31:58 +0000 (12:31 -0800)]
FileStore.h: reorder queue operations in _journaled_ahead
In writeahead mode, an op could dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. _journaled_ahead will now enqueue the op in q before removing
from jq.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
This commit introduced an error in parallel journaling mode.
OpSequencer::flush is only meant to ensure that the ops have become
readable, not necessarily journalled.
Sage Weil [Thu, 24 Feb 2011 13:49:29 +0000 (05:49 -0800)]
Makefile: fix libatomic_ops linking
LDADD seems to have no effect on the final link command. Switching this
back to AM_LDFLAGS. This was changed as in 1c7d8f1ac2c, although it's not
clear that the change was intentional...
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Feb 2011 08:34:37 +0000 (00:34 -0800)]
mds: remove "N stopped" from short mdsmap summary
It's confusing because it sounds like we're talking about daemons, when we
really just mean there are some ranks that created some ondisk state but
aren't currently part of the running cluster.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 23:08:58 +0000 (15:08 -0800)]
mds: strengthen assertions in rejoin ack
The ACK only contains items we asked for with a WEAK request. Assert as
much. (The old continue bits were from ~2007, when this was originally
written.)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 22:25:06 +0000 (14:25 -0800)]
mds: fix export cancellation vs nested freezes
Prevent freezes from completing while we are canceling exports. Otherwise
if we are freezing /a/b and /a, and cancel /a/b, we may inadvertantly
complete the freeze on /a (synchronously) and confuse ourselves. Pin
all freezes beforehand so that when we cancel each one we do not cause
any others to prematurely complete.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Wed, 23 Feb 2011 21:55:43 +0000 (13:55 -0800)]
FileStore: fix OpSequencer::flush error
In writeahead mode, an op will dissappear from jq without immediately
reappearing in q. Thus, q can be empty before seq is requeued and
finished. last_thru_q and last_thru_jq will now be tracked explicitly.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 21:34:01 +0000 (13:34 -0800)]
mon: fix dup mds takeover
Allow a standby to take over for a single MDS only by consistently looking
at the pending_mdsmap and not mdsmap. Mixing the two leads to all kinds
of confusion.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 23 Feb 2011 21:01:08 +0000 (13:01 -0800)]
mds: refragment dirs when inode dirfragtree updates from journal
Force dir fragmentation specified by dirfragtree when replayed from
the journal.
Example:
mds0 is auth for /foo, mds1 is auth for /foo/bar.
mds1 fragments /foo/bar. journals etc.
mds0 gets fragment notify and the in-memory inode's dirfragtree changes.
mds0 journals the /foo/bar inode for some random reason.
mds0 imports /foo/bar.
On replay, mds0 refragments upon first mention of the new fragtree in the
journal, so that the dirfragtree <-> dir frags always match. Confusion is
avoided when we, say, import /foo/bar.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Splt rados_init into rados_create and rados_connect. The pattern will
be for users to call create, set configuration, and then connect. Rename
rados_release to rados_destroy, to be more symmetrical with
rados_create. You can't reconnect after calling destroy.
Don't create the messenger inside the RadosClient constructor. Instead,
wait until RadosClient::connect().
Rename rados_conf_apply to rados_reopen_log. Add comment about SIGHUP.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Everyone uses get_conf to get configuration values. So the logic for
defaulting to some value if we can't find the requested key should live
there. Also fix a case in cconf where we could encounter a usage error
and keep on going.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Tue, 22 Feb 2011 20:45:21 +0000 (12:45 -0800)]
osd: fix recovery pointer when pulling head before snapid
If recovery wants to pull a snapped object and needs the head first, pull()
does that, but the caller doesn't ++skipped and incorrectly bumps the
recovery pointer, preventing us from going back and re-pulling the snapped
object later.
Return a tristate enum from pull so we can tell what it did and update our
recovery state appropriately.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Feb 2011 20:20:40 +0000 (12:20 -0800)]
osd: verify object version during push
Fail to push if the ondisk version doesn't match the version we want to
send.
This isn't supposed to happen. If it does it means we have a bug somewhere
else. Log something to the error log and don't push. This is better than
the current behavior, which goes into a loop (repeatedly pulling the object
and retrying when it's not the right version).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Feb 2011 17:40:47 +0000 (09:40 -0800)]
osd: improve up_thru request behavior
There is some epoch the OSD wants for up_thru, based on when the PG mapping
last changed. However, once the monitor gets to the point where it must
update the map, it should set up_thru to the most recent epoch the OSD has
seen (i.e. the epoch it is known to be "up thru"!). This will hopefully/
frequently avoid any subsequent up_thru requests.
MOSDAlive already has a separate field (in PaxosServiceMessage) to hold the
latest epoch; just fix the constructor to set it properly, and make the
monitor use it. No protocol change, yay!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Mon, 14 Feb 2011 21:24:40 +0000 (13:24 -0800)]
OSD: convert waiting_for_pg from hash_map to map.
This doesn't need to be a hash_map; there will only be an entry
for each PG that gets a message request while it's not active.
Shouldn't be too many PGs that that happens too, right?
Greg Farnum [Sat, 12 Feb 2011 01:25:14 +0000 (17:25 -0800)]
PG: convert hash_maps to maps, remove unused.
waiting_for_[missing|degraded]_object don't need to be
hash_maps, and we don't use stat_object_temp_rd at all.
Swap to map and remove to reduce per-PG memory consumption!