Tommi Virtanen [Tue, 19 Apr 2011 18:20:24 +0000 (11:20 -0700)]
debian: Handle missing tcmalloc on Debian lenny.
lenny doesn't have a suitable libgoogle-perftools-dev, and
release.sh edits it out of build-deps. Detect that and tell
configure that not having tcmalloc is ok.
Sage Weil [Tue, 19 Apr 2011 16:25:30 +0000 (09:25 -0700)]
mds: remove MDSlaveUpdate from list on deletion
These are added to the LogSegment list on the slaves, but also need to be
removed from that list when we replay a COMMIT|ROLLBACK or when the op's
fate is determined during the resolve stage.
This fixes a crash like
./include/elist.h: In function 'elist<T>::item::~item() [with T =
MDSlaveUpdate*]', in thread '0x7fb2004d5700'
./include/elist.h: 39: FAILED assert(!is_on_list())
ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
1: (MDSlaveUpdate::~MDSlaveUpdate()+0x59) [0x4d9fe9]
2: (ESlaveUpdate::replay(MDS*)+0x422) [0x4d2772]
3: (MDLog::_replay_thread()+0xb90) [0x67f850]
4: (MDLog::ReplayThread::entry()+0xd) [0x4b89ed]
5: (()+0x7971) [0x7fb20564a971]
6: (clone()+0x6d) [0x7fb2042e692d]
ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
1: (MDSlaveUpdate::~MDSlaveUpdate()+0x59) [0x4d9fe9]
2: (ESlaveUpdate::replay(MDS*)+0x422) [0x4d2772]
3: (MDLog::_replay_thread()+0xb90) [0x67f850]
4: (MDLog::ReplayThread::entry()+0xd) [0x4b89ed]
5: (()+0x7971) [0x7fb20564a971]
Fixes: #1019 Signed-off-by: Sage Weil <sage@newdream.net>
MDS: Make _rename_apply inode import auth_pinning more intelligent.
We don't want auth_pins on the locallocks (they're never auth_pinned)
and we only want new auth_pins that are for locks on the inode that we
imported -- not for each xlock that the mdr has everywhere (like,
say, on the srcdn)!
Greg Farnum [Thu, 31 Mar 2011 21:02:48 +0000 (14:02 -0700)]
mds: If we're a slave, clean up xlocks when we export an inode.
Because we can do an inode import during a rename that skips the usual
channels, we were getting into an odd state with the xlocks (which we
did as a slave for an inode that we exported away). Clean up the
record of these xlocks for inodes before we get into the request
cleanup (at which point we are labeled as no-longer-auth, and the
standard cleanup routines will break).
Greg Farnum [Thu, 31 Mar 2011 00:10:05 +0000 (17:10 -0700)]
mds: properly drop imported xlocks.
Because we can do an inode import during a rename that skips the usual
channels, we were getting into an odd state with the xlocks (which
were formerly remote and are now local). Clean up the record of
those remote xlocks.
rename all the get_uid_by_* to get_user_info_by_*, remove get_user_info()
and call the appropriate function instead (either the by_uid or by_access_key).
mds: don't run all of try_subtree_merge on a rename across MDSes.
Previously we'd try and do the whole thing, which meant that
the replica got a lock twiddle before it had finished the export.
That broke things spectacularly, since we weren't respecting our
invariants about who gets remote locking messages.
Now we pass through a flag and respect our invariants.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
I don't remember why we needed can_xlock_local() to begin with, but
I can tell that adding this get_xlock_by() check won't stop anything
working that was ever working to begin with (really it's still not
strong enough a check).
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Thu, 24 Mar 2011 21:11:06 +0000 (14:11 -0700)]
MDS: Remove inappropriate assert from _logged_slave_rename.
The slave also can hold some auth pins from locks which the
master has asked it to grab. It's possible we can intelligently
determine how many, but for now just drop the assert.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Thu, 24 Mar 2011 19:23:38 +0000 (12:23 -0700)]
MDS: Server::handle_slave_rename_prep now accounts for dir snaplock.
Previously it ignored the auth pin required to hold snap xlock, which
is currently always held for a rename on a dir. This would lead to
a permanent hang on the request. Now we account for it!
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Tue, 22 Mar 2011 21:23:33 +0000 (14:23 -0700)]
Server: ensure slave mdses have full dest tree
We were already taking rdlocks on the source tree, to make
sure that each slave MDS could traverse to the source dentry. Now,
if there are slave MDSes, we take rdlocks on each destination
ancestor to make sure the slaves can also traverse there.
This fixes an fsstress bug.
Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Fri, 15 Apr 2011 22:51:50 +0000 (15:51 -0700)]
mds: keep import/export subtree_map state in sync with journal
We were being sloppy before with the ESubtreeMap vs import/export events.
Fix that by doing a few things:
- add an ambig flag to the subtree map items, and set it for in-progress
imports. That means an ESubtreeMap followed by EImportFinish will do
the right thing now.
- adjust the dir_auth on EExport journaling (handle_export_dir_ack) so
that our journaled subtree_map state is always in sync with what we
see during replay.
Also document clearly what the dir_auth variations actually mean.
Sage Weil [Fri, 15 Apr 2011 20:53:54 +0000 (13:53 -0700)]
mds: fix export cancel during IMPORT_PREPPING
If we are in PREPPING, we need to drop the stickydirs() on the inodes, and
not the pins on the dirfrags. Do this in the helper so we can keep the
call chains simple.
Also deal with the case where we get a cancel in PREPPED state.
Sage Weil [Thu, 14 Apr 2011 01:36:33 +0000 (18:36 -0700)]
mds: cancel exports in PREPPING state on any failure
The prepping nodes may need to discover bounds from the failed node and
may hang indefinitely. Meanwhile, we won't send out mds_resolve messages
until in-progress migrations complete. Deadlock.
In certain cases the importing node can manufacture the replica. If it
doesn't realize that right off, though, it will get hung up trying to
discover from the wrong node, get referred to the failed node, and block
waiting for recovery. The replica forging is a bit suspect anyway, so
let's avoid the whole thing if we can!
Sage Weil [Fri, 15 Apr 2011 17:02:46 +0000 (10:02 -0700)]
mds: don't skip inodes in journal that may be trimmed during replay
During replay we trim non-auth inodes on EExport or EImportFinish abort.
Subtree trimming may be delayed, too.
Skip parents if the diri is in the same blob, or if it is journaled in the
current segment *and* it is in a subtree that is unambiguously auth. We can't
easily be more precise than that because the actual event we care about on
replay is EExport, but the migrator doesn't twiddle auth bits to false until
later.
Also, reset last_journaled on import.
This fixes replay bugs like
2011-04-13 18:15:18.064029 7f65588ef710 mds1.journal EImportStart.replay 10000000015 bounds []
2011-04-13 18:15:18.064034 7f65588ef710 mds1.journal EMetaBlob.replay 2 dirlumps by unknown0
2011-04-13 18:15:18.064040 7f65588ef710 mds1.journal EMetaBlob.replay dir 10000000010
2011-04-13 18:15:18.064046 7f65588ef710 mds1.journal EMetaBlob.replay missing dir ino 10000000010
mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*)', in thread '0x7f65588ef710'
mds/journal.cc: 407: FAILED assert(0)
ceph version 0.25-683-g653580a (commit:653580ae84c471c34872f14a0308c78af71f7243)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x53) [0xa53d26]
2: (EMetaBlob::replay(MDS*, LogSegment*)+0x7eb) [0x7a737d]
Fixes: #994 Signed-off-by: Sage Weil <sage@newdream.net>
Stop accepting old-style section names of the form $type$id. Instead,
we want section names of the form $type.$id. So [osd0] will no longer
be a valid section name; instead, use [osd.0].
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Rados Gateway: get rid of RGWOp::err. We already have req_state::err and
that represents the same thing.
Standardize nomenclature for errors. 'errno' is our internal
representation of the error. 'code' is what is returned by S3.
'message' is the message at the end. Improve rgw_err.
dump_errno shouldn't modify req_state, but just dump the error.
A new function set_req_state_err sets the error based on an 'errno'.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Reading a config file into any md_config_t structure except g_conf used
to be impossible. This is because the config_option code used to
contain explicit references to g_conf. Those have been removed, so now
any md_config_t should be able to read a configuration file.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Wed, 13 Apr 2011 03:57:11 +0000 (20:57 -0700)]
mds: fix resolve
This was broken by a01fba175b646f6 when an ambiguous import was changed
from CDIR_AUTH_UNKNOWN to <whoami,whoami> and disambiguate_imports wasn't
updated accordingly. The result was inconsistent results for subtree
ownership on different nodes.
This updates disambiguate_imports to match that EImportStart::replay
change.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 12 Apr 2011 22:32:17 +0000 (15:32 -0700)]
mds: fix _freeze_dir assert for refragment case
The is_freezeable_dir() is true at freeze time but not forever after over
the lifetime of the freeze. We split later on and _freeze_dir on the new
fragments, so this assertion isn't necessarily true then.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Since the object store is ultimately based on ext3, ext4, or btrfs, and
object names ultimately get translated into file names, we need to
impose a corresponding limit on the length of ceph object names.
Otherwise, the "writeback" thread in the FileStore gets ENAMETOOLONG,
and the transaction does not succeed, even though we journalled it.
Perhaps we will extend or eliminate MAX_CEPH_OBJECT_NAME_LEN at some
point by using prehashing or some other technique. Until then, we need
to be sure to check for this.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>