Greg Farnum [Tue, 5 Oct 2010 16:25:38 +0000 (09:25 -0700)]
client: Fix truncate_seq/truncate_length initialization.
Initializing to 0 was causing file_to_extents to get called on every inode
since the MDS initializes truncate_seq to 1 and truncate_length to -1.
This revealed itself as a crash on directory inodes, which have their
layouts zeroed since merging the file_layouts branch.
To make clearer, assert that anything being truncated is a file inode.
Previously we unconditionally encoded the standard layout, which
on a directory inode is meaningless. So, use that spot to fill
in the default dir layout, if it exists. Otherwise, zero-fill.
This lets us display default directory layouts without changing
the protocol, which is good.
Always throw exceptions by value rather than as pointers. Always catch
exceptions as const references to avoid unecessary copying. This fixes a
few minor memory leaks and should simplify handling exceptions in the
future.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Fri, 1 Oct 2010 22:54:56 +0000 (15:54 -0700)]
mon: add 'mds fail N' command
Manually mark an mds rank as failed. The daemon should kill itself when
it finds out.
Note that this doesn't do any sanity checks, so it can also be used to
adjust state in an otherwise inconsistent mdsmap due to other bugs (one
where, say, an mds in up but has no info, or not up but not in the failed
set.)
Sage Weil [Fri, 1 Oct 2010 19:43:20 +0000 (12:43 -0700)]
mds: fix stray replica push on _rename_prepare_witness()
We need to push all parents of the straydn to the target. This changed
a while back with the mdsdir stuff but this bit of code wasn't updated.
Updated to mirror send_dentry_unlink().
This fixes a crash like:
mds/MDCache.cc: In function 'void MDCache::adjust_subtree_auth(CDir*, std::pair<int, int>, bool)':
mds/MDCache.cc:644: FAILED assert(root)
ceph version 0.22~rc (0e67718a365b42969e785f544ea3b4258bb2407f)
1: (MDCache::add_replica_dir(ceph::buffer::list::iterator&, CInode*, int, std::list<Context*, std::allocator<Context*> >&)+0x1c1) [0x536a91]
2: (MDCache::add_replica_stray(ceph::buffer::list&, int)+0xdb) [0x536fab]
3: (Server::handle_slave_rename_prep(MDRequest*)+0x1113) [0x4d5c33]
4: (Server::dispatch_slave_request(MDRequest*)+0x21b) [0x4de80b]
5: (Server::handle_slave_request(MMDSSlaveRequest*)+0x145) [0x4e1955]
6: (MDS::_dispatch(Message*)+0x2598) [0x49e038]
...
Sage Weil [Fri, 1 Oct 2010 19:32:59 +0000 (12:32 -0700)]
osd: revamp forgetting lost objects
The old forget lost objects rewrote history in the PG log, which is asking
for all kinds of trouble. Instead, add new logs events to indicate that
an object is LOST (deleted) or LOST_REVERTed (reverted to an older
version).
The LOST_REVERT case means we may need to recover the old version from
another node and rewrite the version number. This isn't implemented yet;
for now we just assert.
Sage Weil [Fri, 1 Oct 2010 05:00:06 +0000 (22:00 -0700)]
osd: fix recovery_primary loop on local clone
When we take the clone branch, we update the missing map. This invalidates
our current iterator, which can cause badness. Instead, increment the
iterator near the top of the loop so we don't have to worry about it.
coll_t is now a string. META_COLL and TEMP_COLL are just constants now.
Now there is a constructor that takes pgid_t and snapid_t, rather than
factory methods. It's clear what that constructor does, so wrapping it
in factory methods should be unecessary.
Bump coll_t serialization version to 3. Implement decoding for the old
versions.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Colin McCabe [Wed, 29 Sep 2010 02:00:28 +0000 (19:00 -0700)]
interval_set: hide data members
This change makes interval_set::m and interval_set::_size private data
members in interval_set, instead of public. This change also creates a
non-const iterator. Using this iterator, users can modify the length of
an interval. So now, all users can use the iterators rather than
interacting with the class internals directly.
mon: Fix issue first addressed in 2c5a3d99aa3be5ce114072e84f73a0a6426e63fd.
We were properly falling out of the while loop when we reached end(), but
not checking for it in the following if-else. Now we do! Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
The setup-chroot.sh script is very handy for building the server in a
chroot environment. I thought I would share it here in case anyone else
finds it useful.
Sage Weil [Mon, 27 Sep 2010 15:31:34 +0000 (08:31 -0700)]
mds: don't block request on freezing if we're already auth_pinned.
If we already auth_pinned, we're past the gates; don't stop on freezable.
This screws up xlock: the lock moves to PREXLOCK state, but the request
that would normally xlock it gets deferred because of a racing freezing
of the tree. Then the PREXLOCK gather kicks in and badness happens.
Sage Weil [Sat, 25 Sep 2010 03:10:08 +0000 (20:10 -0700)]
osd: add coll_t::is_pg() method
This makes the interface a bit more adaptable for a situation where it has
a simple string representation instead of the strict structure it has now.
Eventually this function can simply attempt a pg_t parse.
Sage Weil [Fri, 24 Sep 2010 18:43:37 +0000 (11:43 -0700)]
osd: make sparse data/clone push behave with partial object push
We can't error out if we don't get everything we want in one go now that
we support pushing objects in pieces. Remove this check entirely, since
we don't have a good error handling case anyway.
Sage Weil [Fri, 24 Sep 2010 16:40:40 +0000 (09:40 -0700)]
mds: defer cap release and update consistently when frozen
We need to preserve the order of processing of cap release and writeback
messages across handle_client_caps() and process_request_cap_release().
Use a helper with the appropriate condition, and defer the release
processing as needed.
Sage Weil [Fri, 24 Sep 2010 15:15:54 +0000 (08:15 -0700)]
mds: always mark parent scatterlock when marking dirty rstat
Note that this will let the parent nestlock 'dirty' state get out of
sync with the lock state, as the whole point of the dirty rstat lists is
that it can happen any time. It does, however, queue us up.
Sage Weil [Thu, 23 Sep 2010 23:12:21 +0000 (16:12 -0700)]
mds: scatter pin frozen tree on importer too
The importer also needs to scatter pin. This avoids scatterlock gather
races like so:
A: start exporting to B
A: freeze, scatter pin tree
C: initiate gather
A: delay replay to gather
B: reply to gather, do not include (non-auth) dirfrag
A,B: finish migration
A: reply to gather, do not include (now non-auth) dirfrag
C: gets no info about the dirfrag!
By pinning on the importer, we ensure that at least one MDS will respond
to the gather with auth dirfrag info.
Sage Weil [Thu, 23 Sep 2010 17:00:07 +0000 (10:00 -0700)]
mds: fix bounding frag rstat/fragstat update during import
Be careful about when we update bounding dirfrag info during an import. If
the lock is in a MIX state, we do NOT want to update, since the inode
auth doesn't know jack (unless they are also dirfrag auth, in which case
we'll find out when we unscatter anyway).
Sage Weil [Thu, 23 Sep 2010 04:10:18 +0000 (21:10 -0700)]
mds: do not scatter_writebehind on nudge if replicated
This can cause the inode rstat etc to become out of sync with dirfrag
accounted_rstat when the scatterlock is not in a gathered state: the
local values will get updated but those on other nodes will not, and the
inode will drift out of sync with the dirfrags.
Other callers to scatter_writebehind() are all in contexts where we have
_just_ gathered dirfrag state, or there is no remote dirfrag state to
gather.
Sage Weil [Wed, 22 Sep 2010 22:42:52 +0000 (15:42 -0700)]
mds: use scatter pins for migration instead of rd/wrlocks
This is simpler (for the migrator), and wrlocks allow scatter_writebehind,
which is a no-no for a frozen tree. By pinning the frozen dir's parent
inode, we prevent any scatter or unscatter operations from implicitly
updating metadata within the frozen root dirfrag.
Sage Weil [Wed, 22 Sep 2010 18:31:12 +0000 (11:31 -0700)]
mon: move election start reset to starting_election() helper
An election can start either because we call it, or because someone else
calls it. Either way, we need to reset our state, so move that code into
the election_starting() callback, which is called by the elector's
start()/call_election() anyway.
This hopefully fixes a case where we see a timeout expire on the monitor
and fail the assertion