Sage Weil [Fri, 15 Oct 2010 14:59:29 +0000 (07:59 -0700)]
mds: take nestlock wrlock when projecting rstat into dirfrag
We were already checking that we _can_ wrlock before doing the rstat
projection (if we can't, we mark_dirty_rstat() on the inode), but we
weren't actually taking the wrlock to prevent lock state changes while
that happened.
This bug eventually manifested itself as a failed assertion at the
now familiar
mds/CInode.cc: In function 'virtual void CInode::decode_lock_state(int, ceph::bufferlist&)':
mds/CInode.cc:1364: FAILED assert(pf->rstat == rstat)
Greg Farnum [Fri, 15 Oct 2010 18:21:02 +0000 (11:21 -0700)]
messenger: introduce timeouts on pipes.
This will return read errors on a pipe if it gets no data
for the given period of time (default 15 minutes). In a stateful
session the Connection will hang around and the next write will
initiate standard reconnect, so things keep working but we don't
rack up hundreds of useless threads!
Sage Weil [Thu, 14 Oct 2010 22:07:16 +0000 (15:07 -0700)]
mds: fix can_scatter_pin() to be only SYNC and MIX
Those are the only states where the replica can effectively prevent the
lock from cycling in a way that would force a frozen dirfrag beneath
the scatterpinned inode to update/journal something
(accounted_fragstat/rstat).
Greg Farnum [Mon, 11 Oct 2010 16:40:29 +0000 (09:40 -0700)]
mds: Fix projection in rename code paths.
We aren't actually projecting the inode unless destdn->is_auth(),
so check for that before projecting the snaprealm (which requires
a projected inode)!
Then on rename_apply, open the snaprealm on non-auth MDSes.
This was causing a mis-match in the projection code, since
assimilate_...finish() calls pop_and_dirty_projected_inode(), but
the first half is only called on CEPH_LOCK_INEST locks. So make them match!
Greg Farnum [Thu, 7 Oct 2010 20:04:26 +0000 (13:04 -0700)]
mds: MDCache should adjust_nested_anchors once the op's been logged.
Fixes crashes from assert(nested_anchors >= 0) failures
when updating at the wrong point.
Sage Weil [Tue, 12 Oct 2010 04:25:17 +0000 (21:25 -0700)]
mds: avoid EXCL if mds_caps_wanted in _do_cap_update
The file_excl() trigger asserts mds_caps_wanted is empty. The caller
shouldn't call it if that's the case. If it is, just go to LOCK instead.
All we're doing is picking a state to move to that will allow us to
update max_size.
Sage Weil [Tue, 12 Oct 2010 03:51:19 +0000 (20:51 -0700)]
mds: bump rstat version in predirty_journal_parents
When we propagate the rstat to inode in predirty_journal_parents (because
we hold the nestlock), bump the rstat version as well. This avoids
confusing any replicas, who expect the rstat to have a new version or to
remain unchanged when the lock scatters.
Sage Weil [Fri, 8 Oct 2010 17:45:51 +0000 (10:45 -0700)]
filestore: don't start commit if nothing new is _applied_
We were starting a commit if we had started a new op, but that left a
window in which the op could be being journaled, and nothing new has been
applied to disk. With this fix we only commit if committing/committed
will increase. Now the check matches the
Sage Weil [Thu, 7 Oct 2010 23:15:45 +0000 (16:15 -0700)]
osd: loosen caller_ops asserts
The problem is that merge_log adds new items to the log before it unindexes
divergent items, and that behavior is needed by the current implementation
of merge_old_entry(). Since the divergent items may be the same requests
(and frequently are) these asserts needs to be loosened up.
Now, the most recent addition "wins," and we only remove the entry in
unindex() if it points to us.
Sage Weil [Thu, 7 Oct 2010 23:09:25 +0000 (16:09 -0700)]
osd: move to boot state if down OR wrong address in map
Saw an OSD that was up in the map, but the address didn't match. Caused
all kinds of strange behavior. I'm not sure what I had in mind when the
original test only checked for down AND same address before moving to boot
state, since having the wrong address is clearly bad news.
Sage Weil [Thu, 7 Oct 2010 14:52:02 +0000 (07:52 -0700)]
debug: always append to log
We were truncating if we were in log_per_instance mode. But normally those
logs don't exist. And if they do, we probably don't want to truncate
them. This is particularly true if we respawn ourselves (e.g. after being
marked down) and restart with the same pid.
Greg Farnum [Wed, 6 Oct 2010 23:35:14 +0000 (16:35 -0700)]
mds: Check the lock state, not the inode state!
This was causing a lot of slowdowns.
Additionally, pin the inode when exporting caps -- otherwise it could
disappear out from under a cap ack. This was probably just exposed
by fixing the lock check.
Sage Weil [Tue, 7 Sep 2010 17:01:58 +0000 (10:01 -0700)]
osd: log error instead of crashing on failed pull attempt
If peering screws up and the primary mistakenly tries to pull an object
from us we don't have, log an error instead of crashing. This will still
throw off recovery (it will hang), but that's better than crashing
outright.
Sage Weil [Fri, 24 Sep 2010 18:43:37 +0000 (11:43 -0700)]
osd: make sparse data/clone push behave with partial object push
We can't error out if we don't get everything we want in one go now that
we support pushing objects in pieces. Remove this check entirely, since
we don't have a good error handling case anyway.
Sage Weil [Tue, 5 Oct 2010 22:41:40 +0000 (15:41 -0700)]
osd: cancel deletion on pg change
If the primary changes, cancel deletion so that the new primary has the
benefit of considering whether they need anything we have. Before we were
only canceling if our role changed, but that makes little sense.
Greg Farnum [Tue, 5 Oct 2010 16:25:38 +0000 (09:25 -0700)]
client: Fix truncate_seq/truncate_length initialization.
Initializing to 0 was causing file_to_extents to get called on every inode
since the MDS initializes truncate_seq to 1 and truncate_length to -1.
This revealed itself as a crash on directory inodes, which have their
layouts zeroed since merging the file_layouts branch.
To make clearer, assert that anything being truncated is a file inode.
Previously we unconditionally encoded the standard layout, which
on a directory inode is meaningless. So, use that spot to fill
in the default dir layout, if it exists. Otherwise, zero-fill.
This lets us display default directory layouts without changing
the protocol, which is good.
Always throw exceptions by value rather than as pointers. Always catch
exceptions as const references to avoid unecessary copying. This fixes a
few minor memory leaks and should simplify handling exceptions in the
future.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Fri, 1 Oct 2010 22:54:56 +0000 (15:54 -0700)]
mon: add 'mds fail N' command
Manually mark an mds rank as failed. The daemon should kill itself when
it finds out.
Note that this doesn't do any sanity checks, so it can also be used to
adjust state in an otherwise inconsistent mdsmap due to other bugs (one
where, say, an mds in up but has no info, or not up but not in the failed
set.)
Sage Weil [Fri, 1 Oct 2010 19:43:20 +0000 (12:43 -0700)]
mds: fix stray replica push on _rename_prepare_witness()
We need to push all parents of the straydn to the target. This changed
a while back with the mdsdir stuff but this bit of code wasn't updated.
Updated to mirror send_dentry_unlink().
This fixes a crash like:
mds/MDCache.cc: In function 'void MDCache::adjust_subtree_auth(CDir*, std::pair<int, int>, bool)':
mds/MDCache.cc:644: FAILED assert(root)
ceph version 0.22~rc (0e67718a365b42969e785f544ea3b4258bb2407f)
1: (MDCache::add_replica_dir(ceph::buffer::list::iterator&, CInode*, int, std::list<Context*, std::allocator<Context*> >&)+0x1c1) [0x536a91]
2: (MDCache::add_replica_stray(ceph::buffer::list&, int)+0xdb) [0x536fab]
3: (Server::handle_slave_rename_prep(MDRequest*)+0x1113) [0x4d5c33]
4: (Server::dispatch_slave_request(MDRequest*)+0x21b) [0x4de80b]
5: (Server::handle_slave_request(MMDSSlaveRequest*)+0x145) [0x4e1955]
6: (MDS::_dispatch(Message*)+0x2598) [0x49e038]
...
Sage Weil [Fri, 1 Oct 2010 19:32:59 +0000 (12:32 -0700)]
osd: revamp forgetting lost objects
The old forget lost objects rewrote history in the PG log, which is asking
for all kinds of trouble. Instead, add new logs events to indicate that
an object is LOST (deleted) or LOST_REVERTed (reverted to an older
version).
The LOST_REVERT case means we may need to recover the old version from
another node and rewrite the version number. This isn't implemented yet;
for now we just assert.