Greg Farnum [Mon, 22 Nov 2010 16:50:32 +0000 (08:50 -0800)]
client: only encode_cap_releases once per request.
Accomplish this by making a list of cap releases in the (permanent)
MetaRequest, and then copying that into the (potentially-temporary)
MClientRequest.
Sage Weil [Fri, 12 Nov 2010 23:56:54 +0000 (15:56 -0800)]
msgr: do not clear halt_delivery
We need to keep the halt_delivery plug set on failure/shutdown in order to
prevent a racing reader from queuing new messages. The only time we clear
it is when we discard messages due to a session reset.
Sage Weil [Fri, 12 Nov 2010 21:09:24 +0000 (13:09 -0800)]
msgr: only close socket on reconnect or shutdown
We can't modify 'sd' or (more importnatly) close sd while any other thread
might be using it, or else we might race with an open and they might end
up using someone else's fd.
Take care to _only_ close(sd) in connect(), when the reader thread is
stopped, or when reaping the connection.
Sage Weil [Fri, 12 Nov 2010 21:41:14 +0000 (13:41 -0800)]
msgr: protect pipe queuing with _both_ pipe and dispatch_queue locks
We want to make sure the pipe's queue item doesn't go away.
Also, make queue_received() require pipe_lock to be held. This avoids some
useless unlocking/locking, since (in the case where the pipe is already
queued) we then don't need to drop the pipe_lock at all.
Sage Weil [Fri, 12 Nov 2010 15:55:41 +0000 (07:55 -0800)]
uclient: insert lssnap results under snapdir, not live dir
Put the readdir results (list of snapshots) in the right place in the
hierarchy; we were putting them in the parent dir (as if they were real
directories).
This bug manifested itself as a snaptest-2.sh failure.
Sage Weil [Thu, 11 Nov 2010 04:58:49 +0000 (20:58 -0800)]
mds: fix null_snapflush with multiple intervening snaps
The client is allowed to not send a snapflush if there is no dirty metadata
to write for a given snap. However, the mds can only look up inodes by
the last snapid in the interval. So, when doing a null_snapflush (filling
in for snapflushes the client didn't send), we have to walk forward through
intervening snaps until we find the right inode.
Note that this means we will call _do_snap_update multiple times on the
same inode, but with different snapids.
Sage Weil [Wed, 10 Nov 2010 17:03:37 +0000 (09:03 -0800)]
objecter: throttle before looking at lock protected state
The take_op_budget() may drop our lock if we are in keep_balanced_budget
mode, so we need to do that _before_ we take references to internal state
that may change out from under us during that time.
This fixes a crash like
./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)':
./osd/OSDMap.h:460: FAILED assert(exists(osd) && is_up(osd))
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (Objecter::op_submit(Objecter::Op*)+0x6c2) [0x38658854c2]
2: /usr/lib64/librados.so.1() [0x3865855dc9]
3: (RadosClient::aio_write(RadosClient::PoolCtx&, object_t, long,
ceph::buffer::list const&, unsigned long,
RadosClient::AioCompletion*)+0x24b) [0x386585724b]
4: (rados_aio_write()+0x9a) [0x386585741a]
5: /usr/bin/qemu-kvm() [0x45a305]
6: /usr/bin/qemu-kvm() [0x45a430]
7: /usr/bin/qemu-kvm() [0x43bb73]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)':
./osd/OSDMap.h:460: FAILED assert(exists(osd) && is_up(osd))
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (Objecter::op_submit(Objecter::Op*)+0x6c2) [0x38658854c2]
2: /usr/lib64/librados.so.1() [0x3865855dc9]
3: (RadosClient::aio_write(RadosClient::PoolCtx&, object_t, long,
ceph::buffer::list const&, unsigned long,
RadosClient::AioCompletion*)+0x24b) [0x386585724b]
4: (rados_aio_write()+0x9a) [0x386585741a]
5: /usr/bin/qemu-kvm() [0x45a305]
6: /usr/bin/qemu-kvm() [0x45a430]
7: /usr/bin/qemu-kvm() [0x43bb73]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (ABRT) ***
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (sigabrt_handler(int)+0x91) [0x3865922b91]
2: /lib64/libc.so.6() [0x3c0c032a30]
3: (gsignal()+0x35) [0x3c0c0329b5]
4: (abort()+0x175) [0x3c0c034195]
5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3c110beaad]
Sage Weil [Tue, 9 Nov 2010 17:55:14 +0000 (09:55 -0800)]
mds: fix inode freeze auth pin allowance
When we're renaming across nodes, we need to freeze the inode. This
requires that we allow for the auth_pins that _we_ hold, which include
one because of the linklock xlock, and one by the MDRequest.
Sage Weil [Sat, 6 Nov 2010 18:35:54 +0000 (11:35 -0700)]
mds: remove MIX_STALE
Yay, we don't need it!
If we can't update the frag on scatter, fine. The staleness of the frag
is implicit in the frag's scatter stat version not matching the inode's.
If/when we do want to update it, the frag will clearly be writable, and
we can bring it back in sync then.
Sage Weil [Sun, 7 Nov 2010 03:17:32 +0000 (20:17 -0700)]
mds: don't use helper for rename srcdn
The rdlock_path_xlock_dentry helper works for _auth_ dentries that we
create locally in an auth dirfrag. For the srcdn, we need to discover an
_existing_ dentry that is not necessarily auth.
Call path_traverse ourselves, but be careful to take the appropriate locks
on the resulting dn, dir, and ancestors.
Sage Weil [Sat, 6 Nov 2010 18:02:13 +0000 (11:02 -0700)]
mds: never complete a gather on a flushing lock
The scatter_writebehind() takes a wrlock, but that may still allow the lock
to complete a gather to LOCK and even move to say MIX before the data is
committed. Bad news!
Sage Weil [Sat, 6 Nov 2010 04:52:28 +0000 (21:52 -0700)]
mds: preserve stale state on import; some cleanup
Our new invariant is that MIX_STALE always implies is_stale(). And on
import, if is_stale(), MIX becomes MIX_STALE. This ensures that a replica
that we put into MIX_STALE doesn't turn back into MIX if we import it
and take the auth's state in CInode::decode_import().
Previously I changed the std::multimap decoder to minimize the number of
constructor invocations. However, it could be much more expensive to
copy an initialized (decoded) val_t than to copy an empty one. For
example, if we are decoding std::multimap < int, std::set <int> >. So
change the code to insert a non-decoded val_t again.
However, this still saves two constructor invocations over the original.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Fri, 5 Nov 2010 06:15:06 +0000 (23:15 -0700)]
mds: do not bump scatter stat lock in predirty_journal_parents
If we're in the MIX state, we clearly can't touch this without screwing up
the delicate scatter/gather behavior. If we're in, say, LOCK, there is
still no reason to update it. One frag at least is local and auth if we
are in this code, but there may be other frags on other nodes. This would
just make them appear stale when they are not.
Sage Weil [Fri, 5 Nov 2010 05:48:09 +0000 (22:48 -0700)]
mds: mark scatterlock stale on import of stale frag scatter stat
When the lock scattered, if we didn't have an auth frag that was frozen,
we go into MIX state. Later, we may import a stale dirfrag. We need to
move to MIX_STALE at that point, and/or mark the lock stale so that any
subsequent transition does so.
Sage Weil [Fri, 5 Nov 2010 05:44:01 +0000 (22:44 -0700)]
mds: match bottom half of assilate_dirty_rstat_inodes with a dir flag
We only do the assimilate_dirty_rstat_inodes if we do an update AND the
frag rstat was non-stale, but the bottom half (_finish) doesn't have the
same info to know whether we did it because the top half updates the
fragstat version. Use a flag to indicate we've updated the dirfrag so
the bottom half will only run when needed.
Sage Weil [Fri, 5 Nov 2010 05:19:53 +0000 (22:19 -0700)]
mds: fix inode version used for inest in decode_lock_state
We need to pass the inode rstat's version into finish_scatter_update, not
the shadowed local variable. Otherwise we don't update the dirfrag when
we should.
Sage Weil [Thu, 4 Nov 2010 05:22:54 +0000 (22:22 -0700)]
mds: wait for last_failure_osd_epoch before starting journal replay
This is extremely important, and it forces the MDS to get the osdmap that
includes the blacklist entry for its predecessor. This in turn means that
any OSD we contact trying to read the journal will be forced to get that
osdmap (or newer) before handling our read request, which means that
anything we read cannot be overwritten by a racing request from our
predecessor. This prevents two MDSs writing to the journal at the same
time.
This change fixes potential (and observed!) journal corruption.
Sage Weil [Wed, 3 Nov 2010 20:08:06 +0000 (13:08 -0700)]
mds: use helper for scatter dirfrag update; use on local dirfrags
Any time we scatter is an opportunity to update the dirfrag with the
accounted scatter stat if it is out of date. We should use that
opportunity even when the dirfrag is on the same node as the inode (i.e.,
not just through decode_lock_state).
Fix a compiler warning about an uninitialized variable. Basically, we
used to insert uninitialized values into a std::multimap and then fix
them later. Rather than doing that, just insert the value we want
directly into the map.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
TestEncoding: add a templated encode-then-decode fn that can be used to
test encoding followed by decoding of any type. Test encoding and
decoding of a std::multimap.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>