Sage Weil [Thu, 11 Nov 2010 04:58:49 +0000 (20:58 -0800)]
mds: fix null_snapflush with multiple intervening snaps
The client is allowed to not send a snapflush if there is no dirty metadata
to write for a given snap. However, the mds can only look up inodes by
the last snapid in the interval. So, when doing a null_snapflush (filling
in for snapflushes the client didn't send), we have to walk forward through
intervening snaps until we find the right inode.
Note that this means we will call _do_snap_update multiple times on the
same inode, but with different snapids.
Sage Weil [Wed, 10 Nov 2010 23:33:31 +0000 (15:33 -0800)]
osd: call sched_scrub on reserve reply
Otherwise we have to wait until the next time it's called by the timer, and
during that period we have a reservation locally, and any other peers can't
reserve a scrub from us, and nobody makes any progress.
PG::search_for_missing: when we find a previously unfound object, check
to see if there is an entry in waiting_for_missing_object representing a
client waiting for this object.
PG::repair_object: assert that waiting_for_missing_object is empty
before messing with missing_loc. It definitely should be during a scrub.
ReplicatedPG role change logic: always take_object_waiters on the wait
queues when the PG acting set changes.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
OSD::_process_pg_info:search_for_missing sometimes
OSD::_process_pg_info: If we're the primary for this active PG, and we
have missing objects, call search_for_missing. This should ensure that
we know where to find our missing objects.
The reason why this wasn't there before is that previously, we kept the
PG from going active until all the missing objects were found.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Erase the code in PG::peer that used to keep us from becoming active
when objects were still unfound. Print out the number of missing and
unfound objects at the end of PG::peer.
Erase PG::check_for_lost_objects and PG::forget_lost_objects.
Sage Weil [Wed, 10 Nov 2010 17:03:37 +0000 (09:03 -0800)]
objecter: throttle before looking at lock protected state
The take_op_budget() may drop our lock if we are in keep_balanced_budget
mode, so we need to do that _before_ we take references to internal state
that may change out from under us during that time.
This fixes a crash like
./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)':
./osd/OSDMap.h:460: FAILED assert(exists(osd) && is_up(osd))
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (Objecter::op_submit(Objecter::Op*)+0x6c2) [0x38658854c2]
2: /usr/lib64/librados.so.1() [0x3865855dc9]
3: (RadosClient::aio_write(RadosClient::PoolCtx&, object_t, long,
ceph::buffer::list const&, unsigned long,
RadosClient::AioCompletion*)+0x24b) [0x386585724b]
4: (rados_aio_write()+0x9a) [0x386585741a]
5: /usr/bin/qemu-kvm() [0x45a305]
6: /usr/bin/qemu-kvm() [0x45a430]
7: /usr/bin/qemu-kvm() [0x43bb73]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)':
./osd/OSDMap.h:460: FAILED assert(exists(osd) && is_up(osd))
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (Objecter::op_submit(Objecter::Op*)+0x6c2) [0x38658854c2]
2: /usr/lib64/librados.so.1() [0x3865855dc9]
3: (RadosClient::aio_write(RadosClient::PoolCtx&, object_t, long,
ceph::buffer::list const&, unsigned long,
RadosClient::AioCompletion*)+0x24b) [0x386585724b]
4: (rados_aio_write()+0x9a) [0x386585741a]
5: /usr/bin/qemu-kvm() [0x45a305]
6: /usr/bin/qemu-kvm() [0x45a430]
7: /usr/bin/qemu-kvm() [0x43bb73]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (ABRT) ***
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (sigabrt_handler(int)+0x91) [0x3865922b91]
2: /lib64/libc.so.6() [0x3c0c032a30]
3: (gsignal()+0x35) [0x3c0c0329b5]
4: (abort()+0x175) [0x3c0c034195]
5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3c110beaad]
We need to ensure that buckets are output after their dependencies. The
best way to do this is a depth-first traversal of the bucket directed
acyclic graph. The previous solution was incorrect because it in some
cases it didn't traverse the graph in the right order.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
All the callers of CrushWrapper::get_bucket() check for error codes, but
not for NULL returns. So if there is no bucket (i.e., a NULL pointer) at
crush->bucket[i], just return the error code ENOENT. This is consistent
with how we handle other out-of-bounds requests.
Also, don't allow the caller to get us to try to access negative indices
in crush->bucket.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 9 Nov 2010 17:55:14 +0000 (09:55 -0800)]
mds: fix inode freeze auth pin allowance
When we're renaming across nodes, we need to freeze the inode. This
requires that we allow for the auth_pins that _we_ hold, which include
one because of the linklock xlock, and one by the MDRequest.
In crushtool, dump buckets in tree order. Buckets which reference other
buckets must be dumped after their depedencies, or else re-compilation
will fail.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Sat, 6 Nov 2010 18:35:54 +0000 (11:35 -0700)]
mds: remove MIX_STALE
Yay, we don't need it!
If we can't update the frag on scatter, fine. The staleness of the frag
is implicit in the frag's scatter stat version not matching the inode's.
If/when we do want to update it, the frag will clearly be writable, and
we can bring it back in sync then.
Sage Weil [Sun, 7 Nov 2010 03:17:32 +0000 (20:17 -0700)]
mds: don't use helper for rename srcdn
The rdlock_path_xlock_dentry helper works for _auth_ dentries that we
create locally in an auth dirfrag. For the srcdn, we need to discover an
_existing_ dentry that is not necessarily auth.
Call path_traverse ourselves, but be careful to take the appropriate locks
on the resulting dn, dir, and ancestors.
Sage Weil [Sat, 6 Nov 2010 18:02:13 +0000 (11:02 -0700)]
mds: never complete a gather on a flushing lock
The scatter_writebehind() takes a wrlock, but that may still allow the lock
to complete a gather to LOCK and even move to say MIX before the data is
committed. Bad news!
Sage Weil [Sat, 6 Nov 2010 04:52:28 +0000 (21:52 -0700)]
mds: preserve stale state on import; some cleanup
Our new invariant is that MIX_STALE always implies is_stale(). And on
import, if is_stale(), MIX becomes MIX_STALE. This ensures that a replica
that we put into MIX_STALE doesn't turn back into MIX if we import it
and take the auth's state in CInode::decode_import().
Previously I changed the std::multimap decoder to minimize the number of
constructor invocations. However, it could be much more expensive to
copy an initialized (decoded) val_t than to copy an empty one. For
example, if we are decoding std::multimap < int, std::set <int> >. So
change the code to insert a non-decoded val_t again.
However, this still saves two constructor invocations over the original.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Samuel Just [Thu, 21 Oct 2010 23:54:01 +0000 (16:54 -0700)]
PG.cc: build_scrub_map now drops the PG lock while scanning the PG
build_inc_scrub_map scans all files modified since the given
version number and creates an incremental scrub map to
be merged with a scrub map created with build_scrub_map.
This scan is done while holding the pg lock.
ScrubMap.objects is now represented as a map rather than as
a vector.
PG.h: Added last_update_applied and finalizing_scrub members to
PG.
ReplicatedPG.cc:
calc_trim_to will not trim the log during a scrub (since
replicas need the log to construct incremental maps)
sub_op_modify_oplied and op_applied maintain a
last_update_applied PG member to be used for determining
how far back a replica need go to construct an
incremental scrub map.
osd_types.h:
Added merge_incr method for combining a scrub map with
a subsequent incremental scrub map.
ScrubMap.objects is now a map from sobject_t to object.
PG scrubs will now drop the PG lock while initially scanning the PG
collection allowing writes to continue. The scrub map will be tagged
with the most recent version applied. After halting writes, the
primary will request an incremental map from any replicas whose map
versions do not match log.head.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>