There are two phases in recovery: one where we get all the right objects
on to the primary, and another where we push all those objects out to
the replicas. Formerly, we would not start the second phase until there
were no missing objects at all on the primary.
This change modifies that so that we will start the second phase even if
there are unfound objects. However, we will still wait for all findable
missing objects to be brought to us, of course.
Get rid of uptodate_set. We can find the same information by looking at
the missing and missing_loc sets directly. Keeping the uptodate_set...
er... up-to-date would be very difficult in the presence of all the things
that can modify the missing and missing_loc sets.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Add discover_all_missing. This function makes sure that we have messages
en route to any OSD that we think might have information that could help
us discover where our unfound objects lie.
We call this function:
* In do_peer, right after activating the PG
* In _process_pg_info, when we're the primary of an active PG
* From handle_pg_notify, when we're the primary of an active PG
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
PG::search_for_missing: fix a bug with the handling of MSG_OSD_PG_INFO
messages. Formerly, when processing these messages, we erroneously
assumed that there was nothing missing on the peer at all even in cases
where there were missing objects.
PG::merge_log: drop unused Missing parameter.
_process_pg_info: Don't assume that just because we requested a Log
message at some point, that that is the message we're prcessing.
Correctly handle cases where we didn't get the peer's Missing set or
Log.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Add MOSDPGMissing, a message which just contains the missing objects
information for a PG. We will request messages like this one in order to
locate all of our unfound objects.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 15 Nov 2010 04:26:52 +0000 (20:26 -0800)]
msgr: use provided rx buffer if present
This changes the read path so that we hold the Connection::lock mutex while
reading data off the socket. This ensures that we are reading into a
buffer we are allowed to use, and allows users to revoke a previously
posted buffer. If that happens, switch over to a newly allocated buffer.
Note that currently the final result bufferlist may contain part of the
provided buffer and part of a newly allocated buffer. This is okay as long
as we will always read the same data into the buffer. And in practice, if
the rx buffer is revoked then the message itself will be thrown out anyway.
We have to explictly shut down the timer in Objecter::shutdown.
Otherwise, we are relying on the destructor of SafeTimer to do it.
Unfortunately, that destructor gets called after the mutex the timer is
using has already been destroyed.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Fri, 12 Nov 2010 23:56:54 +0000 (15:56 -0800)]
msgr: do not clear halt_delivery
We need to keep the halt_delivery plug set on failure/shutdown in order to
prevent a racing reader from queuing new messages. The only time we clear
it is when we discard messages due to a session reset.
Sage Weil [Fri, 12 Nov 2010 21:09:24 +0000 (13:09 -0800)]
msgr: only close socket on reconnect or shutdown
We can't modify 'sd' or (more importnatly) close sd while any other thread
might be using it, or else we might race with an open and they might end
up using someone else's fd.
Take care to _only_ close(sd) in connect(), when the reader thread is
stopped, or when reaping the connection.
Sage Weil [Fri, 12 Nov 2010 21:41:14 +0000 (13:41 -0800)]
msgr: protect pipe queuing with _both_ pipe and dispatch_queue locks
We want to make sure the pipe's queue item doesn't go away.
Also, make queue_received() require pipe_lock to be held. This avoids some
useless unlocking/locking, since (in the case where the pipe is already
queued) we then don't need to drop the pipe_lock at all.
Sage Weil [Fri, 12 Nov 2010 15:55:41 +0000 (07:55 -0800)]
uclient: insert lssnap results under snapdir, not live dir
Put the readdir results (list of snapshots) in the right place in the
hierarchy; we were putting them in the parent dir (as if they were real
directories).
This bug manifested itself as a snaptest-2.sh failure.
Sage Weil [Fri, 12 Nov 2010 00:38:02 +0000 (16:38 -0800)]
timer: rewrite mostly from scratch
Just use the provided lock. This _vastly_ reduces the complexity because
we don't have to worry about races between our thread trying to fire off a
timer that is being canceled.
Sage Weil [Thu, 11 Nov 2010 04:58:49 +0000 (20:58 -0800)]
mds: fix null_snapflush with multiple intervening snaps
The client is allowed to not send a snapflush if there is no dirty metadata
to write for a given snap. However, the mds can only look up inodes by
the last snapid in the interval. So, when doing a null_snapflush (filling
in for snapflushes the client didn't send), we have to walk forward through
intervening snaps until we find the right inode.
Note that this means we will call _do_snap_update multiple times on the
same inode, but with different snapids.
Sage Weil [Wed, 10 Nov 2010 23:33:31 +0000 (15:33 -0800)]
osd: call sched_scrub on reserve reply
Otherwise we have to wait until the next time it's called by the timer, and
during that period we have a reservation locally, and any other peers can't
reserve a scrub from us, and nobody makes any progress.
PG::search_for_missing: when we find a previously unfound object, check
to see if there is an entry in waiting_for_missing_object representing a
client waiting for this object.
PG::repair_object: assert that waiting_for_missing_object is empty
before messing with missing_loc. It definitely should be during a scrub.
ReplicatedPG role change logic: always take_object_waiters on the wait
queues when the PG acting set changes.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
OSD::_process_pg_info:search_for_missing sometimes
OSD::_process_pg_info: If we're the primary for this active PG, and we
have missing objects, call search_for_missing. This should ensure that
we know where to find our missing objects.
The reason why this wasn't there before is that previously, we kept the
PG from going active until all the missing objects were found.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Erase the code in PG::peer that used to keep us from becoming active
when objects were still unfound. Print out the number of missing and
unfound objects at the end of PG::peer.
Erase PG::check_for_lost_objects and PG::forget_lost_objects.
Sage Weil [Wed, 10 Nov 2010 17:03:37 +0000 (09:03 -0800)]
objecter: throttle before looking at lock protected state
The take_op_budget() may drop our lock if we are in keep_balanced_budget
mode, so we need to do that _before_ we take references to internal state
that may change out from under us during that time.
This fixes a crash like
./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)':
./osd/OSDMap.h:460: FAILED assert(exists(osd) && is_up(osd))
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (Objecter::op_submit(Objecter::Op*)+0x6c2) [0x38658854c2]
2: /usr/lib64/librados.so.1() [0x3865855dc9]
3: (RadosClient::aio_write(RadosClient::PoolCtx&, object_t, long,
ceph::buffer::list const&, unsigned long,
RadosClient::AioCompletion*)+0x24b) [0x386585724b]
4: (rados_aio_write()+0x9a) [0x386585741a]
5: /usr/bin/qemu-kvm() [0x45a305]
6: /usr/bin/qemu-kvm() [0x45a430]
7: /usr/bin/qemu-kvm() [0x43bb73]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int)':
./osd/OSDMap.h:460: FAILED assert(exists(osd) && is_up(osd))
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (Objecter::op_submit(Objecter::Op*)+0x6c2) [0x38658854c2]
2: /usr/lib64/librados.so.1() [0x3865855dc9]
3: (RadosClient::aio_write(RadosClient::PoolCtx&, object_t, long,
ceph::buffer::list const&, unsigned long,
RadosClient::AioCompletion*)+0x24b) [0x386585724b]
4: (rados_aio_write()+0x9a) [0x386585741a]
5: /usr/bin/qemu-kvm() [0x45a305]
6: /usr/bin/qemu-kvm() [0x45a430]
7: /usr/bin/qemu-kvm() [0x43bb73]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (ABRT) ***
ceph version 0.22.1 (commit:c6f403a6f441184956e00659ce713eaee7014279)
1: (sigabrt_handler(int)+0x91) [0x3865922b91]
2: /lib64/libc.so.6() [0x3c0c032a30]
3: (gsignal()+0x35) [0x3c0c0329b5]
4: (abort()+0x175) [0x3c0c034195]
5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3c110beaad]