Samuel Just [Sun, 17 Jun 2012 23:16:42 +0000 (16:16 -0700)]
OSD,PG: clean up _get_or_create_pg and set interval based on msg
Previously, we set last_peering_reset based on the epoch in which the pg
is created. We now pass the map from the query_epoch to the creation
methods to set based on that.
Samuel Just [Mon, 18 Jun 2012 19:52:06 +0000 (12:52 -0700)]
PG,OSD: prevent pg from completing peering until deletion is complete
hobject_t must now be globally unique in the filestore. Thus, if we
start creating objects in a pg before the removal collections for the
previous incarnation are fully removed, we might end up a second
instance of the same hobject violating the filestore rules.
Samuel Just [Fri, 29 Jun 2012 21:11:07 +0000 (14:11 -0700)]
OSD,PG: clean up pg removal
PG opsequencers will be used for removing a pg. If the pg is recreated
before the removal is complete, we need the new pg incarnation to be
able to inherit the osr of its predecessor.
Previously, we queued the pg for removal and only rendered it unusable
after the contents were fully removed. Now, we syncronously remove it
from the map and queue a transaction renaming the collections. We then
asyncronously clean up those collections. If the pg is recreated, it
will inherit the same osr until the cleanup is complete ensuring correct
op ordering with respect to the collection rename.
Samuel Just [Wed, 20 Jun 2012 22:42:18 +0000 (15:42 -0700)]
PG: flush ops by the end of peering without osr.flush
Rather than explicitely flushing the filestore, send a noop through the
filestore at the beginning of peering and, at the end, wait for it to
finish by adding an extra state.
Also, delay ops until flushed is true. Until we have finished flushing,
we cannot safetly read objects.
Samuel Just [Thu, 14 Jun 2012 02:05:47 +0000 (19:05 -0700)]
OSD,PG: Move pg accesible methods, objects to OSDService
In order to clarify data structure locking, PGs will now access
OSDService rather the the OSD directly. Over time, more structures will
be moved to the OSDService. osd_lock can no longer be held while pg
locks are held.
Samuel Just [Mon, 7 May 2012 20:51:55 +0000 (12:51 -0800)]
OSD: do not drop osd_lock in handle_osd_map
PGs have their map updates done in a different thread. Thus, we no
longer need to grab the pg locks. activate_map no longer requires
the map_lock in order to allow us to queue events for the pgs.
Samuel Just [Thu, 31 May 2012 05:19:58 +0000 (21:19 -0800)]
OSD: initialize pgs in get_or_create_pg via handle_create
Previously, pgs were initialized via Info/Log/etc. Since the event
which triggered the pg creation may now be queued, map update events may
occur before the event is processed. Thus, get_or_create_pg now handles
the initialization prior to queuing the event.
Samuel Just [Tue, 24 Apr 2012 23:00:49 +0000 (16:00 -0700)]
OSD,PG: handle pg map advance in process_peering_event
The pg map will now be advanced in process_peering_event (in advance_pg)
to allow handle_osd_map to not grab pg locks in-line. handle_osd_map
queues NullEvts to ensure that each pg is updated in a timely fashion.
Samuel Just [Tue, 17 Apr 2012 23:54:06 +0000 (16:54 -0700)]
PG: CephPeeringEvt
CephPeeringEvt is now the supertype for all peering state machine
events. This will allow us to generalize checking for stale peering
events and delaying events for future maps.
Samuel Just [Tue, 3 Jul 2012 18:23:16 +0000 (11:23 -0700)]
ReplicatedPG: remove faulty scrub assert in sub_op_modify_applied
This assert assumed that all ops submitted before MOSDRepScrub was
submitted were processed by the time that MOSDRepScrub was
processed. In fact, MOSDRepScrub's scrub_to may refer to a
last_update yet to be seen by the replica.
Samuel Just [Tue, 3 Jul 2012 18:23:16 +0000 (11:23 -0700)]
ReplicatedPG: remove faulty scrub assert in sub_op_modify_applied
This assert assumed that all ops submitted before MOSDRepScrub was
submitted were processed by the time that MOSDRepScrub was
processed. In fact, MOSDRepScrub's scrub_to may refer to a
last_update yet to be seen by the replica.
Sage Weil [Sun, 1 Jul 2012 22:37:31 +0000 (15:37 -0700)]
msgr: choose incoming connection if ours is STANDBY
If the connect_seq matches, but our existing connection is in STANDBY, take
the incoming one. Otherwise, the other end will wait indefinitely for us
to connect but we won't.
Alternatively, we could "win" the race and trigger a connection by sending
a keepalive (or similar), but that is more work; we may as well accept the
incoming connection we have now.
This removes STANDBY from the acceptable WAIT case states. It also keeps
responsibility squarely on the shoulders of the peer with something to
deliver.
Without this patch, a 3-osd vstart cluster with
'ms inject socket failures = 100' and rados bench write -b 4096 would start
generating slow request warnings after a few minutes due to the osds
failing to connect to each other. With the patch, I complete a 10 minute
run without problems.
Sage Weil [Fri, 29 Jun 2012 00:50:47 +0000 (17:50 -0700)]
msgr: preserve incoming message queue when replacing pipes
If we replace an existing pipe with a new one, move the incoming queue
of messages that have not yet been dispatched over to the new Pipe so that
they are not lost. This prevents messages from being lost.
Alternatively, we could set in_seq = existing->in_seq - existing->in_qlen,
but that would make the other end resend those messages, which is a waste
of bandwidth.
Very easy to reproduce the original bug with 'ms inject socket failures'.
Sage Weil [Fri, 29 Jun 2012 00:38:34 +0000 (17:38 -0700)]
msgr: move incoming queue to separate class
This extricates the incoming queue and its funky relationship with
DispatchQueue from Pipe and moves it into IncomingQueue. There is now a
single IncomingQueue attached to each Pipe. DispatchQueue is now no
longer tied to Pipe.
This modularizes the code a bit better (tho that is still a work in
progress) and (more importantly) will make it possible to move the
incoming messages from one pipe to another in accept().
Sage Weil [Thu, 28 Jun 2012 00:06:40 +0000 (17:06 -0700)]
msgr: make D_CONNECT constant non-zero, fix ms_handle_connect() callback
A while ago we inadvertantly broke ms_handle_connect() callbacks because
of a check for m being non-zero in the dispatch_entry() thread. Adjust the
enums so that they get delivered again.
This fixes hangs when, for example, the ceph tool sends a command, gets a
connection reset, and doesn't get the connect callback to resend after
reconnecting to a new monitor.
Sage Weil [Wed, 27 Jun 2012 00:07:31 +0000 (17:07 -0700)]
msgr: do not try to reconnect con with CLOSED pipe
If we have a con with a closed pipe, drop the message. For lossless
sessions, the state will be STANDBY if we should reconnect. For lossy
sessions, we will end up with CLOSED and we *should* drop the message.
Sage Weil [Wed, 27 Jun 2012 00:06:41 +0000 (17:06 -0700)]
msgr: move to STANDBY if we replace during accept and then fail
If we replace an existing pipe during accept() and then fail, move to
STANDBY so that our connection state (connect_seq, etc.) is preserved.
Otherwise, we will throw out that information and falsely trigger a
RESETSESSION on the next connection attempt.