Sage Weil [Tue, 31 Jul 2012 22:02:55 +0000 (15:02 -0700)]
cephtool: send keepalive to tell target
If we 'ceph tell <foo> ...' to a non-monitor, we need to send keepalives to
ensure we detect a tcp drop. (Not so for monitors; monclient already does
its own keepalive thing.)
Sage Weil [Tue, 31 Jul 2012 21:45:51 +0000 (14:45 -0700)]
cephtool: fix deadlock on fault when waiting for osdmap
send_command() was blocking for the osdmap, and also called from the
connect callback. Instead, re-call it from the handle_osd_map() callback
so that it never blocks.
This was easy to trigger with 'ceph osd tell osd.0 foo' and ms failure
injection.
Sage Weil [Thu, 26 Jul 2012 23:50:30 +0000 (16:50 -0700)]
msg/Pipe: if we send a wait, make sure we follow through
Mark our outgoing connection attempt if we send a WAIT in accept(). This
ensures we don't go to standby or closed in fault() on the outgoing
connection for any reason.
Sage Weil [Wed, 25 Jul 2012 00:12:02 +0000 (17:12 -0700)]
msg/Pipe: make STANDBY behavior optional
In particular, lossless_peers should use STANDBY, but lossless_clients
should reconnect immediately since they are already doing their own session
management.
Specifically, this fixes the problem where the Client tries to open a
connection to the MDS and faults after delivering its OPEN_SESSION message
but before it gets the reply: the session isn't open yet, so it isn't
pinging. It could, but it is simpler and faster to make the msgr layer
keep the connection open instead of waiting for a periodic keepalive.
Fixes: #2824 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 20 Jul 2012 16:00:26 +0000 (09:00 -0700)]
msg/Pipe: go to standby on lossless server connection faults
Go directly to the STANDBY state, and print a more accurate message.
Otherwise, we do the same check in writer() and go to STANDBY then. This
is less confusing.
Sage Weil [Thu, 19 Jul 2012 20:51:04 +0000 (13:51 -0700)]
osd: reopen heartbeat connections when they fail
If we have an active peer whose Connection fails, open a new one. This
is necessary now that a lossy client connection does not automatically
reopen on its own (which is necessary to avoid races with session-based
lossy clients and the ms_handle_reset callback).
Sage Weil [Thu, 19 Jul 2012 16:42:57 +0000 (09:42 -0700)]
msgr: drop CLOSED checks during queueing
AFAICS these checks are pointless. There should be no harm in queueing
messages on a closed connection; they'll get cleaned up when it is
deregistered. Moreover, the *queuer* shouldn't be the one who has to
unregister a Pipe.
Sage Weil [Thu, 19 Jul 2012 16:30:33 +0000 (09:30 -0700)]
msgr: do not reopen failed lossy Connections
There was a race where:
- sending stuff to a lossy Connection
- it fails, and queues itself for reap, queues a RESET event
- reaper clears the Pipe
- some thread queues new messages and the Pipe is reopened, messages sent
- RESET event delivered to dispatch, connection is closed and reopened.
The result was that messages got sent to the OSD out of order during the
window between the fault() and ms_handle_reset() getting called. This will
prevent that.
Sage Weil [Thu, 19 Jul 2012 16:28:39 +0000 (09:28 -0700)]
msg/Pipe: disconnect Pipe from lossy Connection immediately on failure
When we have a lossy connection failure, immediately disconnect the Pipe
and set the Connection failed flag. There is no reason to wait until the
reaper comes along.
Sage Weil [Tue, 17 Jul 2012 22:27:27 +0000 (15:27 -0700)]
msg/DispatchQueue: fix locking in dispatch thread
The locking was awkward with locally delivered messages.. we dropped dq
lock, inq lock, re-took dq lock, etc. We would also take + drop + retake
+ drop the dq lock when queuing events. Blech!
Instead:
* simplify the queueing of cons for the local_queue
* dequeue the con under the original dq lock
* queue events under a single dq lock interval, by telling
local_queue.queue() we already have it.
Sage Weil [Tue, 17 Jul 2012 17:56:15 +0000 (10:56 -0700)]
msgr: indicate whether clients are lossy
We need to know whether the client is lossy before we connect to the peer
in order to know whether to deliver a RESET event or not on connection
failure. Lossy clients get one, lossless do not.
And in any case, we know ahead of time, so we may as well indicate as much
in the Policy.
Sage Weil [Tue, 17 Jul 2012 22:30:11 +0000 (15:30 -0700)]
msgr: do not discard_queue in Pipe reaper
The IncomingQueue can live beyond the Pipe. In particular, there is no
reason not to deliver messages we've received on this connection even
though the socket has errored out.
Separate incoming queue discard from outgoing, and only do the latter in
the reaper.
Sage Weil [Sat, 14 Jul 2012 20:46:55 +0000 (13:46 -0700)]
msgr: rework accept() connect_seq/race handling
We change a couple of key things here:
* If there is a matching connect_seq and the existing connection is in OPEN (or
STANDBY; same thing + a failure), we send a RETRY_SESSION and ask the peer to
bump their connect_seq. This handles the case where there was a race, our
end successfully opened, but the peer's racing attempt was slowly processed.
* We always reply with connect_seq + 1. This handles the above case
more cleanly, and lets us use the same code path.
Also avoid duplicating the RETRY_SESSION path with a goto. Beautiful!
Sage Weil [Tue, 10 Jul 2012 20:24:51 +0000 (13:24 -0700)]
mds: fix race in connection accept; fix con replacement
We solve two problems with this patch. The first is that the messenger
will now reuse an existing session's Connection with a new connection,
which means that we don't want to change session->connection when we
are validating an authorizer. Instead, set (but do not change) it.
We also want to avoid a race where:
- mds recovers, replays Sessions with no con's
- multiple connection attempts for the same session race in the msgr
- both are authorized, but out of order
- Session->connection gets set to the losing attempt's Connection*
Instead, we take advantage of an accept event that is called only for
accepted winners.
Sage Weil [Tue, 10 Jul 2012 20:20:30 +0000 (13:20 -0700)]
dispatcher: new 'accept' event type
Create a new event type when we successfully accept a connection. This is
distinct from the authorizor verification, which may happen for multiple
racing connection attempts. In contrast, this will only happen on those
that win the race(s). I don't think this is that important for stateless
servers (OSD, MON), but it is important for the MDS to ensure that it keeps
its Session con reference pointing to the most recently-successful
connection attempt.
Sage Weil [Mon, 9 Jul 2012 17:05:12 +0000 (10:05 -0700)]
msgr: drop unnecessary (un)locking on queuing connection events
This used to be necessary because the pipe_lock was used when queueing
the pipe in the dispatch queue. Now that is handled by IncomingQueue's
own lock, so these can be removed.
By no longer dropping the lock, we eliminate a whole category of potential
hard-to-debug races. (Not that any were observed, but now we dno't need to
worry about them.)
Sage Weil [Thu, 5 Jul 2012 03:47:54 +0000 (20:47 -0700)]
msgr: move dispatch thread into DispatchQueue
The DispatchQueue class now completely owns message delivery. This is
cleaner and lets us drop the redundant destination_stopped flag from
msgr (DQ has its own stop flag).
Sage Weil [Mon, 9 Jul 2012 17:06:55 +0000 (10:06 -0700)]
msgr: simplify checks for queueing connection events
Looking through git history it is not clear exactly how these checks
came to be. They seem to have grown during the multiple-entity-per-rank
transition a few years back. I'm not fully convinced they are necessary,
but we will keep them regardless.
Push checks into DispatchQueue and look at the local stop flag to
determine whether these events should be queued. This moves us away from
the kludgey SimpleMessenger::destination_stopped flag (which will soon
be removed).
Also move the refcount futzing into the DispatchQueue methods. This makes
the callers much simpler.
Sage Weil [Tue, 10 Jul 2012 20:18:27 +0000 (13:18 -0700)]
msgr: take over existing Connection on Pipe replacement
If a new pipe/socket is taking over an existing session, it should also
take over the Connection* associated with the existing session. Because
we cannot clear existing->connection_state, we just take another reference.
Clean up the comments a bit while we're here.
This affects MDS<->client sessions when reconnecting after a socket fault.
It probably also affects intra-cluster (osd/osd, mds/mds, mon/mon)
sessions as well, but I did not confirm that.
Backport: argonaut Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sun, 1 Jul 2012 22:37:31 +0000 (15:37 -0700)]
msgr: choose incoming connection if ours is STANDBY
If the connect_seq matches, but our existing connection is in STANDBY, take
the incoming one. Otherwise, the other end will wait indefinitely for us
to connect but we won't.
Alternatively, we could "win" the race and trigger a connection by sending
a keepalive (or similar), but that is more work; we may as well accept the
incoming connection we have now.
This removes STANDBY from the acceptable WAIT case states. It also keeps
responsibility squarely on the shoulders of the peer with something to
deliver.
Without this patch, a 3-osd vstart cluster with
'ms inject socket failures = 100' and rados bench write -b 4096 would start
generating slow request warnings after a few minutes due to the osds
failing to connect to each other. With the patch, I complete a 10 minute
run without problems.
Sage Weil [Fri, 29 Jun 2012 00:50:47 +0000 (17:50 -0700)]
msgr: preserve incoming message queue when replacing pipes
If we replace an existing pipe with a new one, move the incoming queue
of messages that have not yet been dispatched over to the new Pipe so that
they are not lost. This prevents messages from being lost.
Alternatively, we could set in_seq = existing->in_seq - existing->in_qlen,
but that would make the other end resend those messages, which is a waste
of bandwidth.
Very easy to reproduce the original bug with 'ms inject socket failures'.
Sage Weil [Fri, 29 Jun 2012 00:38:34 +0000 (17:38 -0700)]
msgr: move incoming queue to separate class
This extricates the incoming queue and its funky relationship with
DispatchQueue from Pipe and moves it into IncomingQueue. There is now a
single IncomingQueue attached to each Pipe. DispatchQueue is now no
longer tied to Pipe.
This modularizes the code a bit better (tho that is still a work in
progress) and (more importantly) will make it possible to move the
incoming messages from one pipe to another in accept().
Sage Weil [Thu, 28 Jun 2012 00:06:40 +0000 (17:06 -0700)]
msgr: make D_CONNECT constant non-zero, fix ms_handle_connect() callback
A while ago we inadvertantly broke ms_handle_connect() callbacks because
of a check for m being non-zero in the dispatch_entry() thread. Adjust the
enums so that they get delivered again.
This fixes hangs when, for example, the ceph tool sends a command, gets a
connection reset, and doesn't get the connect callback to resend after
reconnecting to a new monitor.
Sage Weil [Wed, 27 Jun 2012 00:07:31 +0000 (17:07 -0700)]
msgr: do not try to reconnect con with CLOSED pipe
If we have a con with a closed pipe, drop the message. For lossless
sessions, the state will be STANDBY if we should reconnect. For lossy
sessions, we will end up with CLOSED and we *should* drop the message.
Sage Weil [Wed, 27 Jun 2012 00:06:41 +0000 (17:06 -0700)]
msgr: move to STANDBY if we replace during accept and then fail
If we replace an existing pipe during accept() and then fail, move to
STANDBY so that our connection state (connect_seq, etc.) is preserved.
Otherwise, we will throw out that information and falsely trigger a
RESETSESSION on the next connection attempt.
Yehuda Sadeh [Wed, 27 Jun 2012 00:16:11 +0000 (17:16 -0700)]
rest-bench: mark request as complete later
We marked a request as complete in the callback, however
it might be that we're still inside S3_runall_request_context()
which means that request is not really complete yet.
Possibly fixes bug #2652.
If we play 1-4 and then replay 1-4 again, we will end up removing
(b, 1)'s attributes since nlink for (a, 1) the second time through
is 1. We fix this by marking spos on the object_map header for
(a, 1) when we remove (a, 1) but not eh attributes.
- keyrings have new default locations that everyone should use.
- the user key setup is vastly simplified if you use the
'ceph auth get-or-create' command.
mon: MonmapMonitor: Use default port when the specified on 'add' is zero
Fixes a bug triggered by using the ceph tool to 'mon add' with a port set
to zero. We now default to the monitor's default port (6789) instead, and
we will fail if that port is already assigned to some other monitor.
Fixes: bug #2661 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Wed, 20 Jun 2012 18:07:29 +0000 (11:07 -0700)]
objecter: do not feed session to op_submit()
The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly. The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor. It
didn't affect the test results because we weren't testing whether the
watch was working in that case.
Instead, we'll just recalculate and get the same value the parent linger
op did. Which is fine, and goes through all the usual code paths so
nothing is missed.
Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.
Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 21 Jun 2012 14:31:47 +0000 (07:31 -0700)]
mon: encoding new monmap using quorum feature set
It is probably unlikely that someone will expand the mon cluster with a
mixed feature set, but we know the quorum features here, so we should use
them.
Sage Weil [Thu, 21 Jun 2012 03:41:17 +0000 (20:41 -0700)]
mon: conditionally encode auth incremental with quorum feature bits
If the quorum does not yet all have the MONENC feature, stick to the old
encoding.
It might be more polite to require a super-quorum before switching over,
and take note so that thereafter we can stick to the new encoding, but
that has more moving parts and I'm not sure it's worth the complexity.
Sage Weil [Thu, 21 Jun 2012 03:33:41 +0000 (20:33 -0700)]
mon: track intersection of quorum member features
When we form a quorum, also note the intersection of the quorum members'
feature bits. This will inform decisions about what encodings we use.
This is an imperfect strategy because the quorum may change, and we may
have a mon with old code join in and not understand what is going on.
However, it does ensure that a majority of the members run new code, so in
the absence of other failures we can make progress.
Samuel Just [Tue, 19 Jun 2012 16:11:57 +0000 (09:11 -0700)]
PG: improve find_best_info
07f853db3982e68b952a337cf91cbf7ec0709de9 is actually too conservative,
it suffices to find any info with a last_update of at least the least
last_update from the last period to go active. An info from a previous
interval is acceptable if the last interval never reported a commited
operation and thus still has the same last_update.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>