git.apps.os.sepia.ceph.com Git - ceph-ci.git/commit

author	Sage Weil <sage@redhat.com>
	Fri, 30 Nov 2018 15:40:49 +0000 (09:40 -0600)
committer	Sage Weil <sage@redhat.com>
	Fri, 30 Nov 2018 15:58:47 +0000 (09:58 -0600)
commit	8346e397631c6faa8d4305b71b86e3de4f099fc4
tree	c6fd64a2e0336a9ccd5ecbb57db6675c9cff1318	tree \| snapshot
parent	c0e4fedee4da4b250b8654616ec73d958378b4a1	commit \| diff

msg/async: do not trigger RESETSESSION from connect fault during connection phase

Previously, if we got a connection fault during the connect/connect_reply
phase, we would increment connect_seq on the client side and trigger a
RESETSESSION on the server side (because there was not yet an existing
connection to replace). This led to dropped messages, usually in the
form of stuck peering in the rados/thrash suite.

The problem is that the condition for 'reconnect' vs 'backoff' inherited
the test from SimpleMessenger, which had only a STATE_CONNECTING. In
contract, AsyncMessenger also as CONNECTING_WAIT_BANNER_AND_IDENTIFY and
CONNECTING_SEND_CONNECT_MSG, and if we were in these states, we would
increment connect_seq instead of backing off and retrying (without an
increment).

Fix by adjusting the condition to match the range of CONNECTING states
in asyncmessenger.

Fixes: http://tracker.ceph.com/issues/36612
Signed-off-by: Sage Weil <sage@redhat.com>