From: Kamoltat Sirivadhna Date: Fri, 30 May 2025 05:38:32 +0000 (+0000) Subject: ProtocolV2: modify handle_existing_connection logic X-Git-Tag: v21.0.0~256^2~437^2~4 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=5fcbbfb4ccf57463319cb79db985acfd76597f18;p=ceph.git ProtocolV2: modify handle_existing_connection logic Problem: When a monitor daemon (e.g., mon.c) is killed and restarts, its attempt to re-establish a session with a peer (mon.a) is incorrectly rejected. This happens because mon.a retains stale connection state, including a higher peer_global_seq, from the previous session. When the newly restarted mon.c sends a ClientIdentFrame with a fresh client_cookie and a low global_seq (e.g., 1), mon.a compares this against the old connection’s peer_global_seq (e.g., 61) and mistakenly concludes that the connection is stale. As a result, mon.a terminates the incoming connection from mon.c, delaying session recovery and triggering a transient MON_NETSPLIT warning. The connection is only recovered once mon.a's backoff timer expires and it initiates a new connection to mon.c. Solution: Reorder the logic in handle_existing_connection so that the check for client_cookie mismatch is performed before the global_seq comparison. This leverages the existing code path that handles session resets and restarts when a peer has generated a new client_cookie (e.g., after a daemon restart). By bumping this condition earlier, we avoid incorrectly rejecting valid new connections as "stale" and instead allow proper session negotiation to proceed. This change prevents unnecessary connection drops and eliminates the transient MON_NETSPLIT in such restart scenarios. Fixes: https://tracker.ceph.com/issues/71344 Signed-off-by: Kamoltat Sirivadhna --- diff --git a/src/msg/async/ProtocolV2.cc b/src/msg/async/ProtocolV2.cc index 58e4f4df21df..43e3208d6397 100644 --- a/src/msg/async/ProtocolV2.cc +++ b/src/msg/async/ProtocolV2.cc @@ -2671,6 +2671,19 @@ CtPtr ProtocolV2::handle_existing_connection(const AsyncConnectionRef& existing) return WRITE(wait, "wait", read_frame); } + if (exproto->server_cookie && exproto->client_cookie && + exproto->client_cookie != client_cookie) { + // Found previous session + // peer has reseted and we're going to reuse the existing connection + // by replacing the communication socket + ldout(cct, 1) << __func__ << " found previous session existing=" << existing + << ", peer must have reseted." << dendl; + if (connection->policy.resetcheck) { + exproto->reset_session(); + } + return reuse_connection(existing, exproto); + } + if (exproto->peer_global_seq > peer_global_seq) { ldout(cct, 1) << __func__ << " this is a stale connection, peer_global_seq=" << peer_global_seq @@ -2693,19 +2706,6 @@ CtPtr ProtocolV2::handle_existing_connection(const AsyncConnectionRef& existing) return send_server_ident(); } - if (exproto->server_cookie && exproto->client_cookie && - exproto->client_cookie != client_cookie) { - // Found previous session - // peer has reseted and we're going to reuse the existing connection - // by replacing the communication socket - ldout(cct, 1) << __func__ << " found previous session existing=" << existing - << ", peer must have reseted." << dendl; - if (connection->policy.resetcheck) { - exproto->reset_session(); - } - return reuse_connection(existing, exproto); - } - if (exproto->client_cookie == client_cookie) { // session establishment interrupted between client_ident and server_ident, // continuing...