From: Ilya Dryomov Date: Thu, 22 Apr 2021 10:29:59 +0000 (+0200) Subject: mon/MonClient: reset authenticate_err in _reopen_session() X-Git-Tag: v15.2.13~3^2~5^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=12c5de7aa9f45eda68afe09bc87fb17c0637f882;p=ceph.git mon/MonClient: reset authenticate_err in _reopen_session() Otherwise, if "mon host" list has at least one unqualified IP address without a port and both msgr1 and msgr2 are turned on, there is a race affecting MonClient::authenticate(). For backwards compatibility reasons such an address is expanded into two entries, each being treated as a separate monitor. For example, "mon host = 1.2.3.4" generates the following initial monmap: 0: v1:1.2.3.4:6789/0 1: v2:1.2.3.4:3300/0 See MonMap::_add_ambiguous_addr() for details. Then, the following can happen: 1. we connect to both endpoints and attempt to authenticate 2. authenticate() sets authenticate_err to 1 and sleeps on auth_cond 3. msgr1 authenticates first (i.e. it gets the final MAuth message before msgr2 gets the monmap) 4. active_con is set to msgr1 connection, msgr2 connection is closed as redundant 5. _finish_auth() sets authenticate_err to 0 and signals auth_cond, but before either the monmap is received or authenticate() wakes up, msgr1 connection is closed due to a network hiccup 6. ms_handle_reset() calls _reopen_session() which clears active_con and again connects to both endpoints and attempts to authenticate 7. authenticate() wakes up, sees that there is no active_con and goes back to sleep, but this time with authenticate_err == 0 8. msgr2 authenticates first but doesn't call _finish_auth() because it is called only if authenticate_err == 1 9. active_con is set to msgr2 connection, msgr1 connection is closed as redundant 10. authenticate() hangs on auth_cond until timeout defaulting to 5 minutes The discrepancy between msgr1 and msgr2 plays a key role. For msgr1, authentication is considered to be complete as soon as the final MAuth message is received -- the monmap is not waited for. For msgr2, authentication is considered to be complete only after the monmap is received. Avoid the race by setting authenticate_err to 1 in _reopen_session(), so that _finish_auth() is called on/after every authentication attempt instead of just the first one. Fixes: https://tracker.ceph.com/issues/50477 Signed-off-by: Ilya Dryomov (cherry picked from commit 8c9de31c9806629d22c30b35769e664446090046) --- diff --git a/src/mon/MonClient.cc b/src/mon/MonClient.cc index d35725319f024..b94b4d9a9ead4 100644 --- a/src/mon/MonClient.cc +++ b/src/mon/MonClient.cc @@ -547,7 +547,6 @@ int MonClient::authenticate(double timeout) until += ceph::make_timespan(timeout); if (timeout > 0.0) ldout(cct, 10) << "authenticate will time out at " << until << dendl; - authenticate_err = 1; // == in progress while (!active_con && authenticate_err >= 0) { if (timeout > 0.0) { auto r = auth_cond.wait_until(lock, until); @@ -680,6 +679,8 @@ void MonClient::_reopen_session(int rank) active_con.reset(); pending_cons.clear(); + authenticate_err = 1; // == in progress + _start_hunting(); if (rank >= 0) {