mon/MonClient: reset authenticate_err in _reopen_session()
Otherwise, if "mon host" list has at least one unqualified IP address
without a port and both msgr1 and msgr2 are turned on, there is a race
affecting MonClient::authenticate().
For backwards compatibility reasons such an address is expanded into
two entries, each being treated as a separate monitor. For example,
"mon host = 1.2.3.4" generates the following initial monmap:
0: v1:1.2.3.4:6789/0
1: v2:1.2.3.4:3300/0
See MonMap::_add_ambiguous_addr() for details.
Then, the following can happen:
1. we connect to both endpoints and attempt to authenticate
2. authenticate() sets authenticate_err to 1 and sleeps on auth_cond
3. msgr1 authenticates first (i.e. it gets the final MAuth message
before msgr2 gets the monmap)
4. active_con is set to msgr1 connection, msgr2 connection is closed
as redundant
5. _finish_auth() sets authenticate_err to 0 and signals auth_cond,
but before either the monmap is received or authenticate() wakes
up, msgr1 connection is closed due to a network hiccup
6. ms_handle_reset() calls _reopen_session() which clears active_con
and again connects to both endpoints and attempts to authenticate
7. authenticate() wakes up, sees that there is no active_con and goes
back to sleep, but this time with authenticate_err == 0
8. msgr2 authenticates first but doesn't call _finish_auth() because
it is called only if authenticate_err == 1
9. active_con is set to msgr2 connection, msgr1 connection is closed
as redundant
10. authenticate() hangs on auth_cond until timeout defaulting to 5
minutes
The discrepancy between msgr1 and msgr2 plays a key role. For msgr1,
authentication is considered to be complete as soon as the final MAuth
message is received -- the monmap is not waited for. For msgr2,
authentication is considered to be complete only after the monmap is
received.
Avoid the race by setting authenticate_err to 1 in _reopen_session(),
so that _finish_auth() is called on/after every authentication attempt
instead of just the first one.
Fixes: https://tracker.ceph.com/issues/50477
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit
8c9de31c9806629d22c30b35769e664446090046)