From: 宋顺10180185 Date: Tue, 17 Sep 2019 00:26:52 +0000 (-0400) Subject: OSD: avoid failure peer info to resent X-Git-Tag: v15.1.0~1402^2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=37a758a14a4dbbb96fa89c569914b22fbf260025;p=ceph.git OSD: avoid failure peer info to resent maybe_update_heartbeat_peers may remove one peer and never add in. if that peer is in failure_pending but that peer is not really has problem, failure_pending will hold that peer until it really down or add in again. once ms_handle_connect called, these pending failure will be resent again. 2019-09-12 09:44:47.080933 7f1fad781700 10 osd.13 6175 ms_handle_connect on mon 2019-09-12 09:44:47.080937 7f1fad781700 10 osd.13 6175 send_alive up_thru currently 6159 want 6155 2019-09-12 09:44:47.080945 7f1fad781700 10 osd.13 6175 requeue_failures 0 + 1 -> 1 Signed-off-by: 宋顺10180185 --- diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index d67002909e9..f2b928b9468 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -4482,6 +4482,16 @@ void OSD::maybe_update_heartbeat_peers() } dout(10) << "maybe_update_heartbeat_peers " << heartbeat_peers.size() << " peers, extras " << extras << dendl; + + // clean up stale failure pending + for (auto it = failure_pending.begin(); it != failure_pending.end();) { + if (heartbeat_peers.count(it->first) == 0) { + send_still_alive(osdmap->get_epoch(), it->first, it->second.second); + failure_pending.erase(it++); + } else { + it++; + } + } } void OSD::reset_heartbeat_peers(bool all)