When we get a ping reply, remove the peer from the failure_queue
and send a still alive message if the peer is in the failure_pending
map.
Otherwise, the monitor could slowly accumulate sporadic failure reports
leading to an osd being incorrectly marked out.
This bug may have been contributing to the wrongly-marked-down
thrashing observed on some systems.
Signed-off-by: Samuel Just <sam.just@inktank.com>
if (locked && is_active())
_share_map_outgoing(service.get_osdmap()->get_cluster_inst(from));
}
+
+ // Cancel false reports
+ if (failure_queue.count(from))
+ failure_queue.erase(from);
+ if (failure_pending.count(from))
+ send_still_alive(failure_pending[from]);
}
break;