osd: refactor heartbeat health check
The original logic will reuse the timestamp which we send pings to
the specific heartbeat peer to update the last_rx_front[back] field
on receiving the corresponding replies, which later shall be honoured
as the exact time we succeed in getting the corresponding replies and
is used to calculate the heartbeat latency and determine whether the
relevant peer is dead.
However this is not accurate enough as there may be a delay between
we receive a reply and call heartbeat_check(). We can eliminate
the delay by introducing a map to track the ping-history here,
each entry of which consists of three elements:
1. "tx_time", worked as the map key, indicates the exact timestamp
we send pings.
2. "deadline", indicates we shall receive all replies by then,
otherwise we consider this peer as "dead".
3. "unacknowledged", indicates how many pings for the corresponding
ping are still unacknowledged. The initial value is 2(as we send
two pings from the front and back side for each peer).
We insert an item into the map on every time we sending out a ping, and
decrease the "unacknowledged" counter by 1 each time we get a reply from
the tracked ping. If "unacknowledged" drops to 0, we know all the replies
have been successfully collected and we can safely erase the relevant
item from the map as well as the earlier sent ones, if there is any.
By comparing the current timestamp with the oldest deadline, we can now
make a much accurate decision about whether the corresponding peer is
healthy or not. And by setting last_rx_* to the timestamp we receiving
the reply, the lower bound when we can no longer hear a reply from the
corresponding connection is also much clear now.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit
477774ceee42641f6d6884536462f92567bfea11)
Conflicts:
src/osd/OSD.cc (send_still_alive() has 1 less argument)