From 8d0469ce7f6bb836bd70186f7586aaf4a2c67074 Mon Sep 17 00:00:00 2001
From: Kefu Chai <kchai@redhat.com>
Date: Tue, 25 May 2021 14:31:02 +0800
Subject: [PATCH] mon/OSDMonitor: drop stale failure_info even if
 can_mark_down()

in a124ee85b03e15f4ea371358008ecac65f9f4e50, we add a check to drop
stale failure_info reports. but if osdmap does not prohibit us from
marking the osd in question down, the branch checking the stale info
is not executed. in general, it is allowed to mark an osd down, so
the fix of a124ee85b03e15f4ea371358008ecac65f9f4e50 just fails to
work.

in this change, we check for stale failure report of osd in question
as long as the osd is not marked down in the same function. this should
address the slow ops of failure report issue.

Fixes: https://tracker.ceph.com/issues/50964
Signed-off-by: Kefu Chai <kchai@redhat.com>

(cherry picked from commit 2d21ab905889c36bf9a9ecc6f0b66f4142c826e3)
---
 src/mon/OSDMonitor.cc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 848d94e8a2849..9ed5e06f0527b 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -2904,8 +2904,9 @@ bool OSDMonitor::check_failures(utime_t now)
   auto p = failure_info.begin();
   while (p != failure_info.end()) {
     auto& [target_osd, fi] = *p;
-    if (can_mark_down(target_osd)) {
-      found_failure |= check_failure(now, target_osd, fi);
+    if (can_mark_down(target_osd) &&
+	check_failure(now, target_osd, fi)) {
+      found_failure = true;
       ++p;
     } else if (is_failure_stale(now, fi)) {
       dout(10) << " dropping stale failure_info for osd." << target_osd
-- 
2.39.5