]> git.apps.os.sepia.ceph.com Git - ceph.git/commitdiff
OSD: send_still_alive when we get a reply if we reported failure
authorSamuel Just <sam.just@inktank.com>
Fri, 13 Jul 2012 16:20:02 +0000 (09:20 -0700)
committerSamuel Just <sam.just@inktank.com>
Fri, 13 Jul 2012 19:18:46 +0000 (12:18 -0700)
When we get a ping reply, remove the peer from the failure_queue
and send a still alive message if the peer is in the failure_pending
map.

Otherwise, the monitor could slowly accumulate sporadic failure reports
leading to an osd being incorrectly marked out.

This bug may have been contributing to the wrongly-marked-down
thrashing observed on some systems.

Signed-off-by: Samuel Just <sam.just@inktank.com>
src/osd/OSD.cc

index 2a48dbb4b5d9bcc281723f8245c273ded19adabf..28efcc4d771e302b54813a4366ab590ddfb2a7bb 100644 (file)
@@ -1703,6 +1703,12 @@ void OSD::handle_osd_ping(MOSDPing *m)
        if (locked && is_active())
          _share_map_outgoing(service.get_osdmap()->get_cluster_inst(from));
       }
+
+      // Cancel false reports
+      if (failure_queue.count(from))
+       failure_queue.erase(from);
+      if (failure_pending.count(from))
+       send_still_alive(failure_pending[from]);
     }
     break;