Currently we rely on OSDs to watch over each other and perform failure detection
and report to OSDMonitor. Before we can safely and undoubtedly mark an OSD as down,
enough reports from a certain number of different reporters must have been collected.
Also, the victimed OSD has to be declared failed long enough before we make any final
decision in order to avoid temperary problems such as network failure, network traffic jam etc.,
which if handled carelessly, may cause even serious problem such as flapping.
Form the above analysis, even if we have gathered enough witnesses, we have to wait long
enough to sentence the guilty OSD to death. Therefore we rely on the tick() thread to
do such an hourglass job. However, the problem here is currently the tick() thread is
unable to trigger a propose even if it has witnessed such a memont, and this is our goal
to solve such an embrassing situation.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
return true;
}
-void OSDMonitor::check_failures(utime_t now)
+bool OSDMonitor::check_failures(utime_t now)
{
+ bool found_failure = false;
for (map<int,failure_info_t>::iterator p = failure_info.begin();
p != failure_info.end();
++p) {
if (can_mark_down(p->first)) {
- check_failure(now, p->first, p->second);
+ found_failure |= check_failure(now, p->first, p->second);
}
}
+ return found_failure;
}
bool OSDMonitor::check_failure(utime_t now, int target_osd, failure_info_t& fi)
utime_t now = ceph_clock_now(g_ceph_context);
// mark osds down?
- check_failures(now);
+ if (check_failures(now))
+ do_propose = true;
// mark down osds out?
SimpleLRU<version_t, bufferlist> inc_osd_cache;
SimpleLRU<version_t, bufferlist> full_osd_cache;
- void check_failures(utime_t now);
+ bool check_failures(utime_t now);
bool check_failure(utime_t now, int target_osd, failure_info_t& fi);
// map thrashing