One of our clusters has encountered some network issues several days
ago and we've observed some OSDs were stuck at __waiting_for_healthy__
with an obsolete OSDMap(683) in hand(By contrast, the newest OSDMap
from the monitor side has been successfully bumped up to 1589):
```
2018-08-28 15:26:54.858892
7faa3869c700 1 osd.31 683 is_healthy false -- only 1/5 up peers (less than 33%)
2018-08-28 15:26:54.858909
7faa3869c700 1 osd.31 683 not healthy; waiting to boot
2018-08-28 15:26:55.859007
7faa3869c700 1 osd.31 683 is_healthy false -- only 1/5 up peers (less than 33%)
2018-08-28 15:26:55.859023
7faa3869c700 1 osd.31 683 not healthy; waiting to boot
2018-08-28 15:26:56.859122
7faa3869c700 1 osd.31 683 is_healthy false -- only 1/5 up peers (less than 33%)
2018-08-28 15:26:56.859151
7faa3869c700 1 osd.31 683 not healthy; waiting to boot
```
Since most heartbeat_peers of osd.31 were actually offline and osd.31 itself
was stuck at __waiting_for_healthy__, it was unable to refresh osdmap
(which was required for contacting with new up heartbeat_peers) and hence could
be stuck at __waiting_for_healthy__ forever.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
if (is_waiting_for_healthy()) {
start_boot();
+ if (is_waiting_for_healthy()) {
+ // failed to boot
+ Mutex::Locker l(heartbeat_lock);
+ utime_t now = ceph_clock_now();
+ if (now - last_mon_heartbeat > cct->_conf->osd_mon_heartbeat_interval) {
+ last_mon_heartbeat = now;
+ dout(1) << __func__ << " checking mon for new map" << dendl;
+ osdmap_subscribe(osdmap->get_epoch() + 1, false);
+ }
+ }
}
do_waiters();