From: xie xingguo Date: Thu, 6 Sep 2018 11:20:01 +0000 (+0800) Subject: osd/OSD: ping monitor if we are stuck at __waiting_for_healthy__ X-Git-Tag: v14.0.1~368^2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=b6c0eeaebb54188759bec43e05119ad4d166b9e8;p=ceph.git osd/OSD: ping monitor if we are stuck at __waiting_for_healthy__ One of our clusters has encountered some network issues several days ago and we've observed some OSDs were stuck at __waiting_for_healthy__ with an obsolete OSDMap(683) in hand(By contrast, the newest OSDMap from the monitor side has been successfully bumped up to 1589): ``` 2018-08-28 15:26:54.858892 7faa3869c700 1 osd.31 683 is_healthy false -- only 1/5 up peers (less than 33%) 2018-08-28 15:26:54.858909 7faa3869c700 1 osd.31 683 not healthy; waiting to boot 2018-08-28 15:26:55.859007 7faa3869c700 1 osd.31 683 is_healthy false -- only 1/5 up peers (less than 33%) 2018-08-28 15:26:55.859023 7faa3869c700 1 osd.31 683 not healthy; waiting to boot 2018-08-28 15:26:56.859122 7faa3869c700 1 osd.31 683 is_healthy false -- only 1/5 up peers (less than 33%) 2018-08-28 15:26:56.859151 7faa3869c700 1 osd.31 683 not healthy; waiting to boot ``` Since most heartbeat_peers of osd.31 were actually offline and osd.31 itself was stuck at __waiting_for_healthy__, it was unable to refresh osdmap (which was required for contacting with new up heartbeat_peers) and hence could be stuck at __waiting_for_healthy__ forever. Signed-off-by: xie xingguo --- diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index e684898fd431..e78064de179c 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -4833,6 +4833,16 @@ void OSD::tick() if (is_waiting_for_healthy()) { start_boot(); + if (is_waiting_for_healthy()) { + // failed to boot + Mutex::Locker l(heartbeat_lock); + utime_t now = ceph_clock_now(); + if (now - last_mon_heartbeat > cct->_conf->osd_mon_heartbeat_interval) { + last_mon_heartbeat = now; + dout(1) << __func__ << " checking mon for new map" << dendl; + osdmap_subscribe(osdmap->get_epoch() + 1, false); + } + } } do_waiters();