This is an annoying race, we really should delay going
clean until the backfill peer has acknoledged the clean
info, but we currently don't. In order to prevent this
bug from messing up the nightlies, we'll delay killing
the peer for 20s to make it likely that the backfill
peer has gotten the clean info.
Workaround: #6116
Signed-off-by: Samuel Just <sam.just@inktank.com>
self.ceph_manager.wait_for_clean(
timeout=self.config.get('timeout')
)
+ # now we wait 20s to ensure that any backfill peers have heard about
+ # the cleanness
+ time.sleep(20)
+
self.log("Recovered, killing an osd")
self.kill_osd(mark_down=True, mark_out=True)
self.log("Waiting for clean again")