From: Bill Scales Date: Sun, 17 Aug 2025 15:42:11 +0000 (+0100) Subject: qa: test_pool_min_size should kill osds first then mark them down X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=16542cb69254ea4dff4b8cbb41a8935a1a72f8a7;p=ceph.git qa: test_pool_min_size should kill osds first then mark them down The objective of test_pool_min_size is to inject up to M failures in a K+M pool to prove that it has enough redundancy to stay active. It was selecting OSDs and then killing and marking them out one at a time. Testing with wide erasure codes (high values of K and M - for example 8+4) found that this test sometimes failed with a PG become dead. Debugging showed that what was happening is that after one OSD had been killed and marked out this allowed rebalancing and async recovery to start which further reduced the redundancy of the PG, when the remaining error injects happened the PG correctly became dead. In practice OSDs are not normally killed and marked out one after another in quick succession. The more common scenario is that one or more OSDs fail at about the same time (lets say over a couple of minutes) and then after mon_osd_down_out_interval (10 mins) the mon will mark them out. Killing the OSDs first and then marking them out prevents additional async recovery from starting. If OSDs do fail over a long period of time such that the mon marks each OSD out then hopefully there is enough time for async recovery to run between the failures. This commit changes the error inject to kill all the selected OSDs first and then to mark them out. Signed-off-by: Bill Scales --- diff --git a/qa/tasks/ceph_manager.py b/qa/tasks/ceph_manager.py index 0f7e92c5c2f..38672785163 100644 --- a/qa/tasks/ceph_manager.py +++ b/qa/tasks/ceph_manager.py @@ -1046,8 +1046,19 @@ class OSDThrasher(Thrasher): self.log("chose to kill {n} OSDs".format(n=most_killable)) acting_set = self.get_rand_pg_acting_set(pool_id) assert most_killable < len(acting_set) + # kill the selected osds first, then mark them out. This makes + # the error inject a single 'atomic' failure. It simulates what + # happens if multiple OSDs fail over a couple of minutes and + # then the mon marks them out mon_osd_down_out_interval (10 mins) + # later. In contrast if each osd is killed and marked out in turn + # then this simulates a rolling failure, here rebalancing and + # async recovery can start after the first osd is marked out + # further reducing redundancy. With this number of injects in + # quick succession this risks a PG in the pool becoming dead for i in range(0, most_killable): - self.kill_osd(osd=acting_set[i], mark_out=True) + self.kill_osd(osd=acting_set[i]) + for i in range(0, most_killable): + self.out_osd(osd=acting_set[i]) self.log("dead_osds={d}, live_osds={ld}".format(d=self.dead_osds, ld=self.live_osds)) with safe_while( sleep=25, tries=5,