From: Bill Scales <bill_scales@uk.ibm.com>
Date: Sun, 17 Aug 2025 15:42:11 +0000 (+0100)
Subject: qa: test_pool_min_size should kill osds first then mark them down
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=16542cb69254ea4dff4b8cbb41a8935a1a72f8a7;p=ceph.git

qa: test_pool_min_size should kill osds first then mark them down

The objective of test_pool_min_size is to inject up to M failures
in a K+M pool to prove that it has enough redundancy to stay
active.

It was selecting OSDs and then killing and marking them out
one at a time. Testing with wide erasure codes (high values of
K and M - for example 8+4) found that this test sometimes
failed with a PG become dead. Debugging showed that what
was happening is that after one OSD had been killed and
marked out this allowed rebalancing and async recovery to
start which further reduced the redundancy of the PG,
when the remaining error injects happened the PG
correctly became dead.

In practice OSDs are not normally killed and marked out
one after another in quick succession. The more common
scenario is that one or more OSDs fail at about the
same time (lets say over a couple of minutes) and then
after mon_osd_down_out_interval (10 mins) the mon
will mark them out. Killing the OSDs first and then
marking them out prevents additional async recovery
from starting.

If OSDs do fail over a long period of time such that
the mon marks each OSD out then hopefully there is
enough time for async recovery to run between the
failures.

This commit changes the error inject to kill all the
selected OSDs first and then to mark them out.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
---

diff --git a/qa/tasks/ceph_manager.py b/qa/tasks/ceph_manager.py
index 0f7e92c5c2f..38672785163 100644
--- a/qa/tasks/ceph_manager.py
+++ b/qa/tasks/ceph_manager.py
@@ -1046,8 +1046,19 @@ class OSDThrasher(Thrasher):
                 self.log("chose to kill {n} OSDs".format(n=most_killable))
                 acting_set = self.get_rand_pg_acting_set(pool_id)
                 assert most_killable < len(acting_set)
+                # kill the selected osds first, then mark them out. This makes
+                # the error inject a single 'atomic' failure. It simulates what
+                # happens if multiple OSDs fail over a couple of minutes and
+                # then the mon marks them out mon_osd_down_out_interval (10 mins)
+                # later. In contrast if each osd is killed and marked out in turn
+                # then this simulates a rolling failure, here rebalancing and
+                # async recovery can start after the first osd is marked out
+                # further reducing redundancy. With this number of injects in
+                # quick succession this risks a PG in the pool becoming dead
                 for i in range(0, most_killable):
-                    self.kill_osd(osd=acting_set[i], mark_out=True)
+                    self.kill_osd(osd=acting_set[i])
+                for i in range(0, most_killable):
+                    self.out_osd(osd=acting_set[i])
                 self.log("dead_osds={d}, live_osds={ld}".format(d=self.dead_osds, ld=self.live_osds))
                 with safe_while(
                     sleep=25, tries=5,