From a761be714d5c865c7c32b4bac4fab1c365b51c58 Mon Sep 17 00:00:00 2001 From: Laura Flores Date: Mon, 16 May 2022 17:59:42 -0500 Subject: [PATCH] qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`. The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting. The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery. WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/ WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason: http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/ I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/ Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores (cherry picked from commit 40062676c2ceed49b9fa147127ffa83ba6118e2a) --- qa/suites/rados/thrash-erasure-code-big/thrashers/mapgap.yaml | 1 + qa/suites/rados/thrash-erasure-code-big/thrashers/pggrow.yaml | 1 + 2 files changed, 2 insertions(+) diff --git a/qa/suites/rados/thrash-erasure-code-big/thrashers/mapgap.yaml b/qa/suites/rados/thrash-erasure-code-big/thrashers/mapgap.yaml index 318b20266f112..18843d87220a1 100644 --- a/qa/suites/rados/thrash-erasure-code-big/thrashers/mapgap.yaml +++ b/qa/suites/rados/thrash-erasure-code-big/thrashers/mapgap.yaml @@ -11,6 +11,7 @@ overrides: osd map cache size: 1 osd scrub min interval: 60 osd scrub max interval: 120 + osd max backfills: 6 tasks: - thrashosds: timeout: 1800 diff --git a/qa/suites/rados/thrash-erasure-code-big/thrashers/pggrow.yaml b/qa/suites/rados/thrash-erasure-code-big/thrashers/pggrow.yaml index 772f2698b6790..9cbb80dba9e83 100644 --- a/qa/suites/rados/thrash-erasure-code-big/thrashers/pggrow.yaml +++ b/qa/suites/rados/thrash-erasure-code-big/thrashers/pggrow.yaml @@ -7,6 +7,7 @@ overrides: osd: osd scrub min interval: 60 osd scrub max interval: 120 + osd max backfills: 6 tasks: - thrashosds: timeout: 1200 -- 2.39.5