From 8a4e9d152f426648fcef93b26ebb9e0954b43824 Mon Sep 17 00:00:00 2001 From: "Kamoltat (Junior) Sirivadhna" Date: Fri, 6 Mar 2026 17:20:18 +0000 Subject: [PATCH] qa: whitelist slow requests progress.yaml The reason we had a slow-requests is because during the test, 16 concurrent 4 MB writes were running while recovery and backfill were disabled. At the same time, osd.0 was marked out and then back in, causing PG remapping. Because recovery/backfill was disabled, some PGs could not restore their replicas after the remap, leaving them in degraded/remapped states. As a result, a batch of writes remained stuck in the replicated write path, leading to IO stall and slow ops being reported. Solution is to ignore this as we are testing the progress module, not the write paths of OSDs. We intentionally disable backfill and recovery in order to prevent the recovery event to finish quickly. We wanted to prolong it until the progress event pops up. Fixes: https://tracker.ceph.com/issues/70320 Signed-off-by: Kamoltat (Junior) Sirivadhna (cherry picked from commit 6b0c943c8bd004665529c5c5786ecec42bcc9ff7) --- .../rados/mgr/tasks/4-units/progress.yaml | 2 ++ qa/tasks/mgr/test_progress.py | 27 +++++++++++++++++++ 2 files changed, 29 insertions(+) diff --git a/qa/suites/rados/mgr/tasks/4-units/progress.yaml b/qa/suites/rados/mgr/tasks/4-units/progress.yaml index 6ed4f442955f..e09b6cc63c54 100644 --- a/qa/suites/rados/mgr/tasks/4-units/progress.yaml +++ b/qa/suites/rados/mgr/tasks/4-units/progress.yaml @@ -12,6 +12,8 @@ overrides: - \(FS_WITH_FAILED_MDS\) - \(FS_DEGRADED\) - \(OSDMAP_FLAGS\) + - \(slow requests\) + tasks: - cephfs_test_runner: modules: diff --git a/qa/tasks/mgr/test_progress.py b/qa/tasks/mgr/test_progress.py index 3a13055c0922..fd9f5a737b64 100644 --- a/qa/tasks/mgr/test_progress.py +++ b/qa/tasks/mgr/test_progress.py @@ -9,6 +9,33 @@ log = logging.getLogger(__name__) class TestProgress(MgrTestCase): + """ + Test suite for the progress module. + + IMPORTANT: Slow Requests / IO Stalls + ===================================== + These tests intentionally trigger slow requests and IO stalls as a side effect + of the testing methodology. This is expected behavior and should be ignored + when evaluating test results. + + Why this occurs? here is an example: + - Tests run 16 concurrent 4 MB writes (via rados bench -t 16) to populate the cluster + - While recovery/backfill is disabled (nobackfill, norecover flags set), OSDs are + marked out and then back in, causing PG remapping + - Because recovery/backfill is disabled, PGs cannot restore their replicas after + remapping, leaving them in degraded/remapped states + - This causes writes to get stuck in the replicated write path, leading to IO + stalls and slow ops being reported + + Why we do this: + - The purpose is to test the progress module's event tracking, NOT the OSD write paths + - We intentionally disable recovery/backfill to prolong recovery events, giving us + time to observe and validate that progress events are correctly generated, tracked, + and completed by the progress module + - Without disabling recovery, events would complete too quickly to properly test + + Test configurations should include slow request warnings in their ignorelist. + """ POOL = "progress_data" # How long we expect to wait at most between taking an OSD out -- 2.47.3