]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/commitdiff
qa/cephadm: ignore transient CEPHADM_FAILED_DAEMON in smoke-singlehost 67622/head
authorKefu Chai <k.chai@proxmox.com>
Tue, 3 Mar 2026 03:30:58 +0000 (11:30 +0800)
committerKefu Chai <k.chai@proxmox.com>
Tue, 3 Mar 2026 03:47:39 +0000 (11:47 +0800)
Add CEPHADM_FAILED_DAEMON to the log-ignorelist for smoke-singlehost tests
to prevent false failures from transient daemon states during deployment.

Background:
-----------
During daemon deployment (especially OSDs), there's a brief window (typically
2-4 seconds) where the daemon status is reported as 'unknown' before the
daemon fully starts and registers with the cluster. This triggers the
CEPHADM_FAILED_DAEMON health warning which clears itself automatically once
the daemon completes startup.

This is expected and documented behavior during daemon deployment. Other
cephadm test suites already ignore this warning (see commit 53b462764c6
"qa: fix log errors for cephadm tests" which added CEPHADM_FAILED_DAEMON
to the ignorelists for smoke-small, smoke-roleless, osds, upgrade tests,
and many workunits).

The smoke-singlehost test was inadvertently missed in that commit, causing
intermittent false failures when the test's health check happens to run
during the brief transient state.

Failure Example:
----------------
Job 50357 from test run dgalloway-2026-02-13_23:06:25 failed with:

  2026-02-17T00:13:31.081 cluster [WRN] Health check failed: 1 failed
  cephadm daemon(s) (CEPHADM_FAILED_DAEMON)

Timeline:
  00:13:28 - Deploying daemon osd.1 on trial167
  00:13:30 - Reconfiguring daemon osd.1 on trial167
  00:13:31 - Health check: daemon osd.1 in unknown state (CEPHADM_FAILED_DAEMON)
  00:13:34 - Health check cleared: CEPHADM_FAILED_DAEMON (daemon started successfully)
  00:13:35+ - osd.1 running normally

The test framework flagged this as a failure because it detected the warning
in the cluster log, even though the daemon successfully started and the
warning cleared within 3 seconds.

This brings smoke-singlehost in line with other cephadm test suites that
already handle this expected transient state.

References:
-----------
Similar fixes:
- commit 53b462764c6: Added CEPHADM_FAILED_DAEMON to multiple test suites
- commit 69076ae1022: Added CEPHADM_FAILED_DAEMON to nvmeof tests

Fixes: https://tracker.ceph.com/issues/75277
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
qa/suites/orch/cephadm/smoke-singlehost/1-start.yaml

index f350954d13a7738f41ec1d79040c01498fb0c1a2..fd952f9644ace71f9537f2a9e85de2c16182b771 100644 (file)
@@ -24,6 +24,7 @@ overrides:
   ceph:
     log-ignorelist:
       - OSD_DOWN
+      - CEPHADM_FAILED_DAEMON
     conf:
       osd:
         osd shutdown pgref assert: true