The test has been flaky on loaded CI, failing at different assertions
depending on where the expiration budget runs out:
[ RUN ] QuiesceDbTest.RepeatedQuiesceAwait
src/test/mds/TestQuiesceDb.cc:1112: Failure
Expected equality of these values:
ERR(115)
Which is: Operation now in progress(115)
run_request(...)
Which is: Connection timed out(110)
[ FAILED ] QuiesceDbTest.RepeatedQuiesceAwait (627 ms)
The test proves its central invariant -- "await resets the expiration
timer" -- indirectly, by having the set survive a series of
sleep_for(expiration/2) + await pairs. With expiration=0.1s that
leaves only ~20ms above the 80ms consumed by the two 40ms
release-awaits, and the margin is regularly eaten by scheduler
jitter. A similar overrun inside the quiesce loop produces the other
variant, where sleep_for(expiration/2) overshoots and the gap
between two awaits exceeds the expiration.
The per-iteration budget needs to cover ~3 CFS schedule latencies
(test wakes from sleep_for, manager wakes on notify, test wakes on
completion) plus mutex/queue processing, which is roughly 20ms
nominal on Linux with sched_latency_ns=6ms and up to 30-40ms under
heavy contention. Kernels built with a lower CONFIG_HZ coarsen
ceph::coarse_real_clock's CLOCK_REALTIME_COARSE reads, but that
affects only read precision, not the dominant scheduling-latency
term.
Fix by:
1. Raising expiration from 0.1s to 0.2s, so the sleep=expiration/2
margin is ~3x the worst-case per-iteration overhead.
2. Reducing the quiesce loop from 10 to 3 iterations, since three
iterations already span 1.5x expiration cumulatively -- enough
to prove the resets are extending the deadline.
3. Replacing the 2-iteration release-EINPROGRESS loop with a single
release+await(sec(0)) and an at_age equality assertion, so
"release does not reset the timer" is checked directly rather
than inferred from multiple awaits racing the expiration.
4. Using await = 2*expiration for the final ETIMEDOUT so the
expiration is guaranteed to fire inside the wait window rather
than on its boundary.
5. Tracking at_age across the quiesce loop and asserting it advances.
The loop's ASSERT_EQ(OK(),...) already fails if resets stop
working, but the at_age check also catches regressions where
at_age is updated to a stuck or stale value.
The test now runs in ~500ms, comparable to the original.