In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats
can hang or timeout with the default time because the OSD is no longer running.
This was causing test failures.
This fix addresses two issues:
1. test_pg_scrub: Explicitly pass the WAIT_FOR_CLEAN_TIMEOUT and TIMEOUT variables (both set to 2)
to the pg_scrub call after the OSD is killed. This prevents a hang in the wait_for_clean
check within pg_scrub.
2. flush_pg_stats: Add an explicit timeout to the ceph tell osd.$osd flush_pg_stats command,
allowing it to fail quickly when an OSD is unresponsive.
Fixes: https://tracker.ceph.com/issues/74004
Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
wait_for_clean || return 1
pg_scrub 1.0 || return 1
kill_daemons $dir KILL osd || return 1
- ! TIMEOUT=2 pg_scrub 1.0 || return 1
+ ! WAIT_FOR_CLEAN_TIMEOUT=10 TIMEOUT=2 pg_scrub 1.0 || return 1
teardown $dir || return 1
}
ids=`ceph osd ls`
seqs=''
for osd in $ids; do
- seq=`ceph tell osd.$osd flush_pg_stats`
- if test -z "$seq"
- then
- continue
+ seq=$(timeout $timeout ceph tell osd.$osd flush_pg_stats 2>/dev/null) || true
+ if test -z "$seq"; then
+ continue
fi
seqs="$seqs $osd-$seq"
done