From: Darrick J. Wong Date: Mon, 10 Nov 2025 18:26:48 +0000 (-0800) Subject: generic/778: fix background loop control with sentinel files X-Git-Tag: v2025.11.18~10 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=09728c352684b3bb0407c631a6098c5e814b2016;p=xfstests-dev.git generic/778: fix background loop control with sentinel files This test fails on my slowish QA VM with 32k-fsblock xfs: --- /run/fstests/bin/tests/generic/778.out 2025-10-20 10:03:43.432910446 -0700 +++ /var/tmp/fstests/generic/778.out.bad 2025-11-04 12:01:31.137813652 -0800 @@ -1,2 +1,137 @@ QA output created by 778 -Silence is golden +umount: /opt: target is busy. +mount: /opt: /dev/sda4 already mounted on /opt. + dmesg(1) may have more information after failed mount system call. +cycle mount failed +(see /var/tmp/fstests/generic/778.full for details) Injecting a 'ps auxfww' into the _scratch_cycle_mount helper reveals that this process is still sitting on /opt: root 1804418 9.0 0.8 144960 134368 pts/0 Dl+ 12:01 0:00 /run/fstests/xfsprogs/io/xfs_io -i -c open -fsd /opt/testfile -c pwrite -S 0x61 -DA -V1 -b 134217728 134217728 134217728 Yes, that's the xfs_io process started by atomic_write_loop. Inexplicably, the awloop killing code terminates the subshell running the for loop in atomic_write_loop but only waits for the subshell itself to exit. It doesn't wait for any of that subshell's children, and that's why the unmount fails. A bare "wait" (without the $awloop_pid parameter) also doesn't wait for the xfs_io because the parent shell sees the subshell exit and treats that as job completion. We can't use killall here because the system could be running check-parallel, nor can we use pkill here because the pid namespace containment code was removed. The simplest stupid answer is to use sentinel files to control the loop. Cc: fstests@vger.kernel.org # v2025.10.20 Fixes: ca954527ff9d97 ("generic: Add sudden shutdown tests for multi block atomic writes") Signed-off-by: Darrick J. Wong Reviewed-by: Ojaswin Mujoo Reviewed-by: Christoph Hellwig Signed-off-by: Zorro Lang --- diff --git a/tests/generic/778 b/tests/generic/778 index 7cfabc3a..715de458 100755 --- a/tests/generic/778 +++ b/tests/generic/778 @@ -21,6 +21,9 @@ _scratch_mount >> $seqres.full testfile=$SCRATCH_MNT/testfile touch $testfile +awloop_runfile=$tmp.awloop_running +awloop_killfile=$tmp.awloop_kill + awu_max=$(_get_atomic_write_unit_max $testfile) blksz=$(_get_block_size $SCRATCH_MNT) echo "Awu max: $awu_max" >> $seqres.full @@ -31,25 +34,48 @@ num_blocks=$((awu_max / blksz)) filesize=$(( 10 * 1024 * 1024 * 1024 )) _cleanup() { - [ -n "$awloop_pid" ] && kill $awloop_pid &> /dev/null - wait + kill_awloop } atomic_write_loop() { local off=0 local size=$awu_max + + rm -f $awloop_killfile + touch $awloop_runfile + for ((i=0; i<$((filesize / $size )); i++)); do # Due to sudden shutdown this can produce errors so just # redirect them to seqres.full $XFS_IO_PROG -c "open -fsd $testfile" -c "pwrite -S 0x61 -DA -V1 -b $size $off $size" >> /dev/null 2>>$seqres.full + if [ ! -w "$testfile" ] || [ -e "$awloop_killfile" ]; then + break + fi echo "Written to offset: $((off + size))" >> $tmp.aw off=$((off + size)) done + + rm -f $awloop_runfile +} + +# Use sentinel files to control the loop execution because we don't know the +# pid of the xfs_io process and so we can't wait for it directly. A bare +# wait command won't wait for a D-state xfs_io process so we can't do that +# either. We can't use killall because check-parallel, and we can't pkill +# because the pid namespacing code was removed withotu fixing check-parallel. +kill_awloop() { + test -e $awloop_runfile || return + + touch $awloop_killfile + + for ((i=0;i<300;i++)); do + test -e $awloop_runfile || break + sleep 0.1 + done } start_atomic_write_and_shutdown() { atomic_write_loop & - awloop_pid=$! local max_loops=100 local i=0 @@ -70,9 +96,7 @@ start_atomic_write_and_shutdown() { echo "# Shutting down filesystem while write is running" >> $seqres.full _scratch_shutdown - kill $awloop_pid 2>/dev/null # the process might have finished already - wait $awloop_pid - unset $awloop_pid + kill_awloop } # This test has the following flow: