]> git.apps.os.sepia.ceph.com Git - ceph-client.git/commit
sched/deadline: Fix dl_server getting stuck
authorPeter Zijlstra <peterz@infradead.org>
Tue, 16 Sep 2025 21:02:41 +0000 (23:02 +0200)
committerPeter Zijlstra <peterz@infradead.org>
Thu, 25 Sep 2025 07:51:50 +0000 (09:51 +0200)
commit4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c
treeee3a49e04d70d79beb5ac3bb93e39adde655af30
parentf83ec76bf285bea5727f478a68b894f5543ca76e
sched/deadline: Fix dl_server getting stuck

John found it was easy to hit lockup warnings when running locktorture
on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
("sched/deadline: Less agressive dl_server handling").

While debugging it seems there is a chance where we end up with the
dl_server dequeued, with dl_se->dl_server_active. This causes
dl_server_start() to return without enqueueing the dl_server, thus it
fails to run when RT tasks starve the cpu.

When this happens, dl_server_timer() catches the
'!dl_se->server_has_tasks(dl_se)' case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally return
HRTIMER_NO_RESTART.

This ends in no new timer and also no enqueue, leaving the dl_server
'dead', allowing starvation.

What should have happened is for the bandwidth timer to start the
zero-laxity timer, which in turn would enqueue the dl_server and cause
dl_se->server_pick_task() to be called -- which will stop the
dl_server if no fair tasks are observed for a whole period.

IOW, it is totally irrelevant if there are fair tasks at the moment of
bandwidth refresh.

This removes all dl_se->server_has_tasks() users, so remove the whole
thing.

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
include/linux/sched.h
kernel/sched/deadline.c
kernel/sched/fair.c
kernel/sched/sched.h