osd: reset_tp_timeout should reset timeout for all shards
ShardedThreadPools are only used by the classic OSD process
which can have more than one thread for the same shard. Each
thread has a heartbeat timeout used to detect stalled threads.
Some code that is known to take a long time makes calls to
reset_tp_timeout to reset this timeout. However for sharded
pools this can be ineffective because it is common for threads
for the same shard to use the same locks (e.g. PG Lock) and
therefore if thread A is taking a long time and resetting
its timeout while holding a lock, thread B for the same shard
is liable to be waiting for the same lock, will not be
resetting its timeout and can be timed out.
Debug for issue 72879 showed heartbeat timeouts occurring at
the same time for both shards, an attempt to fix the problem
by calling reset_tp_timeout for the slow thread still showed
the other threads for the shard timing out waiting for the PG
lock that was held bythe slow thread. Looking at the OSD code
most places where reset_tp_timeout is called the thread is
holding the PG lock.
This commit moves the concept of shard_index from OSD into
ShardedThreadPool and modifies reset_tp_timeout so that it resets
the timeout for all threads for the same shard.
Some code calls reset_tp_timeout from inside loops that can take
a long time without consideration for how long the thread has
actually been running for. There is a risk that this type of
call could repeatedly reset the timeout for another shard which
is genuinely stuck and hence defeat the heartbeat checks. To
prevent this reset_tp_timeout is modified to be a NOP unless
the thread has been processing the current workitem for more
than 0.5 seconds. Therefore threads have to be slow but making
forward progress to be abe to reset the timeout.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>