]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/commit
crimson/osd: fix race between AllReplicasRecovered and DeferRecovery 68383/head
authorAishwarya Mathuria <amathuri@redhat.com>
Tue, 14 Apr 2026 07:59:36 +0000 (13:29 +0530)
committerAishwarya Mathuria <amathuri@redhat.com>
Wed, 15 Apr 2026 10:25:19 +0000 (10:25 +0000)
commit278ace1bc4abf7a91a7e0932091353b19da11222
tree71d90cd5d05468edb4bd94db6b3ba6a6be2574c5
parentdc6913037fd9de70ddcca7b12ac9ac95ae738adc
crimson/osd: fix race between AllReplicasRecovered and DeferRecovery

Fixes a crash where AllReplicasRecovered event arrives at NotRecovering
state due to async event delivery race with DeferRecovery preemption.

The issue occurs when:
1. Recovery completes and AllReplicasRecovered is queued asynchronously
2. A higher priority operation (e.g., client I/O) triggers AsyncReserver
   to preempt recovery, posting DeferRecovery event
3. DeferRecovery is processed first, transitioning PG to NotRecovering
4. AllReplicasRecovered arrives at wrong state → crash with "bad state
   machine event" because NotRecovering doesn't handle it

The fix follows Classic OSD's approach in PrimaryLogPG::start_recovery_ops():
clear PG_STATE_RECOVERING before posting recovery completion events. This
makes the existing safety check in PeeringState::Recovering::react() work:
when DeferRecovery arrives and sees !state_test(PG_STATE_RECOVERING), it
discards itself, preventing the state transition that would cause the crash.

Fixes:https://tracker.ceph.com/issues/73314
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
src/crimson/osd/pg_recovery.cc