git-server-git.apps.pok.os.sepia.ceph.com Git

author	Aishwarya Mathuria <amathuri@redhat.com>
	Tue, 14 Apr 2026 07:59:36 +0000 (13:29 +0530)
committer	Aishwarya Mathuria <amathuri@redhat.com>
	Wed, 15 Apr 2026 10:25:19 +0000 (10:25 +0000)
commit	278ace1bc4abf7a91a7e0932091353b19da11222
tree	71d90cd5d05468edb4bd94db6b3ba6a6be2574c5	tree \| snapshot
parent	dc6913037fd9de70ddcca7b12ac9ac95ae738adc	commit \| diff

crimson/osd: fix race between AllReplicasRecovered and DeferRecovery

Fixes a crash where AllReplicasRecovered event arrives at NotRecovering
state due to async event delivery race with DeferRecovery preemption.

The issue occurs when:
1. Recovery completes and AllReplicasRecovered is queued asynchronously
2. A higher priority operation (e.g., client I/O) triggers AsyncReserver
to preempt recovery, posting DeferRecovery event
3. DeferRecovery is processed first, transitioning PG to NotRecovering
4. AllReplicasRecovered arrives at wrong state → crash with "bad state
machine event" because NotRecovering doesn't handle it

The fix follows Classic OSD's approach in PrimaryLogPG::start_recovery_ops():
clear PG_STATE_RECOVERING before posting recovery completion events. This
makes the existing safety check in PeeringState::Recovering::react() work:
when DeferRecovery arrives and sees !state_test(PG_STATE_RECOVERING), it
discards itself, preventing the state transition that would cause the crash.

Fixes:https://tracker.ceph.com/issues/73314
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>