]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/commitdiff
osd/PG: fix DeferRecovery vs AllReplicasRecovered race 21706/head
authorSage Weil <sage@redhat.com>
Fri, 27 Apr 2018 20:00:58 +0000 (15:00 -0500)
committerSage Weil <sage@redhat.com>
Sun, 29 Apr 2018 21:00:41 +0000 (16:00 -0500)
- DeferRecovery event queued by AsyncReserver due to preemption
  event.  We are in Recovering state with RECOVERING bit set.
- We finish recovery, clear RECOVERING state bit, and queue
  AllReplicasRecovered from PrimaryLogPG::start_recovery_ops()
- DeferRecovery event arrives, moving us from Recovering -> NotRecovering
- AllReplciasRecovered event arrives, crashing us.

This is all hard to deal with because the events are queued and may
arrive later.  Solve the problem here by tolerating a delayed
DeferRecovery event: if the RECOVERING pg state bit isn't set, ignore
it (it's old).  The async reserver cancel events are unpredictable.

Fixes: http://tracker.ceph.com/issues/23860
Signed-off-by: Sage Weil <sage@redhat.com>
src/osd/PG.cc

index c4805e03ae8d3648be2151474a2c920f7bbdd1b4..c490c47709d23beee9360da9e7a8c7c29b024ab7 100644 (file)
@@ -7685,6 +7685,12 @@ boost::statechart::result
 PG::RecoveryState::Recovering::react(const DeferRecovery &evt)
 {
   PG *pg = context< RecoveryMachine >().pg;
+  if (!pg->state_test(PG_STATE_RECOVERING)) {
+    // we may have finished recovery and have an AllReplicasRecovered
+    // event queued to move us to the next state.
+    ldout(pg->cct, 10) << "got defer recovery but not recovering" << dendl;
+    return discard_event();
+  }
   ldout(pg->cct, 10) << "defer recovery, retry delay " << evt.delay << dendl;
   pg->state_set(PG_STATE_RECOVERY_WAIT);
   pg->osd->local_reserver.cancel_reservation(pg->info.pgid);