From: Sage Weil <sage@redhat.com>
Date: Mon, 25 Nov 2019 19:15:24 +0000 (-0600)
Subject: osd/PeeringState: clear LAGGY and WAIT states on exiting Started
X-Git-Tag: v15.1.0~743^2
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=7bbc724d99e998bf6e06c3d32dc68348ab6aa45a;p=ceph.git

osd/PeeringState: clear LAGGY and WAIT states on exiting Started

These flags were not getting cleared except in recheck_readable(), which
meant that a flag from a prior interval could bleed into a new interval.
More dangerously, in a mixed-version cluster, one interval might include
all octopus+ OSDs while the next might include a pre-octopus OSD, bypassing
most of the laggy recheck code.  This could lead to a stalled request
and/or requeue ordering bug when release_object_locks() looked at
is_laggy() and put a lock waiter on the waiting_for_readable list.

Fixes: https://tracker.ceph.com/issues/42978
Signed-off-by: Sage Weil <sage@redhat.com>
---

diff --git a/src/osd/PeeringState.cc b/src/osd/PeeringState.cc
index a9b5b4fde41d..a326dc28a4e3 100644
--- a/src/osd/PeeringState.cc
+++ b/src/osd/PeeringState.cc
@@ -4301,6 +4301,7 @@ void PeeringState::Started::exit()
   DECLARE_LOCALS;
   utime_t dur = ceph_clock_now() - enter_time;
   pl->get_peering_perf().tinc(rs_started_latency, dur);
+  ps->state_clear(PG_STATE_WAIT | PG_STATE_LAGGY);
 }
 
 /*--------Reset---------*/