]> git.apps.os.sepia.ceph.com Git - ceph-ci.git/commit
osd/PeeringState: fix pending want_acting vs osd offline race
authorxie xingguo <xie.xingguo@zte.com.cn>
Thu, 12 Mar 2020 01:41:10 +0000 (09:41 +0800)
committerxie xingguo <xie.xingguo@zte.com.cn>
Tue, 17 Mar 2020 04:07:28 +0000 (12:07 +0800)
commit155c44bfe068f749dfb186df47d954f402bf903e
tree9cb485d17b814f20d29fb00702e6e5d239b60f84
parentab2488773626d40fb1f8345a99f4bbd6712c0fd4
osd/PeeringState: fix pending want_acting vs osd offline race

In general there are two scenarios we might call choose_acting to
post a pending want_change change:

1) we are in the middle of peering, and we decide to select some
  peers other than current acting set in order to continue serving
  client reads and writes.
  In this case, when any OSD from the pending want_acting set goes down,
  primary will restart peering process and tidy want_acting up properly
  (see PeeringState::Primary::exit()).

2) PG is active, and we want to transit all successfully backfilled
  (or async-recovered) peers back into up set.
  In this case, any want_acting member is deemed to be either coming from
  the current up set or acting set (as we pass restrict_to_up_acting == true
  when calling down into choose_acting).

From 1, we know we'd never leak a want_acting set that might
contain stray peers into Active state. From 2, we know that assert would
effectively catch any potential bad Active choose_acting callers without
setting restrict_to_up_acting properly.

However, in 023524a I did introduce a third scenario that might be
against rule 2 — we now call choose_acting with restrict_to_up_acting
option off on any stray peer coming back to life when PG is active,
and if that peer is down (again) and the corresponding pg_temp change
is still in-flight, then we would reliably fire the assert.

Fix by calling choose_acting again whenever Active sees a new map
that marks down an stray osd in want_acting, so we don't leave
a dirty want_acting (and pg_temp) there.

Fixes: https://tracker.ceph.com/issues/44507
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
src/osd/PeeringState.cc