git.apps.os.sepia.ceph.com Git - ceph-ci.git/commit

author	xie xingguo <xie.xingguo@zte.com.cn>
	Fri, 24 Jul 2020 01:57:40 +0000 (09:57 +0800)
committer	xie xingguo <xie.xingguo@zte.com.cn>
	Sat, 25 Jul 2020 00:28:30 +0000 (08:28 +0800)
commit	10eff2567971ca57b1e821f704de490add021c8e
tree	85389d8fce8884a73180e906b9a1ae201999e369	tree \| snapshot
parent	8c1a077e560248760ac441f315b84304aa693e72	commit \| diff

osd/PeeringState: prevent peer's num_objects going negative

Saw it in a teuthology run:

-5645> 2020-07-20 04:34:32.067 7f351e329700 5 osd.5 pg_epoch: 667 ... exit Started/Primary/Active/Backfilling
-5642> 2020-07-20 04:34:32.067 7f351e329700 5 osd.5 pg_epoch: 667 ... enter Started/Primary/Active/Recovered
-5633> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 ... _update_calc_stats shard 5 primary objects 0 missing 0
-5632> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 ... _update_calc_stats shard 3 objects -1 missing 1
-5631> 2020-07-20 04:34:32.067 7f351e329700 20 osd.5 pg_epoch: 667 ... _update_calc_stats shard 6 objects 0 missing 0

This will crash the choose_acting() procedure as it will mistakenly
think that peer 3 should continue to perform asynchronous recovery
(e.g., due to num_objects_missing = 1) in contrast to fully
backfill-recovered.

While I did not dig into the real cause, there are a couple of
possible explanations of how num_objects can be off. I think that
if a roll forward or log replay could delete something twice, maybe
there would be an undercount. Or maybe something as simple as a
corruption.

Since _update_calc_stats() is going to fix num_objects_missing
for that peer anyway, let's make sure it always starts with a
clean state.

Fixes: https://tracker.ceph.com/issues/46705
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>