On detecting a corrupted object, primary may automatically
repair that object by leveraging the existing recovery procedure,
which turned out to be racy with a previous unfinished _finish_recovery
callback - the problem would then be that _finish_recovery might
continue to purge some strays that we still want to pull data from.
Fix by re-checking if there are any newly added missing objects when
executing _finish_recovery.
Note that before https://github.com/ceph/ceph/pull/29756 we might
instead have to call needs_recovery to catch the race condition
since we did not evict pg from clean state when triggering an auto-repair..
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
void PG::_finish_recovery(Context *c)
{
std::scoped_lock locker{*this};
- // When recovery is initiated by a repair, that flag is left on
- state_clear(PG_STATE_REPAIR);
- if (recovery_state.is_deleting()) {
+ if (recovery_state.is_deleting() || !is_clean()) {
+ dout(10) << __func__ << " raced with delete or repair" << dendl;
return;
}
+ // When recovery is initiated by a repair, that flag is left on
+ state_clear(PG_STATE_REPAIR);
if (c == finish_sync_event) {
dout(10) << "_finish_recovery" << dendl;
finish_sync_event = 0;