From fd3c78bdb2321cd72ce8fb077aba3f191da9d683 Mon Sep 17 00:00:00 2001 From: Bill Scales Date: Wed, 18 Jun 2025 12:11:51 +0100 Subject: [PATCH] osd: Optimized EC pools bug fix when repeating GetLog When the primary shard of an optimized EC pool does not have a copy of the log it may need to repeat the GetLog peering step twice, the first time to get a full copy of a log from a shard that sees all log entries and then a second time to get the "best" log from a nonprimary shard which may have a partial log due to partial writes. A side effect of repeating GetLog is that the missing list is collected for both the "best" shard and the shard that provides a full copy of the log. This later missing list confuses later steps in the peering process and may cause this shard to complete writes and end up diverging from the primary. Discarding this missing list causes Peering to behave the same as if the GetLog step did not need to be repeated. Signed-off-by: Bill Scales (cherry picked from commit d2ba0f932c61746b51d0d427056a53e24db6ea0f) --- src/osd/PeeringState.cc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/osd/PeeringState.cc b/src/osd/PeeringState.cc index cba770b1115..925a05610b2 100644 --- a/src/osd/PeeringState.cc +++ b/src/osd/PeeringState.cc @@ -7484,7 +7484,9 @@ boost::statechart::result PeeringState::GetLog::react(const GotLog&) // Our log was behind that of the auth_log_shard which was a non-primary // with a sparse log. We have just got a log from a primary shard to // catch up and now need to recheck if we need to rollback the log to - // the auth_log_shard + // the auth_log_shard. Discard the received missing log as this does + // may not be consistent with the authorative log + ps->peer_missing.erase(auth_log_shard); psdout(10) << "repeating auth_log_shard selection" << dendl; post_event(RepeatGetLog()); return discard_event(); -- 2.39.5