From fd3c78bdb2321cd72ce8fb077aba3f191da9d683 Mon Sep 17 00:00:00 2001
From: Bill Scales <bill_scales@uk.ibm.com>
Date: Wed, 18 Jun 2025 12:11:51 +0100
Subject: [PATCH] osd: Optimized EC pools bug fix when repeating GetLog

When the primary shard of an optimized EC pool does not have
a copy of the log it may need to repeat the GetLog peering
step twice, the first time to get a full copy of a log from
a shard that sees all log entries and then a second time
to get the "best" log from a nonprimary shard which may
have a partial log due to partial writes.

A side effect of repeating GetLog is that the missing
list is collected for both the "best" shard and the
shard that provides a full copy of the log. This later
missing list confuses later steps in the peering
process and may cause this shard to complete writes
and end up diverging from the primary. Discarding
this missing list causes Peering to behave the same as if
the GetLog step did not need to be repeated.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d2ba0f932c61746b51d0d427056a53e24db6ea0f)
---
 src/osd/PeeringState.cc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/osd/PeeringState.cc b/src/osd/PeeringState.cc
index cba770b1115..925a05610b2 100644
--- a/src/osd/PeeringState.cc
+++ b/src/osd/PeeringState.cc
@@ -7484,7 +7484,9 @@ boost::statechart::result PeeringState::GetLog::react(const GotLog&)
       // Our log was behind that of the auth_log_shard which was a non-primary
       // with a sparse log. We have just got a log from a primary shard to
       // catch up and now need to recheck if we need to rollback the log to
-      // the auth_log_shard
+      // the auth_log_shard. Discard the received missing log as this does
+      // may not be consistent with the authorative log
+      ps->peer_missing.erase(auth_log_shard);
       psdout(10) << "repeating auth_log_shard selection" << dendl;
       post_event(RepeatGetLog());
       return discard_event();
-- 
2.39.5