From ec83507111ce80e777f85410617c9c54842cd2fb Mon Sep 17 00:00:00 2001
From: Alex Ainscow <aainscow@uk.ibm.com>
Date: Mon, 14 Jul 2025 23:57:49 +0100
Subject: [PATCH] osd: Always send EC messages to all shards following error.

Explanation of bug which is being fixed:

Log entry 204'784 is an error - "_rollback_to deleting head on smithi17019940-573 because
got ENOENT|whiteout on find_object_context" so this log entry is generated outside of EC by
PrimaryLogPG. It should be applied to all shards, however osd 13(2) was a little slow and
the update got interrupted by a new epoch so it didn't apply it. All the other shards
marked it as applied and completed (there isn't the usual interlock that EC has of making
sure all shards apply the update before any complete it).

We then processed 4 partial writes applying and completing them (they didn't update osd
13(2)), then we have a new epoch and go through peering. Peering says osd 13(2) didn't see
update 204'784 (it didn't) and therefore the error log entry and the 4 partial writes need
to be rolled back. The other shards had completed those 4 partial writes so we end up with
4 missing objects on all the shards which become unfound objects.

I think the underlying bug means that log entry 204'784 isn't really complete and may
"disappear" from the log in a subsequent peering cycle. Trying to forcefully rollback a
logged error doesn't generate a missing object or a miscompare, so the consequences of the
bug are hidden. It is however tripping up the new EC code where proc_master_log is being
much stricter about what a completed write means.

Fix:
After generating a logged error we could force the next write to EC to update metadata on
all shards even if its a partial write. This means this write won't complete unless all
shards see the logged error. This will make new EC behave the same as old EC. There is
already an interlock with EC (call_write_ordered) which is called just before generating
the log error that ensures that any in-flight writes complete before submitting the log
error. We could set a boolean flag here (at the point call_write_ordered is called is fine,
don't need to wait for the callback) to say the next write has to be to all shards. The
flag can be cleared if we generate the transactions for the next write, or we get an
on_change notification (peering will then clear up the mess)

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 4948f74331c13cd93086b057e0f25a59573e3167)
---
 src/osd/ECCommon.cc | 9 ++++++++-
 src/osd/ECCommon.h  | 1 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/osd/ECCommon.cc b/src/osd/ECCommon.cc
index b81bb5ffc6e..34304f932f9 100644
--- a/src/osd/ECCommon.cc
+++ b/src/osd/ECCommon.cc
@@ -791,12 +791,15 @@ void ECCommon::RMWPipeline::cache_ready(Op &op) {
      * As such we must never skip a transaction completely.  Note that if
      * should_send is false, then an empty transaction is sent.
      */
-    if (should_send && op.skip_transaction(pending_roll_forward, shard, transaction)) {
+    if (!next_write_all_shards && should_send && op.skip_transaction(pending_roll_forward, shard, transaction)) {
       // Must be an empty transaction
       ceph_assert(transaction.empty());
       dout(20) << __func__ << " Skipping transaction for shard " << shard << dendl;
       continue;
     }
+    if (!should_send || transaction.empty()) {
+      dout(20) << __func__ << " Sending empty transaction for shard " << shard << dendl;
+    }
     op.pending_commits++;
     const pg_stat_t &stats =
         (should_send || !backfill_shards.contains(pg_shard))
@@ -845,6 +848,8 @@ void ECCommon::RMWPipeline::cache_ready(Op &op) {
     }
   }
 
+  next_write_all_shards = false;
+
   if (!messages.empty()) {
     get_parent()->send_message_osd_cluster(messages, get_osdmap_epoch());
   }
@@ -956,6 +961,7 @@ void ECCommon::RMWPipeline::on_change() {
   tid_to_op_map.clear();
   oid_to_version.clear();
   waiting_commit.clear();
+  next_write_all_shards = false;
 }
 
 void ECCommon::RMWPipeline::on_change2() {
@@ -963,5 +969,6 @@ void ECCommon::RMWPipeline::on_change2() {
 }
 
 void ECCommon::RMWPipeline::call_write_ordered(std::function<void(void)> &&cb) {
+  next_write_all_shards = true;
   extent_cache.add_on_write(std::move(cb));
 }
\ No newline at end of file
diff --git a/src/osd/ECCommon.h b/src/osd/ECCommon.h
index b564acbf7b3..2a4dc63de67 100644
--- a/src/osd/ECCommon.h
+++ b/src/osd/ECCommon.h
@@ -612,6 +612,7 @@ struct ECCommon {
     ECCommon &ec_backend;
     ECExtentCache extent_cache;
     uint64_t ec_pdw_write_mode;
+    bool next_write_all_shards = false;
 
     RMWPipeline(CephContext *cct,
                 ceph::ErasureCodeInterfaceRef ec_impl,
-- 
2.39.5