From: Bill Scales <bill_scales@uk.ibm.com>
Date: Wed, 27 Aug 2025 13:44:08 +0000 (+0100)
Subject: osd: Optimized EC incorrectly rolled backwards write
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=4eb4b5e01ed0a96ec20022ab106b9f2fe8292cf7;p=ceph-ci.git

osd: Optimized EC incorrectly rolled backwards write

A bug in choose_acting in this scenario:

* Current primary shard has been absent so has missed the latest few writes
* All the recent writes are partial writes that have not updated shard X
* All the recent writes have completed

The authorative shard is chosen from the set of primary-capable shards
that have the highest last epoch started, these have all got log entries
for the recent writes.

The get log shard is chosen from the set of shards that have the highest
last epoch started, this chooses shard X because its furthest behind

The primary shard last update is not less than get log shard last
update so this if statement decides that it has a good enough log:

if ((repeat_getlog != nullptr) &&
    get_log_shard != all_info.end() &&
    (info.last_update < get_log_shard->second.last_update) &&
    pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {

We then proceed through peering using the primary log and the
log from shard X. Neither have details about the recent writes
which are then incorrectly rolled back.

The if statement should be looking at last_update for the
authorative shard rather than the get_log_shard, the code
would then realize that it needs to get the log from the
authorative shard first and then have a second pass
where it gets the log from the get log shard.

Peering would then have information about the partial writes
(obtained from the authorative shards log) and could correctly
roll these writes forward by deducing that the get_log_shard
didn't have these log entries because they were partial writes.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit ac4e0926bbac4ee4d8e33110b8a434495d730770)
---

diff --git a/src/osd/PeeringState.cc b/src/osd/PeeringState.cc
index 3a06323b774..c7189b4beef 100644
--- a/src/osd/PeeringState.cc
+++ b/src/osd/PeeringState.cc
@@ -2511,26 +2511,6 @@ bool PeeringState::choose_acting(pg_shard_t &get_log_shard_id,
   auto get_log_shard = find_best_info(all_info, restrict_to_up_acting,
 				    false, history_les_bound);
 
-  if ((repeat_getlog != nullptr) &&
-      get_log_shard != all_info.end() &&
-      (info.last_update < get_log_shard->second.last_update) &&
-      pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {
-    // Only EC pools with ec_optimizations enabled:
-    // Our log is behind that of the get_log_shard which is a
-    // non-primary shard and hence may have a sparse log,
-    // get a complete log from the auth_log_shard first then
-    // repeat this step in the state machine to work out what
-    // has to be rolled backwards
-    psdout(10) << "get_log_shard " << get_log_shard->first
-	       << " is ahead but is a non_primary shard" << dendl;
-    if (auth_log_shard != all_info.end()) {
-      psdout(10) << "auth_log_shard " << auth_log_shard->first
-		 << " selected instead" << dendl;
-      get_log_shard = auth_log_shard;
-      *repeat_getlog = true;
-    }
-  }
-
   if ((auth_log_shard == all_info.end()) ||
       (get_log_shard == all_info.end())) {
     if (up != acting) {
@@ -2546,6 +2526,23 @@ bool PeeringState::choose_acting(pg_shard_t &get_log_shard_id,
     return false;
   }
 
+  if ((repeat_getlog != nullptr) &&
+      (info.last_update < auth_log_shard->second.last_update) &&
+      pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {
+    // Only EC pools with ec_optimizations enabled:
+    // Our log is behind that of the auth_log_shard and
+    // the get log shard is a non-primary shard and hence may have
+    // a sparse log, get a complete log from the auth_log_shard
+    // first then repeat this step in the state machine to work
+    // out what has to be rolled backwards
+    psdout(10) << "get_log_shard " << get_log_shard->first
+	       << " is ahead but is a non_primary shard" << dendl;
+    psdout(10) << "auth_log_shard " << auth_log_shard->first
+	       << " selected instead" << dendl;
+    get_log_shard = auth_log_shard;
+    *repeat_getlog = true;
+  }
+
   ceph_assert(!auth_log_shard->second.is_incomplete());
   get_log_shard_id = get_log_shard->first;