From: Bill Scales Date: Wed, 27 Aug 2025 13:44:08 +0000 (+0100) Subject: osd: Optimized EC incorrectly rolled backwards write X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=4eb4b5e01ed0a96ec20022ab106b9f2fe8292cf7;p=ceph-ci.git osd: Optimized EC incorrectly rolled backwards write A bug in choose_acting in this scenario: * Current primary shard has been absent so has missed the latest few writes * All the recent writes are partial writes that have not updated shard X * All the recent writes have completed The authorative shard is chosen from the set of primary-capable shards that have the highest last epoch started, these have all got log entries for the recent writes. The get log shard is chosen from the set of shards that have the highest last epoch started, this chooses shard X because its furthest behind The primary shard last update is not less than get log shard last update so this if statement decides that it has a good enough log: if ((repeat_getlog != nullptr) && get_log_shard != all_info.end() && (info.last_update < get_log_shard->second.last_update) && pool.info.is_nonprimary_shard(get_log_shard->first.shard)) { We then proceed through peering using the primary log and the log from shard X. Neither have details about the recent writes which are then incorrectly rolled back. The if statement should be looking at last_update for the authorative shard rather than the get_log_shard, the code would then realize that it needs to get the log from the authorative shard first and then have a second pass where it gets the log from the get log shard. Peering would then have information about the partial writes (obtained from the authorative shards log) and could correctly roll these writes forward by deducing that the get_log_shard didn't have these log entries because they were partial writes. Signed-off-by: Bill Scales (cherry picked from commit ac4e0926bbac4ee4d8e33110b8a434495d730770) --- diff --git a/src/osd/PeeringState.cc b/src/osd/PeeringState.cc index 3a06323b774..c7189b4beef 100644 --- a/src/osd/PeeringState.cc +++ b/src/osd/PeeringState.cc @@ -2511,26 +2511,6 @@ bool PeeringState::choose_acting(pg_shard_t &get_log_shard_id, auto get_log_shard = find_best_info(all_info, restrict_to_up_acting, false, history_les_bound); - if ((repeat_getlog != nullptr) && - get_log_shard != all_info.end() && - (info.last_update < get_log_shard->second.last_update) && - pool.info.is_nonprimary_shard(get_log_shard->first.shard)) { - // Only EC pools with ec_optimizations enabled: - // Our log is behind that of the get_log_shard which is a - // non-primary shard and hence may have a sparse log, - // get a complete log from the auth_log_shard first then - // repeat this step in the state machine to work out what - // has to be rolled backwards - psdout(10) << "get_log_shard " << get_log_shard->first - << " is ahead but is a non_primary shard" << dendl; - if (auth_log_shard != all_info.end()) { - psdout(10) << "auth_log_shard " << auth_log_shard->first - << " selected instead" << dendl; - get_log_shard = auth_log_shard; - *repeat_getlog = true; - } - } - if ((auth_log_shard == all_info.end()) || (get_log_shard == all_info.end())) { if (up != acting) { @@ -2546,6 +2526,23 @@ bool PeeringState::choose_acting(pg_shard_t &get_log_shard_id, return false; } + if ((repeat_getlog != nullptr) && + (info.last_update < auth_log_shard->second.last_update) && + pool.info.is_nonprimary_shard(get_log_shard->first.shard)) { + // Only EC pools with ec_optimizations enabled: + // Our log is behind that of the auth_log_shard and + // the get log shard is a non-primary shard and hence may have + // a sparse log, get a complete log from the auth_log_shard + // first then repeat this step in the state machine to work + // out what has to be rolled backwards + psdout(10) << "get_log_shard " << get_log_shard->first + << " is ahead but is a non_primary shard" << dendl; + psdout(10) << "auth_log_shard " << auth_log_shard->first + << " selected instead" << dendl; + get_log_shard = auth_log_shard; + *repeat_getlog = true; + } + ceph_assert(!auth_log_shard->second.is_incomplete()); get_log_shard_id = get_log_shard->first;