From ec83507111ce80e777f85410617c9c54842cd2fb Mon Sep 17 00:00:00 2001 From: Alex Ainscow Date: Mon, 14 Jul 2025 23:57:49 +0100 Subject: [PATCH] osd: Always send EC messages to all shards following error. Explanation of bug which is being fixed: Log entry 204'784 is an error - "_rollback_to deleting head on smithi17019940-573 because got ENOENT|whiteout on find_object_context" so this log entry is generated outside of EC by PrimaryLogPG. It should be applied to all shards, however osd 13(2) was a little slow and the update got interrupted by a new epoch so it didn't apply it. All the other shards marked it as applied and completed (there isn't the usual interlock that EC has of making sure all shards apply the update before any complete it). We then processed 4 partial writes applying and completing them (they didn't update osd 13(2)), then we have a new epoch and go through peering. Peering says osd 13(2) didn't see update 204'784 (it didn't) and therefore the error log entry and the 4 partial writes need to be rolled back. The other shards had completed those 4 partial writes so we end up with 4 missing objects on all the shards which become unfound objects. I think the underlying bug means that log entry 204'784 isn't really complete and may "disappear" from the log in a subsequent peering cycle. Trying to forcefully rollback a logged error doesn't generate a missing object or a miscompare, so the consequences of the bug are hidden. It is however tripping up the new EC code where proc_master_log is being much stricter about what a completed write means. Fix: After generating a logged error we could force the next write to EC to update metadata on all shards even if its a partial write. This means this write won't complete unless all shards see the logged error. This will make new EC behave the same as old EC. There is already an interlock with EC (call_write_ordered) which is called just before generating the log error that ensures that any in-flight writes complete before submitting the log error. We could set a boolean flag here (at the point call_write_ordered is called is fine, don't need to wait for the callback) to say the next write has to be to all shards. The flag can be cleared if we generate the transactions for the next write, or we get an on_change notification (peering will then clear up the mess) Signed-off-by: Alex Ainscow (cherry picked from commit 4948f74331c13cd93086b057e0f25a59573e3167) --- src/osd/ECCommon.cc | 9 ++++++++- src/osd/ECCommon.h | 1 + 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/src/osd/ECCommon.cc b/src/osd/ECCommon.cc index b81bb5ffc6e..34304f932f9 100644 --- a/src/osd/ECCommon.cc +++ b/src/osd/ECCommon.cc @@ -791,12 +791,15 @@ void ECCommon::RMWPipeline::cache_ready(Op &op) { * As such we must never skip a transaction completely. Note that if * should_send is false, then an empty transaction is sent. */ - if (should_send && op.skip_transaction(pending_roll_forward, shard, transaction)) { + if (!next_write_all_shards && should_send && op.skip_transaction(pending_roll_forward, shard, transaction)) { // Must be an empty transaction ceph_assert(transaction.empty()); dout(20) << __func__ << " Skipping transaction for shard " << shard << dendl; continue; } + if (!should_send || transaction.empty()) { + dout(20) << __func__ << " Sending empty transaction for shard " << shard << dendl; + } op.pending_commits++; const pg_stat_t &stats = (should_send || !backfill_shards.contains(pg_shard)) @@ -845,6 +848,8 @@ void ECCommon::RMWPipeline::cache_ready(Op &op) { } } + next_write_all_shards = false; + if (!messages.empty()) { get_parent()->send_message_osd_cluster(messages, get_osdmap_epoch()); } @@ -956,6 +961,7 @@ void ECCommon::RMWPipeline::on_change() { tid_to_op_map.clear(); oid_to_version.clear(); waiting_commit.clear(); + next_write_all_shards = false; } void ECCommon::RMWPipeline::on_change2() { @@ -963,5 +969,6 @@ void ECCommon::RMWPipeline::on_change2() { } void ECCommon::RMWPipeline::call_write_ordered(std::function &&cb) { + next_write_all_shards = true; extent_cache.add_on_write(std::move(cb)); } \ No newline at end of file diff --git a/src/osd/ECCommon.h b/src/osd/ECCommon.h index b564acbf7b3..2a4dc63de67 100644 --- a/src/osd/ECCommon.h +++ b/src/osd/ECCommon.h @@ -612,6 +612,7 @@ struct ECCommon { ECCommon &ec_backend; ECExtentCache extent_cache; uint64_t ec_pdw_write_mode; + bool next_write_all_shards = false; RMWPipeline(CephContext *cct, ceph::ErasureCodeInterfaceRef ec_impl, -- 2.39.5