crimson/osd: reject Seastore PG merges across shards
Seastore cannot merge collections between reactor shards currently.
On cross-shard detection, tell the monitor the source PG is not ready
(via MOSDPGReadyToMerge{ ready=false }) so the unsafe pg_num decrement
is never proposed, then send MOSDPGStopMerge to clamp pg_num_target and
permanently disable further shrink for the pool.
crimson/osd: implement PG merge detection and orchestration in PGAdvanceMap
Integrate PG merge handling into the map advancement pipeline.
When pg_num shrinks between epochs, check_for_merges() returns a
merge_result_t describing whether this PG is a merge source, target, or
not involved. start() stops advancing through later epochs once a merge
is detected, then either finish_merge_advance() or the normal activate
path runs so complete_rctx() always happens in one place.
- check_for_merges(): detect pg_num shrink and dispatch merge_pg().
- merge_pg(): merge-only work — Seastore eligibility, source handoff
setup, target rendezvous collection and PG::merge_from().
- finish_merge_advance(): commit rctx and complete the role-specific
steps (source: complete_rctx, stop, register_merge_source; target:
handle_advance_map, handle_activate_map, complete_rctx).
Add PG::merge_from to execute the merge of source PGs into a target PG.
This function builds a transaction to remove source-specifc metadata
objects and merge source collections into the target collection.
crimson/osd: per-PG rendezvous for cross-shard merge source handoff
Add infrastructure so source PGs can be extracted from their birth
shard, moved to the target shard, and collected by the target PG before
merge proceeds.
Cross-shard safety: PGs are tied to their birth_shard for destruction.
register_merge_source() uses extract_pg() to detach the source,
seastar::foreign_ptr to hop cores, and
crimson::local_shared_foreign_ptr on the target so release routes
destruction back to the birth shard.
Synchronization: replace the per-shard ShardServices merge_info_t
registry (shared_promise waiters, ready_pgs staging, and cleanup hooks)
with merge state on the target PG itself. Source-side
register_merge_source() delivers PGs via PG::add_merge_source(); the
target waits in PG::collect_merge_sources(n) on a per-PG semaphore.
Duplicate source registrations are ignored. PG::stop() breaks the
semaphore so shutdown does not hang.
ShardServices::register_merge_source() and extract_pg() live in
shard_services; rendezvous types and methods live on PG.
This function ensures that when a PG is being removed or
merged and it calls stop() - it will clear primary state, and
notify the Monitor to clear any pending merge flags.
It will also call client_request_orderer.clear_and_cancel() ensuring
all remaining client requests are properly completed. This is needed
for merging in particular since on_change() is never called for the merge epoch
(handle_advance_map is skipped after merge detection), so
clear_and_cancel() is never invoked on the source PG's orderer.
crimson/osd/shard_services: inherit from peering_sharded_service
Update ShardServices to inherit from seastar::peering_sharded_service.
This allows the service to access its own sharded container directly
via container() rather than manually storing a reference to it.
crimson/osd: Add functions to notify mon when PGs are ready to merge
When a PG is in the pending merge state it is >= pg_num_pending and <
pg_num. When this happens, IO is paused and once the PG peers we notify
the mon that we are idle and safe to merge.
Use Gated for merge notify callbacks.
Patrick Donnelly [Wed, 10 Jun 2026 18:30:59 +0000 (14:30 -0400)]
Merge PR #66726 into main
* refs/pull/66726/head:
doc: Update documentation to reflect new functionality
test: Add integration tests for EC Omap operations and recovery
osd: Hook up omap operations in EC pools
osd: Allow for recovery of OMAP header and entries in EC pools
doc: Write design document to explain the reasoning behind implementing this feature
osd: Introduce functions required for EC OMAP support
osd: Add ECOmapJournal class and relocate OmapUpdateType enum class
Reviewed-by: Bill Scales <bill_scales@uk.ibm.com> Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Ronen Friedman [Wed, 3 Jun 2026 05:40:25 +0000 (05:40 +0000)]
crimson/osd: move get_max_object_size() to store level
is_offset_and_length_valid() called get_sharded_store() locally to
obtain the store-specific max_object_size. On alien cores (where
smp::count > store_shard_nums), the local store is inactive and the
call hits assert(shard_store.get_status() == true).
As the max object size is a store-specific property and not a
store-shard one, there is no reason to acquire the
store shard to obtain it. Instead -
a get_max_object_size() method is added to the Store interface.
Sun Yuechi [Wed, 10 Jun 2026 00:13:53 +0000 (08:13 +0800)]
cmake: disable Catch2 tests when Catch2 is unavailable
debhelper on noble passes -DFETCHCONTENT_FULLY_DISCONNECTED=ON, so CPM
cannot fetch Catch2 and silently skips it, leaving no Catch2 targets
behind and breaking the generate step. Fall back to WITH_CATCH2=OFF
with a warning instead.
dheart [Tue, 9 Jun 2026 13:27:14 +0000 (21:27 +0800)]
os/bluestore: prevent reallocation and corruption when shared_blob key is missing/undecodable
When the shared_blob key is missing or fails to decode,
it is necessary to scan the blob's pextents directly as the sole authoritative source
to verify allocated blocks and prevent double-allocation.
Emmanuel Ameh [Tue, 9 Jun 2026 12:40:03 +0000 (13:40 +0100)]
doc/man: Remove stale EOL release names from deprecation notices
ceph.rst: "osd create" deprecation notice cited "the Luminous release"
(2017, EOL 2020). Update to a plain deprecation statement directing
users to the replacement command (osd new).
rbd.rst: cephx_require_signatures option deprecation cited "the Bobtail
release" (2013, EOL 2015) as context for why the option is deprecated.
Remove the EOL release name; retain the deprecation warning. Fix the
companion nocephx_require_signatures notice for consistency ("in a
future release" instead of "in the future").
Matty Williams [Mon, 18 May 2026 09:09:32 +0000 (10:09 +0100)]
osd: Hook up omap operations in EC pools
Add pool flag to determine if omap operations are supported in a pool.
- Currently disabled in EC pools (will later be enabled for Fast EC pools)
Require all osds to have umbrella or later release version to enable pool flag.
Change recovery reads to use journal updates.
Clear the journal for a new epoch.
Set omap_complete accurately before recovery.
Encode omap updates and add entry to journal.
Decode omap updates, apply updates to object store, then remove from journal.
Change omap reads in PrimaryLogPG to use PGBackend functions, including omap updates from journal.
Assisted-by: Bob
Used for debugging and copying patterns (e.g. implementing REPLACE type to match MODIFY).
Fixes: https://tracker.ceph.com/issues/74188 Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
Matty Williams [Tue, 12 May 2026 15:11:17 +0000 (16:11 +0100)]
osd: Allow for recovery of OMAP header and entries in EC pools
Add omap fields to read_request_t, read_result_t, ECSubRead and ECSubReadReply.
Read and write omap header and entries if !omap_complete.
Require omap_complete to finish recovery.
Fixes: https://tracker.ceph.com/issues/74244 Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
Matty Williams [Fri, 12 Dec 2025 11:21:10 +0000 (11:21 +0000)]
osd: Introduce functions required for EC OMAP support
Introduced a "supports_omap" pool flag which is always enabled for Replicated pools and currently always disabled for EC pools.
Introduced wrappers around omap read operations in PGBackend to include updates from the journal in EC pools with optimisations enabled.
Introduced a function for encoding an EC_OMAP operation in the ObjectModDesc::Visitor class and a function for committing an operation in the Trimmer struct.
Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
Reviewed-by: Joseph Mundackal <jmundackal@bloomberg.net> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Adam Kupczyk [Tue, 1 Jul 2025 13:47:14 +0000 (13:47 +0000)]
os/bluestore: Add new onode recovery method
Added read_allocation_from_onodes_mt function
(originally copied from read_allocation_from_onodes).
Added Decoder_AllocationsAndStatFS class
(originally copied from ExtentDecoderpartial).
There are significant differences from originals:
- shared blobs are not scanned at all
- to not account allocations more than once,
collisions are detected on SimpleBitmap level;
only the first onode referencing shared blob will mark allocation
- Blobs are not preserved
- instead we remember only if blob or spanning blob was compressed
The underlying logic is make recovery faster and prepare for
multithread refactor.
Adam Kupczyk [Fri, 4 Jul 2025 16:28:16 +0000 (16:28 +0000)]
os/bluestore: Rework on decoding
Refactored ExtentDecoder.
Introduced decode_create_blob method to it.
Converted bluestore_blob_t::decode and Blob::decode methods into templates.
Created clear example path how to specialize these and other decoders.
mds: fix shutdown hang when ephemeral pins active and max_mds is 0
During shutdown, `ceph fs set <fs> down true` sets max_mds to 0 before
the MDS daemons have finished exporting their subtrees. shutdown_pass()
iterates over auth subtrees and skips any dir whose inode is
ephemerally pinned, expecting handle_export_pins() to re-place them.
However, handle_export_pins() calls hash_into_rank_bucket() which (after
the companion fix) now returns MDS_RANK_NONE when max_mds == 0. With
no valid target rank the export is never scheduled, so the ephemerally-
pinned dirs are skipped by shutdown_pass() indefinitely and the daemon
loops.
mds: fix crash in hash_into_rank_bucket() when max_mds is 0
When a CephFS cluster is paused (e.g. via `ceph fs set <fs> down true`
or `ceph fs pause`) the MDS map's max_mds is set to 0. Any subsequent
call to hash_into_rank_bucket() with max_mds == 0 triggers a crash:
the jump-consistent-hash loop never executes (j starts at 0, condition
j < max_mds is immediately false), leaving b = -1, so the final
assert(result >= 0 && result < max_mds) aborts the daemon.
This commit enables ceph-osd-crimson and ceph-osd-crimson-dbg
packages for debian builds which have gcc version 13 or above.
This is done as a first step to add noble to supported distors
for crimson.
Kefu Chai [Sun, 7 Jun 2026 08:58:20 +0000 (16:58 +0800)]
mgr/dashboard: don't mutate the cached osd_map in CephService
test_pool_list fails intermittently:
Traceback (most recent call last):
File "qa/tasks/mgr/dashboard/test_pool.py", line 182, in test_pool_list
self.assertNotIn('pg_status', pool)
AssertionError: 'pg_status' unexpectedly found in
{'pool': 1, 'pool_name': 'rbd', ..., 'pg_status': {'active+clean': 1}, ...}
mgr.get('osd_map') defaults to mutable=False, so cacheable_get_python()
returns the mgr's shared cached object rather than a copy.
get_pool_list_with_stats() writes pool['pg_status'] and pool['stats']
into those cached dicts, and get_erasure_code_profiles() sets ecp['name']
and rewrites ecp['k']/['m'] to int. The writes outlive the request, so
once a stats=true call has run, GET /api/pool with stats=false still
returns pools carrying pg_status and the assertion above fails. It only
triggers while the cache stays valid between the two requests, hence the
flakiness.
Audited the other dashboard readers of cached mgr.get() keys: these two
are the only sites that mutate the result; the rest only read, and
health.py already copies its osd_map before editing.
Copy the dicts before stamping them; the cache stays clean.
Sun Yuechi [Sat, 6 Jun 2026 09:44:57 +0000 (17:44 +0800)]
Dockerfile.build: fetch sccache on riscv64
sccache ships a riscv64 release artifact since v0.13.0, published under the
riscv64gc target triple. Map uname -m "riscv64" to that asset name so the
download resolves on riscv64 instead of being skipped.
Sun Yuechi [Sat, 6 Jun 2026 09:44:33 +0000 (17:44 +0800)]
Dockerfile.build: bump sccache to v0.15.0
The releases since v0.8.2 add caching for C++20 modules, assembly, and C
preprocessor output, plus broader GCC/MSVC flag handling. They also avoid
double-caching when ccache is on PATH and carry assorted cache-correctness
and storage-backend fixes.