dheart [Tue, 9 Jun 2026 13:27:14 +0000 (21:27 +0800)]
os/bluestore: prevent reallocation and corruption when shared_blob key is missing/undecodable
When the shared_blob key is missing or fails to decode,
it is necessary to scan the blob's pextents directly as the sole authoritative source
to verify allocated blocks and prevent double-allocation.
Emmanuel Ameh [Tue, 9 Jun 2026 12:40:03 +0000 (13:40 +0100)]
doc/man: Remove stale EOL release names from deprecation notices
ceph.rst: "osd create" deprecation notice cited "the Luminous release"
(2017, EOL 2020). Update to a plain deprecation statement directing
users to the replacement command (osd new).
rbd.rst: cephx_require_signatures option deprecation cited "the Bobtail
release" (2013, EOL 2015) as context for why the option is deprecated.
Remove the EOL release name; retain the deprecation warning. Fix the
companion nocephx_require_signatures notice for consistency ("in a
future release" instead of "in the future").
Matty Williams [Mon, 18 May 2026 09:09:32 +0000 (10:09 +0100)]
osd: Hook up omap operations in EC pools
Add pool flag to determine if omap operations are supported in a pool.
- Currently disabled in EC pools (will later be enabled for Fast EC pools)
Require all osds to have umbrella or later release version to enable pool flag.
Change recovery reads to use journal updates.
Clear the journal for a new epoch.
Set omap_complete accurately before recovery.
Encode omap updates and add entry to journal.
Decode omap updates, apply updates to object store, then remove from journal.
Change omap reads in PrimaryLogPG to use PGBackend functions, including omap updates from journal.
Assisted-by: Bob
Used for debugging and copying patterns (e.g. implementing REPLACE type to match MODIFY).
Fixes: https://tracker.ceph.com/issues/74188 Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
Matty Williams [Tue, 12 May 2026 15:11:17 +0000 (16:11 +0100)]
osd: Allow for recovery of OMAP header and entries in EC pools
Add omap fields to read_request_t, read_result_t, ECSubRead and ECSubReadReply.
Read and write omap header and entries if !omap_complete.
Require omap_complete to finish recovery.
Fixes: https://tracker.ceph.com/issues/74244 Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
Matty Williams [Fri, 12 Dec 2025 11:21:10 +0000 (11:21 +0000)]
osd: Introduce functions required for EC OMAP support
Introduced a "supports_omap" pool flag which is always enabled for Replicated pools and currently always disabled for EC pools.
Introduced wrappers around omap read operations in PGBackend to include updates from the journal in EC pools with optimisations enabled.
Introduced a function for encoding an EC_OMAP operation in the ObjectModDesc::Visitor class and a function for committing an operation in the Trimmer struct.
Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
previously, we got the mono_clock::now() in OSD::start() and passed it
to PerShardState. this worked fine. but it was a little bit convoluted
-- we pass the startup_time all the way to PerShardState.
in this change, we just use call mono_clock::now() in the contructor
of PerShardState. simpler this way.
the startup_time has two consumers:
- the PGs hosted by the sharded_service use it as a reference for the
monotonic timestamp
- Heartbeat::send_heartbeats() uses it as for the mono_ping_stamp.
because, strictly speaking, we cannot gurantee that all PerShardState
sharded services share the identical startup timestamp, as they are
constructed on different shards. but this does not matter, as PGs
always use the hosting shard service for the referencing timestamp,
and OSD always uses the shard service on local shard for sending
heartbeats.
Kefu Chai [Mon, 8 Jun 2026 08:54:47 +0000 (16:54 +0800)]
crimson/osd: remove OSD::startup_time
OSD::startup_time was added in 91c0df81, in which was added to provide
a monotomic increasing timestamp representing the startup time. but
later, we decided to keep track of this timestamp in PerShardState.
but we didn't remove OSD::startup_time when adding
PerShardState::get_mnow().
in this change, we remove the unused OSD::startup_time.
Kefu Chai [Mon, 8 Jun 2026 08:24:40 +0000 (16:24 +0800)]
crimson/osd: coroutinize OSD::start()
OSD::start() is a long, deeply nested .then() continuation chain. let's
rewrite it as a coroutine to make it readable. the chain was already
sequential and the one concurrent step keeps its when_all_succeed(), so
the rewrite preserves both ordering and concurrency.
start() runs once at boot, off the i/o path, so the small overhead of
co_await over a hand-rolled continuation chain is a fine price for the
readability.
Reviewed-by: Joseph Mundackal <jmundackal@bloomberg.net> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Adam Kupczyk [Tue, 1 Jul 2025 13:47:14 +0000 (13:47 +0000)]
os/bluestore: Add new onode recovery method
Added read_allocation_from_onodes_mt function
(originally copied from read_allocation_from_onodes).
Added Decoder_AllocationsAndStatFS class
(originally copied from ExtentDecoderpartial).
There are significant differences from originals:
- shared blobs are not scanned at all
- to not account allocations more than once,
collisions are detected on SimpleBitmap level;
only the first onode referencing shared blob will mark allocation
- Blobs are not preserved
- instead we remember only if blob or spanning blob was compressed
The underlying logic is make recovery faster and prepare for
multithread refactor.
Adam Kupczyk [Fri, 4 Jul 2025 16:28:16 +0000 (16:28 +0000)]
os/bluestore: Rework on decoding
Refactored ExtentDecoder.
Introduced decode_create_blob method to it.
Converted bluestore_blob_t::decode and Blob::decode methods into templates.
Created clear example path how to specialize these and other decoders.
Jon Bailey [Thu, 4 Jun 2026 10:27:07 +0000 (11:27 +0100)]
test: Remove invalid unit test
This test was talking about testing invalid ops, however with the inclusion of sync reads in EC (https://github.com/ceph/ceph/pull/67079), it is valid to perform class reads in EC. In addition, work was done around illegal ops here: https://github.com/ceph/ceph/pull/66258 and the existance of TEST(ClsHello, BadMethods) in test_cls_hello.cc covers illegal ops in that PR leading me to think this is unneccisairy. Because of these reasons, I think its better this test is removed as it is incorrect and also not working.
Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
mds: fix shutdown hang when ephemeral pins active and max_mds is 0
During shutdown, `ceph fs set <fs> down true` sets max_mds to 0 before
the MDS daemons have finished exporting their subtrees. shutdown_pass()
iterates over auth subtrees and skips any dir whose inode is
ephemerally pinned, expecting handle_export_pins() to re-place them.
However, handle_export_pins() calls hash_into_rank_bucket() which (after
the companion fix) now returns MDS_RANK_NONE when max_mds == 0. With
no valid target rank the export is never scheduled, so the ephemerally-
pinned dirs are skipped by shutdown_pass() indefinitely and the daemon
loops.
mds: fix crash in hash_into_rank_bucket() when max_mds is 0
When a CephFS cluster is paused (e.g. via `ceph fs set <fs> down true`
or `ceph fs pause`) the MDS map's max_mds is set to 0. Any subsequent
call to hash_into_rank_bucket() with max_mds == 0 triggers a crash:
the jump-consistent-hash loop never executes (j starts at 0, condition
j < max_mds is immediately false), leaving b = -1, so the final
assert(result >= 0 && result < max_mds) aborts the daemon.
This commit enables ceph-osd-crimson and ceph-osd-crimson-dbg
packages for debian builds which have gcc version 13 or above.
This is done as a first step to add noble to supported distors
for crimson.
Kefu Chai [Sun, 7 Jun 2026 08:58:20 +0000 (16:58 +0800)]
mgr/dashboard: don't mutate the cached osd_map in CephService
test_pool_list fails intermittently:
Traceback (most recent call last):
File "qa/tasks/mgr/dashboard/test_pool.py", line 182, in test_pool_list
self.assertNotIn('pg_status', pool)
AssertionError: 'pg_status' unexpectedly found in
{'pool': 1, 'pool_name': 'rbd', ..., 'pg_status': {'active+clean': 1}, ...}
mgr.get('osd_map') defaults to mutable=False, so cacheable_get_python()
returns the mgr's shared cached object rather than a copy.
get_pool_list_with_stats() writes pool['pg_status'] and pool['stats']
into those cached dicts, and get_erasure_code_profiles() sets ecp['name']
and rewrites ecp['k']/['m'] to int. The writes outlive the request, so
once a stats=true call has run, GET /api/pool with stats=false still
returns pools carrying pg_status and the assertion above fails. It only
triggers while the cache stays valid between the two requests, hence the
flakiness.
Audited the other dashboard readers of cached mgr.get() keys: these two
are the only sites that mutate the result; the rest only read, and
health.py already copies its osd_map before editing.
Copy the dicts before stamping them; the cache stays clean.
Sun Yuechi [Sat, 6 Jun 2026 09:44:57 +0000 (17:44 +0800)]
Dockerfile.build: fetch sccache on riscv64
sccache ships a riscv64 release artifact since v0.13.0, published under the
riscv64gc target triple. Map uname -m "riscv64" to that asset name so the
download resolves on riscv64 instead of being skipped.
Sun Yuechi [Sat, 6 Jun 2026 09:44:33 +0000 (17:44 +0800)]
Dockerfile.build: bump sccache to v0.15.0
The releases since v0.8.2 add caching for C++20 modules, assembly, and C
preprocessor output, plus broader GCC/MSVC flag handling. They also avoid
double-caching when ccache is on PATH and carry assorted cache-correctness
and storage-backend fixes.
Kobi Ginon [Mon, 25 May 2026 12:38:34 +0000 (15:38 +0300)]
cephadm: set Grafana http_addr to 0.0.0.0 when unset
Grafana 11.1+ rejects non-literal http_addr values (e.g. localhost)
in grafana-apiserver. Use 0.0.0.0 by default; stop bracket-wrapping
IPv6 addresses in http_addr. Fixes: https://tracker.ceph.com/issues/75365 Signed-off-by: Kobi Ginon <kginon@redhat.com>
The test uses add-repo --release 17.2.6 to verify version-string repo
handling, but debian-17.2.6 only has focal and bullseye suites and
jammy packages weren't built until 17.2.7. This causes apt-get update
to fail with a 404 on ubuntu_22.04 nodes.
Kefu Chai [Fri, 5 Jun 2026 01:34:56 +0000 (09:34 +0800)]
ceph.spec.in: only require c-ares >= 1.28 on el10+
87e233bb2628784c8c59603e74bc728a8944265e added an unconditional
"Requires: c-ares >= 1.28.0" to ceph-osd-crimson: seastar links
ares_query_dnsrec, which c-ares only grew in 1.28, and the libcares.so.2
SONAME doesn't carry the version so rpm can't infer the floor itself.
But the floor only earns its place where the build links the symbol
against a newer c-ares than the runtime has, and that's an EL thing.
el10's minors cross 1.28 under one $releasever (10.1 ships 1.25, 10.2
ships 1.34), so a builder rolls to 1.34 while a frozen 10.1 node stays on
1.25; without the floor the rpm installs there and the osd then crashes
on the missing symbol. el9 builds the legacy ares_query path and doesn't
need it at all.
Fedora and SUSE don't have the skew: one c-ares per release, built and
run against the same one, so the auto libcares.so.2 dep covers them. So
pin it only on el10+, arch-qualified with %{?_isa}.
Ronen Friedman [Thu, 4 Jun 2026 13:05:26 +0000 (13:05 +0000)]
crimson/test: chain invoke_on_all() future instead of calling get()
The reactors start-up code on ARM64 uses invoke_on_all() to
set a configuration option.
Replace smp::invoke_on_all().get() with future chaining. This
avoids waiting on a future from a reactor continuation (outside
of a seastar thread) that throws exception.
Ronen Friedman [Fri, 29 May 2026 18:21:51 +0000 (18:21 +0000)]
osd/scrub: clean up inconsistent_obj_wrapper and ScrubStore
Add a default constructor to inconsistent_obj_wrapper, allowing
decode_wrapper() to avoid requiring a dummy hobject_t that gets
immediately overwritten by decode(). Remove the now-unnecessary
hobject_t parameter from merge_encoded_error_wrappers().
Introduce a 'last_degraded' timestamp to the pg_stat_t structure to track
the initial point of redundancy loss. This field, used in conjunction
with 'last_clean', allows the manager to calculate a cluster-wide
durability score by measuring the duration of vulnerability windows.
Changes:
1) Add last_degraded (utime_t) to pg_stat_t in osd_types.h.
2) Increment pg_stat_t encoding version to 31. The decode logic
defaults last_degraded to last_clean for backward compatibility
during rolling upgrades.
3) Update operator==, dump(), and generate_test_instances() to
support ceph-dencoder testing and JSON output.
4) Implement latching logic in PeeringState::prepare_stats_for_publish():
- A PG is considered vulnerable if in DEGRADED or UNDERSIZED state.
- last_degraded is set to 'now' only if it is <= last_clean,
effectively latching the timestamp to the start of the failure
event until the PG next becomes clean.
5) Standalone tests to verify:
- The last_degraded timestamp latching logic.
- Verify last_degraded timestamp is modified when OSDs are marked 'out' for
draining purposes in which case PGs are marked undersized.
6) Release note the addition of 'last_degraded' field to PG stats.
John Mulligan [Thu, 14 May 2026 14:02:56 +0000 (10:02 -0400)]
doc: add more details about the remote-control sidecar service
Add a section about how to set up and access the remote-control sidecar
service. Update a bit of the existing config docs that was not accurate.
Cover the three approaches to making use of the remote-control service
as a client.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Fri, 8 May 2026 18:01:36 +0000 (14:01 -0400)]
container: include python3-ceph-smb-ctl in ceph image
The python3-ceph-smb-ctl package provides the ceph-smb-ctl CLI tool (and
requires needed deps) and is a weak dependency of python3-ceph-common.
However, since the container disables weak dependencies by default we
need to explicitly list it if we want it in the container image. Which
we do.
Signed-off-by: John Mulligan <jmulligan@redhat.com>