dheart [Tue, 9 Jun 2026 13:27:14 +0000 (21:27 +0800)]
os/bluestore: prevent reallocation and corruption when shared_blob key is missing/undecodable
When the shared_blob key is missing or fails to decode,
it is necessary to scan the blob's pextents directly as the sole authoritative source
to verify allocated blocks and prevent double-allocation.
Reviewed-by: Joseph Mundackal <jmundackal@bloomberg.net> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
mds: fix shutdown hang when ephemeral pins active and max_mds is 0
During shutdown, `ceph fs set <fs> down true` sets max_mds to 0 before
the MDS daemons have finished exporting their subtrees. shutdown_pass()
iterates over auth subtrees and skips any dir whose inode is
ephemerally pinned, expecting handle_export_pins() to re-place them.
However, handle_export_pins() calls hash_into_rank_bucket() which (after
the companion fix) now returns MDS_RANK_NONE when max_mds == 0. With
no valid target rank the export is never scheduled, so the ephemerally-
pinned dirs are skipped by shutdown_pass() indefinitely and the daemon
loops.
mds: fix crash in hash_into_rank_bucket() when max_mds is 0
When a CephFS cluster is paused (e.g. via `ceph fs set <fs> down true`
or `ceph fs pause`) the MDS map's max_mds is set to 0. Any subsequent
call to hash_into_rank_bucket() with max_mds == 0 triggers a crash:
the jump-consistent-hash loop never executes (j starts at 0, condition
j < max_mds is immediately false), leaving b = -1, so the final
assert(result >= 0 && result < max_mds) aborts the daemon.
This commit enables ceph-osd-crimson and ceph-osd-crimson-dbg
packages for debian builds which have gcc version 13 or above.
This is done as a first step to add noble to supported distors
for crimson.
Kefu Chai [Sun, 7 Jun 2026 08:58:20 +0000 (16:58 +0800)]
mgr/dashboard: don't mutate the cached osd_map in CephService
test_pool_list fails intermittently:
Traceback (most recent call last):
File "qa/tasks/mgr/dashboard/test_pool.py", line 182, in test_pool_list
self.assertNotIn('pg_status', pool)
AssertionError: 'pg_status' unexpectedly found in
{'pool': 1, 'pool_name': 'rbd', ..., 'pg_status': {'active+clean': 1}, ...}
mgr.get('osd_map') defaults to mutable=False, so cacheable_get_python()
returns the mgr's shared cached object rather than a copy.
get_pool_list_with_stats() writes pool['pg_status'] and pool['stats']
into those cached dicts, and get_erasure_code_profiles() sets ecp['name']
and rewrites ecp['k']/['m'] to int. The writes outlive the request, so
once a stats=true call has run, GET /api/pool with stats=false still
returns pools carrying pg_status and the assertion above fails. It only
triggers while the cache stays valid between the two requests, hence the
flakiness.
Audited the other dashboard readers of cached mgr.get() keys: these two
are the only sites that mutate the result; the rest only read, and
health.py already copies its osd_map before editing.
Copy the dicts before stamping them; the cache stays clean.
Sun Yuechi [Sat, 6 Jun 2026 09:44:57 +0000 (17:44 +0800)]
Dockerfile.build: fetch sccache on riscv64
sccache ships a riscv64 release artifact since v0.13.0, published under the
riscv64gc target triple. Map uname -m "riscv64" to that asset name so the
download resolves on riscv64 instead of being skipped.
Sun Yuechi [Sat, 6 Jun 2026 09:44:33 +0000 (17:44 +0800)]
Dockerfile.build: bump sccache to v0.15.0
The releases since v0.8.2 add caching for C++20 modules, assembly, and C
preprocessor output, plus broader GCC/MSVC flag handling. They also avoid
double-caching when ccache is on PATH and carry assorted cache-correctness
and storage-backend fixes.
Kefu Chai [Fri, 5 Jun 2026 01:34:56 +0000 (09:34 +0800)]
ceph.spec.in: only require c-ares >= 1.28 on el10+
87e233bb2628784c8c59603e74bc728a8944265e added an unconditional
"Requires: c-ares >= 1.28.0" to ceph-osd-crimson: seastar links
ares_query_dnsrec, which c-ares only grew in 1.28, and the libcares.so.2
SONAME doesn't carry the version so rpm can't infer the floor itself.
But the floor only earns its place where the build links the symbol
against a newer c-ares than the runtime has, and that's an EL thing.
el10's minors cross 1.28 under one $releasever (10.1 ships 1.25, 10.2
ships 1.34), so a builder rolls to 1.34 while a frozen 10.1 node stays on
1.25; without the floor the rpm installs there and the osd then crashes
on the missing symbol. el9 builds the legacy ares_query path and doesn't
need it at all.
Fedora and SUSE don't have the skew: one c-ares per release, built and
run against the same one, so the auto libcares.so.2 dep covers them. So
pin it only on el10+, arch-qualified with %{?_isa}.
Ronen Friedman [Thu, 4 Jun 2026 13:05:26 +0000 (13:05 +0000)]
crimson/test: chain invoke_on_all() future instead of calling get()
The reactors start-up code on ARM64 uses invoke_on_all() to
set a configuration option.
Replace smp::invoke_on_all().get() with future chaining. This
avoids waiting on a future from a reactor continuation (outside
of a seastar thread) that throws exception.
Ronen Friedman [Fri, 29 May 2026 18:21:51 +0000 (18:21 +0000)]
osd/scrub: clean up inconsistent_obj_wrapper and ScrubStore
Add a default constructor to inconsistent_obj_wrapper, allowing
decode_wrapper() to avoid requiring a dummy hobject_t that gets
immediately overwritten by decode(). Remove the now-unnecessary
hobject_t parameter from merge_encoded_error_wrappers().
Introduce a 'last_degraded' timestamp to the pg_stat_t structure to track
the initial point of redundancy loss. This field, used in conjunction
with 'last_clean', allows the manager to calculate a cluster-wide
durability score by measuring the duration of vulnerability windows.
Changes:
1) Add last_degraded (utime_t) to pg_stat_t in osd_types.h.
2) Increment pg_stat_t encoding version to 31. The decode logic
defaults last_degraded to last_clean for backward compatibility
during rolling upgrades.
3) Update operator==, dump(), and generate_test_instances() to
support ceph-dencoder testing and JSON output.
4) Implement latching logic in PeeringState::prepare_stats_for_publish():
- A PG is considered vulnerable if in DEGRADED or UNDERSIZED state.
- last_degraded is set to 'now' only if it is <= last_clean,
effectively latching the timestamp to the start of the failure
event until the PG next becomes clean.
5) Standalone tests to verify:
- The last_degraded timestamp latching logic.
- Verify last_degraded timestamp is modified when OSDs are marked 'out' for
draining purposes in which case PGs are marked undersized.
6) Release note the addition of 'last_degraded' field to PG stats.
John Mulligan [Thu, 14 May 2026 14:02:56 +0000 (10:02 -0400)]
doc: add more details about the remote-control sidecar service
Add a section about how to set up and access the remote-control sidecar
service. Update a bit of the existing config docs that was not accurate.
Cover the three approaches to making use of the remote-control service
as a client.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Fri, 8 May 2026 18:01:36 +0000 (14:01 -0400)]
container: include python3-ceph-smb-ctl in ceph image
The python3-ceph-smb-ctl package provides the ceph-smb-ctl CLI tool (and
requires needed deps) and is a weak dependency of python3-ceph-common.
However, since the container disables weak dependencies by default we
need to explicitly list it if we want it in the container image. Which
we do.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Kefu Chai [Thu, 4 Jun 2026 10:38:24 +0000 (18:38 +0800)]
test/rgw/posix: free the quota handler in TestDriver
TestDriver::init() allocates quota_handler via
RGWQuotaHandler::generate_handler() but nothing frees it. The real
POSIXDriver frees it in finalize(), which the unit tests never call, so
every fixture that runs init() leaks the handler and the stat caches
hanging off it: 274 allocations, ~40KB, all rooted at generate_handler()
under ASan:
==6102==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 3200 byte(s) in 5 object(s) allocated from:
#1 RGWQuotaHandler::generate_handler(...) src/rgw/rgw_quota.cc:989
#2 TestDriver::init(...) src/test/rgw/test_rgw_posix_driver.cc:1100
#3 POSIXDriverTest::SetUp() src/test/rgw/test_rgw_posix_driver.cc:1191
...
SUMMARY: AddressSanitizer: 40099 byte(s) leaked in 274 allocation(s).
So free it in ~TestDriver(), the counterpart to the init() allocation.
~POSIXDriver() is empty and nothing else touches quota_handler, so there
is no double free, and free_handler(nullptr) is a no-op when init()
bailed out early.
ceph.spec.in: require c-ares >= 1.28 for ceph-osd-crimson
Seastar's DNS stack uses ares_query_dnsrec when built against c-ares
>= 1.28 (ARES_VERSION >= 0x011c00). Only ceph-osd-crimson links that
path; classic-osd does not, so add the version floor on the crimson
subpackage only.
Rocky Linux 10 shaman builds use docker.io/rockylinux/rockylinux:10
(os-release 10.1), but dnf builddeps resolve against the live Rocky 10
BaseOS/AppStream repos, which track the newest minor and install
c-ares-devel/c-ares 1.34.6. CMake links ceph-osd-crimson against that
library. Teuthology nodes are provisioned as Rocky 10.1 and install only
the requested Ceph packages without a full distro upgrade, so their
baseline c-ares stays at 1.25.0 (< 1.28, no ares_query_dnsrec). Install
succeeds but OSD startup fails with "undefined symbol: ares_query_dnsrec".
Require c-ares >= 1.28 on ceph-osd-crimson so dnf upgrades to a suitable
libcares (1.34.6 is already in Rocky 10.1 baseos) or fails cleanly at
install. Ubuntu crimson CI does not show this mismatch: the same LTS is
used for building and testing, and maintainers do not bump upstream
package versions across an LTS lifecycle (only cherry-picked fixes), so
build-time and runtime libc-ares stay aligned.
get('osd_map') returns the cached object directly, so del and key
assignments were silently corrupting the cache for subsequent callers.
Take a shallow copy before modifying, and use pop() instead of del in
case the cache was already corrupted.
mgr/DaemonServer: clarify ok-to-upgrade error message for CRUSH buckets
Refine the error string in DaemonServer.cc returned by the
ok-to-upgrade command when OSDs in a CRUSH bucket cannot be upgraded.
The original message is ambiguous. It fails to clearly convey that
stopping *any* individual OSD in that specific bucket will drop PGs
offline, meaning no OSDs within that bucket can be safely upgraded at
this time.
Update the phrasing to explicitly state that at least X PGs will go offline
if any OSD out of the total count in that CRUSH bucket is stopped. Also
standardize on capitalized acronyms (PG, OSD, CRUSH) and wrap the bucket
name in single quotes for better log readability.
Kobi Ginon [Wed, 3 Jun 2026 14:03:09 +0000 (17:03 +0300)]
doc/rbd: clarify Rocky iSCSI gateway requirements
List Rocky Linux 8+ alongside RHEL/CentOS Stream 7.5+. Note that packaged
ceph-iscsi must recognize Rocky in /etc/os-release (ceph-iscsi#282). Add a
short Rocky note under iSCSI targets; expand the overview maintenance
warning with migration guidance to RBD and the NVMe-oF gateway.
Casey Bodley [Tue, 2 Jun 2026 20:17:59 +0000 (16:17 -0400)]
osdc: deliver neorados completions to associated executor
while Objecter delivers librados completions directly by calling
Context::complete(), neorados completions are passed in as
boost::asio::any_completion_handler and delivered to an asio executor
via boost::asio::defer() on completion
asio handlers may have an "associated executor" so callers can customize
where these completions are delivered. for example, multithreaded
applications often use strand executors to synchronize completion
handlers and prevent data races between concurrent operations
however, applications like radosgw that depend on strands for
thread-safety did not get it due to the fact that Objecter's
Op::complete() delivered all neorados completions to the default
io_context executor
use boost::asio::get_associated_executor() to respect the handler's
executor affinity, if any. but because the Op's handler is the
type-erased any_completion_handler, its associated executor is also
type-erased as any_completion_executor. that any_completion_executor
doesn't support the blocking::never_t property required by defer/post,
so defer() was changed to dispatch() which may call the handler directly
if Objecter is already running on the requested executor. i assume this
is safe, given that librados' Context-based completions already do this
mds : correction in the description for mds_log_max_segments config
Since we use unsigned integer for the config option
`mds_log_max_segments` , the value '-1' is not permitted.
And there's no need to disable this limit. Hence removing
this statement from the its description.
reseting a txn doesnt really create a new one semantically.
avoid incrementing "created" on reset, otherwise we end up
with inflated numbers where MUTATE txn created count
is twice as higher than committed.
Note, "resets" are already tracked as invalidated.
Since we are changing the 'application' for the report,
we need non-RO, in case of cached api call.
using 'pool_stats' map directly to avoid copy of the pg_dump
that can be huge.