This commit enables ceph-osd-crimson and ceph-osd-crimson-dbg
packages for debian builds which have gcc version 13 or above.
This is done as a first step to add noble to supported distors
for crimson.
Kefu Chai [Sun, 7 Jun 2026 08:58:20 +0000 (16:58 +0800)]
mgr/dashboard: don't mutate the cached osd_map in CephService
test_pool_list fails intermittently:
Traceback (most recent call last):
File "qa/tasks/mgr/dashboard/test_pool.py", line 182, in test_pool_list
self.assertNotIn('pg_status', pool)
AssertionError: 'pg_status' unexpectedly found in
{'pool': 1, 'pool_name': 'rbd', ..., 'pg_status': {'active+clean': 1}, ...}
mgr.get('osd_map') defaults to mutable=False, so cacheable_get_python()
returns the mgr's shared cached object rather than a copy.
get_pool_list_with_stats() writes pool['pg_status'] and pool['stats']
into those cached dicts, and get_erasure_code_profiles() sets ecp['name']
and rewrites ecp['k']/['m'] to int. The writes outlive the request, so
once a stats=true call has run, GET /api/pool with stats=false still
returns pools carrying pg_status and the assertion above fails. It only
triggers while the cache stays valid between the two requests, hence the
flakiness.
Audited the other dashboard readers of cached mgr.get() keys: these two
are the only sites that mutate the result; the rest only read, and
health.py already copies its osd_map before editing.
Copy the dicts before stamping them; the cache stays clean.
Sun Yuechi [Sat, 6 Jun 2026 09:44:57 +0000 (17:44 +0800)]
Dockerfile.build: fetch sccache on riscv64
sccache ships a riscv64 release artifact since v0.13.0, published under the
riscv64gc target triple. Map uname -m "riscv64" to that asset name so the
download resolves on riscv64 instead of being skipped.
Sun Yuechi [Sat, 6 Jun 2026 09:44:33 +0000 (17:44 +0800)]
Dockerfile.build: bump sccache to v0.15.0
The releases since v0.8.2 add caching for C++20 modules, assembly, and C
preprocessor output, plus broader GCC/MSVC flag handling. They also avoid
double-caching when ccache is on PATH and carry assorted cache-correctness
and storage-backend fixes.
Kefu Chai [Fri, 5 Jun 2026 01:34:56 +0000 (09:34 +0800)]
ceph.spec.in: only require c-ares >= 1.28 on el10+
87e233bb2628784c8c59603e74bc728a8944265e added an unconditional
"Requires: c-ares >= 1.28.0" to ceph-osd-crimson: seastar links
ares_query_dnsrec, which c-ares only grew in 1.28, and the libcares.so.2
SONAME doesn't carry the version so rpm can't infer the floor itself.
But the floor only earns its place where the build links the symbol
against a newer c-ares than the runtime has, and that's an EL thing.
el10's minors cross 1.28 under one $releasever (10.1 ships 1.25, 10.2
ships 1.34), so a builder rolls to 1.34 while a frozen 10.1 node stays on
1.25; without the floor the rpm installs there and the osd then crashes
on the missing symbol. el9 builds the legacy ares_query path and doesn't
need it at all.
Fedora and SUSE don't have the skew: one c-ares per release, built and
run against the same one, so the auto libcares.so.2 dep covers them. So
pin it only on el10+, arch-qualified with %{?_isa}.
Ronen Friedman [Thu, 4 Jun 2026 13:05:26 +0000 (13:05 +0000)]
crimson/test: chain invoke_on_all() future instead of calling get()
The reactors start-up code on ARM64 uses invoke_on_all() to
set a configuration option.
Replace smp::invoke_on_all().get() with future chaining. This
avoids waiting on a future from a reactor continuation (outside
of a seastar thread) that throws exception.
Ronen Friedman [Fri, 29 May 2026 18:21:51 +0000 (18:21 +0000)]
osd/scrub: clean up inconsistent_obj_wrapper and ScrubStore
Add a default constructor to inconsistent_obj_wrapper, allowing
decode_wrapper() to avoid requiring a dummy hobject_t that gets
immediately overwritten by decode(). Remove the now-unnecessary
hobject_t parameter from merge_encoded_error_wrappers().
Introduce a 'last_degraded' timestamp to the pg_stat_t structure to track
the initial point of redundancy loss. This field, used in conjunction
with 'last_clean', allows the manager to calculate a cluster-wide
durability score by measuring the duration of vulnerability windows.
Changes:
1) Add last_degraded (utime_t) to pg_stat_t in osd_types.h.
2) Increment pg_stat_t encoding version to 31. The decode logic
defaults last_degraded to last_clean for backward compatibility
during rolling upgrades.
3) Update operator==, dump(), and generate_test_instances() to
support ceph-dencoder testing and JSON output.
4) Implement latching logic in PeeringState::prepare_stats_for_publish():
- A PG is considered vulnerable if in DEGRADED or UNDERSIZED state.
- last_degraded is set to 'now' only if it is <= last_clean,
effectively latching the timestamp to the start of the failure
event until the PG next becomes clean.
5) Standalone tests to verify:
- The last_degraded timestamp latching logic.
- Verify last_degraded timestamp is modified when OSDs are marked 'out' for
draining purposes in which case PGs are marked undersized.
6) Release note the addition of 'last_degraded' field to PG stats.
Kefu Chai [Thu, 4 Jun 2026 10:38:24 +0000 (18:38 +0800)]
test/rgw/posix: free the quota handler in TestDriver
TestDriver::init() allocates quota_handler via
RGWQuotaHandler::generate_handler() but nothing frees it. The real
POSIXDriver frees it in finalize(), which the unit tests never call, so
every fixture that runs init() leaks the handler and the stat caches
hanging off it: 274 allocations, ~40KB, all rooted at generate_handler()
under ASan:
==6102==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 3200 byte(s) in 5 object(s) allocated from:
#1 RGWQuotaHandler::generate_handler(...) src/rgw/rgw_quota.cc:989
#2 TestDriver::init(...) src/test/rgw/test_rgw_posix_driver.cc:1100
#3 POSIXDriverTest::SetUp() src/test/rgw/test_rgw_posix_driver.cc:1191
...
SUMMARY: AddressSanitizer: 40099 byte(s) leaked in 274 allocation(s).
So free it in ~TestDriver(), the counterpart to the init() allocation.
~POSIXDriver() is empty and nothing else touches quota_handler, so there
is no double free, and free_handler(nullptr) is a no-op when init()
bailed out early.
ceph.spec.in: require c-ares >= 1.28 for ceph-osd-crimson
Seastar's DNS stack uses ares_query_dnsrec when built against c-ares
>= 1.28 (ARES_VERSION >= 0x011c00). Only ceph-osd-crimson links that
path; classic-osd does not, so add the version floor on the crimson
subpackage only.
Rocky Linux 10 shaman builds use docker.io/rockylinux/rockylinux:10
(os-release 10.1), but dnf builddeps resolve against the live Rocky 10
BaseOS/AppStream repos, which track the newest minor and install
c-ares-devel/c-ares 1.34.6. CMake links ceph-osd-crimson against that
library. Teuthology nodes are provisioned as Rocky 10.1 and install only
the requested Ceph packages without a full distro upgrade, so their
baseline c-ares stays at 1.25.0 (< 1.28, no ares_query_dnsrec). Install
succeeds but OSD startup fails with "undefined symbol: ares_query_dnsrec".
Require c-ares >= 1.28 on ceph-osd-crimson so dnf upgrades to a suitable
libcares (1.34.6 is already in Rocky 10.1 baseos) or fails cleanly at
install. Ubuntu crimson CI does not show this mismatch: the same LTS is
used for building and testing, and maintainers do not bump upstream
package versions across an LTS lifecycle (only cherry-picked fixes), so
build-time and runtime libc-ares stay aligned.
get('osd_map') returns the cached object directly, so del and key
assignments were silently corrupting the cache for subsequent callers.
Take a shallow copy before modifying, and use pop() instead of del in
case the cache was already corrupted.
mgr/DaemonServer: clarify ok-to-upgrade error message for CRUSH buckets
Refine the error string in DaemonServer.cc returned by the
ok-to-upgrade command when OSDs in a CRUSH bucket cannot be upgraded.
The original message is ambiguous. It fails to clearly convey that
stopping *any* individual OSD in that specific bucket will drop PGs
offline, meaning no OSDs within that bucket can be safely upgraded at
this time.
Update the phrasing to explicitly state that at least X PGs will go offline
if any OSD out of the total count in that CRUSH bucket is stopped. Also
standardize on capitalized acronyms (PG, OSD, CRUSH) and wrap the bucket
name in single quotes for better log readability.
Kobi Ginon [Wed, 3 Jun 2026 14:03:09 +0000 (17:03 +0300)]
doc/rbd: clarify Rocky iSCSI gateway requirements
List Rocky Linux 8+ alongside RHEL/CentOS Stream 7.5+. Note that packaged
ceph-iscsi must recognize Rocky in /etc/os-release (ceph-iscsi#282). Add a
short Rocky note under iSCSI targets; expand the overview maintenance
warning with migration guidance to RBD and the NVMe-oF gateway.
mds : correction in the description for mds_log_max_segments config
Since we use unsigned integer for the config option
`mds_log_max_segments` , the value '-1' is not permitted.
And there's no need to disable this limit. Hence removing
this statement from the its description.
Since we are changing the 'application' for the report,
we need non-RO, in case of cached api call.
using 'pool_stats' map directly to avoid copy of the pg_dump
that can be huge.
mgr: replace TTLCache with MgrMapCache and protect api with readonly
This patch removes the old TTLCache implementation and introduces
a new generic MgrMapCache driven by a runtime toggle:
- Add `mgr_map_cache_enabled` config option in global.yaml
- Swap out `ttl_cache` for `api_cache` (MgrMapCache) in ActivePyModules
- Update cacheable_get_python() and get_python() to use LFU‐based api_cache
- add new get_mutable parameter to the get api call to get a copy.
- Invalidate api_cache on notify_all events
- Remove all TTLCache headers, sources, and tests
- Include MgrMapCache.cc in CMakeLists and update BaseMgrModule bindings
- Improve logging around cache hits, misses, and state changes
- ActivePyModules
* Remove unused update_cache_metrics()
* Log cache hits/misses inline and only insert into cache when
enabled+cacheable (with proper Py_INCREF)
* Switch get_python() to use PyFormatterRO for cacheable routes, PyFormatter otherwise
- MgrMapCache/LFUCache
* Add can_read_cache()/can_write_cache() helpers and use const& for key parameters
* Guard perf counter increments and improve debug logging
- PyFormatter
* Add PyFormatterRO subclass that freezes dicts/lists into read-only
proxies on the fly
- Python mgr_module
* Simplify get() to return raw result
This change ensures immutable JSON output on cache hits and tightens up cache logic.
mgr/cli: add cache flush command with proper status reporting
Allow operators to invalidate individual mgr Python caches at runtime
without restarting the manager. Introduces a new CLI command:
$ ceph mgr cli cache flush <map-name>
which returns success or a clear error if the named cache entry doesn’t
exist or isn’t cacheable. This makes it easy to drop stale cached maps
(e.g. osd_map, mon_map) on demand.
Fixes: https://tracker.ceph.com/issues/72447 Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
mgr: add new unit tests for MgrMapCache
- Guard against null perf‐counter before calling inc(), preventing crashes
- Add “foo” to allowed_keys list (for test coverage)
- Rename and refocus the CMake test target from TTLCache to MgrMapCache
- Introduce test_mgrmapcache.cc with LFUCache tests.
- Remove the obsolete test_ttlcache.cc
Fixes: https://tracker.ceph.com/issues/72447 Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
mgr/test_cache: add new tests
Naman Munet [Mon, 4 May 2026 12:57:53 +0000 (18:27 +0530)]
mgr/dashboard: multisite sync-policy page should include daemon selection
Fixes: https://tracker.ceph.com/issues/71522
Changes includes:
- Added daemon selection support to all sync policy endpoints
- Enhanced backend with daemon context awareness
- Fetch only the sync policies from the specified daemon
Dhairya Parmar [Wed, 20 May 2026 21:18:15 +0000 (02:48 +0530)]
mds: persist session auth_name in ESession journal event
So that it can be applied to the freshly creation session which happens
while recreating session in ESession::replay when the OMAP version fell
behind the ESession cmapv and the newly creation session would be
rejected as target when a client tries to reclaim this session.