Kefu Chai [Thu, 4 Jun 2026 10:38:24 +0000 (18:38 +0800)]
test/rgw/posix: free the quota handler in TestDriver
TestDriver::init() allocates quota_handler via
RGWQuotaHandler::generate_handler() but nothing frees it. The real
POSIXDriver frees it in finalize(), which the unit tests never call, so
every fixture that runs init() leaks the handler and the stat caches
hanging off it: 274 allocations, ~40KB, all rooted at generate_handler()
under ASan:
==6102==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 3200 byte(s) in 5 object(s) allocated from:
#1 RGWQuotaHandler::generate_handler(...) src/rgw/rgw_quota.cc:989
#2 TestDriver::init(...) src/test/rgw/test_rgw_posix_driver.cc:1100
#3 POSIXDriverTest::SetUp() src/test/rgw/test_rgw_posix_driver.cc:1191
...
SUMMARY: AddressSanitizer: 40099 byte(s) leaked in 274 allocation(s).
So free it in ~TestDriver(), the counterpart to the init() allocation.
~POSIXDriver() is empty and nothing else touches quota_handler, so there
is no double free, and free_handler(nullptr) is a no-op when init()
bailed out early.
ceph.spec.in: require c-ares >= 1.28 for ceph-osd-crimson
Seastar's DNS stack uses ares_query_dnsrec when built against c-ares
>= 1.28 (ARES_VERSION >= 0x011c00). Only ceph-osd-crimson links that
path; classic-osd does not, so add the version floor on the crimson
subpackage only.
Rocky Linux 10 shaman builds use docker.io/rockylinux/rockylinux:10
(os-release 10.1), but dnf builddeps resolve against the live Rocky 10
BaseOS/AppStream repos, which track the newest minor and install
c-ares-devel/c-ares 1.34.6. CMake links ceph-osd-crimson against that
library. Teuthology nodes are provisioned as Rocky 10.1 and install only
the requested Ceph packages without a full distro upgrade, so their
baseline c-ares stays at 1.25.0 (< 1.28, no ares_query_dnsrec). Install
succeeds but OSD startup fails with "undefined symbol: ares_query_dnsrec".
Require c-ares >= 1.28 on ceph-osd-crimson so dnf upgrades to a suitable
libcares (1.34.6 is already in Rocky 10.1 baseos) or fails cleanly at
install. Ubuntu crimson CI does not show this mismatch: the same LTS is
used for building and testing, and maintainers do not bump upstream
package versions across an LTS lifecycle (only cherry-picked fixes), so
build-time and runtime libc-ares stay aligned.
get('osd_map') returns the cached object directly, so del and key
assignments were silently corrupting the cache for subsequent callers.
Take a shallow copy before modifying, and use pop() instead of del in
case the cache was already corrupted.
mgr/DaemonServer: clarify ok-to-upgrade error message for CRUSH buckets
Refine the error string in DaemonServer.cc returned by the
ok-to-upgrade command when OSDs in a CRUSH bucket cannot be upgraded.
The original message is ambiguous. It fails to clearly convey that
stopping *any* individual OSD in that specific bucket will drop PGs
offline, meaning no OSDs within that bucket can be safely upgraded at
this time.
Update the phrasing to explicitly state that at least X PGs will go offline
if any OSD out of the total count in that CRUSH bucket is stopped. Also
standardize on capitalized acronyms (PG, OSD, CRUSH) and wrap the bucket
name in single quotes for better log readability.
Kobi Ginon [Wed, 20 May 2026 13:09:53 +0000 (16:09 +0300)]
ceph.spec: declare PyYAML and Jinja2 Requires for cephadm RPM
cephadm's zipapp imports yaml and jinja2. When the package is built
with cephadm_bundling and RPM-sourced deps (--without cephadm_pip_deps),
only BuildRequires were listed for jinja2 and no runtime Requires were
declared for PyYAML or Jinja2, so minimal "dnf install cephadm" could
fail with ModuleNotFoundError. The unbundled cephadm case also lacked
PyYAML.
Add matching BuildRequires/Requires in the rpm-bundle branch and
Requires (plus SUSE naming for Jinja2/PyYAML) in the no-bundle branch.
Kobi Ginon [Wed, 3 Jun 2026 14:03:09 +0000 (17:03 +0300)]
doc/rbd: clarify Rocky iSCSI gateway requirements
List Rocky Linux 8+ alongside RHEL/CentOS Stream 7.5+. Note that packaged
ceph-iscsi must recognize Rocky in /etc/os-release (ceph-iscsi#282). Add a
short Rocky note under iSCSI targets; expand the overview maintenance
warning with migration guidance to RBD and the NVMe-oF gateway.
Casey Bodley [Tue, 2 Jun 2026 20:17:59 +0000 (16:17 -0400)]
osdc: deliver neorados completions to associated executor
while Objecter delivers librados completions directly by calling
Context::complete(), neorados completions are passed in as
boost::asio::any_completion_handler and delivered to an asio executor
via boost::asio::defer() on completion
asio handlers may have an "associated executor" so callers can customize
where these completions are delivered. for example, multithreaded
applications often use strand executors to synchronize completion
handlers and prevent data races between concurrent operations
however, applications like radosgw that depend on strands for
thread-safety did not get it due to the fact that Objecter's
Op::complete() delivered all neorados completions to the default
io_context executor
use boost::asio::get_associated_executor() to respect the handler's
executor affinity, if any. but because the Op's handler is the
type-erased any_completion_handler, its associated executor is also
type-erased as any_completion_executor. that any_completion_executor
doesn't support the blocking::never_t property required by defer/post,
so defer() was changed to dispatch() which may call the handler directly
if Objecter is already running on the requested executor. i assume this
is safe, given that librados' Context-based completions already do this
mds : correction in the description for mds_log_max_segments config
Since we use unsigned integer for the config option
`mds_log_max_segments` , the value '-1' is not permitted.
And there's no need to disable this limit. Hence removing
this statement from the its description.
reseting a txn doesnt really create a new one semantically.
avoid incrementing "created" on reset, otherwise we end up
with inflated numbers where MUTATE txn created count
is twice as higher than committed.
Note, "resets" are already tracked as invalidated.
Matty Williams [Fri, 12 Dec 2025 10:13:11 +0000 (10:13 +0000)]
osd: Add ECOmapJournal class and relocate OmapUpdateType enum class
The ECOmapJournal will be used to store omap updates (in ec pools with optimisations enabled) which have not yet been committed to the object store.
Added unit tests for this class.
Promoted OmapUpdateType to osd_types.h so that it can be shared to multiple files without circular dependencies.
Signed-off-by: Matty Williams <Matty.Williams@ibm.com>
Since we are changing the 'application' for the report,
we need non-RO, in case of cached api call.
using 'pool_stats' map directly to avoid copy of the pg_dump
that can be huge.
mgr: replace TTLCache with MgrMapCache and protect api with readonly
This patch removes the old TTLCache implementation and introduces
a new generic MgrMapCache driven by a runtime toggle:
- Add `mgr_map_cache_enabled` config option in global.yaml
- Swap out `ttl_cache` for `api_cache` (MgrMapCache) in ActivePyModules
- Update cacheable_get_python() and get_python() to use LFU‐based api_cache
- add new get_mutable parameter to the get api call to get a copy.
- Invalidate api_cache on notify_all events
- Remove all TTLCache headers, sources, and tests
- Include MgrMapCache.cc in CMakeLists and update BaseMgrModule bindings
- Improve logging around cache hits, misses, and state changes
- ActivePyModules
* Remove unused update_cache_metrics()
* Log cache hits/misses inline and only insert into cache when
enabled+cacheable (with proper Py_INCREF)
* Switch get_python() to use PyFormatterRO for cacheable routes, PyFormatter otherwise
- MgrMapCache/LFUCache
* Add can_read_cache()/can_write_cache() helpers and use const& for key parameters
* Guard perf counter increments and improve debug logging
- PyFormatter
* Add PyFormatterRO subclass that freezes dicts/lists into read-only
proxies on the fly
- Python mgr_module
* Simplify get() to return raw result
This change ensures immutable JSON output on cache hits and tightens up cache logic.
mgr/cli: add cache flush command with proper status reporting
Allow operators to invalidate individual mgr Python caches at runtime
without restarting the manager. Introduces a new CLI command:
$ ceph mgr cli cache flush <map-name>
which returns success or a clear error if the named cache entry doesn’t
exist or isn’t cacheable. This makes it easy to drop stale cached maps
(e.g. osd_map, mon_map) on demand.
Fixes: https://tracker.ceph.com/issues/72447 Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
mgr: add new unit tests for MgrMapCache
- Guard against null perf‐counter before calling inc(), preventing crashes
- Add “foo” to allowed_keys list (for test coverage)
- Rename and refocus the CMake test target from TTLCache to MgrMapCache
- Introduce test_mgrmapcache.cc with LFUCache tests.
- Remove the obsolete test_ttlcache.cc
Fixes: https://tracker.ceph.com/issues/72447 Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
mgr/test_cache: add new tests
Kefu Chai [Wed, 3 Jun 2026 05:53:06 +0000 (13:53 +0800)]
crimson/osd: fold the split-child setup into handle_split_pg_creation
A readability cleanup with no behaviour change.
split_pg()'s loop still did all the per-child setup inline (core
mapping, make_pg, split_colls, split_into, the snapmapper touch) and
then called handle_split_pg_creation() to kick off the child's
PGAdvanceMap. That is a lot of detail for what the loop is really doing.
So let's move that setup into handle_split_pg_creation() and have it
return the child PG. The loop then just asks it to create each child and
collects the result, and the per-child PeeringCtx never has to leave the
function. Children are still created one at a time, each with its own
PeeringCtx.
Kefu Chai [Tue, 26 May 2026 08:01:48 +0000 (16:01 +0800)]
crimson/osd: give each split child its own PeeringCtx
This is a readability cleanup, not a bugfix; the existing code is
correct.
handle_split_pg_creation() handed std::move(rctx) to each child's
PGAdvanceMap, reusing the parent's PeeringCtx by moving it into the
child. That leaves the parent's rctx moved-from, so the next split_pg()
iteration writes into the empty husk and moves it again, and
split_stats() runs against it afterwards too. It works, but only because
a moved-from ceph::os::Transaction comes back empty, which is a subtle
thing to rely on.
So let's just give each child its own PeeringCtx: split_colls() and the
snapmapper touch go into the child's context, and we hand that to the
child's do-init PGAdvanceMap. The only behavioural difference is that the
parent's own map-advance writes now commit with the parent instead of
riding into the first child's transaction, which is harmless because the
children are built from the parent's in-memory state.
Naman Munet [Mon, 4 May 2026 12:57:53 +0000 (18:27 +0530)]
mgr/dashboard: multisite sync-policy page should include daemon selection
Fixes: https://tracker.ceph.com/issues/71522
Changes includes:
- Added daemon selection support to all sync policy endpoints
- Enhanced backend with daemon context awareness
- Fetch only the sync policies from the specified daemon
mgr/cephadm: Don't skip OSDs with non-empty osdspec_affinity
`ceph cephadm osd activate` calls `deploy_osd_daemons_for_existing_osds`
with a synthetic DriveGroupSpec where service_id=''. Commit fbe3a053
introduced an unconditional osdspec_affinity filter:
if osd['tags']['ceph.osdspec_affinity'] != spec.service_id:
continue
The fix is to only enforce the affinity check when spec.service_id is
non-empty. An empty service_id means the caller is osd activate, which
should adopt any existing OSD regardless of its affinity tag.
Kefu Chai [Tue, 2 Jun 2026 07:48:41 +0000 (15:48 +0800)]
rgw/posix: start the Inotify thread last, after the rest is built
f62e811f9ef fixed the wfd/efd init-order race but missed a sibling: thrd
was still declared before map_mutex, the watch maps and the shutdown flag.
Members come up in declaration order, so building thrd kicks off ev_loop()
while those are still uninitialized.
That is bad news, because ev_loop() reads shutdown and, when an event
arrives, locks map_mutex and pokes at the maps. Doing any of that before
they are constructed is undefined behavior: reading the shutdown atomic
before its initializer has even stored false, or locking a std::mutex that
does not exist yet. Making shutdown a std::atomic<bool> made concurrent
access well-defined, but that does not help if the load happens before the
object is constructed.
So just declare thrd last, and the thread will not start until everything
it touches is ready. wfd/efd stay ahead of it, so the earlier fix still
holds.
helgrind caught the shutdown read in ev_loop() racing its own initializer
in the constructor.
Adam Kupczyk [Mon, 18 May 2026 16:33:45 +0000 (16:33 +0000)]
objectstore/test_kv: Unittest for util_divide_key_range
Extensive tests for quality of KeyValueDB::util_divide_key_range.
Tests speed and correctness of split.
Has 2 control modes:
1) on Jenkins (detected by JENKINS_HOME) run with reduced scope
2) passing env VERBOSE=1 gives more details
Adam Kupczyk [Fri, 15 May 2026 16:07:07 +0000 (16:07 +0000)]
kv/KeyValueDB: New utility function util_divide_key_range
Significant reshuffle. Cleaned loops.
Points scanned on db were [size]->[key]. Now it is [key]->[size],
which is better since keys are unique by design, but calculation
of size can be a victim to RocksDB estimation precision.
Dhairya Parmar [Wed, 20 May 2026 21:18:15 +0000 (02:48 +0530)]
mds: persist session auth_name in ESession journal event
So that it can be applied to the freshly creation session which happens
while recreating session in ESession::replay when the OMAP version fell
behind the ESession cmapv and the newly creation session would be
rejected as target when a client tries to reclaim this session.
Shai Fultheim [Sun, 24 May 2026 11:19:56 +0000 (14:19 +0300)]
crimson/os/seastore: enforce capacity in RBMCleaner::try_reserve_projected_usage
RBMCleaner::try_reserve_projected_usage always returned true and just
incremented stats.projected_used_bytes. The EPM BackgroundProcess
relies on the return value to block IO when the device is full, so
this effectively disabled backpressure for the RANDOM_BLOCK_SSD
backend: concurrent transactions could each reserve unbounded amounts,
and the over-commit surfaced downstream as `unexpected enospc` asserts
in the data path (object_data_handler.cc and friends, where ENOSPC is
treated as crimson::ct_error::enospc::assert_failure because the
existing infrastructure assumes ENOSPC is impossible). The OSD aborted
under sustained random-write workloads that exceeded RBM capacity.
Compute the device's data capacity as total - journal, subtract a 5%
headroom (for metadata writes and fragmentation slack the AVL allocator
cannot pack into), and reject reservations that would push
used + projected over the line. The existing EPM blocking-IO path
(extent_placement_manager.cc:726) already queues the IO until
release_projected_usage wakes it, so no caller-side changes are needed.
This is the minimal fix to keep the OSD alive under sustained random
writes. It converts a crash into a stall: once the device fills and
the cleaner has nothing to free (RBMCleaner::clean_space is still a
TODO), new writes block indefinitely instead of crashing. Verified
against an 8-job 1MB random-write fio (--size 63g, 90GB RBM, 3GB
journal): 68 GB user-written, host WAF 1.696, OSD survives, watchdog
kills fio after slow-ops timeout. Without this patch the same workload
asserts in the data path.
The headroom is intentionally generous (5%) because there is no GC
yet; once RBMCleaner::clean_space() exists, the headroom can shrink.
Marcel Lauhoff [Tue, 5 May 2026 12:21:03 +0000 (14:21 +0200)]
rgw: SSE-KMS: Handle Testing Key Per Object
The testing backend uses a 'keysel' attribute to derive a per object
key from the KEK in the config. A single key_id with distinct keysel
has different keys and need to be cached as such.
Add the keysel to the cache key id to handle these collisions.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Tue, 10 Mar 2026 10:51:30 +0000 (11:51 +0100)]
rgw: KMS Cache Shutdown: Reaper first
1. Don't delete the KMS cache before draining/joining the frontend
coroutine threads. They may still depend on the KMS cache.
2. Stop the TTL reaper early to get it off the coroutine pool.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
sieve_expire_erase_unmutexed did not update the sieve hand passed as
advertised. Make it return the updated hand and use that to
update the global _sieve_hand in expire_erase
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Tue, 3 Mar 2026 20:24:10 +0000 (21:24 +0100)]
rgw: SSE-KMS: Handle Vault Transit Key Per Object
KMS backends Barbican, Vault KV, and KMIP have a static key per
key_id. However, with Vault Transit, each object has a unique DEK
wrapped by the transit key.
Keying th cache with key_id in Transit mode results in only the first
DEK to be cached for all subsequent objects.
Fix this by appending a hash of the wrapped DEK to the cache key.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Thu, 19 Dec 2024 14:41:30 +0000 (15:41 +0100)]
rgw: SSE-KMS Secrets Cache
Add SSE Key Management System secrets cache to RGW.
It is common to have secrets shared by many if not all objects in a
bucket. Without RGW-side caching every PUT/GET will cause a request to
an external KSM. This not only adds load to the KSM, but also slows
down read and writes.
Combine WebCache, ceph::async::call_once and LinuxKeyringSecret into
KMSCache. WebCache stores async::once_result to wrap results of a KMS
secret fetch to mitigate cache stampedes (concurrent cache requests to
the same key coalesce into one). The retrieved secrets are stored in
the Linux kernel key retention service (LinuxKeyringSecret) for safe
keeping and retrial by subsequent requests. KMSCache adds a TTL reaper
and life cycle.
Cache values and error handling: The cache stores positive
fetch results, permanent errors (e.g key does not exists) and
transient errors (e.g fetch timeout). Each with a different TTL.
Unit tests to cover cached / uncached KMS retrieve and runtime cache
disable via config.
Add perf counter `kms_fetch_lat` to track KMS fetch request latency
and error counters to track permanent, transient and key store
errors.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com> Fixes: https://tracker.ceph.com/issues/68524
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Thu, 18 Dec 2025 18:42:07 +0000 (19:42 +0100)]
common: Refactor LinuxKeyringSecret into Keyring Interface
Goal: Support multiple backends and faking / mocking for testing.
Add abstract classes Keyring (factory) and KeyringSecret. Add
"Unsupported" implementation for non-Linux platforms. Add a get_best
factory function that currently returns the LinuxKeyring impl on Linux
or Unsupported elsewhere.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Casey Bodley [Mon, 24 Mar 2025 16:51:15 +0000 (12:51 -0400)]
common/async: add call_once() algorithm for optional_yield
modeled after std::call_once() to guarantee that racing callers wait for
the initial caller to finish. the main differences here are
* support for coroutine callers to suspend instead of blocking while
waiting for the initial caller, and
* the wrapped function must return a value, which is cached and returned
to all callers
Casey Bodley [Mon, 24 Mar 2025 16:50:16 +0000 (12:50 -0400)]
common/async: yield_waiter overloads for unique_lock
if async_wait() can race with complete() across threads, the
yield_waiter's handler_state needs to be protected by a mutex. add
an async_wait() overload for unique_lock that behaves like
condition_variable::wait(): the lock is released immediately before
suspending, and reacquired immediately before calling its completion
handler