Shai Fultheim [Sun, 24 May 2026 11:19:56 +0000 (14:19 +0300)]
crimson/os/seastore: enforce capacity in RBMCleaner::try_reserve_projected_usage
RBMCleaner::try_reserve_projected_usage always returned true and just
incremented stats.projected_used_bytes. The EPM BackgroundProcess
relies on the return value to block IO when the device is full, so
this effectively disabled backpressure for the RANDOM_BLOCK_SSD
backend: concurrent transactions could each reserve unbounded amounts,
and the over-commit surfaced downstream as `unexpected enospc` asserts
in the data path (object_data_handler.cc and friends, where ENOSPC is
treated as crimson::ct_error::enospc::assert_failure because the
existing infrastructure assumes ENOSPC is impossible). The OSD aborted
under sustained random-write workloads that exceeded RBM capacity.
Compute the device's data capacity as total - journal, subtract a 5%
headroom (for metadata writes and fragmentation slack the AVL allocator
cannot pack into), and reject reservations that would push
used + projected over the line. The existing EPM blocking-IO path
(extent_placement_manager.cc:726) already queues the IO until
release_projected_usage wakes it, so no caller-side changes are needed.
This is the minimal fix to keep the OSD alive under sustained random
writes. It converts a crash into a stall: once the device fills and
the cleaner has nothing to free (RBMCleaner::clean_space is still a
TODO), new writes block indefinitely instead of crashing. Verified
against an 8-job 1MB random-write fio (--size 63g, 90GB RBM, 3GB
journal): 68 GB user-written, host WAF 1.696, OSD survives, watchdog
kills fio after slow-ops timeout. Without this patch the same workload
asserts in the data path.
The headroom is intentionally generous (5%) because there is no GC
yet; once RBMCleaner::clean_space() exists, the headroom can shrink.
Marcel Lauhoff [Tue, 5 May 2026 12:21:03 +0000 (14:21 +0200)]
rgw: SSE-KMS: Handle Testing Key Per Object
The testing backend uses a 'keysel' attribute to derive a per object
key from the KEK in the config. A single key_id with distinct keysel
has different keys and need to be cached as such.
Add the keysel to the cache key id to handle these collisions.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Tue, 10 Mar 2026 10:51:30 +0000 (11:51 +0100)]
rgw: KMS Cache Shutdown: Reaper first
1. Don't delete the KMS cache before draining/joining the frontend
coroutine threads. They may still depend on the KMS cache.
2. Stop the TTL reaper early to get it off the coroutine pool.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
sieve_expire_erase_unmutexed did not update the sieve hand passed as
advertised. Make it return the updated hand and use that to
update the global _sieve_hand in expire_erase
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Tue, 3 Mar 2026 20:24:10 +0000 (21:24 +0100)]
rgw: SSE-KMS: Handle Vault Transit Key Per Object
KMS backends Barbican, Vault KV, and KMIP have a static key per
key_id. However, with Vault Transit, each object has a unique DEK
wrapped by the transit key.
Keying th cache with key_id in Transit mode results in only the first
DEK to be cached for all subsequent objects.
Fix this by appending a hash of the wrapped DEK to the cache key.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Thu, 19 Dec 2024 14:41:30 +0000 (15:41 +0100)]
rgw: SSE-KMS Secrets Cache
Add SSE Key Management System secrets cache to RGW.
It is common to have secrets shared by many if not all objects in a
bucket. Without RGW-side caching every PUT/GET will cause a request to
an external KSM. This not only adds load to the KSM, but also slows
down read and writes.
Combine WebCache, ceph::async::call_once and LinuxKeyringSecret into
KMSCache. WebCache stores async::once_result to wrap results of a KMS
secret fetch to mitigate cache stampedes (concurrent cache requests to
the same key coalesce into one). The retrieved secrets are stored in
the Linux kernel key retention service (LinuxKeyringSecret) for safe
keeping and retrial by subsequent requests. KMSCache adds a TTL reaper
and life cycle.
Cache values and error handling: The cache stores positive
fetch results, permanent errors (e.g key does not exists) and
transient errors (e.g fetch timeout). Each with a different TTL.
Unit tests to cover cached / uncached KMS retrieve and runtime cache
disable via config.
Add perf counter `kms_fetch_lat` to track KMS fetch request latency
and error counters to track permanent, transient and key store
errors.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com> Fixes: https://tracker.ceph.com/issues/68524
On-behalf-of: SAP marcel.lauhoff@sap.com
Marcel Lauhoff [Thu, 18 Dec 2025 18:42:07 +0000 (19:42 +0100)]
common: Refactor LinuxKeyringSecret into Keyring Interface
Goal: Support multiple backends and faking / mocking for testing.
Add abstract classes Keyring (factory) and KeyringSecret. Add
"Unsupported" implementation for non-Linux platforms. Add a get_best
factory function that currently returns the LinuxKeyring impl on Linux
or Unsupported elsewhere.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
On-behalf-of: SAP marcel.lauhoff@sap.com
Casey Bodley [Mon, 24 Mar 2025 16:51:15 +0000 (12:51 -0400)]
common/async: add call_once() algorithm for optional_yield
modeled after std::call_once() to guarantee that racing callers wait for
the initial caller to finish. the main differences here are
* support for coroutine callers to suspend instead of blocking while
waiting for the initial caller, and
* the wrapped function must return a value, which is cached and returned
to all callers
Casey Bodley [Mon, 24 Mar 2025 16:50:16 +0000 (12:50 -0400)]
common/async: yield_waiter overloads for unique_lock
if async_wait() can race with complete() across threads, the
yield_waiter's handler_state needs to be protected by a mutex. add
an async_wait() overload for unique_lock that behaves like
condition_variable::wait(): the lock is released immediately before
suspending, and reacquired immediately before calling its completion
handler
Adam Kupczyk [Tue, 1 Jul 2025 11:30:59 +0000 (11:30 +0000)]
kv/KeyValueDB: New utility function util_divide_key_range
New function splits provided range into smaller chunks.
Declared in KeyValueDB, but implemented only for RocksDBStore.
Useful for splitting large datasets for multiple threads to
iterate in parallel.
Devika Babrekar [Thu, 21 May 2026 06:33:04 +0000 (12:03 +0530)]
mgr/dashboard: Combining Quorum tables data on Monitors page Fixes: https://tracker.ceph.com/issues/76746 Signed-off-by: Devika Babrekar <devika.babrekar@ibm.com>
Sun Yuechi [Mon, 1 Jun 2026 06:52:03 +0000 (14:52 +0800)]
cmake: link legacy-option-headers from targets that use legacy options
The *_legacy_options.h headers that define the legacy ConfigValues
members are generated at build time by y2c.py. Linking the
legacy-option-headers INTERFACE library adds an order dependency on
that step. A few targets reference legacy members without linking it,
so under a parallel build they can be compiled before the headers
exist and fail with "class ConfigValues has no member ...":
Kefu Chai [Mon, 1 Jun 2026 10:40:06 +0000 (18:40 +0800)]
qa/cephadm: query iSCSI gateway FQDN from inside the container
rbd-target-api validates that the gateway hostname supplied by gwcli
matches the container's own socket.getfqdn(). Running the same call on
the host can return a different value when the host and container resolve
names differently (e.g. on Rocky 10), causing gateway creation to fail
with HTTP 400 and all subsequent gwcli configuration to break silently.
Query the FQDN from inside the iSCSI container directly so the value is
always consistent with what rbd-target-api expects. This also removes the
"run twice" workaround, which was compensating for host-side DNS
warm-up flakiness rather than addressing the underlying mismatch.
Kefu Chai [Tue, 19 May 2026 14:43:56 +0000 (22:43 +0800)]
container: install ceph-mgr-modules-core and ceph-mgr-modules-standard
The Containerfile uses --setopt=install_weak_deps=False throughout, so
ceph-mgr-modules-core (a Recommends of ceph-mgr, not a Requires) and
the split-out module packages are not automatically installed. Add them
explicitly.
ceph-mgr-modules-core was split into per-module packages so that users
only need to install what they actually use. To ease migration for
existing deployments that want the full former set, add a meta-package
ceph-mgr-modules-standard that pulls in all modules which were
previously bundled in ceph-mgr-modules-core.
Kefu Chai [Tue, 19 May 2026 13:50:23 +0000 (21:50 +0800)]
debian, rpm: add ceph-mgr-cli-api package
cli_api is a new module in this release (not previously shipped in
ceph-mgr-modules-core), so it gets its own package without any
Breaks/Replaces or Obsoletes against the old monolithic package.
Kefu Chai [Thu, 21 May 2026 12:45:36 +0000 (20:45 +0800)]
debian: fix missing python3 deps for diskprediction-local and osd-support
diskprediction-local depends on python3-prettytable and osd-support
depends on python3-cherrypy3; both need to be declared explicitly now
that these modules are separate packages.
Kefu Chai [Thu, 21 May 2026 12:45:30 +0000 (20:45 +0800)]
rpm: split ceph-mgr-modules-core into per-module packages
ceph-mgr-modules-core has historically bundled always-on modules
together with optional ones, forcing users to install modules and their
dependencies even when they have no use for them. Split each optional
module into its own package so users and distributions can install only
what they need.
ceph-mgr-modules-core is trimmed to the 10 always-on modules defined
in src/mon/MgrMonitor.cc: balancer, crash, devicehealth, orchestrator,
pg_autoscaler, progress, rbd_support, status, telemetry, volumes.
Each optional module now follows the pattern of ceph-mgr-k8sevents and
ceph-mgr-rook.
New packages carry Obsoletes: ceph-mgr-modules-core < 21.0.0 for
proper upgrade path.
The split also exposes cross-module Python dependencies: modules
co-installed in ceph-mgr-modules-core could freely import each other,
but once separated into individual packages those imports require
explicit Requires entries. Now the inter-dependencies are expressed
properly in ceph.spec.in.
Kefu Chai [Thu, 21 May 2026 12:54:02 +0000 (20:54 +0800)]
rpm: deduplicate mgr module scriptlets with a macro
Define %ceph_mgr_module_scripts() to emit the identical %post/%postun
pair for each optional mgr module package, replacing the 5 existing
hand-written copies (dashboard, diskprediction-local, rook, k8sevents,
cephadm) with a single call site per package. Subsequent commits that
introduce new mgr module packages can use the macro from the start.
Kefu Chai [Tue, 19 May 2026 13:45:25 +0000 (21:45 +0800)]
debian: split ceph-mgr-modules-core into per-module packages
ceph-mgr-modules-core has historically bundled always-on modules
together with optional ones, forcing users to install modules and their
dependencies even when they have no use for them. Split each optional
module into its own package so users and distributions can install only
what they need.
ceph-mgr-modules-core is trimmed to the 10 always-on modules defined
in src/mon/MgrMonitor.cc: balancer, crash, devicehealth, orchestrator,
pg_autoscaler, progress, rbd_support, status, telemetry, volumes.
Each optional module now follows the pattern of ceph-mgr-k8sevents and
ceph-mgr-rook.
New packages carry Breaks/Replaces: ceph-mgr-modules-core (<< 21.0.0)
for proper file ownership transfer on upgrade.
The split also exposes cross-module Python dependencies: modules
co-installed in ceph-mgr-modules-core could freely import each other,
but once separated into individual packages those imports require
explicit Depends entries. Now the inter-dependencies are properly
reflected in debian/control.
Kefu Chai [Mon, 1 Jun 2026 05:19:04 +0000 (13:19 +0800)]
test/libcephfs: reduce SnapDiffDeletionRecreation bulk_count on Windows
this test timed out on Windows. and HugeSnapDiffLargeDelta, at half
the file count, passed in 508 seconds on the same run, suggesting this
test takes ~17 minutes on Windows -- beyond the test runner limit.
we haven't profiled the Windows client yet, but the likely culprit is
EventPoll, the Windows messenger backend, which scans the entire poll
array on every event_wait() and poll_ctl() call rather than using a
keyed data structure.
in this change, we reduce bulk_count to 1 << 12 on Windows. the unique
thing this test covers is the deletion-recreation pattern: a name that
exists as a file in snap1, gets deleted, and reappears as a directory in
snap2 -- it must show up in the diff with both snapids. 4096 produces
1024 such pairs, which is enough to exercise that logic. multi-fragment
snapdiff is already covered by HugeSnapDiffLargeDelta, which derives its
file count from mds_bal_split_size and mds_bal_fragment_fast_factor
explicitly to trigger fragmentation.
Sun Yuechi [Sun, 31 May 2026 09:04:09 +0000 (17:04 +0800)]
compressor/zstd: include <zstd.h> instead of the bundled path
ZstdCompressor.h hard-codes #include "zstd/lib/zstd.h", which only
resolves because include_directories(src) puts the bundled submodule
on the search path. It thus silently depends on src/zstd being checked
out, and breaks with -DWITH_SYSTEM_ZSTD=ON where the submodule is absent.
ceph_zstd already links Zstd::Zstd, whose INTERFACE_INCLUDE_DIRECTORIES
points at the directory holding zstd.h in both modes: src/zstd/lib for
the bundled build, ${Zstd_INCLUDE_DIR} for the system one. Use <zstd.h>
so the include resolves through that interface either way.
Sun Yuechi [Sat, 30 May 2026 06:15:12 +0000 (14:15 +0800)]
cmake: add WITH_SYSTEM_SPDK to link a system-installed SPDK
By default ceph builds the bundled src/spdk fork via BuildSPDK. Add a
WITH_SYSTEM_SPDK option that instead locates a distro-provided SPDK
through a new Findspdk.cmake (pkg-config based, modelled on
Finddpdk.cmake), exposing the same spdk::spdk target.
Sun Yuechi [Sat, 30 May 2026 06:11:11 +0000 (14:11 +0800)]
blk/spdk: support both old and new spdk_env_opts member names
SPDK 21.01 renamed two struct spdk_env_opts members: pci_whitelist ->
pci_allowed and master_core -> main_core. Guard the assignments in
NVMEDevice with SPDK_VERSION.
Kefu Chai [Sat, 30 May 2026 07:49:18 +0000 (15:49 +0800)]
rgw/posix: fix event replay in BucketCache ev_loop
evec is never cleared after each n->notify() call, so events accumulate
across iterations of ev_loop's inner for loop. Each notify() call
receives not just the current event but all events dispatched in earlier
iterations too.
Kefu Chai [Sat, 30 May 2026 07:49:14 +0000 (15:49 +0800)]
rgw/posix: fix refcount leaks in BucketCache
get_bucket(FLAG_LOCK) increments the refcount via lru.ref(), but three
paths returned without the paired lru.unref(): the "do nothing" early
return and the INVALIDATE branch in notify(), and unconditionally in
invalidate_bucket(). Entries hitting these paths accumulated inflated
refcounts that the LRU could never reclaim, leaking during
~BucketCache() → cache.drain().
Replace the manual lru.unref() calls in notify(), add_entry(),
remove_entry(), invalidate_bucket(), and list_bucket() with a scope_guard
declared before unique_lock. Since the guard outlives ulk, it fires after
the mutex is released on all paths, including exceptions from
getRWTransaction() or txn->commit() (e.g. MDB_MAP_FULL, EIO) that the
manual calls never reached.
list_bucket() also had a bare b->mtx.unlock() after fill(); replace it
with unique_lock{..., std::adopt_lock} so a throw from fill() releases
the mutex too.
Eric Zhang [Thu, 28 May 2026 22:56:44 +0000 (15:56 -0700)]
qa: Fix teuthology test timing out
Enable autoscale mode for pools which is default off for teuthology
Increase mon_target_pg_per_osd so pools scale up by enough Signed-off-by: Eric Zhang <emzhang@ibm.com>
The POSIXBucket copy constructor incorrectly calls .get() on a
on a temporary unique_ptr returned by clone(), causing immediate
deletion of the Directory object. This leaves a dangling pointer
that triggers a segfault during destruction.
Matt Benjamin [Tue, 3 Feb 2026 22:12:22 +0000 (17:12 -0500)]
posixdriver: fix cksum_type, flags propagation
Posixdriver doesn't serialize POSIXMultipartUpload, but rather a
member mp_obj of type POSIXMPObj--so to avoid losing the latter's
inherited cksum_type and cksum_flags members (which are already
copied in), copy them out in POSIXMultiPartUpload::get_info() which
we need to call to copy out dest_placement anyway.
(oops, chksum_type was copied in, but not cksum_flags)
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Sun, 15 Feb 2026 20:56:03 +0000 (15:56 -0500)]
posixdriver: fix cache fill of versioned buckets
This change completes the original intent (hypothesized) to
conditionally set the FLAG_CURRENT bit on just the current
entries during bucket listing cache fill.
This avoids interning 2 copies of the current version of each
object in the listing cache, and also correctly sets the
FLAG_CURRENT bit as required--so the current versions are correctly
reported in versioned listings.
Janky logic to find the current version by explicitly chasing
the symlink target and saving it outside the enumeration scope
has been replaced with proper call to stat() provided by Dang.
Symlink::fill_cache() is no longer used, so removed.
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Sun, 15 Feb 2026 15:21:28 +0000 (10:21 -0500)]
posixdriver: add bde.flags to in bucket cache serde cycle
The upstream logic (mostly?) correctly uses bde.flags when filling
the cache for versioned objects, but cache ser(de)ialization has
been discarding that member.
This change suppresses the visible result where RGW incorrectly produces
multiple versions in non-versioned listing because none uniquely sets
FLAG_CURRENT:
Cached listings for versions are still incorrect in containing an
an extra entry for the "current" version in with empty instance
(from the Symlink)--the visible effect being that list-object-versions
output is incorrect (no entry is sent with IsLatest, after the
empty instance version has been filtered out).
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Jacques Heunis [Thu, 15 Jan 2026 12:11:11 +0000 (12:11 +0000)]
tools/rados: Remove plain text snippets from rados bench JSON output
`rados bench` emits performance stats as its output. It is very helpful
for this output to be in a machine-readable format and the CLI provides
the `--format=json` flag to achieve this.
There are some logs that do not respect the formatter flag though, as
they provide status updates as the tool is running and do not form part
of the output dataset. This prevents the contents of stdout from being
valid JSON which destroys the machine-readability of the output.
To resolve this we gate those status messages behind a check for the
formatter. If any specific formatter is provided we do not emit the
status logs. This leaves the plaintext output largely untouched while
helping the machine-readable output to be well-formed.
Fixes: https://tracker.ceph.com/issues/74370 Signed-off-by: Jacques Heunis <jheunis@bloomberg.net>
Casey Bodley [Fri, 29 May 2026 14:43:33 +0000 (10:43 -0400)]
qa/rgw: remove ragweed from multifs subsuite
it's currently broken with newer python on rocky 10 and ubuntu 24
(tracked in https://tracker.ceph.com/issues/72500) and doesn't provide
interesting test coverage outside of rgw/upgrade
Jamie Pryde [Fri, 29 May 2026 11:44:56 +0000 (12:44 +0100)]
qa: Ignore deprecated EC plugin warning in teuthology tests
Add DEPRECATED_EC_PLUGIN to the list of health warnings to
ignore in the thrash-erasure-code-* tests that use deprecated
plugins or techniques. It is expected that this warning will
be raised.
Sun Yuechi [Fri, 29 May 2026 10:39:51 +0000 (18:39 +0800)]
rgw: move SWIFT error_handler out-of-line to fix link failure
The two error_handler overrides are defined inline in rgw_rest_swift.h
and delegate to RGWSwiftWebsiteHandler::error_handler, a non-virtual
function defined in rgw_rest_swift.cc (librgw_a.a). Because the header
is included by rgw_rest.cc, the inline bodies are emitted in
librgw_common.a, which then ODR-uses that symbol across archives.
The link line lists librgw_a.a before librgw_common.a, and GNU ld only
pulls archive members on demand: when librgw_a.a is scanned nothing yet
references RGWSwiftWebsiteHandler::error_handler, so rgw_rest_swift.cc.o
is dropped and the symbol is later unresolved. This shows up as a link
failure with gcc 16 -O2.
Move the two bodies into rgw_rest_swift.cc next to the function they
call, so the ODR-use stays within the same object and the build no
longer depends on archive scan order. No functional change.
Sun Yuechi [Fri, 29 May 2026 10:19:18 +0000 (18:19 +0800)]
cmake/AddCephTest: use namespaced Catch2 imported targets
AddCephTest.cmake links the bare target names Catch2 / Catch2WithMain.
With WITH_SYSTEM_CATCH2=ON, CPM resolves Catch2 via find_package(),
which only exports the namespaced IMPORTED targets Catch2::Catch2 /
Catch2::Catch2WithMain. CMake then treats the bare names as plain
library names and the link fails with -lCatch2WithMain, since the
physical library is named libCatch2Main (OUTPUT_NAME "Catch2Main").
Use the namespaced names. Catch2 exports them as ALIASes in the bundled
(CPM subproject) build too, so the default path keeps working.