The python binaries on some CI images and dev boxes ship stripped, so the
allocator frames in an interpreter-shutdown leak come through unsymbolised
(/usr/bin/python3.13+0x...) and the function-name matches above cannot apply.
leak:python3.10 already handled this for 3.10, and a stale comment claimed 3.12
does not leak.
Add leak:python3.12 and leak:python3.13, mirroring the 3.10 entry, so the
interpreter globals are suppressed whatever CPython the build uses.
Kefu Chai [Tue, 23 Jun 2026 08:14:19 +0000 (16:14 +0800)]
ceph.in: load asan/lsan suppressions on WITH_ASAN builds
bin/ceph from a WITH_ASAN build aborts at exit, with LeakSanitizer reporting
CPython and Cython module-init allocations as leaks:
==2577940==ERROR: LeakSanitizer: detected memory leaks
Direct leak ... in PyObject_Malloc (/usr/bin/python3.12+...)
#4 __pyx_pymod_exec_rados rados_processed.c
SUMMARY: AddressSanitizer: 32113 byte(s) leaked in 30 allocation(s).
These are interpreter globals that live for the process lifetime. qa/lsan.supp
already suppresses them, but bin/ceph never loaded it: vstart.sh sets
LSAN_OPTIONS for the daemons it spawns, while a bin/ceph invoked separately
(ceph-api runs ./bin/ceph fsid once vstart.sh returns) inherits none and exits
non-zero. It stayed hidden until radosgw-admin stopped crashing in vstart and
the run reached that call.
ceph.in already re-execs with the ASan runtime preloaded under WITH_ASAN. Set
ASAN_OPTIONS and LSAN_OPTIONS first, from the CEPH_ASAN_OPTIONS and
CEPH_LSAN_OPTIONS that CMake also feeds add_ceph_test(), so the re-exec'd
interpreter starts with the suppressions loaded. Use setdefault so a value
from the caller still wins.
Kefu Chai [Tue, 23 Jun 2026 08:14:19 +0000 (16:14 +0800)]
cmake: factor the ASan/LSan test options into cache variables
add_ceph_test() spelled out the suppression-file paths and sanitizer flags
inline. bin/ceph needs the same options, so lift them into CEPH_ASAN_OPTIONS
and CEPH_LSAN_OPTIONS and have add_ceph_test() consume those. The environment
the tests run with is unchanged.
Kefu Chai [Tue, 23 Jun 2026 07:43:28 +0000 (15:43 +0800)]
mgr/dashboard: skip the table when an nvmeof cli result has no columns
The dashboard leaves prettytable unpinned. prettytable commit 2574492 ("Apply
some Pylint rules (PLR)", #436) rewrote _stringify_row()'s row_height as
`max(_get_size(c)[1] for c in row)`, which raises ValueError("max() iterable
argument is empty") on a row with no cells. The change is undocumented and
shipped in 3.18.0; get_string() trips on it when a table has a row but no
columns.
AnnotatedDataTextOutputFormatter builds such a table for an empty result, or
one whose only field is status or error_message, so NvmeofCLICommand.call()
returns -EINVAL and the command fails. This broke run-tox-mgr-dashboard-py3
once the tox virtualenv picked up prettytable 3.18.0.
Return an empty string when there are no columns instead of formatting a
degenerate table.
Ilya Dryomov [Mon, 22 Jun 2026 16:58:47 +0000 (18:58 +0200)]
Merge pull request #69641 from ceph/ljflores-patch-1
doc: update email address for Laura Flores
Reviewed-by: Gregory Farnum <gfarnum@redhat.com> Reviewed-by: Dan van der Ster <dan.vanderster@clyso.com> Reviewed-by: Casey Bodley <cbodley@redhat.com>
Kefu Chai [Mon, 22 Jun 2026 07:25:18 +0000 (15:25 +0800)]
doc/rados/operations: document the kernel-client min-compat holdout
A kernel CephFS or RBD mount can keep ``set-require-min-compat-client
reef`` from succeeding: the kernel client does not advertise the
pg-upmap-primary feature, so ``ceph features`` reports it as a luminous
client even when the daemons and every userspace client are newer.
ceph-fuse, libcephfs, and librbd advertise the full feature set and are
not affected.
Document this, and add a section on finding the holdout with
``ceph features`` and switching it to a userspace client.
unittest-seastore runs the seastar reactor on a separate thread
(SeastarRunner) and stops it at exit without draining its pending
tasks, so a few cached extents those tasks still held are leaked at
shutdown. Suppress them by the three seastore subsystems they come from.
Several unittests fail LeakSanitizer on still-reachable allocations that
belong to third-party libraries, not to Ceph. These are boost.thread's
main-thread TLS, OpenSSL's one-time init (the ForkDeathTest children
_exit() before it is freed), and the cipher, DRBG and error-stack state
that OpenSSL and libcryptsetup keep behind the librbd encryption and
migration unittests. None get freed without OPENSSL_cleanup(), so suppress
them by their allocation entry points.
Sun Yuechi [Sun, 21 Jun 2026 08:42:54 +0000 (16:42 +0800)]
vstart: load lsan/asan suppressions on WITH_ASAN builds
AddCephTest.cmake runs unittests with
ASAN_OPTIONS/LSAN_OPTIONS=suppressions=qa/{asan,lsan}.supp, but vstart.sh
does not, so on a WITH_ASAN build `ceph-mon --mkfs` aborts on a still-reachable
leak that those suppressions cover and fails the "ceph API tests" job. Export
the same options when WITH_ASAN=ON.
Sun Yuechi [Sat, 20 Jun 2026 07:17:26 +0000 (15:17 +0800)]
cmake: define BOOST_USE_UCONTEXT tree-wide under ASan
Under WITH_ASAN Boost.Context is built ucontext-only, so consumers that
include its headers without linking Boost::context (e.g. libosd) were
still built for the fcontext backend and broke the link:
Define the backend tree-wide so every consumer agrees on it.
riscv64's ASan runtime mis-handles makecontext/swapcontext, so the
ucontext fiber backend reports false-positive heap-buffer-overflows on
fiber switch that can't be suppressed. So exclude it.
Sun Yuechi [Sun, 21 Jun 2026 08:42:54 +0000 (16:42 +0800)]
test/encoding: probe `setarch -R` and drop the arch argument
readable.sh wraps ceph-dencoder with `setarch $(uname -m) -R` to disable
ASLR on ASan builds, but the arch-qualified form also sets the personality
to that arch, which fails where setarch can't (e.g. riscv64). Use bare
`setarch -R` to only clear ASLR, and probe it first so the script falls
back to running ceph-dencoder unwrapped.
Sun Yuechi [Sun, 21 Jun 2026 08:42:54 +0000 (16:42 +0800)]
crimson: build seastar with the default allocator under ASan
With this build configured as RelWithDebInfo, seastar keeps its own
allocator instead of falling back to libc's. Under ASan that allocator
is called (via dlsym) before it is initialized and SIGSEGVs every
seastar/crimson unittest before main(). Define SEASTAR_DEFAULT_ALLOCATOR
under WITH_ASAN to keep seastar on the libc allocator.
Sun Yuechi [Sun, 21 Jun 2026 16:14:55 +0000 (00:14 +0800)]
cmake: use the libc allocator for sanitizer builds
tcmalloc/jemalloc keep exporting the global operator new/delete even though
their malloc is shadowed by the sanitizer interceptor, so memory the sanitizer
allocated gets freed through tcmalloc and SIGSEGVs (e.g. seastar coroutine
frames). Force libc when WITH_ASAN is set.
Sun Yuechi [Sat, 20 Jun 2026 02:53:51 +0000 (10:53 +0800)]
cmake/boost: load context Jamfile before passing context-impl to b2
With WITH_ASAN, b2 runs as `b2 context-impl=ucontext headers stage` for the
build step and `b2 context-impl=ucontext install` for the install step. The
`context-impl` feature is declared in libs/context/build/Jamfile.v2, which
neither the headers/stage nor the install targets load, so b2 aborts with:
error: unknown feature "<context-impl>"
Name the context project as a target in both commands so its Jamfile (and the
feature) loads before the build request is expanded.
This works around https://github.com/boostorg/context/issues/297, fixed
upstream in
https://github.com/boostorg/context/commit/12ac945158ae3c2373ec0c888899373218aa209f
and first released in Boost 1.88; drop it once the bundled Boost is bumped to
1.88 or newer.
Sun Yuechi [Sat, 20 Jun 2026 05:51:39 +0000 (13:51 +0800)]
test/cmake: drop dead env_vars_for_tox_tests set_property
Both `tox_tests` and `env_vars_for_tox_tests` have been undefined since f0079a1030b, so this expands to `set_property(TEST PROPERTY ENVIRONMENT)`
-- a no-op. The actual per-test environment for tox tests is set in
add_tox_test() (cmake/modules/AddCephTest.cmake). Remove the leftover.
Sun Yuechi [Sun, 7 Jun 2026 14:19:08 +0000 (22:19 +0800)]
mypy: skip follow_imports for prettytable
With test venvs on system site-packages, mypy picks up the system
prettytable (3.4.0+, typed). It flags mgr's add_row(tuple) against the
list[Any] signature (src/mypy.ini) and qa's float_format = str against
the dict[str, str] property (qa/mypy.ini). Skip follow_imports in both.
Sun Yuechi [Sat, 6 Jun 2026 18:05:34 +0000 (02:05 +0800)]
test: optionally run test venvs with system site-packages
Add a CEPH_PYTHON_SYSTEM_SITE switch (off by default). When set:
- setup-virtualenv.sh builds its venv with --system-site-packages;
- run_tox.sh exports VIRTUALENV_SYSTEM_SITE_PACKAGES=true for tox's venvs.
This lets distro packages satisfy test dependencies instead of pip building
them from sdist, which helps where prebuilt wheels are missing (e.g. scipy and
numpy on riscv64) by avoiding a slow rebuild when the RPMs are installed.
Sun Yuechi [Fri, 19 Jun 2026 06:38:10 +0000 (14:38 +0800)]
cmake: make GTEST_PARALLEL_COMMAND visible to all test directories
GTEST_PARALLEL_COMMAND was set as an ordinary variable inside the
`if(NOT TARGET gtest-parallel_ext)` guard, so it only existed in the
first directory that include()s AddCephTest (src/common/options). Later
includes skip the guarded block and leave it empty, so PARALLEL
unittests under src/test silently ran serially.
Promote it to CACHE INTERNAL so it is visible across all directories
regardless of include order.
Sun Yuechi [Thu, 18 Jun 2026 19:06:06 +0000 (03:06 +0800)]
mgr/tox: run pytest in parallel
The py3 and coverage tox environments run the full mgr pytest suite
serially, which makes run-tox-mgr the longest test in CI. Add
pytest-xdist and pass `-n auto` to both so the suite is distributed
across the available CPUs.
pytest-xdist is constrained to <2 to stay compatible with the pinned
pytest-cov. Running in parallel also surfaced a hard-coded port in
cephadm's test_node_proxy, which now allocates an ephemeral port per
process.
Kefu Chai [Fri, 19 Jun 2026 11:20:09 +0000 (19:20 +0800)]
rocksdb: update submodule to fix FTBFS due to missing <cstdint>
59afb3d6 bumped rocksdb submodule in hope to address the FTBFS failure
when building rocksdb with GCC 16, but the tree still failed to build:
```
In file included from /ceph/src/rocksdb/include/rocksdb/trace_record_result.h:14,
from /ceph/src/rocksdb/trace_replay/trace_record_result.cc:6:
/ceph/src/rocksdb/include/rocksdb/trace_record.h:55:32: error: expected ')' before 'timestamp'
55 | explicit TraceRecord(uint64_t timestamp);
| ~ ^~~~~~~~~~
| )
/ceph/src/rocksdb/include/rocksdb/trace_record.h:63:11: error: 'uint64_t' does not name a type
63 | virtual uint64_t GetTimestamp() const;
| ^~~~~~~~
/ceph/src/rocksdb/include/rocksdb/trace_record.h:1:1: note: 'uint64_t' is defined in header '<cstdint>'; this is probably fixable by adding '#include <cstdint>'
```
in this change, we cherry-pick upstream fix to address this build
failure.
Kefu Chai [Sun, 17 Mar 2024 10:42:44 +0000 (18:42 +0800)]
script/run-make: enable ASan
when performing tests, we should enable sanitizers for detecting
potential issues. so, in this change, we enable ASsan, TSan and
UBSan.
script/run-make.sh is used by our CI job for testing PRs, so
enabling these sanitizers helps us to identify issues as early as
possible. because ASan cannot be used along with TSan, we prefer
using ASan for capturing memory related issue in favor of
detecting the multi-threading issues.
also, because of https://bugs.llvm.org/show_bug.cgi?id=23272, we
cannot enable multiple sanitizers. but we should enable UBSan as well,
once we can use a higher version of Clang than Clang-14. with
Clang-14, when enabling UBSan, we'd have following FTBFS
```
error: Cannot represent a difference across sections
```
when compiling `src/tools/neorados.cc`
Abhishek Desai [Tue, 26 May 2026 07:48:40 +0000 (13:18 +0530)]
mgr/dashboard : Support wildcard sans and zonegroup hostnames
fixes : https://tracker.ceph.com/issues/76795 Signed-off-by: Abhishek Desai <abhishek.desai1@ibm.com>
Sun Yuechi [Fri, 29 May 2026 08:35:26 +0000 (16:35 +0800)]
librbd/cache/pwl: cancel periodic_stats timer before perf_stop()
AbstractWriteLog::shut_down() calls perf_stop(), which deletes
m_perfcounter, but the 5s periodic_stats timer is only canceled later
in the destructor. If it fires in between, periodic_stats ->
update_image_cache_state dereferences the freed m_perfcounter. Cancel
the timer under m_timer_lock first.
Destroying an AbstractWriteLog that was init()'ed without a matching
shut_down() is illegal, so the cancel_event() in the destructor is now
redundant and is dropped.
Fixes: https://tracker.ceph.com/issues/77501 Signed-off-by: Sun Yuechi <sunyuechi@iscas.ac.cn>
Kefu Chai [Thu, 18 Jun 2026 14:06:43 +0000 (22:06 +0800)]
ceph.spec.in: exclude CI and test directories from mgr plugin packages
The ceph-mgr-dashboard and ceph-mgr-rook packages install their entire
mgr module directory, which includes a ci/ subdirectory containing
Dockerfiles, e2e test scripts, and cluster specs used only for upstream
CI pipelines. ceph-mgr-cephadm similarly ships a tests/ directory with
Python unit tests. These files have no runtime purpose on a deployed
system and should not be shipped in the binary packages.
Exclude mgr/cephadm/tests, mgr/dashboard/ci, and mgr/rook/ci via
%exclude directives.
Kefu Chai [Thu, 18 Jun 2026 14:06:36 +0000 (22:06 +0800)]
debian/rules: exclude CI and test directories from mgr plugin packages
The ceph-mgr-dashboard and ceph-mgr-rook packages install their entire
mgr module directory, which includes a ci/ subdirectory containing
Dockerfiles, e2e test scripts, and cluster specs used only for upstream
CI pipelines. ceph-mgr-cephadm similarly ships a tests/ directory with
Python unit tests. These files have no runtime purpose on a deployed
system and should not be shipped in the binary packages.
Exclude mgr/cephadm/tests, mgr/dashboard/ci, and mgr/rook/ci via
dh_install --exclude.
mgr/cephadm: add ca_cert_required parameter to get_certificates mock in test
Add the ca_cert_required parameter to the lambda function mocking
CephadmService.get_certificates in test_prometheus_config_security_enabled
to match the updated method signature.
This ensures the test mock properly handles the new parameter that was
added to the get_certificates method.
mgr/smb: Add ssl_certificates buffer for smb features
Add ssl_certificates buffer for smb features like remote_control
and keybridge. when certificate applied it stores as a feature
name and SSLParameters as value where SSLParameters holds cert,
key and ca-cert.
mgr/cephadm: Add function to get ssl certificate from ssl_certificates
A function is added _get_certificates_from_spec_ssl_certificates to get
certificates from ssl_certificates buffer. it returns TLSCredentials from
SSLParameters of ssl_certificates[feature]
Kefu Chai [Sun, 14 Jun 2026 06:52:00 +0000 (14:52 +0800)]
journal/ObjectPlayer: don't acquire locks in destructor
~ObjectPlayer took m_timer_lock only to assert two invariants. but that
lock is borrowed by reference from the caller's SafeTimer, and an
ObjectPlayer can outlive it: a C_Fetch/C_WatchFetch completion on the
librados finisher may hold the last reference and run ~ObjectPlayer after
the timer and its lock are already gone. re-taking the freed lock is a
heap-use-after-free, which unittest_journal hits on arm64 under ASan:
the lock isn't needed though: at refcount 0 the watch has been cancelled
(m_watch_ctx == nullptr, asserted below) so no timer task references us, and
no fetch is in flight since a pending fetch holds a reference. nothing else
can touch our state. Furthermore, we can also skip acquiring `m_lock` as
well, because, in the destructor, it shouldn't really matter -- if one
of these asserts fails because the execution of the destructor races
with some `ObjectPlayer` mthod, we would get what the `assert()` was
added for. They are here to catch bugs and such a race just being
possible is a bug in itself.
Xiubo Li [Wed, 17 Jun 2026 01:14:43 +0000 (09:14 +0800)]
common/options: mark client_force_lazyio as not runtime updatable
The client_force_lazyio option is currently marked as supporting
runtime updates in the configuration schema, but this is misleading.
The value is read once during each file open/create and stored in
the file flags. There is no config observer registered to handle
dynamic updates, and there is no logic to propagate changes to the
already opened file handles.
This patch adds the NO_RUNTIME flag to the option definition to
correctly reflect reality.
Fixes: https://tracker.ceph.com/issues/77451 Signed-off-by: Xiubo Li <xiubo.li@clyso.com>
Matthew N. Heler [Mon, 18 May 2026 01:57:01 +0000 (20:57 -0500)]
mon: add monitor RocksDB backup and restore
Implements an opt-in backup mechanism for the monitor using
rocksdb::BackupEngine. Backups run on a schedule when
mon_backup_interval is set, or are triggered manually via
`ceph tell mon.* backup`. Cleanup keeps the last N, hourly,
and daily snapshots, with a free-space guard. Off by default.
Restore is offline: stop the mon and run
ceph-mon --restore-backup <dir> --yes-i-really-mean-it
optionally with --backup-version (BackupEngine logical version,
as shown by --list-backups). The mon keyring is stashed alongside
the RocksDB backup so a wiped mon_data is recovered end-to-end,
and kv_backend is stamped back when missing.
Co-authored-by: Daniel Poelzleithner <poelzleithner@b1-systems.de> Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Emmanuel Ameh [Tue, 9 Jun 2026 12:21:38 +0000 (13:21 +0100)]
doc/install: Update mirrors.rst to use https and current release
All mirror URLs used insecure http://. Update to https://. The example
repository paths used debian-hammer and rpm-hammer (Hammer was EOL in
2016); update to the current stable release (tentacle). Also update
the GitHub mirroring link from the stale master branch to main.
Emmanuel Ameh [Tue, 9 Jun 2026 12:19:27 +0000 (13:19 +0100)]
doc: Replace Python 2 package names with Python 3 equivalents
librados-intro.rst referenced ``python-rados`` for CentOS/RHEL.
rbd-openstack.rst referenced ``python-rbd`` for both apt and yum.
Python 2 reached end-of-life in January 2020; these package names
install the Python 2 bindings (or fail entirely) on current distros.
Replace with the correct Python 3 package names: python3-rados and
python3-rbd.
rbd-mirror: prune obsolete primary mirror snapshots after relocation
Previously, obsolete primary and demoted primary snapshots on the
secondary cluster were not cleaned up immediately after relocation.
Instead, old primary snapshots remained until a subsequent promote
operation triggered their cleanup, while old demoted primary snapshots
persisted until a later demote operation removed them.
Adding changes for proactive cleanup of obsolete primary and demoted
primary snapshots that are no longer required after relocation.
Also adding test coverage to validate the cleanup behavior.