Patrick Donnelly [Fri, 17 Apr 2026 21:47:14 +0000 (17:47 -0400)]
Merge PR #67511 into tentacle
* refs/pull/67511/head:
doc/_ext: fix ceph_commands.py for new decorator-based command system
pybind/mgr: add per-module CLICommand instances to remaining modules
pybind/mgr/dashboard: create DBCLICommand, use throughout
pybind/mgr/tests/test_object_format: update DecoDemo to use fresh CLICommand
pybind/mgr/smb: adapt SMBCommand to use CLICommandBase
pybind/orchestrator,cephadm: replace CLICommandMeta
pybind/mgr: mechanically fix simple users to not import CLI*Command
pybind/mgr/mgr_module: support per-module CLICommand instances and globals
pybind/.../dashboard: misc automatic linter fixes
Kefu Chai [Sun, 8 Feb 2026 12:34:15 +0000 (20:34 +0800)]
doc/_ext: fix ceph_commands.py for new decorator-based command system
After commit 4aa9e246f, mgr modules migrated from using a class-level
COMMANDS list to decorator-based command registration using per-module
CLICommand instances (e.g., @BalancerCLICommand.Read('balancer status')).
This broke the ceph_commands.py Sphinx extension which was hardcoded to
expect m.COMMANDS to be a list, causing documentation builds to fail.
But not all modules are using this per-module CLICommand. Some modules are
fully migrated (balancer, hello, etc.) and use decorators, while others
are partially migrated (volumes, progress, stats, influx, k8sevents,
osd_perf_query, osd_support) - they have CLICommand defined but still
use the old COMMANDS list.
This fix updates _collect_module_commands() to handle three scenarios:
1. Fully migrated modules: Check CLICommand.dump_cmd_list() and use it
if it returns commands
2. Partially migrated modules: Fall back to the old COMMANDS list if
dump_cmd_list() returns empty
3. Legacy modules: Use COMMANDS list if CLICommand doesn't exist
This ensures the Sphinx extension works with modules in any migration
state, maintaining backwards compatibility while supporting the new
decorator pattern.
Conflicts:
- src/pybind/mgr/prometheus/module.py
CLIReadCommand/CLIWriteCommand removed from imports, not yet in tentacle - kept tentacle version
- src/pybind/mgr/orchestrator/module.py
_cert_store_key_ls, _cert_store_cert_ls - signature changed in main, kept tentacle version
_cert_store_entity_ls - function renamed in main, kept tentacle name
_cert_store_entity_ls - function name changed in main but not in tentacle
- src/pybind/mgr/smb/cli.py
typing imports differ between main and tentacle - manually reconciled
error_wrapper not backported to tentacle - removed from cherry-pick
- src/pybind/mgr/mirroring/module.py
import differs between main and tentacle - manually kept tentacle imports
- src/pybind/mgr/pg_autoscaler/module.py
get_scaling_threshold not backported to tentacle - removed that change
orchestrator and cephadm relied on CLICommandMeta to bypass the global
behavior of CLICommand. That is no longer a problem, so replace
CLICommandMeta with OrchestratorCLICommandBase to preserve the magic
error wrapping.
Patrick Donnelly [Tue, 14 Apr 2026 15:52:23 +0000 (11:52 -0400)]
Merge PR #67757 into tentacle
* refs/pull/67757/head:
qa/tasks/barbican: add kek for simple_crypto_plugin
qa/suites/rgw: use 'member' instead of 'Member' for roles in barbican
qa/suites/rgw: bump keystone to stable/2025.2
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Casey Bodley <cbodley@redhat.com>
Aliaksei Makarau [Tue, 31 Mar 2026 06:40:04 +0000 (08:40 +0200)]
This change introduces the shared memory communication (SMC-D) for the cluster network.
SMC-D is faster than ethernet in IBM Z LPARs and/or VMs (zVM or KVM).
mgr/DaemonServer: Limit search for OSDs to upgrade within the crush bucket.
The behavior of the 'ok-to-upgrade' command is now more deterministic with
respect to the parameters passed.
To achieve the above, the commit implements the following changes:
1. The 'ok-to-upgrade' command is modified to operate strictly on the OSDs
within the CRUSH bucket and, if possible, meet the '--max' criteria when
specified. When --max <num> is provided, the command returns up to <num>
OSD IDs from the specified CRUSH bucket that can be safely stopped for
simultaneous upgrade. This is useful when only a subset of OSDs within
the bucket needs to be upgraded for performance or other reasons.
2. Modifies the standalone tests to reflect the above change.
3. Modifies the relevant documentation to reflect the change in behavior.
* refs/pull/67533/head:
qa/cephadm: ensure host has been fully saved before considering bootstrap complete
mgr/cephadm: add __getstate__ so OSD class can be pickled
Reviewed-by: Adam King <adking@redhat.com> Reviewed-by: Guillaume Abrioux <gabrioux@redhat.com>
debian: remove stale distutils override from py3dist-overrides
distutils was deprecated in Python 3.10 (PEP 632) and removed in
Python 3.12. The `python3-distutils` package no longer exists in
Debian Trixie (Python 3.13) or Ubuntu 24.04+ (Python 3.12).
The only runtime reference was in `debian/ceph-mgr.requires`, already
cleaned up by 3fb3f892aa3. This override is now dead code, hence no
installed file declares a runtime dependency on `distutils`, so
`dh_python3` never resolves it. Removing it prevents a latent
uninstallable-dependency bug if `distutils` were accidentally
reintroduced in a `.requires` file.
Fixes: https://tracker.ceph.com/issues/75901 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Max R. Carrara <m.carrara@proxmox.com> Signed-off-by: Kefu Chai <k.chai@proxmox.com>
(cherry picked from commit d1d07a0542228b7c40238a9a78d138ad07130240)
* refs/pull/67907/head:
doc/start: Add ARM support note to hardware-recommendations.rst
doc: Improve start/hardware-recommendations.rst
doc: Update the old ceph.com/community/ links to ceph.io/en/news/blog/
doc/start: Improve hardware-recommendations.rst
doc/start: fix wording in swap tip
doc: Use ref instead of full URLs for intra-docs links
doc: Use existing labels and ref for hyperlinks in architecture.rst
Reviewed-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
* refs/pull/66948/head:
mgr/DaemonServer: Re-order OSDs in crush bucket to maximize OSDs for upgrade
mgr/DaemonServer: Implement ok-to-upgrade command
mgr/DaemonServer: Modify offline_pg_report to handle set or vector types
* refs/pull/66482/head:
mgr/prometheus/test_module: Adding unit-test for new classes
mgr/prometheus: metrics header for standby module
mgr/prometheus: Use RLock to fix deadlock in HealthHistory
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
mgr/prometheus: prune stale health checks, compress output
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
* refs/pull/65913/head:
client: signal waitfor_commit waiters for write delegation enabled inode
test/libcephfs: add test for fsync on a write delegated inode
client: adjust `Fb` cap ref count check during synchronous fsync()
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Venky Shankar <vshankar@redhat.com>
Previously s3tests_java.py set JAVA_HOME using the `alternatives`
command. That had issues in that `alternatives` is not present on all
Ubuntu systems, and some installations of Java don't update
alternatives. So instead we look for a "java-8" jvm in /usr/lib/jvm/
and set JAVA_HOME to the first one we find.
Yuval Lifshitz [Fri, 20 Feb 2026 15:41:14 +0000 (15:41 +0000)]
test/rgw/notification: do not use netstat in the code
* net-tools are deprecated in fedora and ubuntu
* using netstat -p (used to verify that the http server is listening on
a port) requires root privilages, which may fail in some tests environments
qa/tasks/backfill_toofull.py: Fix assert failures with & without compression
The following issues with the test are addressed:
1. The test was encountering assertion failure (assert backfillfull < 0.9) with
compression enabled. This was because the condition was not factoring in the
compression ratio. Without it the backfillfull ratio can easily exceed 1. By
factoring in the compression ratio, the backfillfull ratio will be in the
range (0 - n), where n can vary depending on the type of compression used.
2. The main contributing factor for (1) above is the amount of data written to
the pool. The writes were time-bound earlier leading to excess data and
eventually the assertion failure. By limiting the data written to the OSDs
to 50% of the OSD capacity in the first phase and only 20% in the re-write
phase, the outcome of the test is more deterministic regardless of
compression being enabled or not.
3. A potential false cluster error is avoided by swapping the setting of
the nearfull-ratio and backfill-ratio after the re-write phase.
Casey Bodley [Wed, 25 Mar 2026 16:38:59 +0000 (12:38 -0400)]
qa/rgw/upgrade: symlinks are explicit about distro versions
avoid relying on "ubuntu_latest" and "rpm_latest" symlinks, which change
over time on main. be explicit about the distro versions supported by
the initial release
Ville Ojamo [Thu, 15 May 2025 10:32:29 +0000 (17:32 +0700)]
doc: Use existing labels and ref for hyperlinks in architecture.rst
Use validated ":ref:" hyperlinks instead of "external links" in "target
definitions" when linking within the Ceph docs:
- Update to use existing labels when linkin from architecture.rst.
- Remove unused "target definitions".
Also use title case for section titles in
doc/start/hardware-recommendations.rst because change to use link text
generated from section title.
Other than generated link texts the rendered PR should look the same as
the old docs, only differing in the source RST.
Alex Ainscow [Wed, 18 Mar 2026 14:51:57 +0000 (14:51 +0000)]
src: Move the decision to build the ISA plugin to the top level make file
Previously, the first time you build ceph, common did not see the correct
value of WITH_EC_ISA_PLUGIN. The consequence is that the global.yaml gets
build with osd_erasure_code_plugins not including isa. This is not great
given its our default plugin.
We considered simply removing this parameter from make entirely, but this
may require more discussion about supporting old hardware.
So the slightly ugly fix is to move this erasure-code specific declartion
to the top-level.
Fixes: https://tracker.ceph.com/issues/75537 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit cecce28f16b0867ea8578a8f0c1478e24a40e525)
mon [stretch-mode]: Allow a max bucket weight diff threshold
Problem:
Users ran into a problem where the crush bucket
weight different check in stretch mode is too strict, e.g.,
one of the disk that is added to one of the node had slight variation
in the capacity and this caused ceph to fail from enabling the stretch
cluster because crush weight is not balanced. The difference was very small.
Solution:
- Introducing: mon_stretch_max_bucket_weight_delta in mon.yaml.in
this config var is default to 0.1 and is used as a threshold
to allow the difference between the two crush buckets in stretch mode
to be no greater than 10%.
- Introducing: STRETCH_MODE_BUCKET_WEIGHT_IMBALANCE as health warnings
when the weight delta between the two sites exceeds 10%
- Modified documentations
- Modified tests that exercises this code path
Backport 6dddf54 introduced a new connection feature bit
NVMEOF_BEACON_DIFF but there are plans (#66624) to make further
enhancements on that feature bit. This would cause the mons to crash
during upgrades.
However, this connection feature bit should not have been added to
begin with. The correct way to do this is extend e55ad7bce2fb85096cd31ff9846403f9dbd01e85 by @athanatos to require
`CEPH_MON_FEATURE_INCOMPAT_NVMEOF_BEACON_DIFF` if all mons support it.
This should be done by having mons add/update their supported features
the MonMap via an update from `MMonJoin` (see for instance `crush_loc`
which was recently added to `mon_info_t`). Once the supported features
indicated for each mon in the `MonMap` show they understand the new
NVMEOF_BEACON_DIFF, then it should be turned on globally in the
`MonMap` as a required feature (added to the incompat set).
Conflicts:
src/mon/NVMeofGwMon.h: conflicts with header change from 19c9be2
fix missing header change in #66584
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Ilya Dryomov [Sun, 1 Mar 2026 21:55:52 +0000 (22:55 +0100)]
qa/workunits/rbd: short-circuit status() if "ceph -s" fails
In mirror-thrash tests, status() can be invoked after one of the
clusters is effectively stopped due to a watchdog bark:
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.rbd_mirror.[cluster2] failed
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
...
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ status
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ local cluster daemon image_pool image_ns image
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ for cluster in ${CLUSTER1} ${CLUSTER2}
In this scenario all commands that are invoked from the loop body
are going to time out anyway.
Ilya Dryomov [Sun, 1 Mar 2026 16:45:51 +0000 (17:45 +0100)]
qa: rbd_mirror_fsx_compare.sh doesn't error out as expected
In mirror-thrash tests, one of the clusters can be effectively stopped
due to a watchdog bark while rbd_mirror_fsx_compare.sh is running and is
in the middle of the "wait for all images" loop:
In this scenario "rbd ls" is going to time out repeatedly, turning the
loop into up to a ~60-hour sleep (up to 720 iterations with a 5-minute
timeout + 10-second sleep per iteration).
Ilya Dryomov [Fri, 27 Feb 2026 14:18:27 +0000 (15:18 +0100)]
qa/tasks: make rbd_mirror_thrash inherit from ThrasherGreenlet
Commit 21b4b89e5280 ("qa/tasks: watchdog terminate thrasher") made it
required for a thrasher to have stop_and_join() method, but the
preceding commit a035b5a22fb8 ("thrashers: standardize stop and join
method names") missed to add it to rbd_mirror_thrash (whether as an
ad-hoc implementation or by way of inheriting from ThrasherGreenlet).
Later on, commit 783f0e3a9903 ("qa: Adding a new class for the
daemonwatchdog to monitor") worsened the issue by expanding the use
of stop_and_join() to all watchdog barks rather than just the case of
a thrasher throwing an exception which is something that practically
never happens.
client: adjust `Fb` cap ref count check during synchronous fsync()
cephfs client holds a ref on Fb caps when handing out a write delegation[0].
As fsync from (Ganesha) client holding write delegation will block indefinitely[1]
waiting for cap ref for Fb to drop to 0, which will never happen until the
delegation is returned/recalled.
If an inode has been write delegated, adjust for cap reference count
check in fsync().
Note: This only workls for synchronous fsync() since `client_lock` is
held for the entire duration of the call (at least till the patch leading
upto the reference count check). Asynchronous fsync() needs to be fixed
separately (as that can drop `client_lock`).
mgr/prometheus: Use RLock to fix deadlock in HealthHistory
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR #65245 added the locks.
Nitzan Mordechai [Tue, 26 Aug 2025 14:30:12 +0000 (14:30 +0000)]
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and `erase` / `clear` did not correctly decref the values, leading
to use-after-free or leaks depending on usage.
Changes:
- Move Py_INCREF from cacheable_get_python() to TTLCache::insert()
- Add `TTLCache::clear()` method for proper memory cleanup
- Ensure TTLCache::get() returns a new reference
- Fix misuse of std::move on c_str() in PyJSONFormatter
These changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.
Nitzan Mordechai [Wed, 20 Aug 2025 14:50:40 +0000 (14:50 +0000)]
mgr/prometheus: prune stale health checks, compress output
This patch introduces several improvements to the Prometheus module:
- Introduces `HealthHistory._prune()` to drop stale and inactive health checks.
Limits the in-memory healthcheck dict to a configurable max_entries (default 1000).
TTL for stale entries is configurable via `healthcheck_history_stale_ttl` (default 3600s).
- Refactors HealthHistory.check() to use a unified iteration over known and current checks,
improving concurrency and minimizing redundant updates.
- Use cherrypy.tools.gzip instead of manual gzip.compress() for cleaner
HTTP compression with proper header handling and client negotiation.
- Introduces new module options:
- `healthcheck_history_max_entries`
- Add proper error handling for CherryPy engine startup failures
- Remove os._exit monkey patch in favor of proper exception handling
- Remove manual Content-Type header setting (CherryPy handles automatically)
* refs/pull/67318/head:
qa/multisite: use boto3's ClientError in place of assert_raises from tools.py.
qa/multisite: test fixes
qa/multisite: boto3 in tests.py
qa/multisite: zone files use boto3 resource api
qa/multisite: switch to boto3 in multisite test libraries
Ilya Dryomov [Tue, 24 Feb 2026 11:46:35 +0000 (12:46 +0100)]
librbd/mirror: detect trashed snapshots in UnlinkPeerRequest
If two instances of UnlinkPeerRequest race with each other (e.g. due
to rbd-mirror daemon unlinking from a previous mirror snapshot and the
user taking another mirror snapshot at same time), the snapshot that
UnlinkPeerRequest was created for may be in the process of being removed
(which may mean trashed by SnapshotRemoveRequest::trash_snap()) or fully
removed by the time unlink_peer() grabs the image lock. Because trashed
snapshots weren't handled explicitly, UnlinkPeerRequest could spuriously
fail with EINVAL ("not mirror snapshot" case) instead of the expected
ENOENT ("missing snapshot" case). This in turn could lead to spurious
ImageReplayer failures with it stopping prematurely.
ImageUpdateWatchers::flush() requests aren't tracked with
m_in_flight-like mechanism the way ImageUpdateWatchers::send_notify()
requests are, but in both cases callbacks that represent delayed work
that is very likely to (indirectly) reference ImageCtx are involved.
When the image is getting closed, ImageUpdateWatchers::shut_down() is
called before anything that belongs to ImageCtx is destroyed. However,
the shutdown can complete prematurely in the face of a pending flush if
one gets sent shortly before CloseRequest is invoked. The callback for
that flush will then race with CloseRequest and may execute after parts
of or even the entire ImageCtx is destroyed, leading to use-after-free
and various segfaults.