Patrick Donnelly [Wed, 22 Apr 2026 17:22:51 +0000 (13:22 -0400)]
Merge PR #67418 into tentacle
* refs/pull/67418/head:
mgr/cephadm: validate hostname in NodeProxyCache
node-proxy: improve HTTP error logging in client
node-proxy: get serial number instead of SKU
node-proxy: allow multiple sources per component
node-proxy: re-auth and retry once on 401
node-proxy: fix flake8 E721 in _dict_diff
node-proxy: make the update loop interval configurable
mgr/node-proxy: fix "ceph orch hardware status --category criticals"
node-proxy: normalize storage data per member
node-proxy: encapsulate send logic in dedicated method
node-proxy: log actual data delta in reporter
node-proxy: add periodic heartbeats in main and reporter loops
node-proxy: adjust log levels
node-proxy: add unit tests
node-proxy: add tox config for mypy, flake8, isort, black
node-proxy: black and isort formatting pass
node-proxy: fix mypy errors
node-proxy: handle nested Redfish paths for components
node-proxy: split out config, bootstrap and redfish logic
node-proxy: refactor config loading
node-proxy: add 'vendor based' redfish system selection
node-proxy: introduce component spec registry and overrides for updates
mgr/cephadm: safe status/health access in node-proxy agent and inventory
node-proxy: narrow build_data exception handling and re-raise
node-proxy: refactor Endpoint/EndpointMgr and fix chassis paths
node-proxy: use safe field access in storage update
node-proxy: reduce log verbosity for missing optional fields
Patrick Donnelly [Wed, 22 Apr 2026 17:21:13 +0000 (13:21 -0400)]
Merge PR #67343 into tentacle
* refs/pull/67343/head:
ceph-volume: fix test_reject_readonly_device unit test
ceph-volume: single lvs call to speed up exclude_lvm_osd_devices
ceph-volume: avoid Device() instantiation in lvm OSD filtering
ceph-volume: avoid RuntimeError on ceph-volume raw list with non-existent loop devices
Patrick Donnelly [Wed, 22 Apr 2026 14:38:38 +0000 (10:38 -0400)]
Merge PR #67850 into tentacle
* refs/pull/67850/head:
mgr, qa: clarify module checks in DaemonServer
mgr, qa: add `pending_modules` to asock command
mgr, common, qa, doc: issue health error after max expiration is exceeded
mgr: ensure that all modules have started before advertising active mgr
Laura Flores [Fri, 12 Sep 2025 20:14:30 +0000 (20:14 +0000)]
mgr, qa: clarify module checks in DaemonServer
The current check groups modules not being
enabled with failing to initialize. In this commit,
we reorder the checks:
1: Screen for a module being enabled. If it's not,
issue an EOPNOTSUPP with instructions on how
to enable it.
2. Screen for if a module is active. If a module
is enabled, then the cluster expects it to
be active to support commands. If the module
took too long to initialize though, we will
catch this and issue an ETIMEDOUT error with
a link for troubleshooting.
Now, these two separate issues are not grouped
together, and they are checked in the right order.
Laura Flores [Thu, 11 Sep 2025 22:13:51 +0000 (22:13 +0000)]
mgr, qa: add `pending_modules` to asock command
Now, the command `ceph tell mgr mgr_status` will show a
"pending_modules" field. This is another way for Ceph operators
to check which modules haven't been initalized yet (in addition
to the health error).
This command was also added to testing scenarios in the workunit.
Laura Flores [Tue, 29 Jul 2025 22:46:46 +0000 (22:46 +0000)]
mgr, common, qa, doc: issue health error after max expiration is exceeded
----------------- Enhancement to the Original Fix -----------------
During a mgr failover, the active mgr is marked available if:
1. The mon has chosen a standby to be active
2. The chosen active mgr has all of its modules initialized
Now that we've improved the criteria for sending the "active" beacon
by enforcing it to retry initializing mgr modules, we need to account
for extreme cases in which the modules are stuck loading for a very long
time, or even indefinitely. In these extreme cases where the modules might
never initialize, we don't want to delay sending the "active" beacon for
too long. This can result in blocking other important mgr functionality,
such as reporting PG availability in the health status. We want
to avoid sending warnings about PGs being unknown in the health status when
that's not ultimately the problem.
To account for an exeptionally long module loading time, I added a new
configurable `mgr_module_load_expiration`. If we exceed this maximum amount
of time (in ms) allotted for the active mgr to load the mgr modules before declaring
availability, the standby will then proceed to mark itself "available" and
send the "active" beacon to the mon and unblock other critical mgr functionality.
If this happens, a health error will be issued at this time, indicating
which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The
idea is to unblock the rest of the mgr's critical functionality while making it
clear to Ceph operators that some modules are unusable.
The workunit was rewritten so it tests for these scenarios:
1. Normal module loading behavior (no health error should be issued)
2. Acceptable delay in module loading behavior (no health error should be
issued)
3. Unacceptable delay in module loading behavior (a health error should be
issued)
This documentation explains the "Module failed to initialize"
cluster error.
Users are advised to try failing over
the mgr to reboot the module initialization process,
then if the error persists, file a bug report. I decided
to write it this way instead of providing more complex
debugging tips such as advising to disable some mgr modules
since every case will be different depending on which modules
failed to initialize.
In the bug report, developers can ask for the health detail
output to narrow down which module is causing a bottleneck,
and then ask the user to try disabling certain modules until
the mgr is able to fully initialize.
Laura Flores [Fri, 25 Apr 2025 22:11:19 +0000 (22:11 +0000)]
mgr: ensure that all modules have started before advertising active mgr
----------------- Explanation of Problem ----------------
When the mgr is restarted or failed over via `ceph mgr fail` or during an
upgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the `pending_modules` map structure). This startup delay can happen
due to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s `serve()` function (each mgr module has a `serve()` function that
performs initialization tasks right when the module is loaded).
When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
```
$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it
```
We should try to lighten any bottlenecks in the mgr module `serve()`
functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the `pending_modules`
structure, indicating that they have not finished starting up.
--------------------- Explanation of Fix --------------------
This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:
1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules
have finished startup.
2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are
still pending startup; if all have started up (`pending_modules` is empty), then we send
the beacon right away. But if some are still pending, we pass the beacon task on to the new
recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later.
3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
startup the first time we checked. We know this is the case if the new recheck context
`recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is
confirmed to be empty, which means that all the modules have started up and are ready to support commands.
4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in
MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set
only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
```
// Whether I think I am available (request MgrMonitor to set me
// as available in the map)
bool available = active_mgr != nullptr && active_mgr->is_initialized();
auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;
```
--------------------- Reproducing the Bug ----------------------
At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
```
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index d12d69f..91c83fa8023 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -772,10 +772,10 @@ class Module(MgrModule):
self.update_pg_upmap_activity(plan) # update pg activity in `balancer status detail`
self.optimizing = False
+ # causing a bottleneck
+ for i in range(0, 1000):
+ for j in range (0, 1000):
+ x = i + j
+ self.log.debug("hitting the bottleneck in the balancer module")
self.log.debug('Sleeping for %d', sleep_interval)
self.event.wait(sleep_interval)
self.event.clear()
```
Then, the error reproduces every time by running:
```
$ ./bin/ceph mgr fail; ./bin/ceph telemetry show
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
```
With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.
This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.
The config is 0 ms by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
`ceph config set mgr mgr_module_load_delay <ms>`
A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
`ceph config set mgr mgr_module_load_delay_name <module name>`
The workunit added here tests a simulated slow loading module
scenario to ensure that this case is properly handled.
Patrick Donnelly [Fri, 17 Apr 2026 21:47:14 +0000 (17:47 -0400)]
Merge PR #67511 into tentacle
* refs/pull/67511/head:
doc/_ext: fix ceph_commands.py for new decorator-based command system
pybind/mgr: add per-module CLICommand instances to remaining modules
pybind/mgr/dashboard: create DBCLICommand, use throughout
pybind/mgr/tests/test_object_format: update DecoDemo to use fresh CLICommand
pybind/mgr/smb: adapt SMBCommand to use CLICommandBase
pybind/orchestrator,cephadm: replace CLICommandMeta
pybind/mgr: mechanically fix simple users to not import CLI*Command
pybind/mgr/mgr_module: support per-module CLICommand instances and globals
pybind/.../dashboard: misc automatic linter fixes
Kefu Chai [Sun, 8 Feb 2026 12:34:15 +0000 (20:34 +0800)]
doc/_ext: fix ceph_commands.py for new decorator-based command system
After commit 4aa9e246f, mgr modules migrated from using a class-level
COMMANDS list to decorator-based command registration using per-module
CLICommand instances (e.g., @BalancerCLICommand.Read('balancer status')).
This broke the ceph_commands.py Sphinx extension which was hardcoded to
expect m.COMMANDS to be a list, causing documentation builds to fail.
But not all modules are using this per-module CLICommand. Some modules are
fully migrated (balancer, hello, etc.) and use decorators, while others
are partially migrated (volumes, progress, stats, influx, k8sevents,
osd_perf_query, osd_support) - they have CLICommand defined but still
use the old COMMANDS list.
This fix updates _collect_module_commands() to handle three scenarios:
1. Fully migrated modules: Check CLICommand.dump_cmd_list() and use it
if it returns commands
2. Partially migrated modules: Fall back to the old COMMANDS list if
dump_cmd_list() returns empty
3. Legacy modules: Use COMMANDS list if CLICommand doesn't exist
This ensures the Sphinx extension works with modules in any migration
state, maintaining backwards compatibility while supporting the new
decorator pattern.
Conflicts:
- src/pybind/mgr/prometheus/module.py
CLIReadCommand/CLIWriteCommand removed from imports, not yet in tentacle - kept tentacle version
- src/pybind/mgr/orchestrator/module.py
_cert_store_key_ls, _cert_store_cert_ls - signature changed in main, kept tentacle version
_cert_store_entity_ls - function renamed in main, kept tentacle name
_cert_store_entity_ls - function name changed in main but not in tentacle
- src/pybind/mgr/smb/cli.py
typing imports differ between main and tentacle - manually reconciled
error_wrapper not backported to tentacle - removed from cherry-pick
- src/pybind/mgr/mirroring/module.py
import differs between main and tentacle - manually kept tentacle imports
- src/pybind/mgr/pg_autoscaler/module.py
get_scaling_threshold not backported to tentacle - removed that change
orchestrator and cephadm relied on CLICommandMeta to bypass the global
behavior of CLICommand. That is no longer a problem, so replace
CLICommandMeta with OrchestratorCLICommandBase to preserve the magic
error wrapping.
cephadm: wait for latest osd map after ceph-volume before OSD deploy
after ceph-volume creates an OSD, the cached osd map of the mgr can
lag behind the monitors, then get_osd_uuid_map() misses the new osd
id and deploy_osd_daemons_for_existing_osds() skips deploying the
cephadm daemon, which reports a misleading "Created no osd(s)" while
the osd exists.
This behavior is often seen with raw devices. (lvm list returns quicker).
This also fixes get_osd_uuid_map(only_up=True) as the previous branch
never populated the map when 'only_up' was true.
Now we only include osds with 'up==1' so a new OSD created (but still down)
is not treated as already present.
Patrick Donnelly [Tue, 14 Apr 2026 15:52:23 +0000 (11:52 -0400)]
Merge PR #67757 into tentacle
* refs/pull/67757/head:
qa/tasks/barbican: add kek for simple_crypto_plugin
qa/suites/rgw: use 'member' instead of 'Member' for roles in barbican
qa/suites/rgw: bump keystone to stable/2025.2
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Casey Bodley <cbodley@redhat.com>
Aliaksei Makarau [Tue, 31 Mar 2026 06:40:04 +0000 (08:40 +0200)]
This change introduces the shared memory communication (SMC-D) for the cluster network.
SMC-D is faster than ethernet in IBM Z LPARs and/or VMs (zVM or KVM).
mgr/DaemonServer: Limit search for OSDs to upgrade within the crush bucket.
The behavior of the 'ok-to-upgrade' command is now more deterministic with
respect to the parameters passed.
To achieve the above, the commit implements the following changes:
1. The 'ok-to-upgrade' command is modified to operate strictly on the OSDs
within the CRUSH bucket and, if possible, meet the '--max' criteria when
specified. When --max <num> is provided, the command returns up to <num>
OSD IDs from the specified CRUSH bucket that can be safely stopped for
simultaneous upgrade. This is useful when only a subset of OSDs within
the bucket needs to be upgraded for performance or other reasons.
2. Modifies the standalone tests to reflect the above change.
3. Modifies the relevant documentation to reflect the change in behavior.
Patrick Donnelly [Tue, 24 Feb 2026 19:30:24 +0000 (14:30 -0500)]
mds: use SimpleLock::WAIT_ALL for wait mask
The Locker uses has_any_waiter for a particular lock to evaluate whether
to nudge the log. For the squid, tentacle, and main branches, this
larger bit mask (all 64 bits) will cause this to wrongly return true for
other locks which have waiters. The side-effect of waking requests
spuriously is undesirable but should not affect performance
significantly.
For reef and older releases, using std::numeric_limits<uint64_t>::max()
in has_any_waiter() causes a bitwise overflow that sets the wait-queue
search bound impossibly high, resulting in the method always incorrectly
returning false. This results in nudge_log never nudging the log!
Fixes: db5c9dc2e6cc95a8d112c2131e4cac5340ca9dd0 Fixes: https://tracker.ceph.com/issues/75141 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 0be3a7896be1d5e9d6b04a6b943c2b5535b8523f)
* refs/pull/67533/head:
qa/cephadm: ensure host has been fully saved before considering bootstrap complete
mgr/cephadm: add __getstate__ so OSD class can be pickled
Reviewed-by: Adam King <adking@redhat.com> Reviewed-by: Guillaume Abrioux <gabrioux@redhat.com>
Run NVMe preformat in the LVM bluestore prepare path and
set skip_mkfs_discard so ceph-osd --mkfs gets --bdev-enable-discard false,
matching the raw OSD behavior.
debian: remove stale distutils override from py3dist-overrides
distutils was deprecated in Python 3.10 (PEP 632) and removed in
Python 3.12. The `python3-distutils` package no longer exists in
Debian Trixie (Python 3.13) or Ubuntu 24.04+ (Python 3.12).
The only runtime reference was in `debian/ceph-mgr.requires`, already
cleaned up by 3fb3f892aa3. This override is now dead code, hence no
installed file declares a runtime dependency on `distutils`, so
`dh_python3` never resolves it. Removing it prevents a latent
uninstallable-dependency bug if `distutils` were accidentally
reintroduced in a `.requires` file.
Fixes: https://tracker.ceph.com/issues/75901 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Max R. Carrara <m.carrara@proxmox.com> Signed-off-by: Kefu Chai <k.chai@proxmox.com>
(cherry picked from commit d1d07a0542228b7c40238a9a78d138ad07130240)
* refs/pull/67907/head:
doc/start: Add ARM support note to hardware-recommendations.rst
doc: Improve start/hardware-recommendations.rst
doc: Update the old ceph.com/community/ links to ceph.io/en/news/blog/
doc/start: Improve hardware-recommendations.rst
doc/start: fix wording in swap tip
doc: Use ref instead of full URLs for intra-docs links
doc: Use existing labels and ref for hyperlinks in architecture.rst
Reviewed-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
* refs/pull/66948/head:
mgr/DaemonServer: Re-order OSDs in crush bucket to maximize OSDs for upgrade
mgr/DaemonServer: Implement ok-to-upgrade command
mgr/DaemonServer: Modify offline_pg_report to handle set or vector types
* refs/pull/66482/head:
mgr/prometheus/test_module: Adding unit-test for new classes
mgr/prometheus: metrics header for standby module
mgr/prometheus: Use RLock to fix deadlock in HealthHistory
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
mgr/prometheus: prune stale health checks, compress output
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
* refs/pull/65913/head:
client: signal waitfor_commit waiters for write delegation enabled inode
test/libcephfs: add test for fsync on a write delegated inode
client: adjust `Fb` cap ref count check during synchronous fsync()
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Venky Shankar <vshankar@redhat.com>
Previously s3tests_java.py set JAVA_HOME using the `alternatives`
command. That had issues in that `alternatives` is not present on all
Ubuntu systems, and some installations of Java don't update
alternatives. So instead we look for a "java-8" jvm in /usr/lib/jvm/
and set JAVA_HOME to the first one we find.
Yuval Lifshitz [Fri, 20 Feb 2026 15:41:14 +0000 (15:41 +0000)]
test/rgw/notification: do not use netstat in the code
* net-tools are deprecated in fedora and ubuntu
* using netstat -p (used to verify that the http server is listening on
a port) requires root privilages, which may fail in some tests environments
The 'ceph orch daemon add osd' path mishandles osd.default which can
can save an incomplete osd.default spec (wrong store key, incomplete
spec, and apply ordering) and can skip work or reject valid LV paths.
This commit validates before saving, persists the full device
selections, matches the spec store by service name, and forces OSD
creation so each command is actually executed.
qa/tasks/backfill_toofull.py: Fix assert failures with & without compression
The following issues with the test are addressed:
1. The test was encountering assertion failure (assert backfillfull < 0.9) with
compression enabled. This was because the condition was not factoring in the
compression ratio. Without it the backfillfull ratio can easily exceed 1. By
factoring in the compression ratio, the backfillfull ratio will be in the
range (0 - n), where n can vary depending on the type of compression used.
2. The main contributing factor for (1) above is the amount of data written to
the pool. The writes were time-bound earlier leading to excess data and
eventually the assertion failure. By limiting the data written to the OSDs
to 50% of the OSD capacity in the first phase and only 20% in the re-write
phase, the outcome of the test is more deterministic regardless of
compression being enabled or not.
3. A potential false cluster error is avoided by swapping the setting of
the nearfull-ratio and backfill-ratio after the re-write phase.
Ujjawal Anand [Tue, 3 Mar 2026 05:38:29 +0000 (11:08 +0530)]
ceph-volume: skip virtual cdrom devices in inventory
-Some hosts expose IPMI/BMC virtual media as /dev/sr0. These
devices are reported as SCSI type 5 (CD/DVD) via sysfs and
appear in inventory/orchestrator output.
-These are not real disks and cannot be used as OSD targets.
-Filter out devices with SCSI type 5 to avoid listing
virtual cdrom devices as valid OSD candidates.
Casey Bodley [Wed, 25 Mar 2026 16:38:59 +0000 (12:38 -0400)]
qa/rgw/upgrade: symlinks are explicit about distro versions
avoid relying on "ubuntu_latest" and "rpm_latest" symlinks, which change
over time on main. be explicit about the distro versions supported by
the initial release
Normally when fast devices are passed to batch command but
no fast allocations could be found the batch command will
do nothing and return an empty plan. This leads to issues
however because the return essentially makes this issue silent
which makes it hard to debug in certain scenarios. I propose
to change this to raise error, and have made changes in osd.py
to better log the errors and process the exceptions. This
shouldn't affect processes that much and the change in
osd.py ensures the raised errors will not interrupt the return
output. I've also changed the unit tests to account for
change.