Laura Flores [Fri, 12 Sep 2025 20:14:30 +0000 (20:14 +0000)]
mgr, qa: clarify module checks in DaemonServer
The current check groups modules not being
enabled with failing to initialize. In this commit,
we reorder the checks:
1: Screen for a module being enabled. If it's not,
issue an EOPNOTSUPP with instructions on how
to enable it.
2. Screen for if a module is active. If a module
is enabled, then the cluster expects it to
be active to support commands. If the module
took too long to initialize though, we will
catch this and issue an ETIMEDOUT error with
a link for troubleshooting.
Now, these two separate issues are not grouped
together, and they are checked in the right order.
Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Thu, 11 Sep 2025 22:13:51 +0000 (22:13 +0000)]
mgr, qa: add `pending_modules` to asock command
Now, the command `ceph tell mgr mgr_status` will show a
"pending_modules" field. This is another way for Ceph operators
to check which modules haven't been initalized yet (in addition
to the health error).
This command was also added to testing scenarios in the workunit.
Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Tue, 29 Jul 2025 22:46:46 +0000 (22:46 +0000)]
mgr, common, qa, doc: issue health error after max expiration is exceeded
----------------- Enhancement to the Original Fix -----------------
During a mgr failover, the active mgr is marked available if:
1. The mon has chosen a standby to be active
2. The chosen active mgr has all of its modules initialized
Now that we've improved the criteria for sending the "active" beacon
by enforcing it to retry initializing mgr modules, we need to account
for extreme cases in which the modules are stuck loading for a very long
time, or even indefinitely. In these extreme cases where the modules might
never initialize, we don't want to delay sending the "active" beacon for
too long. This can result in blocking other important mgr functionality,
such as reporting PG availability in the health status. We want
to avoid sending warnings about PGs being unknown in the health status when
that's not ultimately the problem.
To account for an exeptionally long module loading time, I added a new
configurable `mgr_module_load_expiration`. If we exceed this maximum amount
of time (in ms) allotted for the active mgr to load the mgr modules before declaring
availability, the standby will then proceed to mark itself "available" and
send the "active" beacon to the mon and unblock other critical mgr functionality.
If this happens, a health error will be issued at this time, indicating
which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The
idea is to unblock the rest of the mgr's critical functionality while making it
clear to Ceph operators that some modules are unusable.
The workunit was rewritten so it tests for these scenarios:
1. Normal module loading behavior (no health error should be issued)
2. Acceptable delay in module loading behavior (no health error should be
issued)
3. Unacceptable delay in module loading behavior (a health error should be
issued)
This documentation explains the "Module failed to initialize"
cluster error.
Users are advised to try failing over
the mgr to reboot the module initialization process,
then if the error persists, file a bug report. I decided
to write it this way instead of providing more complex
debugging tips such as advising to disable some mgr modules
since every case will be different depending on which modules
failed to initialize.
In the bug report, developers can ask for the health detail
output to narrow down which module is causing a bottleneck,
and then ask the user to try disabling certain modules until
the mgr is able to fully initialize.
Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 25 Apr 2025 22:11:19 +0000 (22:11 +0000)]
mgr: ensure that all modules have started before advertising active mgr
----------------- Explanation of Problem ----------------
When the mgr is restarted or failed over via `ceph mgr fail` or during an
upgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the `pending_modules` map structure). This startup delay can happen
due to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s `serve()` function (each mgr module has a `serve()` function that
performs initialization tasks right when the module is loaded).
When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
```
$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it
```
We should try to lighten any bottlenecks in the mgr module `serve()`
functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the `pending_modules`
structure, indicating that they have not finished starting up.
--------------------- Explanation of Fix --------------------
This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:
1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules
have finished startup.
2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are
still pending startup; if all have started up (`pending_modules` is empty), then we send
the beacon right away. But if some are still pending, we pass the beacon task on to the new
recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later.
3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
startup the first time we checked. We know this is the case if the new recheck context
`recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is
confirmed to be empty, which means that all the modules have started up and are ready to support commands.
4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in
MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set
only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
```
// Whether I think I am available (request MgrMonitor to set me
// as available in the map)
bool available = active_mgr != nullptr && active_mgr->is_initialized();
auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;
```
--------------------- Reproducing the Bug ----------------------
At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
```
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index d12d69f..91c83fa8023 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -772,10 +772,10 @@ class Module(MgrModule):
self.update_pg_upmap_activity(plan) # update pg activity in `balancer status detail`
self.optimizing = False
+ # causing a bottleneck
+ for i in range(0, 1000):
+ for j in range (0, 1000):
+ x = i + j
+ self.log.debug("hitting the bottleneck in the balancer module")
self.log.debug('Sleeping for %d', sleep_interval)
self.event.wait(sleep_interval)
self.event.clear()
```
Then, the error reproduces every time by running:
```
$ ./bin/ceph mgr fail; ./bin/ceph telemetry show
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
```
With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.
This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.
The config is 0 ms by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
`ceph config set mgr mgr_module_load_delay <ms>`
A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
`ceph config set mgr mgr_module_load_delay_name <module name>`
The workunit added here tests a simulated slow loading module
scenario to ensure that this case is properly handled.
Xuehan Xu [Tue, 24 Feb 2026 07:35:58 +0000 (15:35 +0800)]
crimson/os/seastore/lba: TRIM/CLEANER trans to adjust deltas of
LBALeafNodes when committing them.
This is to deal with the following scenario:
1. A client transaction modifies the value of the LBALeafNode, but not
the pladdr but other field;
2. A TRIM/CLEANER transaction modifies the pladdr for the same laddr_t
concurrently
In the old approach, the client trans may override the pladdr with the
outdated value after the TRIM/CLEANER transaction commits
Xuehan Xu [Sun, 1 Mar 2026 04:42:49 +0000 (12:42 +0800)]
crimson/os/seastore/btree: make updates of lba leaf nodes ptrs
synchronous with contents updates
Since we need merge content of lba leaf nodes when committing
trim/cleaner transactions, and we rely on the child ptrs to determine
whether to modify mappings of pending leaf nodes. We must make sure
the ptr updates and node content updates are synchronous.
Xuehan Xu [Mon, 1 Dec 2025 09:44:45 +0000 (17:44 +0800)]
crimson/os/seastore/transaction_manager: block client transactions if
they conflict with rewriting transactions until the rewriting
transactions finishes
Xuehan Xu [Tue, 14 Oct 2025 03:05:19 +0000 (11:05 +0800)]
crimson/os/seastore/btree_types: BtreeCursors don't hold local copies of
lba/backref values
Since lba mapping values might change during the executions of
client transactions once we allow background transactions to be
submitted without invalidating client ones, we want to avoid other
components using lba/backref mappings from keep local copies to prevent
petential problem
Kefu Chai [Mon, 9 Feb 2026 02:09:14 +0000 (10:09 +0800)]
doc: update mgr module command documentation for per-module registries
Update documentation to reflect the new per-module command registry
pattern introduced in PR #66467. The old global CLICommand decorators
have been replaced with module-specific registries.
Changes:
- doc/mgr/modules.rst: Rewrite CLICommand section with setup guide,
update all examples to use AntigravityCLICommand pattern
- src/pybind/mgr/object_format.py: Add note explaining per-module
registries and update all decorator examples
- doc/dev/developer_guide/dash-devel.rst: Update dashboard plugin
examples to use DBCLICommand
All examples now correctly show:
- Creating registry with CLICommandBase.make_registry_subtype()
- Using module-specific decorator names (e.g., @StatusCLICommand.Read)
- Setting CLICommand class attribute for framework registration
Xuehan Xu [Fri, 6 Mar 2026 16:51:14 +0000 (00:51 +0800)]
crimson/osd/pg: drop inappropriate assertions
The handler of interruptions may be scheduler long after the
interruptions happen, when the world may has changed completely.
So the assertions about temporary states don't seem appropriate
in the handlers of those interruptions.
rgw/dedup: split-head mechanism
Split head object into 2 objects - one with attributes and no data and
a new tail-object with only data.
The new-tail object will be deduped (unlike the head objects which can't
be dedup)
We will split head for objects with size 16MB or less
A few extra improvemnts:
Skip objects created by server-side-copy
Use reftag for comp-swap instead of manifest
Skip shared-manifest objects after readint attributes
Made max_obj_size_for_split and min_obj_size_for_dedup config value in
rgw.yaml.in
refined test: validate size after dedup
TBD: add rados ls -l to report object size on-bulk to speedup the process
improved tests - verify refcount are working, validate objects, remove
duplicates and then verify the last remaining object making sure it was
not deleted
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
Kotresh HR [Fri, 6 Mar 2026 07:28:38 +0000 (12:58 +0530)]
tools/cephfs_mirror: Remove additional wait in pop_dataq_entry
An additional wait has sneaked in while popping job from
syncm's data_q. When the conditional wait was converted to
timed wait as part of f6a6e781b887b01a640d6321a2c085577d9ba07e,
this should have been removed. The extra wait causes no
harm in most of the workflow but might cause issues when
the mirror daemon is stopped. So it should be removed.
Ville Ojamo [Thu, 5 Mar 2026 06:02:55 +0000 (13:02 +0700)]
doc: Fix link and improve Crimson doc
Fix Seastar external link that was not working.
Capitalize consistently as Crimson, SeaStore in text.
Fix typos including in a label and in a ref using it.
Wrap text at column 80.
Remove unused highlight directive.
Fix article and hyphenation.
Try to reduce amount of commas in text and improve language.
Use already existing label and ref instead of section title for link.
Use confval role for configuration keys in text.
Use an autoclass reference instead of hardcoding URL.
Trim spaces at end of lines and convert tabs to spaces.
Use a colon instead of a hyphen pretending to be an em dash.
Signed-off-by: Ville Ojamo <git2233+ceph@ojamo.eu>
ShreeJejurikar [Thu, 26 Feb 2026 07:57:55 +0000 (13:27 +0530)]
rgw: add bucket logging pytest suite
Add a pytest-based test suite for RGW bucket logging that exercises the
radosgw-admin bucket logging CLI commands (list, info, flush) and
verifies the associated S3-level cleanup behavior.
John Mulligan [Thu, 5 Mar 2026 13:30:37 +0000 (08:30 -0500)]
Merge pull request #67571 from phlogistonjohn/jjm-smb-remotectl-local
smb: add remote-control local mode feature
Reviewed-by: Adam King <adking@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Anoop C S <anoopcs@cryptolab.net> Reviewed-by: Xavi Hernandez <xhernandez@gmail.com>
Ville Ojamo [Thu, 5 Mar 2026 09:02:42 +0000 (16:02 +0700)]
doc: Improve start/quick-rbd.rst
Remove mention of FAQ with a broken link.
Use ref for intra-docs links and add labels in destination documents.
Promptify all CLI example commands.
Use standard angle brackets for mandatory arguments in commands.
Remove an unused external link definition.
Trim spaces at end of lines and convert tabs to spaces.
Signed-off-by: Ville Ojamo <git2233+ceph@ojamo.eu>
John Mulligan [Mon, 23 Feb 2026 17:23:06 +0000 (12:23 -0500)]
cephadm: add support for a remote control local socket
It's not an oxymoron, it's Remote Control Local Socket (tm)!
This allows processes on the ceph host to use a unix domain socket
without mTLS to communicate with the remote control sidecar server
in the samba service.
At the higher level We treat the 2nd listener as a "feature" even
though it really configures the same sidecar as "remote-contol".
This way it's easy to have one of "remote-control",
"remote-control-local" or both in the service spec configuring the
smb service.
NOTE: This service does have the ability to verify that the client has
admin-ish access to ceph services by needing the client to pass
the ceph user name and key over the grpc headers.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Wed, 4 Mar 2026 17:44:32 +0000 (12:44 -0500)]
Merge pull request #67534 from phlogistonjohn/jjm-smb-debug-opts
smb: add debug level options to smb cluster resource
Reviewed-by: Xavi Hernandez <xhernandez@gmail.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Anoop C S <anoopcs@cryptolab.net> Reviewed-by: Adam King <adking@redhat.com>
NitzanMordhai [Wed, 4 Mar 2026 12:38:59 +0000 (12:38 +0000)]
qa/tasks/mgr: test_module_selftest set influx hostname to avoid warnings during plugin test
This is a follow pr for pr#66376 and complete the set for influx server
start.
self-test will hit error MGR_INFLUX_NO_SERVER since we dont have
hostname configed, the following command will add a test hostname
so the error won't appear and fail the test.