Vallari Agrawal [Fri, 13 Mar 2026 08:47:46 +0000 (14:17 +0530)]
qa: ignore NVMEOF_GATEWAY_DOWN in nvmeof_scalability.yaml
Sometimes during scale-up/scale-down, a gateway goes in
UNAVAILABLE state (which triggers NVMEOF_GATEWAY_DOWN warning)
for a couple of seconds and self-recovers.
In this, none of the scale test asserts fail.
So NVMEOF_GATEWAY_DOWN can be ignorelist, because scale test asserts
on expected gw count and checks if all expected gws are AVAILABLE
between each iteration of scale-up/scale-down.
Vallari Agrawal [Fri, 13 Mar 2026 08:32:06 +0000 (14:02 +0530)]
qa/tasks/nvmeof.py: retry do_check if gw in CREATED
In do_check(), ensure all the namespaces+listeners are
added in gateway (i.e. gateway not in CREATED state)
after gateway is restarted. This is to prevent going into
next iteration of tharshing while gateways are still being
updated.
John Mulligan [Fri, 13 Mar 2026 17:42:09 +0000 (13:42 -0400)]
script/build-with-container: add CONFIGURE_ARGS env var to configure step
Add a new optional CONFIGURE_ARGS environment variable to the configure
step so that there's a mechanism to pass custom cmake options that
aren't handled elsewhere in the run-make.sh script.
Because configure is a rather fundamental build step it's probably
preferable to set this via an env file so that it persists across
rebuilds. Using an environment var here also avoids both needing to
change run-make.sh or add another CLI option to BWC which already has
too many.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Instead of "ceph orch daemon restart",
wait for daemon to come backup on it's own
during revival.
Also improve do_check retry logic.
And some logging improvements in nvmeof.thrasher task.
Ilya Dryomov [Thu, 12 Mar 2026 10:30:24 +0000 (11:30 +0100)]
include/ceph_features: note more kernel versions
Despite both MONNAMES and MONENC being pre-argonaut feature bits and
the kernel client implicitly assuming argonaut since 5.0, its monmap
decoding routine didn't handle MONNAMES and MONENC until 5.11 (when it
became necessary as part of msgr2 support).
Casey Bodley [Wed, 11 Mar 2026 22:27:54 +0000 (18:27 -0400)]
rgw: set 2_min minumum on rgw_mp_lock_max_time
because the lock renewal request is sent every half interval, don't
allow the lock duration to get small enough that rados request latency
becomes significant
Vallari Agrawal [Wed, 4 Mar 2026 06:21:00 +0000 (11:51 +0530)]
qa: Add "auto_pool_create" to nvmeof_initiator
While deploying gateways with "ceph orch apply nvmeof",
--pool can be optional now. If not passed, a pool with
name ".nvmeof" would automatically be created.
In nvmeof task, "auto_pool_create: True" would skip --pool
in "ceph orch apply nvmeof".
Laura Flores [Fri, 12 Sep 2025 20:14:30 +0000 (20:14 +0000)]
mgr, qa: clarify module checks in DaemonServer
The current check groups modules not being
enabled with failing to initialize. In this commit,
we reorder the checks:
1: Screen for a module being enabled. If it's not,
issue an EOPNOTSUPP with instructions on how
to enable it.
2. Screen for if a module is active. If a module
is enabled, then the cluster expects it to
be active to support commands. If the module
took too long to initialize though, we will
catch this and issue an ETIMEDOUT error with
a link for troubleshooting.
Now, these two separate issues are not grouped
together, and they are checked in the right order.
Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Thu, 11 Sep 2025 22:13:51 +0000 (22:13 +0000)]
mgr, qa: add `pending_modules` to asock command
Now, the command `ceph tell mgr mgr_status` will show a
"pending_modules" field. This is another way for Ceph operators
to check which modules haven't been initalized yet (in addition
to the health error).
This command was also added to testing scenarios in the workunit.
Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Tue, 29 Jul 2025 22:46:46 +0000 (22:46 +0000)]
mgr, common, qa, doc: issue health error after max expiration is exceeded
----------------- Enhancement to the Original Fix -----------------
During a mgr failover, the active mgr is marked available if:
1. The mon has chosen a standby to be active
2. The chosen active mgr has all of its modules initialized
Now that we've improved the criteria for sending the "active" beacon
by enforcing it to retry initializing mgr modules, we need to account
for extreme cases in which the modules are stuck loading for a very long
time, or even indefinitely. In these extreme cases where the modules might
never initialize, we don't want to delay sending the "active" beacon for
too long. This can result in blocking other important mgr functionality,
such as reporting PG availability in the health status. We want
to avoid sending warnings about PGs being unknown in the health status when
that's not ultimately the problem.
To account for an exeptionally long module loading time, I added a new
configurable `mgr_module_load_expiration`. If we exceed this maximum amount
of time (in ms) allotted for the active mgr to load the mgr modules before declaring
availability, the standby will then proceed to mark itself "available" and
send the "active" beacon to the mon and unblock other critical mgr functionality.
If this happens, a health error will be issued at this time, indicating
which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The
idea is to unblock the rest of the mgr's critical functionality while making it
clear to Ceph operators that some modules are unusable.
The workunit was rewritten so it tests for these scenarios:
1. Normal module loading behavior (no health error should be issued)
2. Acceptable delay in module loading behavior (no health error should be
issued)
3. Unacceptable delay in module loading behavior (a health error should be
issued)
This documentation explains the "Module failed to initialize"
cluster error.
Users are advised to try failing over
the mgr to reboot the module initialization process,
then if the error persists, file a bug report. I decided
to write it this way instead of providing more complex
debugging tips such as advising to disable some mgr modules
since every case will be different depending on which modules
failed to initialize.
In the bug report, developers can ask for the health detail
output to narrow down which module is causing a bottleneck,
and then ask the user to try disabling certain modules until
the mgr is able to fully initialize.
Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 25 Apr 2025 22:11:19 +0000 (22:11 +0000)]
mgr: ensure that all modules have started before advertising active mgr
----------------- Explanation of Problem ----------------
When the mgr is restarted or failed over via `ceph mgr fail` or during an
upgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the `pending_modules` map structure). This startup delay can happen
due to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s `serve()` function (each mgr module has a `serve()` function that
performs initialization tasks right when the module is loaded).
When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
```
$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it
```
We should try to lighten any bottlenecks in the mgr module `serve()`
functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the `pending_modules`
structure, indicating that they have not finished starting up.
--------------------- Explanation of Fix --------------------
This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:
1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules
have finished startup.
2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are
still pending startup; if all have started up (`pending_modules` is empty), then we send
the beacon right away. But if some are still pending, we pass the beacon task on to the new
recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later.
3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
startup the first time we checked. We know this is the case if the new recheck context
`recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is
confirmed to be empty, which means that all the modules have started up and are ready to support commands.
4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in
MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set
only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
```
// Whether I think I am available (request MgrMonitor to set me
// as available in the map)
bool available = active_mgr != nullptr && active_mgr->is_initialized();
auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;
```
--------------------- Reproducing the Bug ----------------------
At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
```
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index d12d69f..91c83fa8023 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -772,10 +772,10 @@ class Module(MgrModule):
self.update_pg_upmap_activity(plan) # update pg activity in `balancer status detail`
self.optimizing = False
+ # causing a bottleneck
+ for i in range(0, 1000):
+ for j in range (0, 1000):
+ x = i + j
+ self.log.debug("hitting the bottleneck in the balancer module")
self.log.debug('Sleeping for %d', sleep_interval)
self.event.wait(sleep_interval)
self.event.clear()
```
Then, the error reproduces every time by running:
```
$ ./bin/ceph mgr fail; ./bin/ceph telemetry show
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
```
With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.
This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.
The config is 0 ms by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
`ceph config set mgr mgr_module_load_delay <ms>`
A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
`ceph config set mgr mgr_module_load_delay_name <module name>`
The workunit added here tests a simulated slow loading module
scenario to ensure that this case is properly handled.
Introduce a new NVMe-oF mgr module and which create the pool
used for storing NVMe-related metadata ceph orch nvmeof apply command.
This removes the need for users to manually create and configure the
metadata pool before using the NVMe-oF functionality, simplifying
setup and reducing the chance of misconfiguration.
Rotem Shapira [Wed, 18 Feb 2026 13:51:45 +0000 (13:51 +0000)]
rgw/lua: create fresh VM for each background script execution
Previously, the background thread reused the same Lua VM across
iterations, causing stale state to persist. This made operations
like 'pairs(RGW)' fail to iterate properly.
Now we create a fresh VM on each iteration, which:
- Fixes the iteration bug
- Simplifies the code (no need to update limits on existing VM)
- Ensures clean state for each script execution
Verified with unit tests:
- TableIterateBackground
- TableIterateBackgroundBreak
- TableIterateStepByStep