----------------- Explanation of Problem ----------------
When the mgr is restarted or failed over via `ceph mgr fail` or during an
upgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the `pending_modules` map structure). This startup delay can happen
due to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s `serve()` function (each mgr module has a `serve()` function that
performs initialization tasks right when the module is loaded).
When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
```
$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it
```
We should try to lighten any bottlenecks in the mgr module `serve()`
functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the `pending_modules`
structure, indicating that they have not finished starting up.
--------------------- Explanation of Fix --------------------
This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:
1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules
have finished startup.
2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are
still pending startup; if all have started up (`pending_modules` is empty), then we send
the beacon right away. But if some are still pending, we pass the beacon task on to the new
recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later.
3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
startup the first time we checked. We know this is the case if the new recheck context
`recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is
confirmed to be empty, which means that all the modules have started up and are ready to support commands.
4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in
MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set
only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
```
// Whether I think I am available (request MgrMonitor to set me
// as available in the map)
bool available = active_mgr != nullptr && active_mgr->is_initialized();
auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;
```
--------------------- Reproducing the Bug ----------------------
At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
```
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index
d12d69f..
91c83fa8023 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -772,10 +772,10 @@ class Module(MgrModule):
self.update_pg_upmap_activity(plan) # update pg activity in `balancer status detail`
self.optimizing = False
+ # causing a bottleneck
+ for i in range(0, 1000):
+ for j in range (0, 1000):
+ x = i + j
+ self.log.debug("hitting the bottleneck in the balancer module")
self.log.debug('Sleeping for %d', sleep_interval)
self.event.wait(sleep_interval)
self.event.clear()
```
Then, the error reproduces every time by running:
```
$ ./bin/ceph mgr fail; ./bin/ceph telemetry show
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
```
With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.
---------------------- Integration Testing ---------------------
This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.
The config is 0 ms by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
`ceph config set mgr mgr_module_load_delay <ms>`
A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
`ceph config set mgr mgr_module_load_delay_name <module name>`
The workunit added here tests a simulated slow loading module
scenario to ensure that this case is properly handled.
--------------------- Documentation --------------------
The new documentation describes the three existing mgr states so Ceph
operators can better interpret their Ceph status output.
Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
mgr active: $name
+Interpreting Ceph-Mgr Statuses
+==============================
+
+A cluster's health status will show each ``ceph-mgr`` daemon in one of three states:
+
+1. **active**
+
+ This mgr daemon has been fully initialized, which means it is ready to receive
+ and execute commands. Only one mgr will be in this state at a time.
+
+2. **active (starting)**
+
+ This mgr daemon has been chosen to be ``active``, but it is not done initializing.
+ Although it is not yet ready to execute commands, an operator may still issue commands,
+ which will be held and executed once the manager becomes ``active``. Only one mgr will
+ be in this state at a time.
+
+3. **standby**
+
+ This mgr daemon is not currently receiving or executing commands, but it is there to
+ take over if the current active mgr becomes unavailable. An operator may also manually
+ promote standby manager to active via ``ceph mgr fail`` if desired. All other mgr daemons
+ which are not ``active`` or ``active (starting)`` will be in this state.
+
+Each of these states are visible in the output of the ``ceph -s``. For example:
+
+.. code-block:: console
+
+ $ ceph -s
+ cluster:
+ id: b150f540-745a-460c-a566-376b28b95ac3
+ health: HEALTH_OK
+
+ services:
+ mon: 3 daemons, quorum a,b,c (age 47m) [leader: a]
+ mgr: x(active, starting, since 3s)
+ mds: 1/1 daemons up, 2 standby
+ osd: 4 osds: 4 up (since 47m), 4 in (since 47m)
+
+ data:
+ volumes: 1/1 healthy
+ pools: 4 pools, 177 pgs
+ objects: 24 objects, 451 KiB
+ usage: 4.0 GiB used, 400 GiB / 404 GiB avail
+ pgs: 177 active+clean
+
Client authentication
---------------------
--- /dev/null
+overrides:
+ ceph:
+ log-ignorelist:
+ - \(CEPHADM_STRAY_DAEMON\)
+ - \(CEPHADM_STRAY_HOST\)
+ - \(MGR_DOWN\)
+ - evicting unresponsive client
+
+tasks:
+ - workunit:
+ clients:
+ client.0:
+ - mgr/test_mgr_module_loading_time.sh
--- /dev/null
+#!/bin/bash
+
+setup_cephadm() {
+ # This will create CEPHADM_STRAY_HOST warnings, but we just want to be able to run an orch command.
+ echo "Enabling cephadm module..."
+ ceph mgr module enable cephadm
+ ceph orch set backend cephadm
+}
+
+check_cluster_status() {
+ echo "Checking cluster status..."
+ ceph -s
+}
+
+set_balancer_delay() {
+ echo "Setting balancer module load delay..."
+ ceph config set mgr mgr_module_load_delay_name balancer
+ ceph config set mgr mgr_module_load_delay 10000
+}
+
+test_loading_time() {
+ echo "Testing with module load delay of 10000 ms..."
+ ceph mgr fail
+
+ local orch_status_output
+ if ! orch_status_output=$(ceph orch status 2>&1); then
+ echo "FAIL: 'ceph orch status' failed to run:"
+ echo "$orch_status_output"
+ exit 1
+ fi
+
+ echo "$orch_status_output"
+
+ if [[ "$orch_status_output" == *"Backend: cephadm"* ]]; then
+ echo "PASS: Excess loading time was properly supported."
+ elif [[ "$orch_status_output" == *"Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date"* ]]; then
+ echo "FAIL: Excess loading time was not properly supported."
+ exit 1
+ else
+ echo "FAIL: Unexpected error in 'ceph orch status':"
+ echo "$orch_status_output"
+ exit 1
+ fi
+}
+
+main() {
+ setup_cephadm || return 1
+ check_cluster_status || return 1
+ set_balancer_delay || return 1
+ test_loading_time || return 1
+}
+
+main "$@"
flags:
- no_mon_update
- cluster_create
+- name: mgr_module_load_delay
+ type: millisecs
+ level: dev
+ default: 0
+ desc: Number of milliseconds for Manager modules to delay loading. For testing purposes only.
+ long_desc: Number of milliseconds for Manager modules to delay loading. For testing purposes only.
+ see_also:
+ - mgr_module_load_delay_name
+ flags:
+ - runtime
+ services:
+ - mgr
+- name: mgr_module_load_delay_name
+ type: str
+ level: dev
+ desc: Specify which Manager module is to delay loading by mgr_module_load_delay milliseconds. For testing purposes only.
+ long_desc: Specify which Manager module is to delay loading by mgr_module_load_delay milliseconds. For testing purposes only.
+ see_also:
+ - mgr_module_load_delay
+ flags:
+ - runtime
+ services:
+ - mgr
- name: cephadm_path
type: str
level: advanced
// Send all python calls down a Finisher to avoid blocking
// C++ code, and avoid any potential lock cycles.
finisher.queue(new LambdaContext([this, active_module, name](int) {
+ // Delay loading in testing scenarios
+ auto delay = g_conf().get_val<std::chrono::milliseconds>("mgr_module_load_delay");
+ std::string delayed_module = g_conf().get_val<std::string>("mgr_module_load_delay_name");
+ if ((name == delayed_module) && (delay > std::chrono::milliseconds{0})) {
+ dout(4) << "Delaying load time for module '" << name
+ << "' by " << delay << "..." << dendl;
+ std::this_thread::sleep_for(delay);
+ }
int r = active_module->load(this);
std::lock_guard l(lock);
pending_modules.erase(name);
<< active_module->get_fin_thread_name() << dendl;
active_module->finisher.start();
}
+
+ // Signal when we're finally done starting up modules
+ if (pending_modules.empty() && recheck_modules_start) {
+ finisher.queue(recheck_modules_start);
+ recheck_modules_start = nullptr;
+ }
}));
}
return f.get();
});
}
+
+void ActivePyModules::check_all_modules_started(Context *modules_start_complete) {
+ std::lock_guard l(lock);
+ if (pending_modules.empty()) {
+ // Modules are already done starting, signal completion right away
+ finisher.queue(modules_start_complete);
+ } else {
+ recheck_modules_start = modules_start_complete; // signal that we need to check again later
+ }
+}
{
// module class instances not yet created
std::set<std::string, std::less<>> pending_modules;
+ // This context is set during mgr initialization when
+ // modules need more time to finish starting up.
+ // See ActivePyModules::check_all_modules_started() and
+ // ActivePyModules::start_one().
+ Context *recheck_modules_start = nullptr;
// module class instances already created
std::map<std::string, std::shared_ptr<ActivePyModule>> modules;
PyModuleConfig &module_config;
bool inject_python_on() const;
void update_cache_metrics();
+ // Sends the "active" beacon right away if all mgr modules
+ // have finished startup. If some modules are still pending
+ // startup, the "active" beacon is scheduled to send later
+ // only after all modules are ready.
+ // See "PyModuleRegistry::check_all_modules_started()".
+ void check_all_modules_started(Context *modules_start_complete);
};
finisher.queue(new LambdaContext([this, completion](int r){
init();
- completion->complete(0);
+ py_module_registry->check_all_modules_started(
+ new LambdaContext([this, completion](int){
+ {
+ std::lock_guard l(lock);
+ initializing = false;
+ initialized = true;
+ }
+ completion->complete(0);
+ }));
}));
}
#endif
dout(4) << "Complete." << dendl;
- initializing = false;
- initialized = true;
}
void Mgr::load_all_metadata()
active_modules->config_notify();
}
}
+
+void PyModuleRegistry::check_all_modules_started(Context *modules_start_complete) {
+ active_modules->check_all_modules_started(modules_start_complete);
+}
return active_modules->get_module_finisher(name);
}
+ // Sends the "active" beacon right away if all mgr modules
+ // have finished startup. If some modules are still pending
+ // startup, the "active" beacon is scheduled to send later
+ // after all modules are ready.
+ // See "Mgr::background_init()".
+ void check_all_modules_started(Context *modules_start_complete);
+
// <<< (end of ActivePyModules cheeky call-throughs)
};