----------------- Enhancement to the Original Fix -----------------
During a mgr failover, the active mgr is marked available if:
1. The mon has chosen a standby to be active
2. The chosen active mgr has all of its modules initialized
Now that we've improved the criteria for sending the "active" beacon
by enforcing it to retry initializing mgr modules, we need to account
for extreme cases in which the modules are stuck loading for a very long
time, or even indefinitely. In these extreme cases where the modules might
never initialize, we don't want to delay sending the "active" beacon for
too long. This can result in blocking other important mgr functionality,
such as reporting PG availability in the health status. We want
to avoid sending warnings about PGs being unknown in the health status when
that's not ultimately the problem.
To account for an exeptionally long module loading time, I added a new
configurable `mgr_module_load_expiration`. If we exceed this maximum amount
of time (in ms) allotted for the active mgr to load the mgr modules before declaring
availability, the standby will then proceed to mark itself "available" and
send the "active" beacon to the mon and unblock other critical mgr functionality.
If this happens, a health error will be issued at this time, indicating
which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The
idea is to unblock the rest of the mgr's critical functionality while making it
clear to Ceph operators that some modules are unusable.
--------------------- Integration Testing --------------------
The workunit was rewritten so it tests for these scenarios:
1. Normal module loading behavior (no health error should be issued)
2. Acceptable delay in module loading behavior (no health error should be
issued)
3. Unacceptable delay in module loading behavior (a health error should be
issued)
--------------------- Documentation --------------------
This documentation explains the "Module failed to initialize"
cluster error.
Users are advised to try failing over
the mgr to reboot the module initialization process,
then if the error persists, file a bug report. I decided
to write it this way instead of providing more complex
debugging tips such as advising to disable some mgr modules
since every case will be different depending on which modules
failed to initialize.
In the bug report, developers can ask for the health detail
output to narrow down which module is causing a bottleneck,
and then ask the user to try disabling certain modules until
the mgr is able to fully initialize.
Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
(cherry picked from commit
bf25a08cc58c8e872806e92dd36d9ff91e5523d9)
mgr active: $name
-Interpreting Ceph-Mgr Statuses
-==============================
+Interpreting Manager Daemon Status
+==================================
A cluster's health status will show each ``ceph-mgr`` daemon in one of three states:
1. **active**
- This mgr daemon has been fully initialized, which means it is ready to receive
- and execute commands. Only one mgr will be in this state at a time.
+ This Manager daemon has been fully initialized, which means it is ready to receive
+ and execute commands. Only one Manager will be in this state at a time.
2. **active (starting)**
- This mgr daemon has been chosen to be ``active``, but it is not done initializing.
+ This Manager daemon has been chosen to be ``active``, but it is not done initializing.
Although it is not yet ready to execute commands, an operator may still issue commands,
- which will be held and executed once the manager becomes ``active``. Only one mgr will
- be in this state at a time.
+ which will be held and executed once the Manager becomes ``active``. Only one Manager
+ will be in this state at a time.
3. **standby**
- This mgr daemon is not currently receiving or executing commands, but it is there to
- take over if the current active mgr becomes unavailable. An operator may also manually
- promote standby manager to active via ``ceph mgr fail`` if desired. All other mgr daemons
- which are not ``active`` or ``active (starting)`` will be in this state.
+ This Manager daemon is not currently receiving or executing commands, but it is ready to
+ take over if the current active Manager becomes unavailable. An administrator may
+ manually promote a standby to become active via ``ceph mgr fail`` if desired. All other
+ Manager daemons which are not ``active`` or ``active (starting)`` will be in this state.
-Each of these states are visible in the output of the ``ceph -s``. For example:
+Each of these states are visible in the output of the ``ceph status`` command. For example:
.. code-block:: console
- $ ceph -s
+ $ ceph status
cluster:
id: b150f540-745a-460c-a566-376b28b95ac3
health: HEALTH_OK
daemon(s) or use ``ceph mgr fail`` on the active daemon in order to force
failover to another daemon.
+**Module failed to initialize**
+
+If the ``ceph health detail`` looks something like this, it means that some
+modules took too long to initialize after a Manager failover, and are unable to process
+commands:
+
+.. code-block:: console
+
+ HEALTH_ERR 4 mgr modules have failed
+ [ERR] MGR_MODULE_ERROR: 14 mgr modules have failed
+ Module 'rbd_support' has failed: Module failed to initialize.
+ Module 'status' has failed: Module failed to initialize.
+ Module 'telemetry' has failed: Module failed to initialize.
+ Module 'volumes' has failed: Module failed to initialize.
+
+
+You can also see these modules listed under ``pending_modules``
+in the output of the following command:
+
+.. prompt:: bash $
+
+ ceph tell mgr mgr_status
+
+To troubleshoot, you may run ``ceph mgr fail`` to reboot
+module initialization.
+
+Note that the health error may clear on its own since modules
+will continue to initialize in the background.
+
+If the modules are still failing to initialize, please file a bug
+report under the `"mgr" project <https://tracker.ceph.com/projects/mgr>`_
+for further assistance.
+
OSDs
----
#!/bin/bash
-setup_cephadm() {
- # This will create CEPHADM_STRAY_HOST warnings, but we just want to be able to run an orch command.
- echo "Enabling cephadm module..."
- ceph mgr module enable cephadm
- ceph orch set backend cephadm
-}
-
-check_cluster_status() {
- echo "Checking cluster status..."
- ceph -s
-}
-
-set_balancer_delay() {
- echo "Setting balancer module load delay..."
- ceph config set mgr mgr_module_load_delay_name balancer
- ceph config set mgr mgr_module_load_delay 10000
-}
-
-test_loading_time() {
- echo "Testing with module load delay of 10000 ms..."
- ceph mgr fail
-
- local orch_status_output
- if ! orch_status_output=$(ceph orch status 2>&1); then
- echo "FAIL: 'ceph orch status' failed to run:"
- echo "$orch_status_output"
- exit 1
- fi
+# This script tests how the mgr handles different module loading times post failover.
+# The motivation is this tracker ticket: https://tracker.ceph.com/issues/71631
+# To run this script on a vstart cluter, use the following command:
+# cd ceph/build
+# ../qa/workunits/mgr/test_mgr_module_loading_time.sh --vstart
+
+vstart=0
+if [ "$1" = "--vstart" ]; then
+ vstart=1
+fi
+
+ceph="ceph"
+if [ $vstart -eq 1 ]; then
+ ceph="./bin/ceph"
+fi
+
+
+# This will create CEPHADM_STRAY_HOST warnings, but we just want to be able to run an orch command.
+echo "Enabling cephadm module..."
+"$ceph" mgr module enable cephadm
+"$ceph" orch set backend cephadm
+
+echo "Checking cluster status..."
+"$ceph" -s
+
+# ------ Test 1 ------
+echo "Test 1: Test normal module loading behavior without any injected delays"
+
+echo "Ensure that no module is set for a load delay..."
+"$ceph" config set mgr mgr_module_load_delay_name ""
+
+echo "Test 1: Ensure that there is no injected load delay..."
+"$ceph" config set mgr mgr_module_load_delay 0
+
+"$ceph" mgr fail
+orch_status_output=$("$ceph" orch status 2>&1)
+
+echo "$orch_status_output"
+if [[ "$orch_status_output" == *"Backend: cephadm"* ]]; then
+ echo "PASS: orch command succeeded during normal behavior."
+elif [[ "$orch_status_output" == *"Error ENOTSUP: Module 'orchestrator' is not enabled/loaded"* ]]; then
+ echo "FAIL: orch command failed during normal behavior."
+ exit 1
+else
+ echo "FAIL: Unexpected error in orch command during normal behavior."
+ echo "$orch_status_output"
+ exit 1
+fi
+
+echo "Ensure health detail DOES NOT warn about any modules that failed initialization..."
+health=$("$ceph" health detail 2>&1)
+if [[ "$health" == *"Module failed to initialize"* ]]; then
+ echo "FAIL: One or more modules failed to initialize during small delay."
+ echo "$health"
+ exit 1
+fi
+
+echo "Verify that mgr is active..."
+stat=$("$ceph" -s 2>&1)
+if [[ "$stat" != *"active, since"* ]]; then
+ echo "FAIL: Mgr should be in 'active' state."
+ echo "$stat"
+ exit 1
+fi
+
+# ------ Test 2 ------
+echo "Select balancer module to receive loading delays..."
+"$ceph" config set mgr mgr_module_load_delay_name balancer
+
+echo "Test 2: Inject small delay (10000 ms) that should not exceed max loading retries"
+"$ceph" config set mgr mgr_module_load_delay 10000
+
+"$ceph" mgr fail
+orch_status_output=$("$ceph" orch status 2>&1)
+
+echo "$orch_status_output"
+if [[ "$orch_status_output" == *"Backend: cephadm"* ]]; then
+ echo "PASS: orch command succeeded during small delay."
+elif [[ "$orch_status_output" == *"Error ENOTSUP: Module 'orchestrator' is not enabled/loaded"* ]]; then
+ echo "FAIL: orch command failed during small delay."
+ exit 1
+else
+ echo "FAIL: Unexpected error in orch command during small delay."
+ echo "$orch_status_output"
+ exit 1
+fi
+
+echo "Ensure health detail DOES NOT warn about any modules that failed initialization..."
+health=$("$ceph" health detail 2>&1)
+if [[ "$health" == *"Module failed to initialize"* ]]; then
+ echo "FAIL: One or more modules failed to initialize during small delay."
+ echo "$health"
+ exit 1
+fi
+
+echo "Verify that mgr is active..."
+stat=$("$ceph" -s 2>&1)
+if [[ "$stat" != *"active, since"* ]]; then
+ echo "FAIL: Mgr should be in 'active' state."
+ echo "$stat"
+ exit 1
+fi
+
+# ------ Test 3 ------
+echo "Test 3: Inject large delay (10000000000 ms) that exceeds max loading retries and emits cluster error"
+"$ceph" config set mgr mgr_module_load_delay 10000000000
+
+"$ceph" mgr fail
+orch_status_output=$("$ceph" orch status 2>&1)
+
+echo "$orch_status_output"
+if [[ "$orch_status_output" == *"Error ENOTSUP: Module 'orchestrator' is not enabled/loaded"* ]]; then
+ echo "PASS: orch command failed during large delay as expected."
+else
+ echo "FAIL: Unexpected error in orch command during large delay."
echo "$orch_status_output"
+ exit 1
+fi
+
+echo "Ensure health detail DOES warn about any modules that failed initialization..."
+health=$("$ceph" health detail 2>&1)
+if [[ "$health" == *"Module failed to initialize"* ]]; then
+ echo "PASS: Cluster properly issued error about modules that failed to initialize."
+ echo "$health"
+else
+ echo "FAIL: Cluster did not properly issue error about modules that failed to initialize."
+ echo "$health"
+ exit 1
+fi
+
+echo "Verify that mgr is active..."
+stat=$("$ceph" -s 2>&1)
+if [[ "$stat" != *"active, since"* ]]; then
+ echo "FAIL: Mgr should be in 'active' state."
+ echo "$stat"
+ exit 1
+fi
+
+# ----- Test 4 -----
+echo "Test 4: Disable the problematic module and confirm that the health error goes away"
+
+echo "Disabling the balancer module..."
+"$ceph" mgr module force disable balancer --yes-i-really-mean-it
+
+echo "Sleeping for 10 seconds to allow the health error to clear up..."
+sleep 10
+
+echo "Ensure health detail no longer warns about any modules that failed initialization..."
+health=$("$ceph" health detail 2>&1)
+if [[ "$health" == *"Module failed to initialize"* ]]; then
+ echo "FAIL: One or more modules failed to initialize despite problem module being disabled."
+ echo "$health"
+ exit 1
+fi
+
+echo "Verify that mgr is active..."
+stat=$("$ceph" -s 2>&1)
+if [[ "$stat" != *"active, since"* ]]; then
+ echo "FAIL: Mgr should be in 'active' state."
+ echo "$stat"
+ exit 1
+fi
- if [[ "$orch_status_output" == *"Backend: cephadm"* ]]; then
- echo "PASS: Excess loading time was properly supported."
- elif [[ "$orch_status_output" == *"Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date"* ]]; then
- echo "FAIL: Excess loading time was not properly supported."
- exit 1
- else
- echo "FAIL: Unexpected error in 'ceph orch status':"
- echo "$orch_status_output"
- exit 1
- fi
-}
-
-main() {
- setup_cephadm || return 1
- check_cluster_status || return 1
- set_balancer_delay || return 1
- test_loading_time || return 1
-}
-
-main "$@"
+echo "All tests passed."
- runtime
services:
- mgr
+- name: mgr_module_load_expiration
+ type: millisecs
+ level: dev
+ default: 20000
+ desc: Maximum number of milliseconds the active mgr is allowed to load the mgr modules before declaring availability.
+ long_desc: Maximum number of milliseconds the active mgr is allowed to load the mgr modules. If any modules are still
+ uninitialized after the expiration is exceeded, the mgr proceeds to declare availability, but a health error will be
+ issued indicating which modules didn't load in time.
+ flags:
+ - runtime
+ services:
+ - mgr
- name: cephadm_path
type: str
level: advanced
clog(clog_),
audit_clog(audit_clog_),
initialized(false),
- initializing(false)
+ initializing(false),
+ initialization_start_time(ceph::coarse_mono_clock::zero())
{
cluster_state.set_objecter(objecter);
}
ceph_assert(!initializing);
ceph_assert(!initialized);
initializing = true;
+ initialization_start_time = ceph::coarse_mono_clock::now();
finisher.start();
return false;
}
+bool Mgr::exceeded_initialization_expiration()
+{
+ // initialization_start_time=0 when initialization hasn't started yet,
+ // so know we can't have exceeded the time expiration.
+ if (ceph::coarse_mono_clock::is_zero(initialization_start_time)) {
+ return false;
+ }
+
+ // Save the amount of time elapsed
+ auto time_elapsed = ceph::coarse_mono_clock::now() - initialization_start_time;
+ dout(20) << "time elapsed since mgr initialization: " << time_elapsed << dendl;
+
+ // Reset start time if the expiration time has been exceeded.
+ // Signal initialization=true so the mgr forcibly sends an "active" beacon
+ auto expiration = g_conf().get_val<std::chrono::milliseconds>("mgr_module_load_expiration");
+ bool exceeded_expiration = time_elapsed > expiration;
+ if (exceeded_expiration) {
+ std::lock_guard l(lock);
+ initialization_start_time = ceph::coarse_mono_clock::zero();
+ initializing = false;
+ initialized = true;
+ }
+
+ return exceeded_expiration;
+}
+
void Mgr::handle_mgr_digest(ref_t<MMgrDigest> m)
{
dout(10) << m->mon_status_json.length() << dendl;
bool initialized;
bool initializing;
+ ceph::coarse_mono_time initialization_start_time;
public:
Mgr(MonClient *monc_, const MgrMap& mgrmap,
~Mgr();
bool is_initialized() const {return initialized;}
+ bool exceeded_initialization_expiration();
entity_addrvec_t get_server_addrs() const {
return server.get_myaddrs();
}
}
// Whether I think I am available (request MgrMonitor to set me
- // as available in the map)
- bool available = active_mgr != nullptr && active_mgr->is_initialized();
+ // as available in the map).
+ //
+ // The active mgr is marked available if:
+ // 1. The mon has chosen a standby to be active
+ // 2. The chosen active mgr has all of its modules initialized
+ //
+ // In extreme cases, if modules take very long to initialize (a buffer of extra time
+ // is allowed; see "mgr_module_load_expiration"), we will proceed to mark the chosen
+ // active mgr "available" to unblock other mgr functionality such as reporting PG
+ // availability. If this happens, a health error will be issued indicating which
+ // mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). This unblocks
+ // the rest of the mgr's functionality while making it clear that some modules
+ // are unusuable.
+ bool available = false;
+ if (active_mgr != nullptr) {
+ available = active_mgr->is_initialized() || active_mgr->exceeded_initialization_expiration();
+ }
auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;
// checks (to avoid outputting two health messages about a
// module that said can_run=false but we tried running it anyway)
failed_modules[module->get_name()] = module->get_error_string();
+ } else if ((active_modules->is_pending(module->get_name()))) {
+ failed_modules[module->get_name()] = "Module failed to initialize.";
}
}