]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/commitdiff
cephadm: retry cleaning old cgroups when it fails 67049/head
authorAdam King <adking@redhat.com>
Thu, 22 Jan 2026 16:25:02 +0000 (11:25 -0500)
committerAdam King <adking@redhat.com>
Thu, 22 Jan 2026 16:25:02 +0000 (11:25 -0500)
It is possible that when attempting to redeploy a daemon
the shutdown of the daemon from cephadm running `systemctl stop`
may not have completed and we'll be unable to finish
cleaning the old cgroup. In these cases, moving on
immediately to try to start the systemd unit tends to
result in it failing to start. This patch adds a retry
to cleaning the old cgroups that should hopefully
avoid the race condition and daemons failing to start
because of it

Signed-off-by: Adam King <adking@redhat.com>
src/cephadm/cephadm.py

index 4e255e551656bf9e793c3bfa7916937c0730adf9..03234518cc74ec25c71c73954dc99c41a345dfa7 100755 (executable)
@@ -1015,10 +1015,18 @@ def clean_cgroup(ctx: CephadmContext, fsid: str, unit_name: str) -> None:
             if p.is_dir():
                 cg_trim(p)
         path.rmdir()
-    try:
-        cg_trim(cg_path)
-    except OSError:
-        logger.warning(f'Failed to trim old cgroups {cg_path}')
+
+    for s in [0.5, 1.0, 2.0, False]:
+        try:
+            cg_trim(cg_path)
+        except OSError:
+            if not s:
+                logger.warning(f'Failed 4 times to trim old cgroups <{cg_path}>. Giving up!')
+            else:
+                logger.warning(f'Failed to trim old cgroups <{cg_path}>. Retrying in {s} seconds...')
+                time.sleep(s)
+        else:
+            break
 
 
 def deploy_daemon_units(