From: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Date: Mon, 17 Feb 2025 10:32:06 +0000 (+0530)
Subject: rbd-mirror: bootstrap wait for previous disabling group to cleanup
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=2d009c331c1324f7b0f667e41f4ad93b4e5e984c;p=ceph.git

rbd-mirror: bootstrap wait for previous disabling group to cleanup

Was seeing a case where the following operations are done:
1. daemon is stopped on secondary
2. then mirroring on the group is disabled
3. added/removed image[s] to/from the group
4. enabled group back for mirroring
5. Mirroring daemon is brought back to life

From the handling:
1. Two GroupReplayer's are started by the InstanceReplayer, one for old group
   and one for new group (not surprisingly both deal with the same pool images)
2. The GroupReplayer for old group instance enters into
   group_replayer::BootstrapRequest, notices remote_group_id is not found, and
   starts cleaning-up the group, """tries to remove local group and all the
   images. Finally returns to GroupReplayer, stop the GroupReplayer setting
   the state as stopped with description group removed and finally unregister
   admin socket hook."""
3. On the other hand the GroupReplayer for new group instance runs in concurrent
   to the old one, figures out local group_id by name exists and """tries to
   remove local group and all the images. Finally returns to GroupReplayer,
   stop the GroupReplayer setting the state as stopped with description group
   removed and finally unregister admin socket hook."""

You can see 2 and 3 are ending up in the same situation because of the
concurrent behaviour. i.e one has to add the group with a name and create
images in the pool. Where as the other has to remove the group with same name
from the same pool.

Thanks to Ilya for the suggestion here, according to the suggestion the
fix is simple. The way this is handled for standalone images is that the
second replayer (i.e. (3)) sees that the image is in MIRROR_IMAGE_STATE_DISABLING
state and backs off (i.e.second group waits and retries later).

If the second replayer backs off with ERESTART, the first replayer should
eventually clean up the old group which would allow the second replayer to
proceed with creating a new group.

fixes: issue#27

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
---

diff --git a/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc b/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc
index 84ef18b54cb96..32cae92110215 100644
--- a/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc
+++ b/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc
@@ -699,6 +699,14 @@ void BootstrapRequest<I>::handle_get_local_mirror_group(int r) {
     return;
   }
 
+  if (m_local_mirror_group.state == cls::rbd::MIRROR_GROUP_STATE_DISABLING) {
+    derr << "group with same name exists: " << m_group_name
+         << " and is currently disabling" << dendl;
+    finish(-ERESTART); // The other group replayer might be removing the
+                       // group already, so wait and retry later.
+    return;
+  }
+
   dout(20) << m_local_mirror_group << dendl;
 
   list_local_group_snapshots();