From: Prasanna Kumar Kalever Date: Mon, 17 Feb 2025 10:32:06 +0000 (+0530) Subject: rbd-mirror: bootstrap wait for previous disabling group to cleanup X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=2d009c331c1324f7b0f667e41f4ad93b4e5e984c;p=ceph.git rbd-mirror: bootstrap wait for previous disabling group to cleanup Was seeing a case where the following operations are done: 1. daemon is stopped on secondary 2. then mirroring on the group is disabled 3. added/removed image[s] to/from the group 4. enabled group back for mirroring 5. Mirroring daemon is brought back to life From the handling: 1. Two GroupReplayer's are started by the InstanceReplayer, one for old group and one for new group (not surprisingly both deal with the same pool images) 2. The GroupReplayer for old group instance enters into group_replayer::BootstrapRequest, notices remote_group_id is not found, and starts cleaning-up the group, """tries to remove local group and all the images. Finally returns to GroupReplayer, stop the GroupReplayer setting the state as stopped with description group removed and finally unregister admin socket hook.""" 3. On the other hand the GroupReplayer for new group instance runs in concurrent to the old one, figures out local group_id by name exists and """tries to remove local group and all the images. Finally returns to GroupReplayer, stop the GroupReplayer setting the state as stopped with description group removed and finally unregister admin socket hook.""" You can see 2 and 3 are ending up in the same situation because of the concurrent behaviour. i.e one has to add the group with a name and create images in the pool. Where as the other has to remove the group with same name from the same pool. Thanks to Ilya for the suggestion here, according to the suggestion the fix is simple. The way this is handled for standalone images is that the second replayer (i.e. (3)) sees that the image is in MIRROR_IMAGE_STATE_DISABLING state and backs off (i.e.second group waits and retries later). If the second replayer backs off with ERESTART, the first replayer should eventually clean up the old group which would allow the second replayer to proceed with creating a new group. fixes: issue#27 Signed-off-by: Prasanna Kumar Kalever --- diff --git a/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc b/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc index 84ef18b54cb96..32cae92110215 100644 --- a/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc +++ b/src/tools/rbd_mirror/group_replayer/BootstrapRequest.cc @@ -699,6 +699,14 @@ void BootstrapRequest::handle_get_local_mirror_group(int r) { return; } + if (m_local_mirror_group.state == cls::rbd::MIRROR_GROUP_STATE_DISABLING) { + derr << "group with same name exists: " << m_group_name + << " and is currently disabling" << dendl; + finish(-ERESTART); // The other group replayer might be removing the + // group already, so wait and retry later. + return; + } + dout(20) << m_local_mirror_group << dendl; list_local_group_snapshots();