Was seeing a case where the following operations are done:
1. daemon is stopped on secondary
2. then mirroring on the group is disabled
3. added/removed image[s] to/from the group
4. enabled group back for mirroring
5. Mirroring daemon is brought back to life
From the handling:
1. Two GroupReplayer's are started by the InstanceReplayer, one for old group
and one for new group (not surprisingly both deal with the same pool images)
2. The GroupReplayer for old group instance enters into
group_replayer::BootstrapRequest, notices remote_group_id is not found, and
starts cleaning-up the group, """tries to remove local group and all the
images. Finally returns to GroupReplayer, stop the GroupReplayer setting
the state as stopped with description group removed and finally unregister
admin socket hook."""
3. On the other hand the GroupReplayer for new group instance runs in concurrent
to the old one, figures out local group_id by name exists and """tries to
remove local group and all the images. Finally returns to GroupReplayer,
stop the GroupReplayer setting the state as stopped with description group
removed and finally unregister admin socket hook."""
You can see 2 and 3 are ending up in the same situation because of the
concurrent behaviour. i.e one has to add the group with a name and create
images in the pool. Where as the other has to remove the group with same name
from the same pool.
Thanks to Ilya for the suggestion here, according to the suggestion the
fix is simple. The way this is handled for standalone images is that the
second replayer (i.e. (3)) sees that the image is in MIRROR_IMAGE_STATE_DISABLING
state and backs off (i.e.second group waits and retries later).
If the second replayer backs off with ERESTART, the first replayer should
eventually clean up the old group which would allow the second replayer to
proceed with creating a new group.
fixes: issue#27
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
return;
}
+ if (m_local_mirror_group.state == cls::rbd::MIRROR_GROUP_STATE_DISABLING) {
+ derr << "group with same name exists: " << m_group_name
+ << " and is currently disabling" << dendl;
+ finish(-ERESTART); // The other group replayer might be removing the
+ // group already, so wait and retry later.
+ return;
+ }
+
dout(20) << m_local_mirror_group << dendl;
list_local_group_snapshots();