]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
5 days agolibrbd/api: restrict APIs on secondary mirror groups 64946/head
VinayBhaskar-V [Mon, 11 Aug 2025 09:32:03 +0000 (09:32 +0000)]
librbd/api: restrict APIs on secondary mirror groups

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
6 days agoMerge pull request #64782 from ajarr/ajarr-wip-async-group-enable
Ilya Dryomov [Fri, 19 Sep 2025 12:44:31 +0000 (14:44 +0200)]
Merge pull request #64782 from ajarr/ajarr-wip-async-group-enable

librbd: make mirror group enable asynchronous

Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 days agoMerge pull request #65452 from abitdrag/wip-miki-test-rbd-cgsm-base
Ilya Dryomov [Tue, 16 Sep 2025 22:12:05 +0000 (00:12 +0200)]
Merge pull request #65452 from abitdrag/wip-miki-test-rbd-cgsm-base

qa/workunits/rbd: test snapshot sync with interrupted daemon on secondary

Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
8 days agoMerge pull request #65524 from pkalever/pkalever-user-snap-N-peer-uuid-fixes
Ilya Dryomov [Tue, 16 Sep 2025 22:03:56 +0000 (00:03 +0200)]
Merge pull request #65524 from pkalever/pkalever-user-snap-N-peer-uuid-fixes

rbd-mirror: fix group snapshot sync ordering and also properly remove peer UUIDs

Reviewed-by: VinayBhaskar-V <vvarada@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
8 days agorbd-mirror: do not prune primary snapshots as part of the secondary daemon 65524/head
Prasanna Kumar Kalever [Tue, 16 Sep 2025 12:47:18 +0000 (18:17 +0530)]
rbd-mirror: do not prune primary snapshots as part of the secondary daemon

Problem:
In a relocate situation there is a demote involved and we have a demote
snapshot both on primary with peer UUID attached and on secondary with peer UUID
attached. Later a promote say on site-b triggered primary demote group snap
removal on site-a which started a image snap removal, so it started a request to
image replayer (idle at that time), but the request didn't yet went
through.

As a next command we had --force promote on site-a but since the request on
site-a for image snap removal is already in place it somehow went through and
removed the image snaps leaving the group snap on site-a primary demote snapshot
partially deleted with peer UUID attached. And now note site-a is again primary.

In the later test, it involved demote on site-b and followed by a resync on it.
But since the primary demote snapshot is lingering in the partially deleted
state and since peer UUID is still attached, there is no way the resync will
go through, leaving the syncing snaps in incomplete state forever.

Note: Any CLI command on new primary site-a cannot delete the partially deleted
snap because there is a peer UUID attached to it.

Solution:
Do not prune any primary snapshot (including primary demote snapshots) as part
of the secondary daemon operations.

It is safe, however, to unlink remote group snapshots (such as non-primary
demote snapshots). Just ensure that primary demote snapshots are not pruned by
the secondary daemon.

Even in scenarios where a relocation back to the initial cluster occurs in the
future, the primary demoted snapshot on the current secondary (which becomes
primary during relocation) will not be synced again — even though it retains
with a peer UUID. This is because the scan_for_unsynced_group_snapshots()
function uses the last local group snapshot ID to locate the corresponding
snapshot on the remote side and continues syncing from that point onward, rather
than starting from the beginning.

However, in the event of a full resync request, this snapshot will be synced.
This is expected behavior and should not pose any issues. This actually
helps in pruning the primary demote snap on primary quickly later.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 days agolibrbd/mirror: remove peer UUID of group snapshot before attempting removal
Prasanna Kumar Kalever [Thu, 4 Sep 2025 16:26:15 +0000 (21:56 +0530)]
librbd/mirror: remove peer UUID of group snapshot before attempting removal

Problem:
When the number of snapshots exceeds the configured limit
(rbd_mirroring_max_mirroring_snapshots), the cleanup process remove the last
group snapshot on the primary. If this removal fails partially, the
group snapshot with some or empty image snap may still remain.

As a result:
* The snapshot is incorrectly considered for synchronization to the secondary.
* The secondary snapshot can remain forever in the incomplete state, leading
  to stuck of daemon progress.

Solution:
Before attempting to remove the group snapshot:
* Explicitly remove the peer UUID associated with the snapshot.
* This ensures that the snapshot is no longer considered eligible for
  synchronization.
* Even if the actual snapshot deletion fails partially, the snapshot is not
  tried for mirroring, avoiding daemon stalls or inconsistencies.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 days agorbd_mirror: order the user group snaps and the mirror group snaps syncing
Prasanna Kumar Kalever [Fri, 29 Aug 2025 08:36:58 +0000 (14:06 +0530)]
rbd_mirror: order the user group snaps and the mirror group snaps syncing

In case of user group snap addition:
* Wait for the user group snap to sync to complete and only then mark
  the mirror group snap to complete
In case of user group snap removal:
* Wait for the user group snap to get pruned and only then mark the
  mirror group snap to complete

Updated tests:
  • test_create_group_with_images_then_mirror_with_regular_snapshots
      enable scenario 1 as with the recent changes it works, also remove
      additional mirror group snapshot as it is no more required.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
9 days agoqa/workunits/rbd: test snapshot sync with interrupted daemon on secondary 65452/head
Miki Patel [Tue, 9 Sep 2025 10:19:24 +0000 (15:49 +0530)]
qa/workunits/rbd: test snapshot sync with interrupted daemon on secondary

Validate the snapshot mirroring functionality when the rbd-mirror daemon on
the secondary cluster is interrupted during an snapshot sync process.

Signed-off-by: Miki Patel <miki.patel132@gmail.com>
9 days agoMerge pull request #65004 from VinayBhaskar-V/wip-async-gr
Ilya Dryomov [Mon, 15 Sep 2025 17:45:05 +0000 (19:45 +0200)]
Merge pull request #65004 from VinayBhaskar-V/wip-async-gr

rbd-mirror: replace cond.wait() with async callbacks in group_replayer::Replayer

Reviewed-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
9 days agorbd-mirror: replace cond.wait() with async callbacks in group_replayer::Replayer 65004/head
VinayBhaskar-V [Fri, 4 Jul 2025 12:41:38 +0000 (18:11 +0530)]
rbd-mirror: replace cond.wait() with async callbacks in group_replayer::Replayer

This commit also replace all cls sync api calls with async calls and
handles shut_down by properly tracking all async_ops()

Co-authored-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2 weeks agolibrbd/api/Group.cc: use correct error code symbolic 64782/head
Ramana Raja [Mon, 8 Sep 2025 17:44:22 +0000 (13:44 -0400)]
librbd/api/Group.cc: use correct error code symbolic

... constant, EOPNOTSUPP, instead of ENOTSUP, for handling errors
raised on the server side for not having support for mirror groups.

Signed-off-by: Ramana Raja <rraja@redhat.com>
2 weeks agoMerge pull request #65341 from ajarr/wip-ajarr-fix-force-promote-delete-gp-race
Ilya Dryomov [Fri, 5 Sep 2025 16:47:35 +0000 (18:47 +0200)]
Merge pull request #65341 from ajarr/wip-ajarr-fix-force-promote-delete-gp-race

qa/workunits/rbd: fix race in test_force_promote_delete_group

Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
3 weeks agoqa/workunits/rbd: fix race in test_force_promote_delete_group 65341/head
Ramana Raja [Tue, 26 Aug 2025 19:22:12 +0000 (15:22 -0400)]
qa/workunits/rbd: fix race in test_force_promote_delete_group

... that is a part of the rbd_mirror_group_simple test suite.

In "test_force_promote_delete_group," the secondary group is
force-promoted, but the original primary is not demoted. The
force-promoted group is disabled for mirroring. One of its member images
is removed. The group is then re-enabled for mirroring before being
removed.

Sometimes the check to confirm that the group has been removed fails.
This happens because the rbd-mirror daemon’s group replayer, after
restarting, recreates the group immediately after it has been explicitly
removed and before the check for group removal.

To fix this race, remove the check to confirm that the group has been
removed as it is not necessary. The successful execution of the group
removal command should be sufficient.

Signed-off-by: Ramana Raja <rraja@redhat.com>
4 weeks agolibrbd/api/Mirror: enhancements to group_enable API
N Balachandran [Mon, 2 Jun 2025 04:08:52 +0000 (09:38 +0530)]
librbd/api/Mirror: enhancements to group_enable API

Notable enhancements have been made to the mirror group_enable API:
* Previously, member images were enabled for mirroring synchronously
  and one after another. They are now enabled asynchronously and
  concurrently, making  the time to enable a group independent of the
  number of member images it contains.

* Intermediate steps involved in enabling an image, such as fetching
  mirror peers, acquiring an exclusive lock, creating a mirror snapshot,
  among other operations, run asynchronously but are synchronized
  across all the group member images using C_Gather callbacks. An
  asynchronous operation on an image proceeds to the next step only
  after that operation completes for all member images. This allows for
  easier cleanup of member images and the group when an intermediate
  operation fails for one or more  member images.

* The exclusive locks on member images are now held for a shorter
  duration. All member image locks are acquired just before taking
  image snapshots and released after retrieving the snapshot IDs.
  Previously, the locks were obtained earlier in the sequence of
  operations, even for steps that did not need them, and were not
  explicitly released.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
Co-Authored-by: Ramana Raja <rraja@redhat.com>
6 weeks agoMerge pull request #64913 from VinayBhaskar-V/wip-fix-group-disable
Ilya Dryomov [Tue, 12 Aug 2025 19:00:58 +0000 (21:00 +0200)]
Merge pull request #64913 from VinayBhaskar-V/wip-fix-group-disable

rbd-mirror: Allow group disable without --force when in PROMOTION_STATE_UNKNOWN

Reviewed-by: Ramana Raja <rraja@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
6 weeks agorbd-mirror: Allow group disable without --force when in PROMOTION_STATE_UNKNOWN 64913/head
VinayBhaskar-V [Fri, 8 Aug 2025 13:00:34 +0000 (13:00 +0000)]
rbd-mirror: Allow group disable without --force when in PROMOTION_STATE_UNKNOWN

With the earlier version disabling a primary group with promotion state PROMOTION_STATE_UNKNOWN
requires a force flag which deviates from the existing behavior of standalone images.
This change aligns group disable behavior with that of standalone images by allowing primary
groups in PROMOTION_STATE_UNKNOWN state to be disabled without the force flag.

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
8 weeks agorbd-mirror: further adding some more helper routines 53793/head
Prasanna Kumar Kalever [Mon, 23 Jun 2025 06:15:55 +0000 (11:45 +0530)]
rbd-mirror: further adding some more helper routines

This commit aims to simpify:
load_local_group_snapshots
load_remote_group_snapshots
scan_for_unsynced_group_snapshots

With this commit it should be clear that non of the above routines use
reverse iterators or too much nesting

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocleanup: rename remove_mirror_peer_uuid to mirror_group_snapshot_unlink_peer
Prasanna Kumar Kalever [Fri, 20 Jun 2025 10:56:53 +0000 (16:26 +0530)]
cleanup: rename remove_mirror_peer_uuid to mirror_group_snapshot_unlink_peer

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: fix group replayer deleting the previous mirror group snapshot
Prasanna Kumar Kalever [Fri, 20 Jun 2025 10:52:46 +0000 (16:22 +0530)]
rbd-mirror: fix group replayer deleting the previous mirror group snapshot

The prune logic in the group replayer currently holds the previous
group snapshot until the current group snapshot syncing is complete.
Hence there is always a previous group snapshot (and the respective
image snapshots).

But when there are user group snapshots group replayer takes them to
account and prunes the previous mirror snapshots, which shouldn't be done.
The group replayer should always hold the previous mirror group snapshot
(irrespective of the user group snap).

Also rename unlink_group_snapshots() -> prune_group_snapshots() and
substitute the use of unlink with prune as required.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocleanup: adjust the layout of various routines
Prasanna Kumar Kalever [Fri, 20 Jun 2025 10:28:22 +0000 (15:58 +0530)]
cleanup: adjust the layout of various routines

1. move prepare_non_primary_mirror_snap_name to anonymous namespace
2. call is_rename_requested() in handle_load_remote_group_snapshots()
3. restructure various routines for better readability

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: unifying lock access
Prasanna Kumar Kalever [Thu, 19 Jun 2025 13:51:53 +0000 (19:21 +0530)]
rbd-mirror: unifying lock access

cleanup the intermediate unnecessary unlocking followed by locking

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: align mirror_snapshot_complete and regular_snapshot_complete
Prasanna Kumar Kalever [Thu, 19 Jun 2025 13:46:17 +0000 (19:16 +0530)]
rbd-mirror: align mirror_snapshot_complete and regular_snapshot_complete

keep mirror_snapshot_complete and regular_snapshot_complete similar and
cleanup validate_image_snaps_sync_complete()

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: avoid altering local group snap vector elements
Prasanna Kumar Kalever [Tue, 27 May 2025 07:52:37 +0000 (13:22 +0530)]
rbd-mirror: avoid altering local group snap vector elements

Now the m_local_group_snaps is dedicated to ListSnapshotsRequest() locally
and is unaltered.

Note, this patch fixes the crash, that can be seen with user snap
created, then remove and then created with the same name.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: allow removal of user group snapshot if id do not match
Prasanna Kumar Kalever [Tue, 27 May 2025 07:42:05 +0000 (13:12 +0530)]
rbd-mirror: allow removal of user group snapshot if id do not match

Currently in the unlink_group_snapshots() we check for regular group snapshot
names on remote group and local group. If the same name user snapshot exists on
remote group we do not remove it locally. There might be a case, user deleted
the user group snapshot on the remote group and recreated it with the same name
again (before the delete update via a mirror group snapshot is conveyed locally),
even in such case unfortunately currently we still don't delete the group
snapshot locally as the names still matching with the locally existing snapshot.
Hence we need to preserve the user snapshot locally only if their id's match and
not the group snapshot names.

Without this fix the mirroring daemon stays stuck at the same snapshot.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agoqa/workunits/rbd: fix test_group_with_clone_image and test_images_different_pools
Prasanna Kumar Kalever [Thu, 12 Jun 2025 15:11:49 +0000 (20:41 +0530)]
qa/workunits/rbd: fix test_group_with_clone_image and test_images_different_pools

Uncomment and run test_group_with_clone_image and test_images_different_pools,
these tests are now modified to expect the mirror enable time failures.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd: unlink the individual image snapshots taken as part of group snapshot
Prasanna Kumar Kalever [Thu, 24 Apr 2025 07:14:24 +0000 (12:44 +0530)]
librbd: unlink the individual image snapshots taken as part of group snapshot

... but are not linked to group snapshot. This can happen when a mirror group
snapshot process crashes and the group snapshot is left in INCOMPLETE
state. The fix is to crawl through the list of images in the group and
match the individual image snapshots with a matching group snap id in
them and removing them.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: fix mirror group demote
Prasanna Kumar Kalever [Thu, 17 Apr 2025 15:04:21 +0000 (20:34 +0530)]
librbd/api: fix mirror group demote

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: fix mirror group disable
Prasanna Kumar Kalever [Thu, 26 Jun 2025 11:18:00 +0000 (16:48 +0530)]
librbd/api: fix mirror group disable

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: fix group enable
Prasanna Kumar Kalever [Thu, 17 Apr 2025 16:17:14 +0000 (21:47 +0530)]
librbd/api: fix group enable

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: filter group snapshots by mirror_peer_uuid in group_replayer
VinayBhaskar-V [Tue, 27 May 2025 13:28:51 +0000 (18:58 +0530)]
rbd-mirror: filter group snapshots by mirror_peer_uuid in group_replayer

The group_replayer::Replayer was previously including all remote group
snapshots during synchronization, regardless of whether they were
associated with the current peer's mirror_peer_uuid. This could result
in syncing snapshots that belong to other peers, leading to inconsistent
behavior.

This commit ensures that only remote group snapshots containing the
expected mirror_peer_uuid are processed. Snapshots that do not match
are filtered out early in the replay logic, preventing cross-peer
interference.

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
8 weeks agorbd-mirror: ensure safe access to m_prune_snap_ids_by_gr
VinayBhaskar-V [Tue, 27 May 2025 13:19:08 +0000 (18:49 +0530)]
rbd-mirror: ensure safe access to m_prune_snap_ids_by_gr

The ImageReplayer m_prune_snap_ids_by_gr set is modified under a lock
but accessed from external calls without a lock. This unsynchronized access can
lead to undefined behavior if the set is accessed while the set is being modified.

This commit ensures that ImageReplayer m_prune_snap_ids_by_gr are
safely access under a lock.

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
8 weeks agoqa/workunits/rbd: bypass regular group snapshot removal checks
Prasanna Kumar Kalever [Thu, 22 May 2025 17:55:38 +0000 (23:25 +0530)]
qa/workunits/rbd: bypass regular group snapshot removal checks

Currently there is a delay in the deletion of the regular group snaps that
needs fixing, so just commenting the checks to avoid the temporary test
failures.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd: allow group remove when group mirroring isn't supported
Ilya Dryomov [Mon, 16 Jun 2025 20:46:38 +0000 (22:46 +0200)]
librbd: allow group remove when group mirroring isn't supported

Ignore a potential EOPNOTSUPP error from Mirror::group_disable() in
Group::remove() -- this would come up if the client side (including rbd
CLI) is upgraded before the OSDs.  The same is done for standalone
images in RemoveRequest::handle_disable_mirror().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Resolves: rhbz#2372434

8 weeks agorbd: don't fail "rbd group info" if group mirroring isn't supported
Ilya Dryomov [Mon, 16 Jun 2025 20:46:27 +0000 (22:46 +0200)]
rbd: don't fail "rbd group info" if group mirroring isn't supported

Make it behave as if mirroring was disabled on the group.  This comes
up if the client side (including rbd CLI) is upgraded before the OSDs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Resolves: rhbz#2372434

8 weeks agorbd: don't fail "rbd mirror pool status" if group mirroring isn't supported
Ilya Dryomov [Mon, 16 Jun 2025 16:07:57 +0000 (18:07 +0200)]
rbd: don't fail "rbd mirror pool status" if group mirroring isn't supported

Make it behave as if mirroring was disabled on all groups.  This comes
up if the client side (including rbd CLI) is upgraded before the OSDs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Resolves: rhbz#2372434

8 weeks agolibrbd/api: fix the way we get the promotion state of the group today
Prasanna Kumar Kalever [Wed, 21 May 2025 18:30:49 +0000 (00:00 +0530)]
librbd/api: fix the way we get the promotion state of the group today

avoid using get_last_mirror_snapshot_state() and start using
GroupGetInfoRequest()

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: avoid erasing local snap vector elements
Prasanna Kumar Kalever [Wed, 21 May 2025 13:10:51 +0000 (18:40 +0530)]
rbd-mirror: avoid erasing local snap vector elements

..let the next refresh trying to load local snaps, reload new set of snaps

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: group replayer defend for orphan snapshot
Prasanna Kumar Kalever [Sun, 18 May 2025 17:38:29 +0000 (23:08 +0530)]
rbd-mirror: group replayer defend for orphan snapshot

once the group replayer notices the orphan snapshot, it should shutdown.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd: group get info skip incomplete snapshot
Prasanna Kumar Kalever [Fri, 16 May 2025 08:44:06 +0000 (14:14 +0530)]
librbd: group get info skip incomplete snapshot

group get info shouldn't look at incomplete group snapshot if group is
primary or primary demoted. For example a failed force promote can leave an
incomplete primary group snapshot on the group, it doesn't mean that group
is primary yet.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd: group promote create orphan snapshot
Prasanna Kumar Kalever [Mon, 19 May 2025 15:08:50 +0000 (20:38 +0530)]
librbd: group promote create orphan snapshot

create a dummy/orphan snapshot, on a force promote operation. It is a
special non-primary snapshot without a mirror uuid and any image snapshots
within it. Its existence denotes that a force promote had begun. It is
created when a force promote was issued and no demote snapshot exists locally.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: fix mirror group promote
Prasanna Kumar Kalever [Tue, 29 Apr 2025 09:39:17 +0000 (15:09 +0530)]
librbd/api: fix mirror group promote

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocls/rbd: add a field complete to GroupSnapshotNamespaceMirror
N Balachandran [Tue, 20 May 2025 11:15:50 +0000 (16:45 +0530)]
cls/rbd: add a field complete to GroupSnapshotNamespaceMirror

Add a field to allow tracking whether the mirror group snapshot syncing
is complete separate from the mirror group snapshot creation itself.

At this point it's never set/used but already exposed in the public API
to avoid a binary incompatibility in the near future.

Co-authored-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agorbd-mirror: ignore EOPNOTSUPP for group listing
N Balachandran [Mon, 12 May 2025 15:35:24 +0000 (21:05 +0530)]
rbd-mirror: ignore EOPNOTSUPP for group listing

pool_watcher::RefreshEntitiesRequest errored out if listing mirrored
groups on a remote cluster failed with EOPNOTSUPP. In mixed clusters, this
prevented image replayers from starting up and hence syncing standalone images.
EOPNOTSUPP is now ignored when listing mirror groups.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agorbd-mirror: fix missing mirror group status
N Balachandran [Fri, 25 Apr 2025 11:31:58 +0000 (17:01 +0530)]
rbd-mirror: fix missing mirror group status

The recent changes to the GroupReplayer Bootstrap caused the
mirror group status for a newly enabled group to not be written to
the rbd_mirroring object on the secondary for upto 10 seconds.
The cls_rbd group_status_set did not write the status to disk if
the group does not exist or if its state is not enabled. This has been
modified to write the status even if the status is creating.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agorbd-mirror: return EREMCHG from bootstrap in case of rename
VinayBhaskar-V [Mon, 5 May 2025 06:12:55 +0000 (11:42 +0530)]
rbd-mirror: return EREMCHG from bootstrap in case of rename

also, align the status descriptions to match with Imagereplayer

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd: let Mirror::group_get_info() return ENOENT only if group DNE
Ilya Dryomov [Tue, 29 Apr 2025 18:12:42 +0000 (20:12 +0200)]
librbd: let Mirror::group_get_info() return ENOENT only if group DNE

Getting ENOENT when the group doesn't exist is very much desired,
especially considering the fact that groups can't be opened like images
are and that group APIs operate in terms of group names.  However, in
case the group exists but mirroring is disabled, it's better to remain
consistent with Mirror::image_get_info() and return a partially formed
struct where state is set to RBD_MIRROR_GROUP_DISABLED -- this is the
behavior that API users have grown to rely on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agoqa/workunits/rbd: update to mirror group snapshot tests
John Agombar [Mon, 28 Apr 2025 12:29:46 +0000 (13:29 +0100)]
qa/workunits/rbd: update to mirror group snapshot tests

Updated tests:
• test_odf_failover_failback - Update retry_promote screnario to limit retries to
  10000 attempts

Disabled tests:
• test_create_group_with_images_then_mirror_with_regular_snapshots scenario 1
  test sometimes fails as removed regular snapshot remains on secondary cluster
  after new mirror group snapshot is sync complete

Helpers:
- get_pool_image_count() $XMLSTARLET variable shouldn't be used anymore

Signed-off-by: John Agombar <agombar@uk.ibm.com>
8 weeks agorbd-mirror: increase load_local_group_snapshots() task interval
Ilya Dryomov [Fri, 25 Apr 2025 14:13:04 +0000 (16:13 +0200)]
rbd-mirror: increase load_local_group_snapshots() task interval

It's a gross workaround for a lack of notifications on groups.
Let's at least not run this task every second (for every mirror-enabled
group).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agolibrbd/api: fix group promote completely
Prasanna Kumar Kalever [Thu, 17 Apr 2025 13:23:05 +0000 (18:53 +0530)]
librbd/api: fix group promote completely

* do not group/image demote on undo (promote failure)
* wait for demote snapshot sync to complete  locally before accepting and continue
  to actually perform group promote.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: avoid undo on force group promote failure
VinayBhaskar-V [Tue, 15 Apr 2025 11:45:09 +0000 (17:15 +0530)]
rbd-mirror: avoid undo on force group promote failure

also drops redundant util::notify_unquiesce() calls

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agopybind/rbd: add new mirror group state
N Balachandran [Fri, 25 Apr 2025 02:41:02 +0000 (08:11 +0530)]
pybind/rbd: add new mirror group state

Adds the new RBD_MIRROR_GROUP_CREATING state to the python bindings.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agopybind/rbd/mock_rbd.pxi: add missing group related structs and APIs
Ramana Raja [Thu, 24 Apr 2025 15:22:32 +0000 (11:22 -0400)]
pybind/rbd/mock_rbd.pxi: add missing group related structs and APIs

Signed-off-by: Ramana Raja <rraja@redhat.com>
8 weeks agorbd-mirror: fix remove_mirror_peer_uuid
N Balachandran [Wed, 23 Apr 2025 15:52:34 +0000 (21:22 +0530)]
rbd-mirror: fix remove_mirror_peer_uuid

A race between the remove_mirror_peer_uuid() function in the group Replayer,
which called group_snap_set without checking if there were any
peer_uuids on the remote snap, and a snapshot cleanup on the primary
could lead to a case where the group snapshot is recreated on the
primary without the snap_order key. This causes all mirroring operations
on the group to fail on the primary as ListSnapshotsRequest fails when
it cannot find the snap_order key.
This has been fixed to update the group snapshot only if it contains the
peer_uuid.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agoqa/workunits/rbd: update to mirror group snapshot tests
John Agombar [Tue, 22 Apr 2025 13:54:44 +0000 (14:54 +0100)]
qa/workunits/rbd: update to mirror group snapshot tests

Updated tests:
• test_force_promote_before_initial_sync - Updated test to automatically calculate size
  of image depending on cluster resources available (faster cluster => faster sync => larger image)

Enabled tests:
- test_enable_mirroring_when_duplicate_group_exists scenarios 1,2 and 4
- test_demote_snap_sync

New tests:
- test_invalid_actions - Disabled as policing is missing from initial release

Signed-off-by: John Agombar <agombar@uk.ibm.com>
8 weeks agorbd-mirror: group replayer bootstrap changes
N Balachandran [Sat, 19 Apr 2025 11:26:49 +0000 (16:56 +0530)]
rbd-mirror: group replayer bootstrap changes

The group replayer BootstrapRequest has been refactored to make
it easier to handle various scenarios. It now  mimics the ImageReplayer
Bootstrap sequence to a great degree.

Credit to Ilya Dryomov <idryomov@gmail.com> for the help with
integrating GroupMirrorStateUpdateRequest into the Replayer.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agorbd-mirror: fix GroupReplayer status updates
N Balachandran [Sat, 19 Apr 2025 03:26:55 +0000 (08:56 +0530)]
rbd-mirror: fix GroupReplayer status updates

The GroupReplayer status update mechanism now behaves like that
of the ImageReplayer. The health state is also returned correctly.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agorbd-mirror: release lock before calling m_async_op_tracker.finish_op()
VinayBhaskar-V [Thu, 17 Apr 2025 14:29:36 +0000 (19:59 +0530)]
rbd-mirror: release lock before calling m_async_op_tracker.finish_op()

fix in the InstanceReplayer::start_group_replayers

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
8 weeks agorbd-mirror: prevent a non-primary group from being removed
VinayBhaskar-V [Thu, 17 Apr 2025 10:35:19 +0000 (16:05 +0530)]
rbd-mirror: prevent a non-primary group from being removed

also adjust the smoke test to align with the changes.

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agopybind/mgr/rbd_support: check group enabled for snap mirroring
Ramana Raja [Tue, 15 Apr 2025 20:26:39 +0000 (16:26 -0400)]
pybind/mgr/rbd_support: check group enabled for snap mirroring

... in mirror group snap scheduler's group_validator.

Make sure that the mirror group is in enabled state before performing
schedule operations on it.

Signed-off-by: Ramana Raja <rraja@redhat.com>
8 weeks agolibrbd/api: propagate ENOENT error in Mirror::group_get_info() API
Ramana Raja [Tue, 15 Apr 2025 19:16:34 +0000 (15:16 -0400)]
librbd/api: propagate ENOENT error in Mirror::group_get_info() API

The ENOENT error was ignored by the librbd API that retrieves mirror
group information. This manifests as a bug in the mirror group
snapshot scheduler, where the scheduler utilized this API to prevent
scheduler commands on groups that are not enabled for snapshot-based
mirroring. Since the librbd API masked the ENOENT error, the
scheduler's check for mirror group mode was compromised, resulting
in scheduler commands unexpectedly succeeding on groups that were not
enabled for snapshot-based mirroring. Therefore, allow the ENOENT error
from the API that retrieves mirror group information to surface. This
fixes the spurious mirror group mode check in the mirror snapshot
scheduler. Additionally, add a test to ensure that group snapshot
schedule commands fail on a group not enabled for mirroring.

Signed-off-by: Ramana Raja <rraja@redhat.com>
8 weeks agoqa/workunits/rbd: update to rbd_mirror_group_simple
John Agombar [Tue, 15 Apr 2025 15:37:24 +0000 (16:37 +0100)]
qa/workunits/rbd: update to rbd_mirror_group_simple

Added -p option to rbd_mirror_group_simple.sh. This can be used to print the enabled tests
and scenarios (for use by Jenkins) when producing a list of tests to run

Changed order of tests so that long running tests are run later

Updated tests:
• test_group_rename - workaround for intermittent failure

New tests:
• test_invalid_actions - test is currently disabled as it is not complete

Signed-off-by: John Agombar <agombar@uk.ibm.com>
8 weeks agorbd-mirror: fix crash in group Replayer
N Balachandran [Tue, 15 Apr 2025 11:34:41 +0000 (17:04 +0530)]
rbd-mirror: fix crash in group Replayer

The rbd-mirror daemon crashed if the Replayer was destroyed while
an on-going remote group snap listing operation was in progress. The
callback would attempt to access members of the Replayer instance which
no longer existed.

Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
8 weeks agoqa/workunits/rbd: update to mirror group snapshot tests
John Agombar [Thu, 10 Apr 2025 18:29:48 +0000 (19:29 +0100)]
qa/workunits/rbd: update to mirror group snapshot tests

Added -d <filename> option to rbd_mirror_group_simple.sh.  This can be used to
save the stdout and stderr from commands run during a test+scenario into $TEMPDIR/filename.
After a successful test completion the contents of this file are deleted prior to running
the next test to prevent the file from becoming too large.

Updated tests:
• various - Added step after a resync request to wait for the group id to change before
  continuing.  This ensures that the group delete/recreate step has been completed and
  prevents later commands from failing with group doesn't exist type errors.
• test_enable_mirroring_when_duplicate_group_exists - Added checks of the "state" and
  "description" fields for the peer site in the group status output.  Test is disabled
  as it currently fails
• test_enable_mirroring_when_duplicate_image_exists_scenarios - simplified test to only
  have a single duplicate named image.  Test fails still and is disabled.
• test_remote_namespace - added steps to take new snapshot on primary after failover
  and check that this syncs successfully.
• test_group_and_standalone_images_do_io - merged 2 scenarios to remove duplication

New tests:
Updated tests:
• test_demote_snap_sync - Checks that a demote snap is correctly synced to the secondary
  after the deamon is restarted

Signed-off-by: John Agombar <agombar@uk.ibm.com>
8 weeks agolibrbd: don't list snapshots in Group::snap_get_mirror_namespace()
VinayBhaskar-V [Fri, 11 Apr 2025 08:46:44 +0000 (10:46 +0200)]
librbd: don't list snapshots in Group::snap_get_mirror_namespace()

It's redundant -- GroupSnapshot struct can be fetched based on the
passed group snap ID directly.

Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
8 weeks agorbd: don't fail "rbd group snap ls" if namespace details aren't available
Ilya Dryomov [Thu, 10 Apr 2025 19:50:41 +0000 (21:50 +0200)]
rbd: don't fail "rbd group snap ls" if namespace details aren't available

Treat namespace details as an optional set of details, the same as "rbd
snap ls --all" command does.  This avoids sporadic failures with ENOENT
when snapshot listing races snapshot removal which "rbd group snap ls"
command is actually more vulnerable to than "rbd snap ls --all" because
namespace details for group snapshots aren't cached internally.

Credit to N Balachandran <nibalach@redhat.com> for root causing this.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agorbd-mirror: don't reset m_on_start_finish in GroupReplayer::restart()
Ilya Dryomov [Thu, 10 Apr 2025 11:44:45 +0000 (13:44 +0200)]
rbd-mirror: don't reset m_on_start_finish in GroupReplayer::restart()

If restart() is called while a previous start() is still in progress,
resetting m_on_start_finish before calling stop() from inside restart()
causes on_finish that was passed to start() to be leaked.  One of the
ways this manifests in is a hang in InstanceReplayer::stop() on any
kind of shutdown -- due to InstanceReplayer::start_group_replayer()
keeping track of its start() invocations through C_TrackedOp

  group_replayer->start(new C_TrackedOp(m_async_op_tracker, nullptr),
                        false, false);

and flushing m_async_op_tracker in InstanceReplayer::stop().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agorbd-mirror: fix in shutdown sequence in InstanceReplayer
VinayBhaskar-V [Wed, 9 Apr 2025 05:45:33 +0000 (11:15 +0530)]
rbd-mirror: fix in shutdown sequence in InstanceReplayer

Co-authored-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
8 weeks agorbd_mirror: add fields to mirror group status's description str
Ramana Raja [Thu, 27 Mar 2025 17:09:47 +0000 (13:09 -0400)]
rbd_mirror: add fields to mirror group status's description str

Add the following fields to the mirror group status's description string:
- last_snasphot_bytes
- last_snasphot_complete_seconds
- local_snapshot_timestamp
- remote_snapshot_timestamp

Signed-off-by: Ramana Raja <rraja@redhat.com>
8 weeks agoqa/workunits/rbd: update to mirror group snapshot tests
John Agombar [Thu, 20 Mar 2025 20:48:57 +0000 (20:48 +0000)]
qa/workunits/rbd: update to mirror group snapshot tests

Added new environment variable RBD_MIRROR_GLOBAL_DEBUG to allow additional debug to be turned on
after a cluster has been created.

Updated tests:
• test_group_rename - disabled some checks that are failing as the fix is deferred
• test_empty_group and test_empty_groups - added some snapshot, demote, promote steps

New tests:
• test_remote_namespace - Added new scenarios that test default and non-default
  namespaces on the local and remote clusters.

Enabled tests:
- test_force_promote
- test_group_rename

Signed-off-by: John Agombar <agombar@uk.ibm.com>
8 weeks agolibrbd/api: cleanup prepare_group_images to dup group_ioctx
Prasanna Kumar Kalever [Thu, 3 Apr 2025 11:49:08 +0000 (17:19 +0530)]
librbd/api: cleanup prepare_group_images to dup group_ioctx

.. dup group_ioctx instead of changing to default namespace and
resetting back to original

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd_mirror: fix cross namespace group snapshot mirroring
Prasanna Kumar Kalever [Thu, 3 Apr 2025 10:47:39 +0000 (16:17 +0530)]
rbd_mirror: fix cross namespace group snapshot mirroring

In case if group belongs to non-default namespace, the mirror daemon on
secondary is setting an empty mirror peer uuid on non primary demote snapshot.
And when previous secondary turns into primary (by an explicit promote request)
the previous i.e. non primary demote snapshot is getting unlinked as part of
the promote request as there is no mirror peer uuid set on it. Because the
previous dependent snapshot is removed, there are split-brain errors leading
to recent test failures.

This fix makesure to set the right mirror peer uuid on the non primary demote
snapshot even if the group belong to non-default namespace.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agoqa/workunits/rbd: stick to RBD_MIRROR_MODE for mirror image mode
Ilya Dryomov [Sun, 30 Mar 2025 09:09:33 +0000 (11:09 +0200)]
qa/workunits/rbd: stick to RBD_MIRROR_MODE for mirror image mode

MIRROR_IMAGE_MODE was dropped in favor of RBD_MIRROR_MODE in commit
9b773eec4a8c ("qa/suites/rbd: Cleanup of MIRROR_IMAGE_MODE").

The check in mirror_group_snapshot_and_wait_for_sync_complete() is
redundant because "rbd mirror group snapshot" command would fail anyway
and mirror_group_internal() isn't used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agoqa/workunits/rbd: use xmlstarlet directly in rbd_mirror_helpers.sh
Ilya Dryomov [Sun, 30 Mar 2025 09:09:33 +0000 (11:09 +0200)]
qa/workunits/rbd: use xmlstarlet directly in rbd_mirror_helpers.sh

Commit e09f04669053 ("qa/workunits/rbd: mirror group functional tests")
mistakenly resurrected a redundant variable which was dropped in commit
4f309603caa3 ("qa: drop XMLSTARLET variable, use xmlstarlet directly").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agoqa/workunits/rbd: stop_mirror() should send a given signal just once
Ilya Dryomov [Sun, 30 Mar 2025 09:09:33 +0000 (11:09 +0200)]
qa/workunits/rbd: stop_mirror() should send a given signal just once

Commit e09f04669053 ("qa/workunits/rbd: mirror group functional tests")
moved the kill invocation in stop_mirror() from outside of the wait loop
to inside.  This is wrong because the signal should be sent just once:
rbd-mirror daemon installs a oneshot handler for SIGINT and SIGTERM and
a subsequent signal immediately kills the process.

The same commit also erroneously changed RBD_MIRROR_INSTANCES default
for all rbd_mirror_helpers.sh users instead of just group tests.

This fixes sporadic failures in "TEST: no blocklists".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agorbd-mirror: fix issues around health state
Prasanna Kumar Kalever [Tue, 1 Apr 2025 13:17:21 +0000 (18:47 +0530)]
rbd-mirror: fix issues around health state

* move m_status_state & m_state_desc under lock
* fix unit test build issue

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: implement group replayer Health State
Prasanna Kumar Kalever [Fri, 28 Mar 2025 16:24:44 +0000 (21:54 +0530)]
rbd-mirror: implement group replayer Health State

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: catch and bubble-up all the image level errors to group status
Prasanna Kumar Kalever [Fri, 28 Mar 2025 14:25:54 +0000 (19:55 +0530)]
rbd-mirror: catch and bubble-up all the image level errors to group status

Wait for the right group status to attain before checking for the image level
errors.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd: get_group_snap_get_mirror_namespace() API + groups in "rbd mirror pool status"
VinayBhaskar-V [Thu, 27 Feb 2025 22:08:55 +0000 (03:38 +0530)]
librbd: get_group_snap_get_mirror_namespace() API + groups in "rbd mirror pool status"

Co-authored-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: VinayBhaskar-V <vvarada@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 weeks agorbd_mirror: avoid passing empty remote_mirror_uuid to group_replayer
Prasanna Kumar Kalever [Thu, 27 Mar 2025 07:25:33 +0000 (12:55 +0530)]
rbd_mirror: avoid passing empty remote_mirror_uuid to group_replayer

group_replayer can fetch remote_mirror_uuid as remote_pool_meta.mirror_uuid

>>> gc = rbd.Group(ioctx, 'test_group')
>>> print(gc.group_snap_get_mirror_namespace('104b430672cf'))
{'state': 2, 'mirror_peer_uuids': [],
'primary_mirror_uuid': '6cd393ad-c21d-42e6-a404-0dabf596bfe7',
'primary_snap_id': '104b430672cf'}

Thanks to Ilya for highlighting the issue.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: don't mask images in group with read-only as part of image_demote()
Prasanna Kumar Kalever [Wed, 26 Mar 2025 13:40:33 +0000 (19:10 +0530)]
librbd/api: don't mask images in group with read-only as part of image_demote()

if the images are part of a group wait until group_demote() is finally done
with GroupUnlinkPeerRequest() and then mask the images part of the group with
IMAGE_READ_ONLY_FLAG_NON_PRIMARY.

Thanks to Nithya for working along for a better fix here.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocleanup: uncommented assert m_state_builder
Prasanna Kumar Kalever [Wed, 26 Mar 2025 07:35:39 +0000 (13:05 +0530)]
cleanup: uncommented assert m_state_builder

Thanks to Nithya for highlighting it.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: add defence in group enable to bail about different pool images
Prasanna Kumar Kalever [Tue, 25 Mar 2025 19:39:52 +0000 (01:09 +0530)]
librbd/api: add defence in group enable to bail about different pool images

If a group contains images from different pools do not allow enabling
mirroring on it.

Thanks to Ilya for all the suggestions and review.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd: fix text in mirror group help messages
John Agombar [Mon, 30 Sep 2024 12:28:24 +0000 (13:28 +0100)]
rbd: fix text in mirror group help messages

Signed-off-by: John Agombar <agombar@uk.ibm.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: finalize the API's about skip-quiesce and ignore-quiesce-error flags
Prasanna Kumar Kalever [Mon, 24 Mar 2025 10:03:31 +0000 (15:33 +0530)]
librbd/api: finalize the API's about skip-quiesce and ignore-quiesce-error flags

* leave --skip-quiesce and --ignore-quiesce-error options only on
  rbd mirror group snapshot command
* drop flags argument from mirror_group_enable(), mirror_group_promote() and
  mirror_group_demote() APIs, it will remain only on
  mirror_group_create_snapshot() and aio_mirror_group_create_snapshot() APIs
* mirror_group_promote() and mirror_group_demote() should behave as if
  RBD_SNAP_CREATE_SKIP_QUIESCE flag was passed
* mirror_group_enable() should use get_default_snap_create_flags() to get flags
  -- it will be governed by rbd_default_snapshot_quiesce_mode config option
* make each of the mentioned APIs explicitly do either
  a) snap_create_flags_api_to_internal(<flags passed by the user>, &snap_create_flags),
  b) snap_create_flags_api_to_internal(get_default_snap_create_flags(), &snap_create_flags)

Credits to Ilya Dryomov <idryomov@gmail.com> for the above finalisation.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocleanup: bootstrap no more need prepare_non_primary_mirror_snap_name
Prasanna Kumar Kalever [Mon, 24 Mar 2025 05:19:10 +0000 (10:49 +0530)]
cleanup: bootstrap no more need prepare_non_primary_mirror_snap_name

We had removed the need for non primary group snapshots creation on
secondary as part of bootstrap and this function is no more needed.

Thanks to Nithya for highlighting.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: don't call group_snap_set for every image snap for regular group snap
Prasanna Kumar Kalever [Fri, 21 Mar 2025 17:37:54 +0000 (23:07 +0530)]
rbd-mirror: don't call group_snap_set for every image snap for regular group snap

It looks like we fixed avoiding of calling group_snap_set() for each image
snapshot update for mirror group snapshot, but for regular group snapshot,
it is still happening. This commit will fix it.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: call set_image_replayer_limits() only if group_snap_set() successful
Prasanna Kumar Kalever [Fri, 21 Mar 2025 18:35:16 +0000 (00:05 +0530)]
rbd-mirror: call set_image_replayer_limits() only if group_snap_set() successful

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: address group_snap_set() failures
Prasanna Kumar Kalever [Fri, 21 Mar 2025 14:00:09 +0000 (19:30 +0530)]
rbd-mirror: address group_snap_set() failures

There are 5 places where we call group_snap_set()

1. create_mirror_snapshot()
2. create_regular_snapshot()
3. mirror_snapshot_complete()
4. regular_snapshot_complete() and
5. remove_mirror_peer_uuid()

For 1 & 2 cases, we can simply delete the so far created group snapshot
in the respective callback handler which is basically an empty INCOMPLETE
group snapshot and let the state machine recreate it again later.

For 3 & 4 cases, we are cannot delete the created snapshot, because the
image snapshots whould have synced/syncing by now, deleting the group
snapshot will bring additional comlications (if there is a failover at
the same time). Hence setting m_retry_validate_snap flag in this case,
this would all the rescan even for regular group snapshots, if the
snapshot is yet INCOMPLETE on disk the validate_image_snaps_sync_complete()
will be called again.

For case 5, added logic to retry remove_mirror_peer_uuid() again.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: fix m_stop_requested leading to a race
Prasanna Kumar Kalever [Fri, 21 Mar 2025 08:59:45 +0000 (14:29 +0530)]
rbd-mirror: fix m_stop_requested leading to a race

* if m_stop_requested is set then is_replay_interrupted return true.
* also shut_down should set m_stop_requested to false, it is instead
  setting it to true this will lead to race and a possible crash accessing GR
  b/w shut_down() and notify_group_listener_stop()

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocleanup: avoid passing last_local_snap_id to unlink_group_snapshots
Prasanna Kumar Kalever [Fri, 21 Mar 2025 08:45:05 +0000 (14:15 +0530)]
cleanup: avoid passing last_local_snap_id to unlink_group_snapshots

last_local_snap_id can be fetched from m_local_group_snaps.rbegin()

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agocleanup: avoid use of remote_group_snap id and name variable names for readability
Prasanna Kumar Kalever [Fri, 21 Mar 2025 08:27:00 +0000 (13:57 +0530)]
cleanup: avoid use of remote_group_snap id and name variable names for readability

remote_group_snap_id is used interchangeably with local_group_snap_id
this patch cleans it up.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agoqa/workunits/rbd: fix and enable test_force_promote_before_initial_sync
Prasanna Kumar Kalever [Thu, 20 Mar 2025 20:54:49 +0000 (02:24 +0530)]
qa/workunits/rbd: fix and enable test_force_promote_before_initial_sync

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: fail group promote when there is no previous snapshot
Prasanna Kumar Kalever [Thu, 20 Mar 2025 17:36:29 +0000 (23:06 +0530)]
librbd/api: fail group promote when there is no previous snapshot

If the group enable time initial snapshot didn't sync to the secondary and
is in incomplete state, but then there happens a force promote on secondary,
there is no previous snapshot for that force promote to rollback to.

In this situation the force promote should fail.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agoqa/workunits/rbd: update to mirror group snapshot tests
John Agombar [Thu, 20 Mar 2025 20:48:57 +0000 (20:48 +0000)]
qa/workunits/rbd: update to mirror group snapshot tests

Updated status() helper function to dump contents of stderr and stdout for last command
New helper functions to check for image snaps existence
Added new environment variable RBD_MIRROR_HIDE_BASH_DEBUGGING to turn off set -x output.
Previously RBD_MIRROR_SHOW_CLI_CMD was being used for this and controlling the display of cli output.

New tests:
• test_group_rename - test that a group rename is only mirrored to the remote after a mirror group
  snapshot command.  Also test that a group rename is not inadvertantly mirrored or undone
  (test commented out as it is failing)
• test_enable_mirroring_when_duplicate_group_exists - various scenarios that check an empty group
  and approaches to fixing the duplicate names on either site.
  (test is commented out as it is not yet finished)
• test_enable_mirroring_when_duplicate_group_and_images_exists - builds on the previous test
  but has duplicate named images too (test is commented out as it is failing)
• test_image_snapshots_with_group - test regular image snapshots along with mirror group snapshots

Enabled tests:
- test_force_promote scenarios 1,2,3 and 5 pass

Signed-off-by: John Agombar <agombar@uk.ibm.com>
8 weeks agorbd-mirror: avoid bootstrap creating a local non_primary group snapshot
Prasanna Kumar Kalever [Thu, 20 Mar 2025 07:00:23 +0000 (12:30 +0530)]
rbd-mirror: avoid bootstrap creating a local non_primary group snapshot

previously we are bound to this but with the current design we no more need it.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: look for mismatch in name only on secondary cluster
Prasanna Kumar Kalever [Wed, 19 Mar 2025 18:17:10 +0000 (23:47 +0530)]
rbd-mirror: look for mismatch in name only on secondary cluster

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agoqa/workunits/rbd: enable and fix test_resync_marker test
Prasanna Kumar Kalever [Thu, 20 Mar 2025 06:02:13 +0000 (11:32 +0530)]
qa/workunits/rbd: enable and fix test_resync_marker test

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agorbd-mirror: group-replayer check for remote demote state
Prasanna Kumar Kalever [Thu, 20 Mar 2025 05:44:48 +0000 (11:14 +0530)]
rbd-mirror: group-replayer check for remote demote state

I'm seeing a possibility for 3 situations here for resync flagging and
rbd-mirror daemon working on it:

1. No Demotion on Primary while/just-before resync is play'ed
    there is no demote snap along side resync, we can cancel syncing other
    snaps, and start resync as soon as resync is flagged, because there is
    no point syncing snaps that we are anyway going to delete the whole
    group and resync fresh.

2. first Demote + immediately Resync
    demote came first, this mean before proceeding with resync, we should
    always see if the last remote snap is PRIMARY (validate if the remote
    is still primary, which is on point) and only proceed

3. first Resync + immediately Demote
    resync Came first, so we head straight to resync.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agolibrbd/api: fix rollback failures to mirror group snapshots
Prasanna Kumar Kalever [Tue, 18 Mar 2025 16:50:17 +0000 (22:20 +0530)]
librbd/api: fix rollback failures to mirror group snapshots

group_snap_rollback_by_record() is made flexible to handle the mirror snapshots
along side the existing user snapshots with minor changes.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
8 weeks agoqa/workunits/rbd: adjust grep invocation in is_leader()
Ilya Dryomov [Sun, 16 Mar 2025 16:33:58 +0000 (17:33 +0100)]
qa/workunits/rbd: adjust grep invocation in is_leader()

admin_daemon() no longer populates stdout and stderr, all output is
redirected inside of run_cmd_internal().

This unbreaks "TEST: release leader and wait it is reacquired".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>