git.apps.os.sepia.ceph.com Git

librbd: unlink newest mirror snapshot when at capacity, bump capacity

CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image
snapshot" command or via rbd_support mgr module when creating a new
scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots
capacity on the primary cluster can race with Replayer::unlink_peer()
invoked by rbd-mirror when finishing syncing an older snapshot on the
secondary cluster.  Consider the following:

   [ primary: primary-snap1, primary-snap2, primary-snap3
     secondary: non-primary-snap1 (complete), non-primary-snap2 (syncing) ]

0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks
   primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks
   non-primary-snap2 complete

   [ snap1 (the old base) is no longer needed on either cluster ]

4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3
   delta

   [ primary: primary-snap2, primary-snap3, primary-snap4
     secondary: non-primary-snap2 (complete), non-primary-snap3 (syncing) ]

8. rbd_support unlinks and removes primary-snap3 which is in-use by
   rbd-mirror

If snap trimming on the primary cluster kicks in soon enough, the
secondary image becomes corrupted: rbd-mirror would eventually finish
"syncing" non-primary-snap3 and mark it complete in spite of bogus data
in the HEAD -- the primary cluster OSDs would start returning ENOENT
for snap trimmed objects.  Luckily, rbd-mirror's attempt to pick snap3
as the new base would wedge the replayer with "split-brain detected:
failed to find matching non-primary snapshot in remote image" error.

Before commit a888bff8d00e ("librbd/mirror: tweak which snapshot is
unlinked when at capacity") this could happen pretty much all the time
as it was the second oldest snapshot that was unlinked.  This commit
changed it to be the third oldest snapshot, turning this into a more
narrow but still very much possible to hit race.

Unfortunately this race condition appears to be inherent to the way
snapshot-based mirroring is currently implemented:

a. when mirror snapshots are created on the producer side of the
   snapshot queue, they are already linked
b. mirror snapshots can be concurrently unlinked/removed on both
   sides of the snapshot queue by non-cooperating clients (local
   rbd_mirror_image_create_snapshot() vs remote rbd-mirror)
c. with mirror peer links off the list due to (a), there is no
   existing way for rbd-mirror to persistently mark a snapshot as
   in-use

As a workaround, bump rbd_mirroring_max_mirroring_snapshots to 5 and
always unlink the newest snapshot (i.e. slot 4) instead of the third
oldest snapshot (i.e. slot 2).  Hopefully this gives enough leeway,
as rbd-mirror would need to sync two snapshots (i.e. transition from
syncing 0-1 to 1-2 and then to 2-3) before potentially colliding with
rbd_mirror_image_create_snapshot() on slot 4.

Fixes: https://tracker.ceph.com/issues/55803
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ef83c0f347c38f3eee70d554d5433a1a17b209b6)

author	Ilya Dryomov <idryomov@gmail.com>
	Sun, 29 May 2022 16:20:34 +0000 (18:20 +0200)
committer	Ilya Dryomov <idryomov@gmail.com>
	Thu, 9 Jun 2022 09:26:20 +0000 (11:26 +0200)
commit	01aca9a2f0778d0ad4b0bc77da15194982739ab5
tree	2819c7ff5c2745e8068bf8f9f86bb99fcc70c97c	tree \| snapshot
parent	0ad3353523f7b27b2e235e4e91b82599f2f28229	commit \| diff

doc/rbd/rbd-mirroring.rst		diff \| blob \| history
src/common/options/rbd.yaml.in		diff \| blob \| history
src/librbd/mirror/snapshot/CreatePrimaryRequest.cc		diff \| blob \| history
src/test/librbd/test_mirroring.cc		diff \| blob \| history