Ilya Dryomov [Sun, 29 May 2022 16:20:34 +0000 (18:20 +0200)]
librbd: unlink newest mirror snapshot when at capacity, bump capacity
CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image
snapshot" command or via rbd_support mgr module when creating a new
scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots
capacity on the primary cluster can race with Replayer::unlink_peer()
invoked by rbd-mirror when finishing syncing an older snapshot on the
secondary cluster. Consider the following:
0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks
primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks
non-primary-snap2 complete
[ snap1 (the old base) is no longer needed on either cluster ]
4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3
delta
8. rbd_support unlinks and removes primary-snap3 which is in-use by
rbd-mirror
If snap trimming on the primary cluster kicks in soon enough, the
secondary image becomes corrupted: rbd-mirror would eventually finish
"syncing" non-primary-snap3 and mark it complete in spite of bogus data
in the HEAD -- the primary cluster OSDs would start returning ENOENT
for snap trimmed objects. Luckily, rbd-mirror's attempt to pick snap3
as the new base would wedge the replayer with "split-brain detected:
failed to find matching non-primary snapshot in remote image" error.
Before commit a888bff8d00e ("librbd/mirror: tweak which snapshot is
unlinked when at capacity") this could happen pretty much all the time
as it was the second oldest snapshot that was unlinked. This commit
changed it to be the third oldest snapshot, turning this into a more
narrow but still very much possible to hit race.
Unfortunately this race condition appears to be inherent to the way
snapshot-based mirroring is currently implemented:
a. when mirror snapshots are created on the producer side of the
snapshot queue, they are already linked
b. mirror snapshots can be concurrently unlinked/removed on both
sides of the snapshot queue by non-cooperating clients (local
rbd_mirror_image_create_snapshot() vs remote rbd-mirror)
c. with mirror peer links off the list due to (a), there is no
existing way for rbd-mirror to persistently mark a snapshot as
in-use
As a workaround, bump rbd_mirroring_max_mirroring_snapshots to 5 and
always unlink the newest snapshot (i.e. slot 4) instead of the third
oldest snapshot (i.e. slot 2). Hopefully this gives enough leeway,
as rbd-mirror would need to sync two snapshots (i.e. transition from
syncing 0-1 to 1-2 and then to 2-3) before potentially colliding with
rbd_mirror_image_create_snapshot() on slot 4.
Ilya Dryomov [Sun, 29 May 2022 17:55:04 +0000 (19:55 +0200)]
test/librbd: fix set_val() call in SuccessUnlink* test cases
rbd_mirroring_max_mirroring_snapshots isn't actually set to 3 there
due to the stray conf_ prefix. It didn't matter until now because the
default was also 3.
Xuehan Xu [Fri, 20 May 2022 09:23:03 +0000 (17:23 +0800)]
crimson/os/seastore/segment_cleaner: add dedicated backref trimming process
Space reclamation needs to merge backrefs up to the point where the latest
release of extents within the scope of the reclamation process happened.
When the journal size is large, that merge may generate a transaction
record with size exceeds the max record size threshold. So we need have a
backref trimming process that merge most of the backrefs before the space
reclamation happens.
This commit also fixes issue: https://tracker.ceph.com/issues/55692, by
repeating the inflight backrefs trimming transaction when it's
invalidated by other trans on the ROOT block
Xiubo Li [Thu, 31 Mar 2022 07:16:49 +0000 (15:16 +0800)]
client: stop retrying the request when exceeding 256 times
The type of 'retry_attempt' in 'MetaRequest' is 'int', while in
'ceph_mds_request_head' the type of 'num_retry' is '__u8'. So in
case the request retries exceeding 256 times, the MDS will receive
a incorrect retry seq.
In this case it's ususally a bug in MDS and continue retrying the
request makes no sense. For now let's limit it to 256. In future
this could be fixed in ceph code, so avoid using the hardcode here.
Fixes: https://tracker.ceph.com/issues/55144 Signed-off-by: Xiubo Li <xiubli@redhat.com>
as per https://www.json.org/json-en.html, JSON encodes bool as
"true" or "false", without the quotes. before this change, the quotes
are always added when encoding boolean values.
but this change is not backward compatible.
encode_json()'s bool overload is used by rgw. it uses JSONObj
defined in common/ceph_json.h to decode JSON-encoded structs.
and it does not differentiate bool from str when decoding a boolean
value despite that it could have check the "quoted" member variable
of JSONObj for validating the type of value. so we should be fine.
but gcc-toolset-8-annobin provides this file. upgrading to
gcc-toolset-11 does not help. see https://centos.pkgs.org/8-stream/centos-appstream-x86_64/gcc-toolset-11-annobin-plugin-gcc-10.23-1.el8.x86_64.rpm.html
so, the intermediate solution would be to disable the plugin, if
we want to use gcc-toolset to build rpm packages.
in this change, _annotated_build is undefined to prevent the compiler
from adding extra information to the binary. in general this change
shuold be safe, without these information, it'd be hard to tell if
the binary is hardened or what ABI version it expects. see
also https://fedoraproject.org/wiki/Changes/Annobin
Rishabh Dave [Thu, 19 May 2022 18:29:25 +0000 (23:59 +0530)]
qa/cephfs: remove temporary files
These temporary files don't matter for test execution with teuthology
but they do matter for execution with vstart_runner.py since the test
fails if these files exist already. And tests are often run repeatedly
with vstart_runner.py, unlike with teuthology.
Fixes: https://tracker.ceph.com/issues/55719 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Laura Flores [Mon, 16 May 2022 22:59:42 +0000 (17:59 -0500)]
qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores <lflores@redhat.com>
Adam King [Fri, 1 Apr 2022 12:20:28 +0000 (08:20 -0400)]
mgr/cephadm: make UpgradeState from_json a bit safer
This way, for downgrades to whatever versions
this lands in onward, having added new parameters to
UpgradeState shouldn't break anything. Can't do much
about downgrades to older versions from this one
but this should help in the future.
Adam King [Mon, 28 Mar 2022 16:10:15 +0000 (12:10 -0400)]
mgr/cephadm: split _do_upgrade into sub functions
This function was around 500 lines and difficult to work
with. Splitting it into sub functions should hopefully make
it a bit easier to understand and make changes to.
Rishabh Dave [Thu, 19 May 2022 15:33:54 +0000 (21:03 +0530)]
cephfs-shell: check version before importing Cmd2ArgparseError
Cmd2ArgparseError is available only cmd2 version 1.0.1 onwards. Before
that, SystemExit(2) is raised. This commit creates an empty class
Cmd2ArgparseError for earlier version so that similar error won't creep
up again.
Fixes: https://tracker.ceph.com/issues/55716 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Soumya Koduri [Fri, 6 May 2022 17:10:12 +0000 (22:40 +0530)]
rgw/qa: Run tests on multiple cloudtier config
Run cloudtier tests with parameter 'retain_head_object'
set to true and false.
However having multiple cloudtier storage classes in the same task
is increasing the transition time and resulting in spurious failures.
Hence until there is a consistent way of running the tests, without
having to depend on lc_debug_interval, disabled one of the config for
now.