In preparation for multiple similarly configured MockTestImageCtx
objects being used in a single test, centralize their creation and add
a couple of helpers for setting expectations from a callback.
mon/Monitor: during shutdown don't accept new authentication and create new sessions
During shutdown, the monitor is designed not to accept new authentication requests
or create new sessions. However, a problem arises when the monitor marks its status
as "shutdown" but still accepts new authentication requests and creates new sessions.
This issue causes the monitor to fail when checking the session list.
To fix this problem, an update is implemented. With this fix,
a monitor in the "shutdown" state will correctly reject new authentication requests
and prevent the creation of new sessions.
This ensures that the monitor operates as intended during the shutdown process.
NitzanMordhai [Thu, 16 Nov 2023 07:09:29 +0000 (07:09 +0000)]
Tools/rados: Improve Error Messaging for Object Name Resolution
The current implementation of 'rados clearomap' exhibits a behavior where
an error message is generated without the associated object name or,
in the case of a non-existent object name, may result in a segmentation fault.
The proposed fix addresses this issue by enhancing the error message.
After applying the fix, error messages will consistently display the correct
object name, providing users with more accurate and actionable information.
Casey Bodley [Mon, 8 Jan 2024 16:24:18 +0000 (08:24 -0800)]
make-dist: don't use --continue option for wget
the boost jfrog mirror is broken and returns an HTML error page instead
of the archive. the file size of this page is 11534 bytes
when download_from() retries the download from download.ceph.com, the -c
option tells it to resume the download of the existing file. the
resulting boost_1_82_0.tar.bz2 ends up with the correct total file size
of 121325129 bytes, but the first 11534 bytes still correspond to the
HTML from jfrog. that causes the sha256sum mismatch
remove the -c option so that wget fetches the archive in its entirety
Ilya Dryomov [Wed, 22 Nov 2023 13:39:13 +0000 (14:39 +0100)]
librados: make querying pools for selfmanaged snaps reliable
If get_pool_is_selfmanaged_snaps_mode() is invoked on a fresh RADOS
client instance that still lacks an osdmap, it returns false, same as
for "this pool is not in selfmanaged snaps mode". The same happens if
the pool in question doesn't exist since the signature doesn't allow to
return an error.
The motivation for this API was to prevent users from running "rados
cppool" on a pool with unmanaged snapshots and deleting the original
thinking that they have a full copy. Unfortunately, it's exactly
"rados cppool" that fell into this trap, so no warning is printed and
--yes-i-really-mean-it flag isn't enforced.
Ilya Dryomov [Sun, 10 Dec 2023 16:01:24 +0000 (17:01 +0100)]
test/pybind/rbd: don't ignore from_snapshot in check_diff()
Despite the test in test_diff_iterate() being correct, it started
failing:
> check_diff(self.image, 0, IMG_SIZE, 'snap1', [(0, 512, False)])
...
a = [], b = [(0, 512, False)]
...
> assert a == b
E AssertionError
This is because check_diff() drops 'snap1' argument on the floor and
passes None to image.diff_iterate() instead. This goes back to 2013,
see commit e88fe3cbbc8f ("rbd.py: add some missing functions").
- beginning of time -> HEAD, through intermediate snap
- snap -> snap, directly
- snap -> HEAD, directly
But coverage is too weak: none of the weird OBJECT_PENDING cases and
only a single diff-iterate vs deep-copy case is tested, for example.
Coverage is missing completely for:
- beginning of time -> HEAD, directly
- beginning of time -> snap, directly
- beginning of time -> snap, through intermediate snap
- snap -> snap, through intermediate snap
- snap -> HEAD, through intermediate snap
Ilya Dryomov [Fri, 8 Dec 2023 14:19:02 +0000 (15:19 +0100)]
librbd: OBJECT_PENDING should always be treated as dirty
OBJECT_PENDING is a transition state which normally isn't encountered
in (snapshot) object maps. In case it's encountered, for example when
a snapshot is taken after losing power at the time a discard was being
handled, the object should be treated as dirty and produce a diff as
a result.
Assuming an object is marked OBJECT_PENDING, theoretically there are
four cases with respect to object's state in the next snapshot:
Prior to commit b81cd2460de7 ("librbd/object_map: diff state machine
should track object existence"), (3) was handled incorrectly (diff set
to DIFF_STATE_NONE instead of DIFF_STATE_UPDATED).
Post commit 399a45e11332 ("librbd/object_map: rbd diff between two
snapshots lists entire image content"), (4) is handled incorrectly
(diff set to DIFF_STATE_DATA instead of DIFF_STATE_DATA_UPDATED).
Similar to DiffIterateTest.DiffIterateDeterministic, systematically
cover the most common cases involving full-object discards. With this
in place, issue [1] can be reproduced by any of:
(preparatory) before snap3 is taken
(1) beginning of time -> HEAD
(2) snap1 -> HEAD
(5) beginning of time -> snap3
(6) snap1 -> snap3
Sub-object discards aren't covered here because of further issues
[2][3].
Ilya Dryomov [Fri, 10 Nov 2023 10:14:42 +0000 (11:14 +0100)]
librbd: resurrect "exists" assert in simple_diff_cb()
This effectively reverts commit 3ccc3bb4bd35 ("librbd: diff_iterate
needs to handle holes in parent images") which just dropped the assert
instead of addressing the root cause of reported crashes.
Ilya Dryomov [Thu, 9 Nov 2023 19:44:18 +0000 (20:44 +0100)]
librbd: diff-iterate shouldn't ever report "new hole" against a hole
If an object doesn't exist in both start and end versions but there is
an intermediate snapshot which contains it (i.e. the object is written
to and captured at some point but then discarded prior to or in the end
version), diff-iterate reports "new hole" -- callback is invoked with
exists=false. This occurs both on the slow list_snaps path and in
fast-diff mode.
Despite going all the way back to the introduction of diff-iterate in
commit 0296c7cdae91 ("librbd: implement diff_iterate"), this behavior
is wrong and contradicts diff-iterate API documentation added in commit a69532e86450 ("librbd: document diff_iterate in header") in the same
series:
If the source snapshot name is NULL, we interpret that as
the beginning of time and return all allocated regions of the
image.
It also triggered an assert added in commit c680531e070a ("librbd:
change diff_iterate interface to be more C-friendly") in the same
series. Unfortunately, commit f1f6407221a0 ("test_librbd: add
diff_iterate test including discard"), also part of the same series,
added a test which expected the wrong behavior. Very confusing!
A year later, a different manifestation of this bug was fixed in commit 9a1ab95176fe ("rbd: Fix rbd diff for non-existent objects"), but the
fix only covered the case where calc_snap_set_diff() goes past the end
snap ID while processing clones. The case where it runs out of clones
to process before reaching the end snap ID remained mishandled.
A year after that, commit 3ccc3bb4bd35 ("librbd: diff_iterate needs to
handle holes in parent images") dropped the assert mentioned above and
this bug got enshrined in the newly introduced fast-diff mode.
Finally, a few years later, deep-copy actually started relying on this
bug in commit e5a21e904142 ("librbd: deep-copy image copy state machine
skips clean objects"). This necessitates bifurcation in DiffRequest
because deep-copy wants the "has this object been touched" semantics,
which is different from diff-iterate (and also potentially much more
expensive to produce!).
This commit brings a minimal update to TestMockObjectMapDiffRequest
tests and DiffIterateTest.DiffIterateDiscard. Coverage is expanded in
the following commits.
Xiubo Li [Mon, 27 Nov 2023 07:55:42 +0000 (15:55 +0800)]
mds: set the loner to true for LOCK_EXCL_XSYN
For filelock when in LOCK_EXCL_XSYN state the non-loner clients
should be issued empty caps, but since the loner of this state
is set to false and it could make the Locker to issue the Fcb caps
to them, which is incorrect.
Ilya Dryomov [Fri, 1 Dec 2023 17:29:12 +0000 (18:29 +0100)]
test/librbd: drop DiffIterateTest.DiffIterateRegression6926
This was added to test [1]. It's duplicated by several cases in
DiffIterateTest.DiffIterateDeterministicPP now. Specifically, the
issue could be reproduced by any of:
(8) beginning of time -> snap2
(9) snap1 -> snap2
(10) beginning of time -> snap1
Ilya Dryomov [Fri, 1 Dec 2023 17:54:19 +0000 (18:54 +0100)]
test/librbd: drop TestLibRBD.SnapDiff
This was added to integration test [1], separate from the fix which
went in only with unit test adjustments. It's duplicated by several
cases in DiffIterateTest.DiffIterateDeterministic now. Specifically,
the issue could be reproduced by any of:
(3) snap2 -> HEAD
(4) snap3 -> HEAD
(7) snap2 -> snap3
scribble()-based DiffIterate tests are too weak: at least two
regressions that should been caught by DiffIterate.DiffIterate or
DiffIterate.DiffIterateStress were missed [1][2]. Aside from the
randomness which can be both a good and a bad thing, asserts there
ensure only that the returned diff covers all changes that were made.
If the returned diff is too excessive or otherwise bogus, this isn't
detected [3].
Add a deterministic test to systematically cover the most common cases
that don't involve discards. A similar test for discards will be added
with the fix for [4].
Comment out debug log in vector_iterate_cb() like it's done in
iterate_cb().
Ilya Dryomov [Mon, 27 Nov 2023 10:59:26 +0000 (11:59 +0100)]
librbd: fix read_whole_object handling in ObjectListSnapsRequest
Originally, in commit 2be4840afd4f ("librados/snap_set_diff: don't
assert on empty snapset"), exists was set to true. This didn't make
ObjectListSnapsRequest, causing the following deep-copy tests to fail
when run against calc_snap_set_diff() rigged to return "whole object"
as described in [1]:
This is a regression introduced in commit cc87a8bd697e ("librbd:
deep-copy object utilizes image-extent IO methods") by way of commit 11923e234efc ("librbd: generic object list snapshot request").
Ilya Dryomov [Mon, 27 Nov 2023 09:11:52 +0000 (10:11 +0100)]
librbd: fix LIST_SNAPS_FLAG_WHOLE_OBJECT behavior
Bundling read_whole_object and LIST_SNAPS_FLAG_WHOLE_OBJECT cases
together is wrong:
- In read_whole_object case, calc_snap_set_diff() sets just
read_whole_object. Everything else is zeroed out and may require
resetting to fit with the rest of ObjectListSnapsRequest logic.
- In LIST_SNAPS_FLAG_WHOLE_OBJECT case, only the diff should be
expanded. Everything else is set by calc_snap_set_diff() and should
be used as is. This goes for end_size in particular -- if it's reset
to object size, bogus zero extents may be returned as the object
would appear to have grown.
This is a regression introduced in commit 4429ed4f3f4c ("librbd: switch
diff iterate API to use new snaps list dispatch methods") by way of
commit 66dd53d9c4d9 ("librbd: optionally return full object extent for
any snapshot deltas").
Ilya Dryomov [Sun, 19 Nov 2023 21:44:28 +0000 (22:44 +0100)]
test/librbd: make ListSnapsWholeObject actually test stuff
Despite being added in commit 66dd53d9c4d9 ("librbd: optionally return
full object extent for any snapshot deltas") ostensibly to test the new
LIST_SNAPS_FLAG_WHOLE_OBJECT code, it surely doesn't do that because
the flag isn't even passed to MockObjectListSnapsRequest::create().
I can only guess, but it looks like snap ID 3 was intended to be
a starting point. Otherwise, with 0 and CEPH_NOSNAP passed as snap
IDs, the overlap that is set up for the clone wouldn't affect the
computation in any way.
Use snap ID 3 as a starting point and run both with and without
LIST_SNAPS_FLAG_WHOLE_OBJECT on the same snapset to pinpoint the
difference.
Ilya Dryomov [Sat, 11 Nov 2023 13:15:49 +0000 (14:15 +0100)]
librados/snap_set_diff: set end_size only if end object exists
Since commit 73f50a13109f ("rbd-mirror: use generalized deep copy for
image sync"), the only user of calc_snap_set_diff() immediately unsets
end_size otherwise.
calc_snap_set_diff() semantics are clearer if end_size is set together
with end_exists and clone_end_snap_id.
Ilya Dryomov [Sat, 9 Dec 2023 20:00:42 +0000 (21:00 +0100)]
test/librbd: avoid config-related crashes in DiscardWithPruneWriteOverlap
For reasons that I think no longer apply today, set_val() and
set_val_or_die() refuse to set "type: str" config options that aren't
marked as "can be changed at runtime" -- set_val() returns an error and
set_val_or_die() terminates the process. What is and isn't marked as
"can be changed at runtime" seems to be pretty much random both within
and outside of RBD, so let's just refactor how config is set here.
While at it, I realized that reproducer config is underspecified:
- for rbd_cache_policy and rbd_cache_writethrough_until_flush settings
to matter, rbd_cache must be set to true and rbd_cache_max_dirty must
be set to a positive number
- order should be set explicitly, because rbd_default_order can be as
low as 12 (for 4096-byte objects), interfering with the logic of the
test
Yuri Weinstein [Thu, 7 Dec 2023 16:30:34 +0000 (08:30 -0800)]
Merge pull request #54771 from ajarr/wip-63714-pacific
pacific: qa/workunits/rbd/cli_generic.sh: narrow race window when checking that rbd_support module command fails after blocklisting the module's client
Yuval Lifshitz [Wed, 23 Feb 2022 15:21:10 +0000 (17:21 +0200)]
rgw: prevent spurious/lost notifications in the index completion thread
this was happening when asyn completions happened during reshard.
more information about testing:
https://gist.github.com/yuvalif/d526c0a3a4c5b245b9e951a6c5a10517
we also add more logs to the completion manager.
should allow finding unhandled completions due to reshards.
Joshua Baergen [Thu, 9 Nov 2023 16:43:22 +0000 (09:43 -0700)]
librbd: Append one journal event per image request
In the case where an image request is split across multiple object
extents and journaling is enabled, multiple journal events are appended.
Prior to this change, all object requests would wait for the last
journal event to complete, since journal events complete in order and
thus the last one completing implies that all prior journal events were
safe at that point.
The issue with this is that there's nothing stopping that last journal
event from being cleaned up before all object requests have stopped
referring to it. Thus, it's entirely possible for the following sequence
to occur:
1. An image request gets split into two image extents and two object
requests. Journal events are appended (one per image extent).
2. The first object request gets delayed due to an overlap, but the
second object request gets submitted and starts waiting on the last
journal event (which also causes a C_CommitIOEvent to be instantiated
against that journal event).
3. Journaling completes, and the C_CommitIOEvent fires. The
C_CommitIOEvent covers the entire range of data that was journaled in
this event, and so the event is cleaned up.
4. The first object request from above is allowed to make progress; it
tries to wait for the journal event that was just cleaned up which
causes the assert in wait_event() to fire.
As far as I can tell, this is only possible on the discard path today,
and only recently. Up until 21a26a752843295ff946d1543c2f5f9fac764593
(librbd: Fix local rbd mirror journals growing forever), m_image_extents
always contained a single extent for all I/O types; this commit changed
the discard path so that if discard granularity changed the discard
request, m_image_extents would be repopulated, and if the request
happened to cross objects then there would be multiple m_image_extents.
It appears that the intent here was that there should be one journal
event per image request and the pending_extents kept track of what had
completed thus far. This commit restores that 1:1 relationship.
Just in case a CInode's nlink is 1, and then a unlink request comes
and then early replies and submits to the MDLogs, but just before
the MDlogs are flushed a link request comes, and the link request
also succeeds and early replies to client.
Later when the unlink/link requests' MDLog events are flushed and
the callbacks are called, which will fire a stray denty reintegration.
But it will pick the new dentry, which is from the link's request
and is a remote dentry, to do the reintegration. While in the
'rename' code when traversing the path it will trigger to call the
'dn->link_remote()', which later will fire a new stray dentry
reintegration.
The problem is if the first 'rename' request is retried several
times, and in each time it will fire a new reintegration, which
makes no sense and maybe blocked for a very long time dues to some
reasons and then will be reported as slow request warning.
... when checking whether a rbd_support module command fails after
blocklisting the module's client.
In tests that check the recovery of the rbd_support module after its
client is blocklisted, the rbd_support module's client is
blocklisted using the `osd blocklist add` command. Next,
`osd blocklist ls` command is issued to confirm that the client is
blocklisted. A rbd_support module command is then issued and expected
to fail in order to verify that the blocklisting has affected the
rbd_support module's operations. Sometimes it was observed that before
this rbd_support module command reached the ceph-mgr, the rbd_support
module detected the blocklisting, recovered from it, and was able to
serve the command. To reduce the race window that occurs when trying to
verify that the rbd_support module's operation is affected by client
blocklisting, get rid of the `osd blocklist ls` command.
Kim Minjong [Fri, 3 Feb 2023 02:47:47 +0000 (11:47 +0900)]
ceph-volume: fix a bug in _check_generic_reject_reasons
The types of removable and ro are wrong. Here, both filters are not
working at all. Changed this from integer to string and corrected the test
data.
Delete redundant logic in get_block_devs_sysfs. Given the name of the
function, I think it is correct to judge from _check_generic_reject_reasons,
and in fact it was before v17.2.4.
Fixes: https://tracker.ceph.com/issues/58591 Signed-off-by: Kim Minjong <make.dirty.code@gmail.com>
(cherry picked from commit a78e660728c6c0442cdbfa65db776b5856aee933)
This functions works for what it is supposed to do:
check if a device is busy.
That being said, this induces a race condition in `get_devices()`
Indeed, it does:
1/ `os.open()` with `(os.O_RDWR | os.O_EXCL)`
2/ `os.close()`
The second call has an effect: it triggers a udev event which causes
systemd-udevd to re-process the device. This seems to be a question of
millisecond but because of this, /sys (sysfs) isn't fully populated as
expected. Given that get_devices() collects a lot of details from sysfs
in a loop, some of these details can be missed.
ceph-volume overall doesn't make decisions based on `is_locked_raw_device()`
This detail is used only for reporting (inventory).
For this reason, dropping this function seems reasonnable.
As a compromise, we can check if the device has partitions and/or a FileSystem
on it.
As storage capacities grow, multi-actuator technology introduced by Seagate addresses the downward pressure on performance that comes with growing drive capacities and areal densities. Having multiple actuators enables drives to maintain the performance needs of customers with data-intensive applications. However, this innovation requires storage stack changes because a single hard drive is now represented by two or more independent actuators (independent_access_ranges) that transfer data concurrently and are represented by a single LBA address space.
This code addition to the `ceph-volume` command assists Ceph administrators in identifying drives with multiple actuators by utilizing the Linux kernel's multi-actuator support introduced in version 5.16. Dual-actuator hard drives are gaining market share and becoming more important to Ceph deployments on large discs.
The code has been tested with Seagate Osprey Exos 2X18 with Mach.2 technology.
$ sudo ceph-volume inventory /dev/sdd
====== Device report /dev/sdd ======
path /dev/sdd
ceph device None
lsm data {}
available False
rejected reasons Has GPT headers
device id ST18000NM0092-3CX103_MVV00H3J
removable 0
ro 0
vendor ATA
model ST18000NM0092-3C
sas address
rotational 1
actuators 2
scheduler mode mq-deadline
human readable size 16.37 TB
Copyright (c) 2022 Seagate Technology LLC and/or its Affiliates
mon: fix iterator mishandling in PGMap::apply_incremental
Fixes: https://tracker.ceph.com/issues/58303 Signed-off-by: Oliver Schmidt <os@flyingcircus.io>, Christian Theune <ct@flyingcircus.io>
(cherry picked from commit 6ba497f2e7a68517fc3a47b6cfd79a19724f8d39)