Laura Flores [Fri, 26 Jan 2024 17:32:43 +0000 (17:32 +0000)]
osd: clear out unneeded pending pg-upmap-primary mappings
If the score did not improve, we should clear out any
pending pg-upmap-primary mappings so they don't execute
in situations where the same incremental is used to balance
multiple pools (i.e. in the balancer mgr module).
Laura Flores [Tue, 2 Jan 2024 21:28:03 +0000 (21:28 +0000)]
mgr/balancer: add pg_upmap_primaries to `balancer status detail`
Followup to https://github.com/ceph/ceph/pull/54801/commits/8a5553597ca6a428cb8ffc9fc5bebde048fbd068.
Streamlines some of the logic so pg upmap activity is properly
initalized, and updated in offline mode as well as online.
Laura Flores [Thu, 18 Jan 2024 18:57:24 +0000 (18:57 +0000)]
mgr: add read balancer support inside the balancer module
Read balancing may now be managed automatically via the balancer
manager module. Users may choose between two new modes: ``upmap-read``, which
offers upmap and read optimization simultaneously, or ``read``, which may be used
to only optimize reads. Existing balancer commands have also been added to
contain more information about read balancing.
Run the following commands to test the new automatic behavior:
`ceph balancer on` (on by default)
`ceph balancer mode <read|upmap-read>`
`ceph balancer status`
Run the following commands to test the new supervised behavior:
`ceph balancer off`
`ceph balancer mode <read|upmap-read>`
`ceph balancer eval` | `ceph balancer eval <pool-name>`
`ceph balancer eval-verbose` | `ceph balancer eval-verbose <pool-name>`
`ceph balancer optimize <plan-name>`
`ceph balancer show <plan-name>`
`ceph balancer eval <plan-name>`
`ceph balancer execute <plan-name>`
In the balancer module, there is also a new "self_test" function which tests
the module's basic functionality. This test can be triggered with the following
commands:
`ceph mgr module enable selftest`
`ceph mgr self-test module balancer`
Related Trello: https://trello.com/c/sWoKctzL/859-add-read-balancer-support-inside-the-balancer-module Signed-off-by: Laura Flores <lflores@ibm.com>
Ramana Raja [Wed, 17 Jan 2024 18:24:36 +0000 (13:24 -0500)]
rbd-nbd: map using netlink interface by default
Mapping rbd images to nbd devices using ioctl interface is not
robust. It was discovered that the device size or the md5 checksum
of the nbd device was incorrect immediately after mapping using
ioctl method. When using the nbd netlink interface to map RBD images
the issue was not encountered. Switch to using nbd netlink interface
for mapping.
Fixes: https://tracker.ceph.com/issues/64063 Signed-off-by: Ramana Raja <rraja@redhat.com>
Just because this is what Ceph's config uses and it saves a narrowing
conversion. If we want to set a max value on the thread count, we
should do it in config.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Ramana Raja [Tue, 23 Jan 2024 21:07:04 +0000 (16:07 -0500)]
rbd-nbd: log errors during netlink_resize() using derr
When using rbd CLI to map the images to NBD devices via netlink,
any errors that arose during image resizing in netlink_resize()
were not logged. Switching the error logging from using cerr to
derr helps log the errors from netlink_resize().
Ramana Raja [Mon, 22 Jan 2024 22:06:58 +0000 (17:06 -0500)]
rbd_nbd: fix resize of images mapped using netlink
Include device identifier or cookie in the message sent to the kernel
to resize images mapped to NBD devices using netlink. Otherwise,
netlink_resize() fails and the size of the device isn't updated.
Fixes: https://tracker.ceph.com/issues/64139 Signed-off-by: Ramana Raja <rraja@redhat.com>
While submitting the log line asyncronously is reasonable,
with this implementation the EntryVector &q parameter does
not necessarily outlive the submission continuation.
Ronen Friedman [Wed, 17 Jan 2024 15:36:16 +0000 (09:36 -0600)]
osd/scrub: check reservation replies for relevance
Compare a token (nonce) carried in the reservation reply with the remembered
token of the reservation request. If they don't match, the reply is
stale and should be ignored (and logged).
Ronen Friedman [Sun, 31 Dec 2023 16:18:09 +0000 (10:18 -0600)]
osd/scrub: introduce a 'not before' attribute for scrub jobs
The NB enables the OSD to delay the next attempt to schedule a specific
scrub job. This is useful for jobs that have failed for whatever
reason, especially if the primary has failed to acquire the replicas.
Ronen Friedman [Sat, 30 Dec 2023 12:36:26 +0000 (06:36 -0600)]
osd/scrub: remove the 'penalized jobs' queue
The 'penalized jobs' queue was used to track scrub jobs that had failed
to acquire their replicas, and to prevent those jobs from being retried
too quickly. This functionality will be replaced by a
simple 'not before' delay (see the next commits).
Yingxin Cheng [Mon, 11 Dec 2023 06:38:51 +0000 (14:38 +0800)]
crimson/osd: drop a foreign-copy to shard-0 for every pg operation
By using ConnectionRef before pg submission, and after that, change to
use ConnectionXcoreRef.
The intent is to drop the foreign copy of the connection to shard 0 at
pg submission time. This should remove two pairs of crosscore
communications in shard 0 for each I/O, one for connection-ref foreign
copy, another for connection-ref destruction.
Ilya Dryomov [Sat, 6 Jan 2024 16:08:04 +0000 (17:08 +0100)]
librbd: try to preserve object map for diff-iterate in fast-diff mode
As an optimization, try to ensure that the object map for the end
version is preloaded through the acquisition of exclusive lock and
as a consequence remains around until exclusive lock is released.
If it's not around, DiffRequest would (re)load it on each call.
Ilya Dryomov [Sat, 6 Jan 2024 16:05:39 +0000 (17:05 +0100)]
librbd/object_map: potentially use in-memory object map in DiffRequest
If the object map for the end version is around (already loaded in
memory, either due to the end version being a snapshot or due to
exclusive lock being held), use it to run diff-iterate against the
beginning of time. Since it's the only object map needed in that
case, such calls would be satisfied locally.
Ilya Dryomov [Fri, 5 Jan 2024 12:15:54 +0000 (13:15 +0100)]
librbd/object_map: decouple object map processing in DiffRequest
In preparation for potentially using in-memory object map, decouple
object map processing from loading object maps and place the logic in
prepare_for_object_map() and process_object_map().
Ilya Dryomov [Fri, 5 Jan 2024 11:23:24 +0000 (12:23 +0100)]
common/bit_vector: fix iterator vs reference constness confusion
T (ConstIterator or Iterator) is confused with const T here:
IteratorImpl dereference operator is wrongly overloaded on const
and returns Reference instead of ConstReference for ConstIterator.
This then fails inside bufferlist bowels because Reference is
incompatible with bufferlist::const_iterator.
Ilya Dryomov [Thu, 4 Jan 2024 10:39:20 +0000 (11:39 +0100)]
librbd/object_map: don't resize object map in handle_load_object_map()
Currently it's done in two cases:
- if the loaded object map is larger than expected based on byte size,
it's truncated to expected number of objects
- in case of deep-copy, if the loaded object map is smaller than diff
state, it's expanded to get "track the largest of all versions in the
set" semantics
Both of these cases can be easily dealt with without modifying the
object map. Being able to process a const object map is needed for
working on in-memory object map which is external to DiffRequest.
It's totally broken: instead of returning the current position and
moving to the next position, it returns the next position and doesn't
move anywhere. Luckily it hasn't been used until now.
Ilya Dryomov [Thu, 28 Dec 2023 09:14:18 +0000 (10:14 +0100)]
librbd: propagate diff-iterate range to parent in fast-diff mode
When getting parent diff, pass the overlap-reduced image extent instead
of the entire 0..overlap range to avoid a similar quadratic slowdown on
cloned images.
Ilya Dryomov [Wed, 27 Dec 2023 17:07:05 +0000 (18:07 +0100)]
librbd/object_map: add support for ranged diff-iterate
Currently diff-iterate in fast-diff mode is performed on the entire
image no matter what image extent is passed to the API. Then, unused
diff just gets discarded as DiffIterate ends up querying only objects
that the passed image extent maps to. This hasn't been an issue for
internal consumers ("rbd du", "rbd diff", etc) because they work on the
entire image, but turns out to lead to quadratic slowdown in some QEMU
use cases.
0..UINT64_MAX range is carved out for deep-copy which is unranged by
definition. To get effectively unranged diff-iterate, 0..UINT64_MAX-1
range can be used.
Ilya Dryomov [Sat, 23 Dec 2023 14:19:09 +0000 (15:19 +0100)]
test/librbd: expand TestMockObjectMapDiffRequest edge case coverage
For each covered edge case or error, run through the following
scenarios:
- where the edge case concerns snap_id_start
- where the edge case concerns snap_id_end
- where the edge case concerns intermediate snapshot and
snap_id_start == 0 (diff against the beginning of time)
- where the edge case concerns intermediate snapshot and
snap_id_start != 0 (diff from snapshot)
Ilya Dryomov [Sat, 23 Dec 2023 13:47:54 +0000 (14:47 +0100)]
librbd/object_map: allow intermediate snaps to be skipped on diff-iterate
In case of diff-iterate against the beginning of time, the result
depends only on the end version. Loading and processing object maps
or intermediate snapshots is redundant and can be skipped.
This optimization is made possible by commit be507aaed15f ("librbd:
diff-iterate shouldn't ever report "new hole" against a hole") and, to
a lesser extent, the previous commit.
Getting FastDiffInvalid, LoadObjectMapError and ObjectMapTooSmall to
pass required tweaking not just expectations, but also start/end snap
ids and thus also the meaning of these tests. This is addressed in the
next commit.
Ilya Dryomov [Fri, 22 Dec 2023 17:50:20 +0000 (18:50 +0100)]
librbd/object_map: resurrect diff-iterate behavior when image is shrunk
The new "track the largest of all versions in the set, diff state is
only ever grown" semantics introduced in commit 330f2a7bb94f ("librbd:
helper state machine for computing diffs between object-maps") don't
make sense for diff-iterate. It's a waste because DiffIterate won't
query beyond the end version size -- this is baked into the API.
Limit this behavior to deep-copy and resurrect the original behavior
from 2015 for diff-iterate.