Casey Bodley [Tue, 11 Mar 2025 16:07:22 +0000 (12:07 -0400)]
cls/rgw: non-versioned listings skip past version suffix
when skipping a versioned entry for a non-versioned listing, we must
advance the marker or risk infinite loops. in particular, plain entries
converted by convert_plain_entry_to_versioned() sort at the end of an
object's versions, but have an empty version id whose retry would start
back at the beginning of the object's versions
Casey Bodley [Tue, 11 Mar 2025 16:51:02 +0000 (12:51 -0400)]
rgw/rados: fix list_objects_ordered() detection of "forward progress"
for multiple versions of the same object name, ListObjectVersions is
supposed to return versions "in the order that they were stored,
returning the most recently stored object first"
this sort order is preserved by the bucket index in cls_rgw, so
list_objects_ordered() should not expect the version ids to be sorted
lexicographically. replace the not-less-than comparison with equality
Zac Dover [Mon, 24 Mar 2025 12:26:11 +0000 (22:26 +1000)]
src/common: add guidance for deep-scrubbing ratio warning
Add an explanation of how to set the value of
"mon_warn_pg_not_deep_scrubbed_ratio" to the confval definition of that
variable. Although this variable contains the string "mon", it is set on
the Manager. I have added a note to direct users to set this value on
the Manager.
This issue was pointed out by Petr Tlapa on Slack in late March of 2025.
Nitzan Mordechai [Thu, 20 Feb 2025 07:37:45 +0000 (07:37 +0000)]
LogMonitor: set no_reply for forward MLog commands
On streach mod clusters we can see slow ops when
removing and adding osds with --zap --force when osds
connected to peon monitor and forwarding the MLog to leader.
the no_reply is set only when we are connected to the leader,
this fix will add also the other option - so no_reply set anyway.
when extending the log, the sequence was left on a bad state because it would first create a transaction to update with the current seq number but leave the "real" transaction with the same sequence number which should be `extend_log_transaction.seq + 1`.
This commit fixes documentation about many-to-many topic relationship for notifications. The current sentence states the same fact twice instead of clarifying.
John Mulligan [Tue, 18 Mar 2025 19:56:25 +0000 (15:56 -0400)]
reef: mgr/diskprediction_local: avoid more mypy errors
Similar to c4111033172db28c4737e8438f27901811919ce4 this patch
suppresses mypy errors in the diskprediction_local mgr module.
I probably put the magic comment on more lines than needed but
mypy does not have a block-comment method to suppress checking
for just a region of code today.
This patch is not a backport as the issue is only impacting
reef CI jobs and so it is applied directly to the reef branch.
Signed-off-by: John Mulligan <phlogistonjohn@asynchrono.us>
Samuel Just [Thu, 13 Feb 2025 04:16:47 +0000 (04:16 +0000)]
dmclock/.../dmclock_server: do not clean clients with requests
PriorityQueueBase::do_clean() shouldn't remove ClientRec instances which
still have queued requests. Otherwise, very low priority clients might
end up having requests actually lost, which shouldn't be possible.
In the OSD, this resulted in PGRecovery items being lost if queued with
background_best_effort while expanding a cluster. Such items can
legitimately sit in the queue for a long period of time as they
represent background data migration which is allowed to be starved by an
aggressive client workload. Dropping the items broke an assumption in
the OSD that all items enqueued would eventually be dequeued resulting
in resources being leaked.
Samuel Just [Thu, 13 Feb 2025 03:54:28 +0000 (03:54 +0000)]
test/osd/TestMClockScheduler: create_item should pass prio < cutoff
Cutoff is set to 12, so let's pass something < 12 rather than 12.
Comments in some tests suggest that the intent is for create_item
to create things in the mclock queue rather than the high_queue.
Samuel Just [Thu, 13 Feb 2025 02:55:27 +0000 (02:55 +0000)]
test/osd/TestMClockScheduler: add test for very slow dequeue
Related: https://tracker.ceph.com/issues/61594 Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit b35589f7eb39e6bfabe7df1c55281f41925eca61)
John Mulligan [Thu, 13 Mar 2025 11:59:42 +0000 (07:59 -0400)]
script: ensure curl is always available in build containers
Ensure that curl is installed in all build containers regardless of
ceph's dependencies or other factors. This allows us to use curl in
any subsequent build steps/scripts.
Fixes: https://tracker.ceph.com/issues/70451 Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit b4e11f75bfa76036b9109485aa1cb4f9d633c8a2)
Conflicts:
src/pybind/mgr/dashboard/frontend/package-lock.json (conflicts
with typescript package version, kept the existing one)
src/pybind/mgr/dashboard/frontend/package.json (conflicts with
typescript package version, kept the existing one)
src/pybind/mgr/dashboard/frontend/src/app/ceph/rgw/rgw-multisite-migrate/rgw-multisite-migrate.component.ts (conflicts with automated system user creation in main)
src/pybind/mgr/dashboard/frontend/src/app/shared/forms/cd-validators.ts (conflicts with oauthAddressTest validator)
Laura Flores [Fri, 7 Mar 2025 06:22:00 +0000 (06:22 +0000)]
mon, osd: add command to remove invalid pg-upmap-primary entries
The current rm-pg-upmap-primary command checks that the pgid exists
in the pgmap before continuing to remove it. Due to https://tracker.ceph.com/issues/66867,
some invalid pg-upmap-primary entires may exist for pools that have been removed.
Currently, these mappings are impossible to remove since the pgids no longer
exist in the pgmap.
This new command, rm-pg-upmap-primary-all, allows users the ability to remove
any and all pg-upmap-primary mappings in the osdmap at once, which includes
valid and invalid entries.
This command may also be helpful when upgrading from versions where users
are plagued by https://tracker.ceph.com/issues/61948. Users may use an upgraded
mon to remove all pg-upmap-primray entries (valid and invalid) so they continue
to upgrade to a safe version.
See manual testing for this patch here: https://tracker.ceph.com/issues/67179#note-12
Fixes: https://tracker.ceph.com/issues/67179 Fixes: https://tracker.ceph.com/issues/69760 Signed-off-by: Laura Flores <lflores@ibm.com>
(cherry picked from commit 6e9e2033bf0f4779bdfac9a3a4f29115459c8c0e)
Conflicts:
src/osd/OSDMap.cc
src/osd/OSDMap.h
The `rm_all_upmap_prims` per pool function is part of
https://github.com/ceph/ceph/commit/2953db8b58535605882dff2e1d4ff36e6075e122, which
is related to the "size optimized" read balancer feature that
is only included >= Squid.
John Mulligan [Sat, 8 Feb 2025 20:03:32 +0000 (15:03 -0500)]
container: stop deleting python generated files
Stop deleting the python generated files (pyc, pyo) that RPM packages
have installed. At some point in the misty past someone thought it would
be a good idea to remove these. This practice got carried over to the
new in-tree Containerfile. IMO this is probably due to a thought to save
space, but if that's the case then the RPMs should not be carrying them
either. Plus, not having them is going to slow python down as it needs
to compile every py file that gets loaded. Let's be consistent: if the
RPMs have pyc and pyo files then they should be in the image - if
they're bad or too big they should not be in the RPMs either, right?
This has the pleasant side effect of making `rpm -Va` inside the image
happier.
Fixes: https://tracker.ceph.com/issues/69869 Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 0f178e61de52c6a0b757f8f6937340c002e66c73)
John Mulligan [Sat, 8 Feb 2025 19:51:23 +0000 (14:51 -0500)]
container: avoid installing docs using the dnf configuration
Avoid installing docs by using the dnf configuration tsflags parameter,
passing the nodocs flag. This tells dnf and rpm not to install
documentation, such as manpages. Stop installing the docs just to delete
them later with an `rm -rf` type command. Now the docs don't get
installed in the first place, saving space, but the rpm is happy
(`rpm -Va` no longer shows docs as 'missing').
Fixes: https://tracker.ceph.com/issues/69868 Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit bf9b8d36aba3c7a8c7a3ecfc4d00359985e745b6)
Hannes Baum [Wed, 6 Nov 2024 08:46:09 +0000 (09:46 +0100)]
mgr: fix subuser creation via dashboard
Subusers couldn't be created through the dashboard, because the get call was overwritten with Python magic due to it being the function under the HTTP call.
The get function was therefore split into an "external" and "internal" function, whereas one
can be used by functions without triggering the magic. Since the user object was then returned correctly, json.loads could be removed.
This test deals with enabling/disabling the modules. The assumption I
have is after enabling the
module test will wait for an active mgr but its not able to find it in
time and it fails. so taking inspiration from https://github.com/ceph/ceph/pull/58995/commits/6c7253be6f6fbfa6faed7a539cb78847fec04580 adding retries and logs to see if that's the case
Joshua Baergen [Wed, 18 Dec 2024 17:27:58 +0000 (10:27 -0700)]
blk/KernelDevice: Introduce a cap on the number of pending discards
Some disks have a discard performance that is too low to keep up with
write workloads. Using async discard in this case will cause the OSD to
run out of capacity due to the number of outstanding discards preventing
allocations from being freed. While sync discard could be used in this
case to cause backpressure, this might have unacceptable performance
implications.
For the most part, as long as enough discards are getting through to a
device, then it will stay trimmed enough to maintain acceptable
performance. Thus, we can introduce a cap on the pending discard count,
ensuring that the queue of allocations to be freed doesn't get too long
while also issuing sufficient discards to disk. The default value of 1000000 has ample room for discard spikes (e.g. from snaptrim); it could
result in multiple minutes of discards being queued up, but at least
it's not unbounded (though if a user really wants unbounded behaviour,
they can choose it by setting the new configuration option to 0).
ceph-volume: allow zapping partitions on multipath devices
ceph-volume refuses to zap a device if it is a partition on a multipath
device due to an overly strict condition. This change ensures that only
full mapper devices (excluding partitions) are blocked from being zapped,
allowing partitions on multipath devices to be processed correctly.
With commit fcbf7367d285 ("rbd-nbd: map using netlink interface by
default") backported to reef, this reef-only fixup limited to fsx is no
longer needed.
Ramana Raja [Wed, 17 Jan 2024 18:24:36 +0000 (13:24 -0500)]
rbd-nbd: map using netlink interface by default
Mapping rbd images to nbd devices using ioctl interface is not
robust. It was discovered that the device size or the md5 checksum
of the nbd device was incorrect immediately after mapping using
ioctl method. When using the nbd netlink interface to map RBD images
the issue was not encountered. Switch to using nbd netlink interface
for mapping.
Ilya Dryomov [Mon, 3 Mar 2025 16:59:35 +0000 (17:59 +0100)]
test/pybind/rbd: fix read offset in write zeroes tests
Random data is written and write zeroes is invoked on 0~256, but the
read is done on 256~256. This means that if write zeroes malfunctions
the test wouldn't catch it (especially in the thick provision case).
VinayBhaskar-V [Tue, 26 Nov 2024 11:18:51 +0000 (16:48 +0530)]
librbd: add rbd_diff_iterate3() API to take source snapshot by ID
Allow a diff to start from a non-user snapshot. This would be used by
"rbd du" command to account for non-user snapshots which are currently
just skipped potentially resulting in underreported space usage and in
other places.
Conflicts:
src/include/rbd/librbd.h [ commit e5ccce14c4b0 ("rbd: add group
snap info command") not in reef ]
src/test/pybind/test_rbd.py [ commit d7fd66ec9944 ("librbd: add
rbd_clone4() API to take parent snapshot by ID") not in reef ]