Alex Ainscow [Thu, 27 Mar 2025 14:38:44 +0000 (14:38 +0000)]
crimson/osd: Add scrub stubs for crimson and classic, ready for new EC
The new optimised EC code is not backward compatible withold EC Code.
Before this commit there is some stub code which assumes that an hinfo
xattr will exist and can be used for scrub. This is no longer the case in new EC.
We plan to first make the scrubbing changes for new EC in classic and will
subsequently port to crimson. It will not look like the code here, so there is
little point in keeping it.
Additionally, add some stubs for scrub in classic optimized EC.
There will be a later PR specifically for dealing with scrubbing in
new EC which fix all the fix mes in class,
The crimson code will be fixed up at a later date and will only
support optimised EC.
Alex Ainscow [Fri, 28 Mar 2025 13:46:30 +0000 (13:46 +0000)]
common: Generalise to_interval_set to allow more interval_set implementations.
This generalises to_interval_set so that the interval set does not need to share a
common internal map structure with interval_map. The implementation is achieved
through iteration, so there is no requirement for the old restriction.
Alex Ainscow [Thu, 27 Mar 2025 11:44:29 +0000 (11:44 +0000)]
common: bitset_set
This bitset_set change relaxes policing of bitset_set, so that
out-of-range can be queried in the contains interface. This means
that callers cam simplifiy calls. For example:
if (key == invalid) || !set.contains(key)) {
do_stuff
}
Bill Scales [Wed, 26 Mar 2025 13:43:43 +0000 (13:43 +0000)]
osd: EC Optimizations: proc_master_log changes for partial logs
proc_master_log is part of the peering process that merges
the authorative log (in the case of EC pools the log of the
shard missing the most updates) into the primary log.
When there are partial writes it is likely that the
authorative log is behind because of partial writes that
did not update that shard. proc_master_log works out where
the logs diverge and then studies each additional log entry
to see if all the updates made in that log entry have been
applied. If any shard is missing an update then that log
entry (and all subsequent entries) need to be rolled back,
otherwise the entry can be rolled forward and included in
the authorative log.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 13:25:07 +0000 (13:25 +0000)]
osd: EC Optimizations: Peer changes for partial logs
Changes to peering for replica/strays to handle partial
logs. For EC optimized pools shards may not have a complete
log if there have been partial writes that did not update
the shard. If the most recent entries in the log have all
skipped updating a shard then it will have a log that ends
earlier than other shards. During peering the primary which
has a full copy of the log works out whether other shards
have any missing objects and then communicates this to
the replica/stray shards during activation.
The primary uses the partial write last complete data in
pg_info_t to explain to other shards if they are missing
log entries and just need to update last_update and
last_complete.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 13:15:45 +0000 (13:15 +0000)]
osd: EC Optimizations: Get missing changes for partial logs
Changes to the get missing step of peering to handle partial
writes. Having established the authorative log the primary
works out what shards are missing objects. With partial
writes this code needs to differentiate between a shard that
missed an update (and hence has a missing object) versus a
shard that was not updated by a partial write. The divergent
log entries are examined to see if the updates were partial
writes that did not involve the shard.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 13:04:46 +0000 (13:04 +0000)]
osd: EC Optimizations: PG log changes for partial logs
Optimized EC pools will not add a log entry for shards that
are not modified by a partial write. This means the shard
will have a partial copy of the log.
There are several asserts in PGLog that assume that the log
is contiguous, these need to be relaxed when it is an optimized
EC pool (other pools retain the full strength asserts).
During peering the primary may provide a complete log to a
non-primary shard to merge into its log. This merge can skip
log entries for partial writes that do not update the shard.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 12:40:13 +0000 (12:40 +0000)]
osd: EC optimizations: Get log twice when auth_log_shard is a non-primary
When an event such as splitting the PG occurs the new primary does
not have any log at the start of peering. Non-primary shards in an
EC optimized pool may not have a complete log of writes due to
partial writes. If the choosen authorative shard is a non-primary
shard then the new primary needs to first get a full copy of the
log (which extends past the authorative shard log) from another
shard and then repeat the get log step to get the authorative
shard's log so it can be merged rewinding divergent entries.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Thu, 6 Mar 2025 09:47:17 +0000 (09:47 +0000)]
osd: EC Optimizations: Partial write changes to add_next_event
add_next_event is used during peering to process log entries
that a shard is missing to build up a list of missing objects.
With EC optimized pools and partial writes not every update
modifies every shard. The log entry contains details of which
shards were modified and this can be used to work out whether
a missing entry needs to be created/updated.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 10:46:07 +0000 (10:46 +0000)]
osd: EC Optimizations: Relax reset_complete_to for partial writes
EC Optimized pools can have shards missing log entries because
of partial writes. This means it is possible to have a missing
entry with a newer version than the log. Relax an assert in
reset_complete_to to avoid this.
reset_complete_to also resets last_complete to 0 when the
oldest missing object is before the first log entry. This
is to aggressive for partial writes and needs to be relaxed.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 10:05:07 +0000 (10:05 +0000)]
osd: EC Optimizations: Add shard_id_sets for backfill_target and ...
acting_recovery_backfill
Optimized EC code uses shard_id_sets as a convinient and fast way of
representing sets of shards. Peering calculates a backfill_target set
and an active_recovery_backfill set as a map of pg_shard_ids during
peering and these are then used while processing I/O requests.
Modify peering so that it initializes a shard_id_set version of
these two sets and makes these available to ECBackend code.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 26 Mar 2025 08:30:32 +0000 (08:30 +0000)]
osd: EC Optimizations: Share pwlc between peers
Optimized EC pools add partial_writes_last_complete (pwlc) data to
pg_info_t to track shards that were not updated because of partial
writes. During peering the primary collects the info structure from
all the replica/strays and then having reconciled the log can send
the info back to peers. Different shards may have newer/older
versions of pwlc, the primary merges these together to create
the definitive copy and then redistributes this to the other shards.
The primary also adjusts the last_update and last_complete values
in the info structure received from peers using the pwlc data to
advance these where shards were not updated because of a partial
write.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Tue, 25 Mar 2025 17:41:57 +0000 (17:41 +0000)]
osd: EC Optimizations: Update pwlc in pg_info_t
Optimized EC pools add extra data to the log entry to track
which shards were updated by a partial write. When the log
entry is completed this needs to be summarized in the
partial_writes_last_complete map in pg_info_t.
Summarising this data in pg_info_t makes it easy to determine
whether the reason a shard is behind is because it is missing
update or has just not been involved in recent updates. This
also ensures that even if there is a long sequence of
updates that all skip updating a shard that a record of this
is retained in the info structure even after the log
has been trimmed.
Edited by aainscow as suggested in comment here:
https://github.com/ceph/ceph/pull/62522/files#r2050803678
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com> Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
pybind: switch from pkgutil.find_loader() to importlib.util.find_spec()
Replace pkgutil.find_loader() with importlib.util.find_spec() throughout
Python bindings. This addresses the deprecation warning in Python 3.10
(scheduled for removal in 3.14) that appeared when generating librbd
Python bindings.
The importlib.util.find_spec() API has been available since Python 3.4
and is compatible with our minimum required Python version (3.9, since
commit 51f71fc1).
The warning resolved:
```
/home/kefu/dev/ceph/src/pybind/rbd/setup.py:8: DeprecationWarning: 'pkgutil.find_loader' is deprecated and slated for removal in Python 3.14; use importlib.util.find_spec() instead
if not pkgutil.find_loader('setuptools'):
```
J. Eric Ivancich [Wed, 16 Apr 2025 16:38:33 +0000 (12:38 -0400)]
rgw: prevent crash in `radosgw-admin bucket object shard ...`
This subcommand is used to ask radosgw-admin which bucket index shard
a given object in a given bucket would have its bucket index entry
on. The user is required to supply the number of shards (i.e., the
command doesn't look that up). If 0 is provided it would result in a
divide by zero runtime exception. Values less than or equal to zero
are now protected.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Fix stray example command block leftover from rebase in
cloud-transition.rst.
Remove extra character > in cloud-sync-module.rst.
Add missing formatting char ` in cloud-sync-module.rst.
Remove extra empty line between example commands that
resulted in a line with just a "#" prompt.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
rbd-mirror: release lock before calling m_async_op_tracker.finish_op()
m_async_op_tracker.finish_op() in InstanceReplayer::start_image_replayers
may invoke a completion that re-enters code paths that attempt to acquire
the same mutex (m_lock), violating the non-recursive lock constraint.
This can be fixed by releasing the lock before calling
m_async_op_tracker.finish_op().
Merge pull request #62818 from ronen-fr/wip-rf-iocnt-plus
osd/scrub: performance counters: count I/Os, use unlabeled counters
Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com> Reviewed-by: Bill Scales <bill_scales@uk.ibm.com> Reviewed-by: Samuel Just <sjust@redhat.com>
Ville Ojamo [Thu, 10 Apr 2025 10:34:57 +0000 (17:34 +0700)]
doc/radosgw: Promptify CLI, cosmetic fixes
Use the more modern prompt block for CLI commands
and use right one $ vs #.
Fix indentation on JSON example outputs and
some CLI command switches.
Add some arguably missing comma in JSON example output.
Add a full stop at the end of a one-sentence paragraph.
Remove extra comma mid-sentence in another.
Fix missing backslashes or typo at end of multiline commands.
Lines under section headings as long as heading text.
Fix hyperlinks. Fix list items prefixed with - insted of *.
Format configuration syntax in the middle of text as code.
Fix typo "PI" to "API" and remove extra space.
Remove colons at the end of section headers in a few places.
Use Title Case in section titles consistently with short words lowercase.
Possibly controversial: don't add whitespace before and
after main title section header text.
Possibly controversial: don't indent line continuation
backslashes, leave only 1 space before them.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
osd/scrub: a single counters selection mechanism - step 1
Following the preceeding PR, the Scrubber now employs
two methods for selecting the specific subset of performance
counters to update (the replicated pool set or the EC one).
The first method is using labeled counters, with 4 optional labels
(Primary/Replica X Replicated/EC Pool). The second method
is by naming the specific OSD counters to use in ScrubIoCounterSet
objects, then selecting the appropriate set based on the pool type.
This commit is the first step on the path to unifying the two
methods - discarding the use of labeled counters, and only
naming OSD counters.
osd/scrub: perf-counters for I/O performed by the scrubber
Define two sets of performance counters to track I/O performed
by the scrubber - one set to be used when scrubbing a PG
in a replicated pool, and one - for EC PGs.
https://github.com/ceph/ceph/pull/62080 tested version was **different**
from the one that got merged.
The untested change was changing the boolean returned from start_recovery_ops.
While the seastar::repeat loop in BackgroundRecoveryT<T>::start() was changed accordingly,
other do_recovery() return cases were not considered.
See Tested / Merged here: https://github.com/Matan-B/ceph/pull/2/files
start_recovery_ops used by do_recovery should return whether the iteration (i.e recovery) keep going.
Direct users to upgrade only to Squid v19.2.2, and warn readers not to
upgrade to Squid 19.2.1. This PR is raised in response to a request from
Neha Ojha.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
rgw: metadata and data sync fairness notifications to retry upon any error case
This is a complementary fix to the earlier one described at #62156.
When the sync shard notification fails due to any failures including timeout,
this change keeps the loop going for both metadata and data sync.
John Mulligan [Fri, 11 Apr 2025 17:02:15 +0000 (13:02 -0400)]
mgr/cephadm: do not delete smb fs cephx keys
This change effectively disables fencing for the smb service because
the previous attempt to implement fencing would destroy the only
cephx key. Deleting this key would prevent any smb service part of
the logical cluster from talking to cephfs, even ones that were not
to be fenced.
The whole concept of fencing and ranks needs a bit of a rethink in
regards to smb. For now, we're just going to rely on ctdb and not
cephadm for smb's HA.
Fixes: 60300360cc500091e9dadf929d00bb72afad033c Signed-off-by: John Mulligan <jmulligan@redhat.com>
The 'delay_ready_t' parameter was used in the past to
control whether, when a change in the scrub scheduling inputs
occurs (e.g. a configuration change), even those scheduling targets
that are already ripe for scrubbing will have their schedule recomputed.
This parameter, however, is ignored: all "regular-periodic"
scrubbing targets are always rescheduled when the scheduling inputs
change.
The commit removes the 'delay_ready_t' parameter from the codebase.