Alex Ainscow [Fri, 6 Jun 2025 11:09:04 +0000 (12:09 +0100)]
osd: Optimised EC should avoid decodes off the end of objects.
This was a particular edge case whereby the need to do an encode and a decode as part of recovery
was causing EC to attempt to do a decode off the end of the shard, despite this being
unnecessary.
Alex Ainscow [Fri, 23 May 2025 08:59:31 +0000 (09:59 +0100)]
osd: During recovery, pass "for_recovery" when attempting re-reads
get_all_remaining_reads() is used for two purposes:
1. If a read unexpectedly fails, recover from other shards.
2. If a shard is missing, but is allowed to be missing (typically
due to unequal shard sizes), we rely on this function to return
no new reads, without an error.
The test failure we saw case (2), but I think case (1) is important.
Most of the time we probably would not notice, but if insufficient redundancy
exists without the for_recovery being set, then this will result in
recovery failing.
Alex Ainscow [Fri, 2 May 2025 09:11:45 +0000 (10:11 +0100)]
osd: Improve backfill in new EC.
In old EC, the full stripe was always read and written. In new EC, we only attempt
to recover the shards that were missing. If an old OSD is available, the read can
be directed there.
Alex Ainscow [Thu, 8 May 2025 15:14:03 +0000 (16:14 +0100)]
osd: Cope with empty reads from an OSD without panic.
If a ReadOp from EC contains two objects where one object only reads from a single shard, but
other onjects require other shards, then this bug can be hit. The fix should make it clear
what the issue is
Alex Ainscow [Thu, 8 May 2025 13:22:36 +0000 (14:22 +0100)]
osd: Recover non-primary shards with the correct version.
Scrub revealed a bug whereby the non-primary shards were being given a
version number in the OI which did not match the expected version in
the authoritative OI.
A secondary issue is that all attributes were being pushed to the non
primary shards, whereas only OI is actually needed.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
# Conflicts:
# src/osd/osd_types.h
Alex Ainscow [Wed, 7 May 2025 09:33:24 +0000 (10:33 +0100)]
osd: Do not do a read-modify-write if op.delete_first is set
Some client OPs are able to generate transactions which delete an
object and then write it again. This is used by the copy-from ops.
If such a write is not 4k aligned, then the new EC code was incorrectly doing
a read-modify write on the misaligned 4k. This causes some
garbage to be written to the backend OSD, off the end of the
object. This is only a problem if the object is later extended
without the end being written.
Problematic sequence is:
1. Create two objects (A and B) of size X and Y where:
X > Y, (Y % 4096) != 0
2. copy_from OP B -> A
3. Extend B without writing offset Y+1
This will result in a corrupt data buffer at Y+1 without this fix.
Alex Ainscow [Thu, 1 May 2025 09:09:15 +0000 (10:09 +0100)]
osd: Fix off-by-one error in shard_extent_map.
Inserting the first parity buffer was causing the ro-range within the SEM to be incorrectly calculated.
Simple fix and I have added some unit tests to defend this error in the future.
Alex Ainscow [Thu, 24 Apr 2025 13:02:41 +0000 (14:02 +0100)]
osd: clone + delete ops should invalidate source new EC extent cache.
The op.is_delete() function only returns true if the op is not ALSO
doing something else (e.g. a clone). This causes issues with clearing
the new EC extent cache.
Alex Ainscow [Wed, 23 Apr 2025 14:41:11 +0000 (15:41 +0100)]
osd: Fix parity updates in truncates.
Previously in optimised EC, when truncating to a partial
stripe, the parity was not being updated. This fix reads
the non-truncated data from the final stripe and calculates
parity updates, which are written to the parity shards.
Alex Ainscow [Tue, 22 Apr 2025 12:41:19 +0000 (13:41 +0100)]
osd: Fix EC cache invalidation bug
With optimised EC, there were two bugs with cache invalidation:
1. If two invalidates were in the queue, its possible the second
invalidate might be cleared by the first.
2. Reads were being requested if size was being reduced.
Also, added a few debug improvements and some new asserts.
Alex Ainscow [Wed, 9 Apr 2025 12:49:49 +0000 (13:49 +0100)]
osd: Make EC alignment independent of page size.
Code which manipulates full pages is often faster. To exploit this
optimised EC was written to deal with 4k alignment wherever possible.
When inputs are not aligned, they are quickly aligned to 4k.
Not all architectures use 4k page sizes. Some power architectures for
example have a 64k page size. In such situations, it is unlikely that
using 64k page alignment will provide any performance boost, indeed it
is likely to hurt performance significantly. As such, EC has been
moved to maintain its internal alignment (4k), whcih can be configured.
This has the added advantage, that we can can potentially tweak this
value in the future.
Alex Ainscow [Wed, 16 Apr 2025 09:41:48 +0000 (10:41 +0100)]
osd: Fix Truncates in Optimised EC
The previous truncate code attempted to perform a non-aligned truncate by
creating a zero buffer at the end of the object, which was written.
The new code initially truncates to the exact size of the user object before
growing the object to the required 4k alignment. This simpler arrangement
also simplifies the rollback.
Bill Scales [Fri, 6 Jun 2025 12:28:14 +0000 (13:28 +0100)]
osd: EC optimizations fix bug when recovering only partial write objects
PGLog::reset_complete_to is not handling the scenario where all the
missing objects have a partial write that excludes updating the shard being
recovered as their most recent update. In this scenario the oldest need
is newer than newest log entry. Setting last_compelte to the head of the
log confuses code and makes it think that recovery has completed.
The fix is to hold last_complete one entry behind the head of the log
until all missing objects have been recovered.
PGLog::recover_got already does this when an object is recovered and the
remaining objects to recover match this scenario, so this fix just makes
reset_complete_to behave the same way as recover_got.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Thu, 5 Jun 2025 10:17:06 +0000 (11:17 +0100)]
osd: EC optimizations correct pwlc after PG split
When a PG splits the log entries are divided between the two PGs,
this can result in PWLC refering to log entries in the other PG.
Rollback PWLC after the split so it is not further advanced that
the most recently completed log entry.
Non-primary shards can be missing log entries and may rollback
PWLC too far because of this, however this does not matter
because a split occurs at the start of a peering cycle and these
shards will be updated with the correct PWLC from the primary
shard later in the peering cycle when they are activated.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Mon, 26 May 2025 13:33:12 +0000 (14:33 +0100)]
osd: EC optimizations overaggresive check for missing objects
Relax an assert in read_log_and_missing for optimized EC
pools. Because the log may not have entries for partial
writes but the missing list is calculated from the full
log the need version for a missing item may be newer than
the lastest log entry for that object.
ceph_objectstore_tool needs care because we don't want to add
extra dependencies. To minimise the dependencies, we always
relax the asserts when using this tool.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com> Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Bill Scales [Wed, 21 May 2025 17:16:50 +0000 (18:16 +0100)]
osd: EC Optimizations fix proc_master_log handling of splits
For optimized EC pools proc_master_log needs to deal with
the other log being merged being behind the local log because
it is missing partial writes. This is done by finding the
point where the logs diverge and then checking whether local
log entries have been committed on all the shards.
A bug in this code meant that after a PG split (where there
may be gaps in the log due to entries moving to the other PG)
that the divergence point was not found and committed
partial writes ended up being discarded which creates
unfound objects.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Tue, 20 May 2025 10:37:05 +0000 (11:37 +0100)]
osd: EC Optimizations fix missing call to partial_write
When a shard is backfilling and it receives a log entry where the
transaction is not applied it can skip the roll forward by
immediately advancing crt. However it is still necessary to
call partial_write in this scenario to keep the pwlc information
up to date.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Wed, 14 May 2025 07:39:40 +0000 (08:39 +0100)]
osd: EC Optimizations bug fix for flip/flop acting set
EC optimizations pools have a set of non-primary shards which
cannot become the primary because they do not have all the
metadata updates. If one of these shards is chosen as the
primary it will set the acting set to force another shard to
be chosen.
It is important that the selected acting set is the same
acting set that will be chosen by the next primary (assuming
nothing else changes) otherwise a PG can get into a state where
the acting set flip/flops between two different states causing
the PG to get stuck in peering and hanging I/O.
A bug in update_peer_info meant that non-primary shards did not
present the same info to choose_acting_set as primary shards
because they were not updating their pg_info_t based on pwlc
information from other shards.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Tue, 13 May 2025 11:55:14 +0000 (12:55 +0100)]
osd: Refuse to commit/rollforward beyond end of log.
In optimised EC, if transaction is applied to all shards, followed by a
partial transaction AND these two transactions overlap, then it is
possible for the non-primary shards to commit a version which is after
then end of the log.
This commit changes the apply_log such that the commit version will be
changed to the head of the log in such situations.
Alex Ainscow [Tue, 29 Apr 2025 11:02:07 +0000 (12:02 +0100)]
osd: Refactor partial_write to address multiple issues.
We fix a number of issues with partial_write here.
Fix an issue where it is unclear whether the empty PWLC state is
newer or older than a populated PWLC on another shard by always
updating the pwlc with an empty range, rather than blank.
This is an unfortunate small increase in metadata, so we should
come back to this in a later commit (or possibly later PR).
Normally a PG log consists of a set of log entries with each
log entry have a version number one greater than the previous
entry. When a PG splits the PG log is split so that each of the
new PGs only has log entries for objects in that PG, which
means there can be gaps between version numbers.
PGBackend::partial_write is trying to keep track of adjacent
log updates than do not update a particular shard storing
these as a range in partial_writes_last_complete. To do this
it must compare with the version number of the previous log
entry rather than testing for a version number increment of one.
Also simplify partial_writes to make it more readable.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com> Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Alex Ainscow [Tue, 29 Apr 2025 10:59:24 +0000 (11:59 +0100)]
osd: nonprimary shards are permitted to have a crt newer than head
Non-primary shards do not get updates for some transactions. It is possible
however for other transactions to increase the can_rollback_to to a later
version. This causes an assert for some operations.
Bill Scales [Fri, 25 Apr 2025 14:03:02 +0000 (15:03 +0100)]
osd: overaggressive assert in read_log_and_missing with optimized EC pool
read_log_and_missing is called during OSD initializaiton to sanity check
the PG log. One of its checks is too agressive for an optimized EC pool
where because of a partial write there can be a log entry but no update
to the object on this shard (other shards will have been updated). The
fix is to skip the checks when the log entry indicates this shard was
not updated.
Only affects pool with allow_ec_optimizations flag on.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Thu, 29 May 2025 11:53:27 +0000 (12:53 +0100)]
osd: EC optimizations rework for pg_temp
Bug fixes for how pg_temp is used with optimized EC pools. For these
pools pg_temp is re-ordered with non-primary shards last. The acting
set was undoing this re-ordering in PeeringState, but this is too
late and results code getting the shard id wrong. One consequence
iof this was an OSD refusing to create a PG because of an incorrect
shard id.
This commit moves the re-ordering earlier into OSDMap::_get_temp_osds,
some changes are then required to OSDMap::clean_temps.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 23 May 2025 09:45:46 +0000 (10:45 +0100)]
osd: EC Optimizations OSDMap::clean_temps preventing change of primary
clean_temps is clearing pg_temp if the acting set will be the same
as the up set. For optimized EC pools this is overaggressive because
there are scenarios where it is setting acting set to be the same as
up set to force an alternative shard to be chosen as primary - this
happens because the acting set is transformed to place non-primary
shards at the end of the pg_temp vector.
Detect this scenario and stop clean_temps from undoing the acting
set which is being set by PeeringState::choose_acting.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Thu, 22 May 2025 12:12:57 +0000 (13:12 +0100)]
osd: EC optimizations bug in OSDMap::clean_temps
OSDMap clean_temps clears pg_temp for a PG when the up set
matches the acting_set. For optimized EC pools the pg_temp
is reordered to place primary shards first, this function
was not calling pgtemp_undo_primaryfirst to revert the
reordering.
This meant that a 2+1 EC PG with up set [1,2,3] and
a desired acting set [1,3,2] re-ordered the acting
set to produce pg_temp as [1,2,3] and then deleted this
because it equals the up set.
Calling pgtemp_undo_primaryfirst makes this code work
as intended.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Mon, 19 May 2025 16:01:39 +0000 (17:01 +0100)]
osd: EC Optimizations fix routing of requests to non-zero shard id
Pools with EC optimizations can use pg_temp to modify the selection
of the primary. When this happens the clients route request to the
correct OSD, but wrong shard which causes hung I/Os or misdirected
I/Os.
Fix Objecter to select the correct shard when sending requests to
EC optimized pools. Fix OSD to modify the shard when receving
requests from legacy clients.
Add new unittests to test new functions for remapping the shard id.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Thu, 24 Apr 2025 14:13:31 +0000 (15:13 +0100)]
osd: Cosmetic code cleanup and improve debug
All these changes are one of:
* Whitespace changes
* Addition/removal of debug
* Typos
* Make CPU-intensive debug statements deb ug level 30
* Remove unnecessary counter which did not really help with debug
* Extra comments
Zac Dover [Wed, 25 Jun 2025 09:19:49 +0000 (19:19 +1000)]
doc/radosgw: line edit bucket_logging.rst
Edit doc/radosgw/bucket_logging.rst so that it is not solecistic and so
that its punctuation is corrected and its use of articles is corrected.
This file remains in my judgment demotic and maybe demotic enough to
warrant another editorial pass in the future.
Venky Shankar [Wed, 25 Jun 2025 06:39:39 +0000 (12:09 +0530)]
Merge PR #59435 into main
* refs/pull/59435/head:
mgr/volumes: Fix json.loads for test on mon caps
mgr/volumes: Add test for mon caps if auth key has remaining mds/osd caps
mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps
Add comprehensive documentation for defining configuration options in
ceph-mgr modules, including all supported properties and their usage.
Previously, the documentation did not explain how to define ceph-mgr
module configuration options, despite subtle differences from other Ceph
components. This change documents all supported Option properties, their
types, and provides clear examples to help module developers properly
configure their options.
Kefu Chai [Wed, 25 Jun 2025 03:02:46 +0000 (11:02 +0800)]
doc: do not depend on typed-ast
the typed-ast project was marked end of life since July 2023, and
not maintained anymore. since we build the document using readthedocs'
service, and in .readtherdocs.yml we use python 3.9, which comes with
ast module included by its standard library.
the typed-ast dependency was originally added in 30d41597, but now that
we are using python 3.9, there is no need to use this module anymore.
Kefu Chai [Wed, 25 Jun 2025 03:50:24 +0000 (11:50 +0800)]
doc/dev/config: Document how to use :confval: directive for config options
Add comprehensive guide for documenting configuration options using the
:confval: directive, including naming conventions and cross-referencing.
Previously, the documentation lacked guidance on using the :confval:
directive and the important distinction between regular config options
and mgr module options (which require the mgr/<module>/ namespace
prefix). This change provides detailed examples and best practices for
properly documenting and referencing both types of configuration options.
Kefu Chai [Tue, 24 Jun 2025 14:38:13 +0000 (22:38 +0800)]
rbd: fix unused function warning when WITH_KRBD is disabled
Guard print_error_description() and get_unsupported_features() with
`#ifdef WITH_KRBD` to prevent compiler warnings when KRBD support is
not enabled.
These functions are only called by do_kernel_map(), which is itself
conditionally compiled. When WITH_KRBD is not defined, the compiler
generates unused function warnings for these helper functions.
Fixes warning:
```
/home/kefu/dev/ceph/src/tools/rbd/action/Kernel.cc:305:13: warning: ‘void rbd::action::kernel::print_error_description(const char*, const char*, const char*, const char*, int)’ defined but not used [-Wunused-function]
305 | static void print_error_description(const char *poolname,
| ^~~~~~~~~~~~~~~~~~~~~~~
```
Zac Dover [Mon, 23 Jun 2025 12:50:03 +0000 (22:50 +1000)]
doc/rados: clarify "upmap_max_deviation"
Clarify the threshold set by "upmap_max_deviation" and add the
information about this configurable that is currently in
src/pybind/mgr/balancer/module.py to src/common/options/global.yaml.in,
so that it will be accessible by means of ".. confval::" declarations.
Let PGShardManager::invoke_on_each_shard_seq pass the local shard_services
instance instead of using an additional helper.
The downside of dropping the generic sharded_map_seq helper is that it is
able to support *any* (seastar::)sharded object. However, as shard_services
is the only user of it - directly using the local instance without the
helper seems easier to read.
Matan Breizman [Sun, 22 Jun 2025 10:10:10 +0000 (10:10 +0000)]
crimson/common/smp_helpers: fix reactor_map_seq
Copy f into reactor_map_seq which would be kept alive
due to this method being a coroutine. That way, we can ensure
the lambdas passed to each core that are capturing f by
reference would be safe.
Alternatively, we can also copy f by using it's copy ctor and
pass a copy to each shard:
co_await crimson::submit_to(core, F(f))
However, avoiding the copy is possible here due to the sequential
traversal. Note, seastar's invoke_on_all do copy each callback to
every shard and is running the invocation in parallel.
The above would have fixed f's captures to be invalid and result
in a segfaults on diffrent shards.