Jon Bailey [Tue, 26 May 2026 12:31:47 +0000 (13:31 +0100)]
qa: test_rados_tool - change check on osd dump command to use json
Previously the test_rados_tool.sh test was dependant on flag ordering. This mean if you added a new flag after full_quota (such as split_reads or ec_optimizations), this could break the teuthology test if we try to test with these flags on. We prevent this by changing this condition to use json to ensure we are no longer depend on the order of the flags which the default command line output gives.
This also adds a check to ensure the pool name matches what we are working on, to ensure we don't get false-positives if we happened to have other pools.
Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Alex Ainscow [Thu, 16 Apr 2026 09:28:29 +0000 (10:28 +0100)]
test/librbd: Fix infinite recursion in MockTestMemRadosClient::do_mon_command
The do_mon_command() method was calling the mocked mon_command() which
has a default action that calls do_mon_command(), creating an infinite
recursion that caused a segmentation fault due to stack overflow.
Fixed by calling TestRadosClient::mon_command() (the base class
implementation) instead of the mocked version.
This resolves the segfault in unittest_librbd when running the
run-rbd-unit-tests.sh script.
Alex Ainscow [Thu, 5 Feb 2026 14:51:46 +0000 (14:51 +0000)]
test: Parameterize librados tests and add more split op tests.
Previously the librados tests were each restricted to a particular
configuration. Here we parameterize to execute against multiple
configurations of pool and fix the necessary create/clean up work.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Fixed a bug where reference_sub_read was set to -1 when non-read operations
were processed before read operations, causing crashes when accessing
sub_reads[reference_sub_read].
Changes:
1. Added init_reference_sub_read() virtual method to initialize reference_sub_read
early, after _calc_target() populates the acting set but before processing
any operations.
2. ECSplitOp::init_reference_sub_read(): Sets reference_sub_read to the acting
index of the primary shard by performing a reverse lookup.
3. ReplicaSplitOp::init_reference_sub_read(): Counts valid OSDs and picks a
random acting index for load balancing.
4. Simplified init_read() in both classes by removing duplicate logic that
previously set reference_sub_read during read processing.
5. Added safety check in SplitOp::init() to ensure the reference_sub_read
entry exists in sub_reads before accessing it for non-read operations.
The fix ensures reference_sub_read is always set before any operations are
processed, preventing the crash that occurred when STAT, GETXATTR, or other
non-read operations appeared before READ operations in the operation list.
Tested with split_op_cxx.cc test suite: 33/35 tests pass (2 test expectation
issues unrelated to the core bug fix).
Matty Williams [Mon, 20 Oct 2025 15:46:43 +0000 (16:46 +0100)]
test/osd: Add balanced read flags to io_sequence exerciser
Added optional "-b"/"balanced" flag to the end of read/read2/read3 operations in interactive mode, to make them balanced reads.
Balanced read percentage is not used in interactive mode.
Add command line argument to specify percentage of read ops that should use the balanced reads flag. Default is 100%.
Signed-off-by: Matty Williams <Matty.Williams@ibm.com> Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Alex Ainscow [Thu, 5 Feb 2026 14:45:04 +0000 (14:45 +0000)]
osdc: Refactor SplitOp
There are large number of changes in this commit which were found through
development and testing of split ops.
I have split out all the objecter updates carefully, but since the split op
code is not currently used in production, I have not documented every change
and made significant refactors/rearrangements.
Alex Ainscow [Thu, 5 Feb 2026 14:00:38 +0000 (14:00 +0000)]
osdc: Do not recalculate target for split ops.
SplitOp calculates the target and set the necessary target OSD itself. This
means that calc_target is not required again on first submit of the sub
read ops.
Alex Ainscow [Thu, 5 Feb 2026 13:34:58 +0000 (13:34 +0000)]
osdc: Extend op_post_submit to cope with successful Ops and move SplitOp decision point.
The locking situation in Objecter is complex. When ops are completed whether with success or otherwise, some locks are held. For split ops, this is particularly complex, since multiple sessions are involved in the completion.
To avoid all these deadlock issues, splitOps choose to schedule a completion task using asio::post, which can then take the appropriate locks before completing the IO, without risk of deadlock.
Usage of this will be added in a refactor of SplitOps.
In addition, previously split ops was being calculated immediately as soon as the op was submitted. Here we move the submit down to below the throttling and timeout code. This way we throttle/timeout the original op.
Handling the timeout (op_cancel) will be handled in a later commit.
As part of this commit we also introduce a SplitOp session. This allows us to keep track of the parent ops while the child ops have been submitted and redrive the correct op(s) when necessary.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Alex Ainscow [Thu, 5 Feb 2026 13:30:36 +0000 (13:30 +0000)]
osdc: Add split_op statistic
This statistic counts the number of OPs which have been submitted using the
split op mechanism. It allows a user to check how useful this is and
performance/development to check that this mechanism is being used in
any given application.
Alex Ainscow [Thu, 5 Feb 2026 13:24:29 +0000 (13:24 +0000)]
osdc: Add config option to specify split-replica-read threshold
SplitOps will add support to split replica reads. This allows the user to
specify the threshold at which they are split.
The default is currently set at 256k. This is set at a point where we have
confidence that split ops will never reduce performance.
Development may choose to reduce this default based on performance measurements.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Callum James <callum.james@ibm.com> Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Alex Ainscow [Thu, 5 Feb 2026 13:16:25 +0000 (13:16 +0000)]
osdc: Add FORCE and FAIL_ON_EAGAIN flags.
Previously, the lower levels of Objecter would potentially redrive ops to
different OSDs when the map changed, or the OSD returns -EAGAIN. These
flags will be used to change this behaviour:
* FORCE_OSD means that the OSD is fixed and cannot be changed.
* FAIL_ON_EGAIN means that rather than redriving, the OP should be failed (to splitops)
Alex Ainscow [Thu, 5 Feb 2026 15:00:03 +0000 (15:00 +0000)]
osd: Corrent accounting and return codes for Direct Reads
We will never return -EAGAIN from ECBackend. If ECBackend returns EAGAIN, this causes the PrimaryLogPG code to drop the op. This is for historical reasons, but hard to refactor out.
Instead, the PrimaryLogPG code has been refactored to work out that EAGAIN is required much earlier in the processing, where EAGAIN will be returned to the client.
Here we also correct accounting in do_read and sparse_read so that we can correctly track the number of bytes read from direct reads.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Alex Ainscow [Thu, 5 Feb 2026 13:14:07 +0000 (13:14 +0000)]
osd: Torn write protection for Direct Reads
It is possible for direct reads to query two seperate shards and
get different versions of the object for each shard when using
direct reads.
To solve this we add a get_internal_version op to tell us the version
of the object on that shard and submit that in the same transaction
as the read so we can ensure the versions are what we expect. If we
have a mismatch, we resubmit the read through the primary path.
Also a couple of spelling/tidy ups
Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com> Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Callum James <callum.james@ibm.com>
Alex Ainscow [Thu, 5 Feb 2026 15:00:03 +0000 (15:00 +0000)]
osd: SplitOp preparatory work in osd_types
- Add ec_data_shard_count interface
- Prevent sending of the split reads flag to tentacle OSDs
- Add ec data shard count and coding shard count into the pool
- Encode shard mappings into the pool, for use by Direct reads
Alex Ainscow [Thu, 5 Feb 2026 13:25:20 +0000 (13:25 +0000)]
mon: Functionality for enabling and upgrading ec_direct_reads
When a cluster upgrades to umbrella, we will enable direct reads for any pool which is using ec optimizations.
We also add k and m to the pg_pool_t structure to allow more efficient parsing of the k and m values of EC rather than string parsing of the profile.
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com> Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Alex Ainscow [Thu, 5 Feb 2026 13:19:05 +0000 (13:19 +0000)]
mon: Add mechanism for user to add/clear pool flags.
Previously, every time we had a new experimental feature, switched with a
pool flag, we needed to add a bunch of boiler plate. Given that end users
should not be using these features, adding all of this user-visible
behaviour is not desirable.
This adds a single mechanism to specify a flag set by number. These magic
numbers can be used during development and then either removed, or
promoted to user-friendly flags.
7f739adae2 dropped the last log call from get_segment_manager(), after
which `LOG_PREFIX(SegmentManager::get_segment_manager)` and
`SET_SUBSYS(seastore_device)` had no remaining users under `HAVE_ZNS`,
generating:
Afreen Misbah [Wed, 27 May 2026 00:07:38 +0000 (05:37 +0530)]
mgr/dashboard: fix nested shell quoting in cephadm e2e start-cluster
with_libvirt wraps commands in sg libvirt -c "$1", adding an extra
shell layer. Nested double quotes inside the outer double-quoted
string caused the argument to be split — with_libvirt received a
truncated $1, producing "Unterminated quoted string" on the remote
shell.
Drop the unnecessary inner double quotes around cephadm shell
arguments since cephadm shell accepts the command as separate args.
Use single quotes for the grep pattern inside the double-quoted
string so it survives the sg subshell.
Casey Bodley [Tue, 26 May 2026 16:03:48 +0000 (12:03 -0400)]
rgw/s3control: skip account id check for admin users
allow access to admin users that don't belong to the requested account.
this is also necessary for multisite, where requests are forwarded to
the metadata master as the multisite system user instead of the original
requester
osd_scrub_queued_snaptrims_limit, introduced in PR#68737,
blocks the initiation of non-urgent scrubs on OSDs that
are overloaded with snap-trim operations.
ShreeJejurikar [Wed, 20 May 2026 07:18:03 +0000 (12:48 +0530)]
qa/rgw/bucket-logging: configure STS for assume-role test
Set rgw sts key and enable rgw s3 auth use sts, both needed by
test_bucket_logging_requester_assumed_role. Mirrors the existing
settings in qa/suites/rgw/verify/overrides.yaml.
Xuehan Xu [Fri, 15 May 2026 09:10:04 +0000 (17:10 +0800)]
crimson/os/seastore: also update the mappings copied by client
transactions when committing background rewriting transactions
With the 128-bit laddr key layout in place, SeaStore::rename would
involve copying mappings. These mappings must also be updated when
the logical extents they point to are rewritten.
This commit introduces performance counters for individual Ceph mgr modules.
These counters allow monitoring module behavior, debugging latency issues,
and identifying performance bottlenecks, all without modifying the modules themselves.
The following counters are now exposed under:
> ceph daemon mgr.<id> perf dump
Example structure:
"mgr_module_<module_name>": {
"notify_avg_usec": { <- Average time spent handling notify events
"avgcount": 0,
"sum": 0
},
"cmd_avg_usec": { <- Average time spent processing CLI/admin commands
"avgcount": 0,
"sum": 0
},
"serve_avg_usec": { <- Average time spent in module serve loop (if applicable)
"avgcount": 0,
"sum": 0
},
"alive": 1 <- Module is alive (1 = running, 0 = exited)
"cpu_usage": 0, <- CPU usage in percent
"mem_rss_change": 0, <- Memory RSS change in bytes
"mem_rss_current": 490737664 <- Memory RSS current in bytes
}
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
Conflicts:
src/mgr/ActivePyModules.cc - finisher.queue changed by 63859, adding py_module to the parameter list
src/mgr/PyModuleRegistry.cc - check_all_modules_started added by 63859
ceph-volume: OSD mapper lifecycle (LVM + raw) for activate
This adds small helpers so activate can consistently bring the OSD device
stack online (LVM lvchange, optional mapper open) and tear it down again,
with refresh in between. Same idea for the raw path. Crypto is handled
inside that flow when the OSD is encrypted.
Kefu Chai [Sun, 24 May 2026 08:25:46 +0000 (16:25 +0800)]
rgw: bump Apache Arrow submodule from 17.0.0 to 19.0.1
When WITH_SYSTEM_ARROW is false, Ceph builds Arrow from the bundled
src/apache submodule. Our CI uses ubuntu:jammy as the base image, which
does not package libarrow-dev, so the bundled path is always taken there.
Arrow 17.0.0 vendors a copy of Thrift whose download URLs are no longer
reachable, breaking CI builds that try to fetch them at configure time.
Bump arrow submodule to 19.0.1, the latest Arrow release that:
- builds successfully on ubuntu:jammy, and
- requires only CMake 3.22 (the version shipped by ubuntu:jammy)
See also
CMake version shipped by ubuntu:jammy
- https://packages.ubuntu.com/jammy/cmake
Kefu Chai [Fri, 22 May 2026 11:01:17 +0000 (19:01 +0800)]
crimson/scrub: fix assert in PGScrubber::release_range() on interval change
when an interval change occurs while ScrubReserveRange is still
waiting to acquire background_process_lock, ChunkState::exit()
calls release_range() but blocked is not yet set. this triggers
ceph_assert(blocked) in release_range().
fix by checking if blocked is set before asserting. if blocked is
not set, the range was never reserved, so release_range() is a
no-op. ScrubReserveRange's finally block handles lock cleanup in
this case.