Bill Scales [Wed, 1 Oct 2025 14:52:23 +0000 (15:52 +0100)]
osd: Optimized EC missing list not updated on recovering shard (OLD FIX)
Shards that are recovering (last_complete != last_update) are using
pwlc to advance last_update for writes that did not effect the shard.
However simply incrementing last_update means that the primary doesnt
send the shard log entries that it missed and consequently it cannot
update its missing list.
If the shard is already missing object X at version V1 and there was
a partial write at V2 that did not update the shard, it does not need
to retain the log entry, but it does need to update the missing list
to say it needs V2 rather than V1. This ensures all shards report
a need for an object at the same version and avoids an assert in
MissingLoc::add_active_missing when the primary is trying to
combine the missing lists from all the shards to work out what has
to be recovered.
The fix is to avoid applying pwlc when last_complete != last_update,
this forces the primary to send the log to the recovering shard
which can then update its missing list (and discarding the log
entries as they are partial writes).
Fixes: https://tracker.ceph.com/issues/73249 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Fri, 17 Oct 2025 17:33:09 +0000 (18:33 +0100)]
osdc: Forward cls call to version.read to primary
RGW uses a class method on erasure coded pools to read a version
number attribute. These are associated with large reads which
would benefit from direct reads.
Alex Ainscow [Fri, 17 Oct 2025 12:22:03 +0000 (13:22 +0100)]
osdc: outbl concatenates all output buffers.
The functionality defened by librados tests is that if an outbl and
out_bl buffers are provided by the client and there are multiple ops
then the outbl is a concatenation of all out_bl buffers. splitOps needs
to honour this too.
Alex Ainscow [Fri, 17 Oct 2025 08:26:34 +0000 (09:26 +0100)]
osdc: SplitOps handle client-provided buffers without replacing them.
Some clients provide buffers to be written to by Objecter through a
(legacy?) field in the op called "outbl" (not to be confused with out_bl).
If those clients provide a buffer of exactly the right size, then the
expectation is that the buffer gets written to without being moved. This change
makes SplitOps mimic the read completion behaviour from Objecter.
The memcpy here is a significant performance impact, however it is thought
that most clients will not use this interface.
There are many comments in this code to explain the reasons, as it is quite
a surprising mechanism.
Alex Ainscow [Fri, 17 Oct 2025 08:11:48 +0000 (09:11 +0100)]
osdc: SplitOps tolerate object underruns
The OSDs can correctly under run a read if the object is smaller than the
requested read size. The librados test suite detected that the split ops
code did not tolerate such an underrun. This is now fixed.
Alex Ainscow [Fri, 17 Oct 2025 08:01:38 +0000 (09:01 +0100)]
test/librados: Extend librados tests to cover both fast EC and split ops.
The librados tests have special handling for EC, but they do not attempt to test
either fast EC or split ops. This change upgrades the tests to be paramaterized
tests and constructs the necessary pools and boilerplate.
There is also a minor tweak to a stat tests to allow for longer names which
get generated in my containerised test environment.
Alex Ainscow [Fri, 17 Oct 2025 07:52:27 +0000 (08:52 +0100)]
osd: Add mon command to override pool flags.
This is intended as a developer too and as such requires the
yes_i_really_mean_it flag. The idea is that we can add new
experimental pool features, with a generic way of turning
the features on, without polluting the parameters in the command
yet further.
The command is perhaps a little messy:
ceph osd pool <pool> set set_pool_flags <int>
ceph osd pool <pool> set unset_pool_flags <int>
I also decided against an API to show the pool flags. As this
would be hard to read without some complex decode and the
functionality to observe pool flags in logs with high debug
levels already exists.
Alex Ainscow [Fri, 3 Oct 2025 12:35:48 +0000 (13:35 +0100)]
osd: Relax missing entry assert for partial writes.
This assert was relaxed to allow for missing partial write logs. However
it needs to be relaxed further to cope with the missing list not containing
some objects with later log entries.
Bill Scales [Wed, 1 Oct 2025 14:52:23 +0000 (15:52 +0100)]
osd: Optimized EC missing list not updated on recovering shard
Shards that are recovering (last_complete != last_update) that
skip transactions and log entries because of partial writes are
using pwlc to advance last_update. However simply incrementing
last_update is not sufficient - there are scenarios where the
needed version of a missing object has to be updated.
If the shard is already missing object X at version V1 and there was
a partial write at V2 that did not update the shard, it does not need
to retain the log entry, but it does need to update the missing list
to say it needs V2 rather than V1. This ensures all shards report
a need for an object at the same version and avoids an assert in
MissingLoc::add_active_missing when the primary is trying to
combine the missing lists from all the shards to work out what has
to be recovered. Avoiding applying pwlc during the early phase
of the peering process ensures the missing list gets updated.
However if a shard is not missing object X and there was a partial
write at V2 that did not update the shard then at the end of peering
it is still necessary to advance last_upadte by applying pwlc. This
ensures that in later peering cycles the code does not change its
mind and think the shard is now missing object X.
The fix is to be more sophisticated about when pwlc can be used
to advance last_update for a recovering shard. The code now
passes in a parameter indicating whether we are in the early
(pre activate) or later phase of peering. This also means that
additional calls to apply_pwlc are needed when peering gets to
activating and is searching for missing to make updates that were
not made earlier.
Fixes: https://tracker.ceph.com/issues/73249 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Tue, 14 Oct 2025 08:24:56 +0000 (09:24 +0100)]
osdc: Add SplitOp capability to Objecter
This will provide the ability for Objecter to split up
certain ops and distribute them to the OSDs directly if
that provides a preformance advantage.
This is experimental code and is switched off unless the
magic pool flags are enabled. These magic pool flags were
pushed in an earlier commit in the same PR.
Alex Ainscow [Fri, 3 Oct 2025 14:11:00 +0000 (15:11 +0100)]
osdc: Remove unused con parameter from Objecter::_calc_target()
This parameter is not used by the _calc_target code. It is being
removed just to clean up the code, as we are making some changes
to _calc_target in later stages of the split io PR.
Alex Ainscow [Fri, 3 Oct 2025 13:55:56 +0000 (14:55 +0100)]
osdc: Interface to submit IO with ASIO Post.
For direct read failures, the locking is such that we cannot
immediately send a new IO without deadlocking. This new interface
allows an op to be sent as an asio post.
Alex Ainscow [Fri, 3 Oct 2025 13:39:03 +0000 (14:39 +0100)]
osd: Implement read sync for EC direct reads
When doing a direct read in EC, only a single OSD is involved and
that OSD, by definition is the only OSD involved. As such we can
do the more performant sync read, rather than async read.
Alex Ainscow [Fri, 3 Oct 2025 13:15:32 +0000 (14:15 +0100)]
osd: Generalise can_serve_replica_read for consumption by EC.
The can_serve_replica_read() function is called by replica to determine whether there are
any uncommitted writes. If such writes exist, then the system will reject the IO to avoid
the risk of reading data from a write which may yet be rolled back.
The same code is going to be useful for EC direct reads.
Alex Ainscow [Fri, 3 Oct 2025 12:53:33 +0000 (13:53 +0100)]
osd: Replace unused EC offset translation function with useful one.
The old chunk_aligned_shard_offset_to_ro_offset was not only unused, it
didn't actually have the correct logic. We replace it here with similar,
but more useful function that will be used in sparse reads for EC
Alex Ainscow [Fri, 3 Oct 2025 12:49:58 +0000 (13:49 +0100)]
osd: Introduce pool flag for "split IO" and Plugin flag for "direct read"
These flags will currently behave as follows:
1. The pool flag is never set, unless by a user with the osd_pool_default_flags
config option.
2. The pool flag will be removed for EC pools where the plugin does not support
direct reads.
3. Replica pools will never remove the flag.
The intention is to eventually invert this logic and allow split IOs upon
upgrade to Umberella in this same function.
Matan Breizman [Sun, 12 Oct 2025 12:29:10 +0000 (12:29 +0000)]
doc: Introduce Crimson User Guide
All of Crimson's documentation was included under `doc/dev`.
As we gradually lean towards a more user facing documentation such as
deployment, usage, Packaging and so on -- we should have a separated guide with
non dev related docs.
Nizamudeen A [Thu, 16 Oct 2025 05:35:32 +0000 (11:05 +0530)]
mgr/dashboard: fix generic form submit validator for inline edit
currently the validation error is being applied generically to the
parent formgroup which will set the whole form into an error state when
one of the inline editing is failing on a validation. So just changing
that to a single control.
Fixes: https://tracker.ceph.com/issues/73558 Signed-off-by: Nizamudeen A <nia@redhat.com>
Casey Bodley [Wed, 15 Oct 2025 21:08:48 +0000 (17:08 -0400)]
cmake: BuildArrow.cmake uses bundled thrift if system version < 0.17
the bump to arrow 17.0.0 broke the ubuntu jammy builds with:
In file included from /usr/include/thrift/transport/TTransport.h:25,
from /usr/include/thrift/protocol/TProtocol.h:28,
from /usr/include/thrift/TBase.h:24,
from /build/ceph-20.3.0-3599-g3d863d32/src/arrow/cpp/src/generated/parquet_types.h:14,
from /build/ceph-20.3.0-3599-g3d863d32/src/arrow/cpp/src/generated/parquet_constants.h:10,
from /build/ceph-20.3.0-3599-g3d863d32/src/arrow/cpp/src/generated/parquet_constants.cpp:7:
/usr/include/thrift/transport/TTransportException.h:23:10: fatal error: boost/numeric/conversion/cast.hpp: No such file or directory
23 | #include <boost/numeric/conversion/cast.hpp>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
when comparing the gcc command line with arrow-15.0.0, the following argument
is no longer present:
> -isystem /build/ceph-20.3.0-3402-gb2db4947/obj-x86_64-linux-gnu/boost/include
arrow 17.0.0 seems to assume that thrift doesn't depend on boost anymore. a
comment in https://github.com/apache/arrow/issues/32266 claims that
> we don't need Boost with system Thrift 0.17.0 or later
but our jammy builds are stuck with libthrift-0.16.0. to reenable jammy builds,
instruct Arrow's cmake to use its bundled thrift dependency if our system thrift
version is < 0.17.0
Kefu Chai [Wed, 15 Oct 2025 07:46:26 +0000 (15:46 +0800)]
debian/control: Add libxsimd-dev build dependency for vendored Arrow
In commit e8460cbd, we introduced the "pkg.ceph.arrow" build profile to
support building with system Arrow packages. However, neither Debian nor
Ubuntu currently ships Arrow packages.
Since WITH_RADOSGW_SELECT_PARQUET is always enabled in debian/rules,
Arrow support is required for all builds. When the pkg.ceph.arrow profile
is not selected, the build uses vendored Arrow. With the recent change to
use AUTO mode for xsimd detection, Arrow will attempt to find system xsimd
>= 9.0.1. Adding libxsimd-dev as a build dependency ensures it's available
for Arrow to detect and use, reducing build time on supported distributions.
On distributions with insufficient xsimd versions (< 9.0.1), Arrow will
automatically fall back to its bundled version.
Kefu Chai [Wed, 15 Oct 2025 07:46:22 +0000 (15:46 +0800)]
cmake/BuildArrow: Use AUTO mode for xsimd dependency detection
Arrow requires xsimd >= 9.0.1 according to arrow/cpp/thirdparty/versions.txt.
Previously, we unconditionally set -Dxsimd_SOURCE=BUNDLED, forcing the use
of Arrow's vendored xsimd regardless of system package availability.
This commit changes to -Dxsimd_SOURCE=AUTO, which allows Arrow's
resolve_dependency mechanism to automatically:
1. Try to find system xsimd package
2. Check if version >= 9.0.1
3. Use system version if found and sufficient
4. Fall back to bundled version otherwise
This reduces build time and dependencies on systems with sufficient xsimd,
while maintaining compatibility with older distributions.
Distribution availability:
- Ubuntu Noble (24.04): libxsimd-dev 12.1.1 (✓ will use system)
- Ubuntu Jammy (22.04): libxsimd-dev 7.6.0 (✗ will use bundled)
- Debian Trixie (13): libxsimd-dev 13.2.0 (✓ will use system)
- CentOS Stream 9: xsimd-devel 7.4.9 (✗ will use bundled)