Ronen Friedman [Thu, 20 Nov 2025 13:54:20 +0000 (07:54 -0600)]
osd/scrub: do not attempt to read past the end of an object
When performing deep scrubs, the scrubber reads object data
in strides. Existing code uses a short read to detect the end
of the object (and if the object size is a multiple of the
stride - an extra read is performed, which returns 0 bytes).
The proposed change is to avoid such extra read attempts,
by using our knowledge of the object size.
Also - some minor code cleanups in the relevant function.
Alex Ainscow [Fri, 28 Nov 2025 14:33:13 +0000 (14:33 +0000)]
osd: Perform shard look up correctly in partial EC writes
Plugins are permitted to provide a mapping to change the order in which OSDs
are used. In practice only LRC does this and it is not currently enabled
with optimisations, so this is a theoretical bug.
The bug here was that the "first" shard was assumed to be shard_id_t(0). However,
this is not true for LRC.
Fixes: https://tracker.ceph.com/issues/74016 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:06:22 +0000 (10:06 +0000)]
qa: Reduce number of osd threads when using compression
Smithi nodes used by teuthology tests have 8 CPU cores and typically run
4 OSD processes. When bluestore software compression is enabled the size
of the OSD thread pool needs to be reduced to 2 threads per OSD because
these threads can easily use 100% of a core. This avoids excessive
amounts of context switches, which leads to OSD threads timing out,
which causes the OSD to drop heartbeat pings and for the monitor to
temporarily mark it down. In extreme cases this can lead to PGs getting
stuck in repeated loops of peering until the teuthology test times out.
Context switches happen oppurtunistically at the end of system calls
so functions with lots of logging are some of the worst affected.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:38:44 +0000 (10:38 +0000)]
osd: Restrict logging in MissingLoc::add_source_info
add_source_info can generate an excessive amount of logging
if a PG has thousands of missing objects. When a system is
under load and threads are repeatedly context switching this
can lead to timeouts (tests showed this function taking up
to 10 seconds to execute with 99% of that time being in
logging calls where the thread was being pre-empted).
Stopping logging after the function has been running for
more than 0.5 seconds strikes a balance between providing
sufficient informtion to debug problems while providing
more stability when a system is heavily loaded.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:25:48 +0000 (10:25 +0000)]
osd: Increase log level for listing missing list
Logging the entire contents of a missing list can generate a
1M character log line when there are 8000 missing objects in a
PG. Other places in the code logging the missing list use debug
level 25 which is not enabled by default in teuthology tests.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Mon, 24 Nov 2025 09:18:21 +0000 (09:18 +0000)]
osd: reset_tp_timeout should reset timeout for all shards
ShardedThreadPools are only used by the classic OSD process
which can have more than one thread for the same shard. Each
thread has a heartbeat timeout used to detect stalled threads.
Some code that is known to take a long time makes calls to
reset_tp_timeout to reset this timeout. However for sharded
pools this can be ineffective because it is common for threads
for the same shard to use the same locks (e.g. PG Lock) and
therefore if thread A is taking a long time and resetting
its timeout while holding a lock, thread B for the same shard
is liable to be waiting for the same lock, will not be
resetting its timeout and can be timed out.
Debug for issue 72879 showed heartbeat timeouts occurring at
the same time for both shards, an attempt to fix the problem
by calling reset_tp_timeout for the slow thread still showed
the other threads for the shard timing out waiting for the PG
lock that was held bythe slow thread. Looking at the OSD code
most places where reset_tp_timeout is called the thread is
holding the PG lock.
This commit moves the concept of shard_index from OSD into
ShardedThreadPool and modifies reset_tp_timeout so that it resets
the timeout for all threads for the same shard.
Some code calls reset_tp_timeout from inside loops that can take
a long time without consideration for how long the thread has
actually been running for. There is a risk that this type of
call could repeatedly reset the timeout for another shard which
is genuinely stuck and hence defeat the heartbeat checks. To
prevent this reset_tp_timeout is modified to be a NOP unless
the thread has been processing the current workitem for more
than 0.5 seconds. Therefore threads have to be slow but making
forward progress to be abe to reset the timeout.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Tue, 14 Oct 2025 08:24:56 +0000 (09:24 +0100)]
osdc: Add SplitOp capability to Objecter
This will provide the ability for Objecter to split up
certain ops and distribute them to the OSDs directly if
that provides a preformance advantage.
This is experimental code and is switched off unless the
magic pool flags are enabled. These magic pool flags were
pushed in an earlier commit in the same PR.
Alex Ainscow [Fri, 3 Oct 2025 14:11:00 +0000 (15:11 +0100)]
osdc: Remove unused con parameter from Objecter::_calc_target()
This parameter is not used by the _calc_target code. It is being
removed just to clean up the code, as we are making some changes
to _calc_target in later stages of the split io PR.
Alex Ainscow [Fri, 3 Oct 2025 13:55:56 +0000 (14:55 +0100)]
osdc: Interface to submit IO with ASIO Post.
For direct read failures, the locking is such that we cannot
immediately send a new IO without deadlocking. This new interface
allows an op to be sent as an asio post.
Alex Ainscow [Fri, 3 Oct 2025 13:39:03 +0000 (14:39 +0100)]
osd: Implement sync reads and sparse reads for EC for direct reads
Sparse reads for EC are simple to implement, as the code is essentially
identical to that of replica, with some address translation.
When doing a direct read in EC, only a single OSD is involved and
that OSD, by definition is the only OSD involved. As such we can
do the more performant sync read, rather than async read.
Alex Ainscow [Fri, 3 Oct 2025 13:15:32 +0000 (14:15 +0100)]
osd: Generalise can_serve_replica_read for consumption by EC.
The can_serve_replica_read() function is called by replica to determine whether there are
any uncommitted writes. If such writes exist, then the system will reject the IO to avoid
the risk of reading data from a write which may yet be rolled back.
The same code is going to be useful for EC direct reads.
Alex Ainscow [Fri, 3 Oct 2025 12:53:33 +0000 (13:53 +0100)]
osd: Replace unused EC offset translation function with useful one.
The old chunk_aligned_shard_offset_to_ro_offset was not only unused, it
didn't actually have the correct logic. We replace it here with similar,
but more useful function that will be used in sparse reads for EC
Alex Ainscow [Fri, 3 Oct 2025 12:49:58 +0000 (13:49 +0100)]
osd: Introduce pool flag for "split IO" and Plugin flag for "direct read"
These flags will currently behave as follows:
1. The pool flag is never set, unless by a user with the osd_pool_default_flags
config option.
2. The pool flag will be removed for EC pools where the plugin does not support
direct reads.
3. Replica pools will never remove the flag.
The intention is to eventually invert this logic and allow split IOs upon
upgrade to Umberella in this same function.
Nitzan Mordechai [Tue, 18 Nov 2025 09:37:48 +0000 (09:37 +0000)]
Objecter: respect higher epoch subscription in tick
The OSD and Objecter share the same MonClient. During preboot, a potential
race condition exists where the OSD subscribes to osdmap epoch X, while
the Objecter subscribes to epoch X - 1.
The Objecter's subscription overrides the OSD's subscription. Consequently,
the monitor ignores the request (as it believes the OSD already has the
older map), causing the OSD to hang during preboot.
To fix this, check if a higher epoch is already subscribed before calling
_maybe_request_map during Objecter::tick. If a higher epoch is found,
maintain the existing subscription.
Nizamudeen A [Wed, 26 Nov 2025 06:20:40 +0000 (11:50 +0530)]
mgr/dashboard: fix server side table sort
show a loading screen when the sort is being performed through
server-side since the sort will happen a little slow
It will be more visible in bigger environments, and with test env if you
try to sort too many time in a short interval and you start to see some
inconsistencies. This is only there for tables like OSDs or hosts where
we have the server side rendering enabled
Fixes: https://tracker.ceph.com/issues/73994 Signed-off-by: Nizamudeen A <nia@redhat.com>
Laura Flores [Mon, 24 Nov 2025 17:31:05 +0000 (11:31 -0600)]
qa/suites/upgrade: add "OBJECT_UNFOUND" to ignorelists
The thrashing in the upgrade tests has been configured to be very aggressive;
the tests are permitted to stop up to 4 of the 8 OSDs, so it is expected that
it is causing these kinds of health warnings to be generated.
Fixes: https://tracker.ceph.com/issues/72424 Signed-off-by: Laura Flores <lflores@ibm.com>
Afreen Misbah [Tue, 21 Oct 2025 16:37:46 +0000 (22:07 +0530)]
mgr/dashboard: Carbonize the Change Password Form
Fixes https://tracker.ceph.com/issues/73193
- using carbon based stylings, typography and components
- used grid layout for form arrangement
- breadcrumb is slightly off, which needs to be fixed by applying grid layout to the app shell
Kefu Chai [Sat, 22 Nov 2025 00:24:36 +0000 (08:24 +0800)]
qa/suites/rados/encoder: exclude ceph-osd-* when installing LTS releases
In a37b5b5, the ceph-osd-classic and ceph-osd-crimson packages were
added to qa/packages/packages.yaml. The "install" task uses this file as
the default package list for all branches, including LTS releases like
Reef.
However, a37b5b5 only exists in the main branch and won't be backported
to LTS branches. This causes installation failures in the rados/encoder
test suite, which verifies forward compatibility by installing LTS
releases and testing whether they can decode the latest corpus.
Exclude ceph-osd-classic and ceph-osd-crimson from LTS installations to
ensure the test suite can successfully install ceph-dencoder, which is
required for the interoperability tests.
Imran Imtiaz [Thu, 20 Nov 2025 14:45:32 +0000 (14:45 +0000)]
mgr/dashboard: add GET API endpoint for consistency groups
Signed-off-by: Imran Imtiaz <imran.imtiaz@uk.ibm.com> Fixes: https://tracker.ceph.com/issues/73942
Add a consistency group dashboard API endpoint to get the list of images
in the consistency groups that match the namespace of the group.
Seena Fallah [Mon, 19 Feb 2024 09:39:24 +0000 (10:39 +0100)]
cmake: skip boost dependency on ALIAS executable targets
The current add_executable override in Boost does not support alias
targets. Although Ceph currently has no alias targets that are
affected by this limitation, addressing this issue now will benefit
future developments and personal projects.
This change enhances the robustness of the override logic, ensuring
compatibility with alias targets moving forward.