Ville Ojamo [Wed, 22 Oct 2025 07:19:31 +0000 (14:19 +0700)]
doc: Use validated links with ref instead of external links
Use :ref: for intra-docs links that are validated, instead of external
links.
Only use already existing labels.
Fixes a few anchors that pointed to now-renamed section titles.
Use automatically generated link text where appropriate.
Delete unused link definitions.
Mostly in doc/rados/ but also a few in doc/rbd/.
Try to fix all links in each of the changed documents.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
cmake: convert erasure_code and json_spirit to OBJECT libraries
This resolves a circular dependency issue where ceph-common was linked
against erasure_code and json_spirit static libraries, while these
libraries themselves referenced symbols from ceph-common, creating an
unresolvable circular dependency. The static libraries were incorrectly
marked PUBLIC, causing executables linking against ceph-common to also
link against them directly.
The circular dependency manifested as linker errors in tests like
ceph_test_ino_release_cb, where libjson_spirit.a and liberasure_code.a
contained undefined references to ceph-common symbols (e.g.,
ceph::__ceph_assert_fail and get_str_list) that couldn't be resolved
due to the linking order.
For instance, ceph_test_ino_release_cb failed to link:
```
/usr/bin/ld: ../../../lib/libcephfs.so.2.0.0: undefined reference to symbol '_ZN4ceph18__ceph_assert_failERKNS_11assert_dataE'
/usr/bin/ld: ../../../lib/libceph-common.so.2: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
```
Changes:
- Convert erasure_code and json_spirit from STATIC to OBJECT libraries
- Embed their object files directly into ceph-common during linking,
breaking the circular dependency and preventing public propagation
of these dependencies to downstream targets
- Remove direct linkage from targets that already get these symbols
through ceph-common dependencies:
* Erasure code unit tests (unittest_erasure_code_isa,
unittest_erasure_code_plugin_isa, unittest_erasure_code_example)
* Test libraries (radostest)
* Plugins (denc-mod-osd, cls_refcount, cls_rgw, cls_lua, ec_lrc)
This also prevents ODR violations that would occur if targets linked
against these libraries directly while also getting them from
ceph-common. Such violations cause undefined behavior including
segmentation faults, as static variables and vtables would exist in
duplicate, leading to crashes during destruction or when accessing
shared state.
For example, before removing direct linkage from plugins, ceph-dencoder
would segfault on certain object types:
/ceph/src/test/encoding/readable.sh: line 111: Segmentation fault
$CEPH_DENCODER type ScrubMap import ... decode encode decode dump_json
crimson/mclock_scheduler: Support mclock for crimson
The patch is trying to sync mclock source of crimson similar to
classic osds. Currently the feature is use by crimson only for
background recovery operations but later we will use it for other
OSD operations also.To use the same user need to configure
crimson_osd_scheduler_concurrency parameter for osd.
Replace item_t with WorkItem variant to maintain similarity
with classic OSD.
crimson/background_recovery: switch to unified SchedulerClass and introduce get_average_object_size for pg
1) Replace usage of crimson::osd::scheduler::scheduler_class_t
with unified SchedulerClass
2) Add priority to scheduler params structure
3) Introduce get_average_object_size for pg
crimson/osd,osd_operation: initialize mClock scheduler, detect rotational devices, and run OperationThrottler background task
Initialize the mClock scheduler on all shards when the device class
is non-rotational. If the device is rotational throw an exception
to prevent unsupported configurations.
In addition, introduce a background task in OperationThrottler that
continuously dequeues and schedules client requests from the mClock
scheduler based on available credits and throttling limits.
Alex Ainscow [Fri, 28 Nov 2025 14:33:13 +0000 (14:33 +0000)]
osd: Perform shard look up correctly in partial EC writes
Plugins are permitted to provide a mapping to change the order in which OSDs
are used. In practice only LRC does this and it is not currently enabled
with optimisations, so this is a theoretical bug.
The bug here was that the "first" shard was assumed to be shard_id_t(0). However,
this is not true for LRC.
Fixes: https://tracker.ceph.com/issues/74016 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:06:22 +0000 (10:06 +0000)]
qa: Reduce number of osd threads when using compression
Smithi nodes used by teuthology tests have 8 CPU cores and typically run
4 OSD processes. When bluestore software compression is enabled the size
of the OSD thread pool needs to be reduced to 2 threads per OSD because
these threads can easily use 100% of a core. This avoids excessive
amounts of context switches, which leads to OSD threads timing out,
which causes the OSD to drop heartbeat pings and for the monitor to
temporarily mark it down. In extreme cases this can lead to PGs getting
stuck in repeated loops of peering until the teuthology test times out.
Context switches happen oppurtunistically at the end of system calls
so functions with lots of logging are some of the worst affected.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:38:44 +0000 (10:38 +0000)]
osd: Restrict logging in MissingLoc::add_source_info
add_source_info can generate an excessive amount of logging
if a PG has thousands of missing objects. When a system is
under load and threads are repeatedly context switching this
can lead to timeouts (tests showed this function taking up
to 10 seconds to execute with 99% of that time being in
logging calls where the thread was being pre-empted).
Stopping logging after the function has been running for
more than 0.5 seconds strikes a balance between providing
sufficient informtion to debug problems while providing
more stability when a system is heavily loaded.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:25:48 +0000 (10:25 +0000)]
osd: Increase log level for listing missing list
Logging the entire contents of a missing list can generate a
1M character log line when there are 8000 missing objects in a
PG. Other places in the code logging the missing list use debug
level 25 which is not enabled by default in teuthology tests.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Mon, 24 Nov 2025 09:18:21 +0000 (09:18 +0000)]
osd: reset_tp_timeout should reset timeout for all shards
ShardedThreadPools are only used by the classic OSD process
which can have more than one thread for the same shard. Each
thread has a heartbeat timeout used to detect stalled threads.
Some code that is known to take a long time makes calls to
reset_tp_timeout to reset this timeout. However for sharded
pools this can be ineffective because it is common for threads
for the same shard to use the same locks (e.g. PG Lock) and
therefore if thread A is taking a long time and resetting
its timeout while holding a lock, thread B for the same shard
is liable to be waiting for the same lock, will not be
resetting its timeout and can be timed out.
Debug for issue 72879 showed heartbeat timeouts occurring at
the same time for both shards, an attempt to fix the problem
by calling reset_tp_timeout for the slow thread still showed
the other threads for the shard timing out waiting for the PG
lock that was held bythe slow thread. Looking at the OSD code
most places where reset_tp_timeout is called the thread is
holding the PG lock.
This commit moves the concept of shard_index from OSD into
ShardedThreadPool and modifies reset_tp_timeout so that it resets
the timeout for all threads for the same shard.
Some code calls reset_tp_timeout from inside loops that can take
a long time without consideration for how long the thread has
actually been running for. There is a risk that this type of
call could repeatedly reset the timeout for another shard which
is genuinely stuck and hence defeat the heartbeat checks. To
prevent this reset_tp_timeout is modified to be a NOP unless
the thread has been processing the current workitem for more
than 0.5 seconds. Therefore threads have to be slow but making
forward progress to be abe to reset the timeout.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Tue, 14 Oct 2025 08:24:56 +0000 (09:24 +0100)]
osdc: Add SplitOp capability to Objecter
This will provide the ability for Objecter to split up
certain ops and distribute them to the OSDs directly if
that provides a preformance advantage.
This is experimental code and is switched off unless the
magic pool flags are enabled. These magic pool flags were
pushed in an earlier commit in the same PR.