Zac Dover [Fri, 22 Aug 2025 08:39:29 +0000 (18:39 +1000)]
doc/cephfs: edit troubleshooting.rst (Slow MDS)
Move the "Slow requests (MDS)" section immediately after the first
section in this document ("Slow/Stuck Operations"), because the first
procedure on the page directs the reader to undertake the operation in
"Slow requests (MDS)" before trying anything else.
Dan Mick [Thu, 21 Aug 2025 20:00:43 +0000 (13:00 -0700)]
make-debs.sh: make "skip debug packages" conditional
Now that we're using make-debs.sh as a builder inside containers,
the default should be to build all the packages, including debug.
(Also, fix a typo.)
Ilya Dryomov [Thu, 21 Aug 2025 19:39:29 +0000 (21:39 +0200)]
mon/MonClient: post version request completions outside of monc_lock
dispatch() is allowed to invoke the completion object in the current
thread, before control returns from dispatch(). This isn't desirable
when it comes to discarding version requests in MonClient::shutdown()
and MonClient::_reopen_session() because completion objects could then
be invoked under monc_lock. In case of MonClient::_reopen_session() in
particular, this leads to an attempt to acquire monc_lock once again in
MonClient::get_version() on a retry due to monc_errc::session_reset
that is converted to errc::resource_unavailable_try_again:
MonClient::ms_handle_reset
< takes monc_lock >
MonClient::_reopen_session
< invokes the completion object via dispatch() with ec == monc_errc::session_reset >
Objecter::CB_Objecter_GetVersion::operator() [ ec == errc::resource_unavailable_try_again ]
Objecter::_wait_for_latest_osdmap
MonClient::get_version
< attempts to take monc_lock in the body of the lambda >
The end result is either a lockup or some form of undefined behavior.
The best possible outcome here is an exception (std::system_error with
"Resource deadlock avoided" error) and a successive call to
std::terminate().
This is a regression introduced in commit e81d4eae4e76 ("common/async:
Update `use_blocked` for newer asio"). Revert to posting version
request completions for the error cases in a way that is uniform with
the success case in MonClient::handle_get_version_reply().
Samuel Just [Tue, 12 Aug 2025 00:36:16 +0000 (17:36 -0700)]
crimson/.../store-bench: refactor arguments and workloads
- Adds a workload abstraction grouping arguments with associated
workload.
- Reworks argument parsing to occur prior to seastar app, allows
passing unparsed arguments to ceph.
- Refactors time usages as necessary to use std::chrono types.
- Removes LOG_PREFIX usages that don't currently have log lines.
- Other minor cleanups not worth separating out.
make_snapmapper_oid was missing for the new PGs created post-splitting.
This was causing a scrub error due to the missing snap mapper object in the children PGs.
crimson/osd/scrub: check if stats have been marked invalid
In case of splits/merges the PeeringState::split_into function will mark
the child and parent pgs stats invalid.
We need to check for this when scrub finishes, update the stats and mark them valid.
Kefu Chai [Mon, 9 Jun 2025 12:35:44 +0000 (20:35 +0800)]
src: Fix memory leaks in generate_test_instance() by returning values instead of pointers
Problem:
The current `generate_test_instance()` function returns `std::list<T*>`,
which creates memory management issues:
- Inconsistent lifecycle management of test instances
- Callers don't always properly clean up allocated memory
- Results in ASan memory leak reports in unit tests and ceph-dencoder
Solution:
Change `generate_test_instance()` to return `std::list<T>` instead of `std::list<T*>`:
Core Changes:
- Modified all classes with `generate_test_instance()` to return `std::list<T>`
- Use `emplace_back()` without parameters** to avoid copy/move
constructors for classes that don't define them
- Updated ceph-dencoder to handle the new return type
ceph-dencoder Adaptations:
Since `m_list` now holds `T` objects instead of `T*`, and we can't
assume `T` is copyable/moveable:
- Keep `m_object` as a pointer for flexibility
- Handle two scenarios:
1. `m_object` points to an element in `m_list`
2. `m_object` points to a decoded instance (requires manual cleanup)
- Introduce `make_ptr()` as a factory function to create a smart pointer
to conditionally free the managed pointer.
Additional Cleanup:
- Simplify DencoderBase constructor from template to plain
function (extra parameters were never used in derived classes)
With this change, object lifecycles are now managed by value semantics
instead of raw pointers, eliminating memory leaks.
Improve source rpm detection by adding a new detection method that
executes and rpm command in a container to get exactly the version of
the source rpm that the ceph.spec file would have generated. For
backwards compatibility and that I don't entirely trust myself to have
tested this the old methods are still available.
The old `--rpm-no-match-sha` is now an alias for `--srpm-match=any` to
cause it to build any (unique) ceph srpm it finds.
`--srpm-match=versionglob` retains the previous default behavior of
using a glob matching on the git id or ceph version value. The new
default of `--srpm-match=auto` implements the rpm command based behavior
described above.
All of this is wrapped in a new step `find-rpm` but that's mostly an
implementation detail and for testing.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Alex Ainscow [Wed, 13 Aug 2025 11:03:21 +0000 (12:03 +0100)]
src: Add sign-compare warnings to clang
For a while, GCC has generated warnings about sign errors. A common
mistake if compiling with clang was to accidentally introduce signedness
errors, which were picked up by the GCC builds.
This occurs due to an inconsistency in -Wall implementation between clang
and gcc: gcc includes sign-compare, clang does not.
See:
https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wall
vs
https://clang.llvm.org/docs/DiagnosticsReference.html#wall
Note that sign-compare is included under -Wextra for clang:
https://clang.llvm.org/docs/DiagnosticsReference.html#wextra
Clang will now generate similar warnings with -Wsign-compare:
https://clang.llvm.org/docs/DiagnosticsReference.html#wsign-compare
Interestingly, if specified on its own, -Wsign-compare will include
C, whereas gcc -Wall affects C++ only. Therefore we must work around
this in the make file to emulate the GCC behaviour in clang builds.
Also fix a couple of warnings found in some tests.
Kefu Chai [Mon, 18 Aug 2025 02:41:07 +0000 (10:41 +0800)]
os/Transaction: initialize unused fields in TransactionData
Initialize unused1, unused2, and unused3 fields to zero in TransactionData
to ensure consistent encoding/decoding behavior.
Background:
In commit a0c9fec7, we updated TransactionData encoding/decoding and bumped
the Transaction encoding version from 9 to 10. As part of this change, we
renamed three fields to mark them as unused:
- largest_data_len → unused1
- largest_data_off → unused2
- largest_data_off_in_data_bl → unused3
The move constructor was also updated to stop setting these fields, leaving
them uninitialized after move operations.
Problem:
This worked with existing tests because check-generated.sh reused struct
instances, preserving stale values across encode/decode cycles. However,
an upcoming test change will stop reusing instances and compare hexdumps
of encoded/re-encoded values to verify consistency. Uninitialized fields
cause these comparisons to fail due to garbage values.
Solution:
Initialize the unused fields to zero in the move constructor. This preserves
existing behavior while ensuring consistent encoding. These fields can be
removed entirely in a future change.
Samuel Just [Wed, 13 Aug 2025 16:54:35 +0000 (09:54 -0700)]
crimson/.../fixed_kv_node: don't call copy_out if delta_buffer is empty
Cache::mark_transaction_conflicted calls get_delta(), which in turn
calls FixedKVNodeLayout::copy_out for lba nodes. If the mutation_pending
extent happens not to have any deltas, it'll fail in memcpy in
FixedKVNodeLayout::copy_out.
I think this is valid because a transaction may become conflicted
between when duplicate_for_write is called and when the actual mutation
is performed on the extent.
Fixes: https://tracker.ceph.com/issues/72579 Signed-off-by: Samuel Just <sjust@redhat.com>
Ronen Friedman [Thu, 7 Aug 2025 04:54:30 +0000 (23:54 -0500)]
qa/standalone/scrub: re-code osd-scrub-dump.sh to test scrub repair functionality.
The new version of osd-scrub-dump.sh is designed to
allow multiple "corruption methods" on a subset of objects.
The functionality includes specifying:
- the number of objects created;
- the number to have their Primary version modified;
- the number to have their Replicas modified;
- the set of "manipulations" to perform on the objects.
The arm64-only module uadk needs numa.h to build; nothing else
ensures it's available. Make it an unconditional ceph build
dependency on behalf of the arm64 build.
Fixes: https://tracker.ceph.com/issues/72594 Signed-off-by: Dan Mick <dan.mick@redhat.com>
tasks/cephfs: Use different errmsg for invalid dir
During test_df_for_invalid_directory, path_walk is now called.
Use a more general error message as more errnos can be returned
and this will be a better catch all.
Signed-off-by: Christopher Hoffman <choffman@redhat.com>