Kefu Chai [Tue, 3 Jun 2025 08:07:33 +0000 (16:07 +0800)]
librbd/cache/pwl: fix memory leak in SyncPoint persist context cleanup
Previously, SyncPoint allocated two C_Gather instances tracked by raw
pointers but failed to properly clean them up when only a single sync
point existed, causing memory leaks detected by AddressSanitizer.
This change fixes the leak by modifying AbstractWriteLog::shut_down()
to check for prior sync points in the chain. When the current sync point
is the only one present, we now activate the m_prior_log_entries_persisted
context to ensure:
- The onfinish callback executes and releases the captured strong
reference to the enclosing SyncPoint
- The parent m_sync_point_persist context completes and gets properly
released
This ensures all allocated contexts are cleaned up correctly during
shutdown, eliminating the memory leak.
The ASan report:
```
Indirect leak of 2064 byte(s) in 1 object(s) allocated from:
#0 0x56440919ae2d in operator new(unsigned long) (/home/jenkins-build/build/workspace/ceph-pull-requests/build/bin/unittest_librbd+0x2f3de2d) (BuildId: 6a04677c6ee5235f1a41815df807f97c5b96d4cd)
#1 0x56440bd67751 in __gnu_cxx::new_allocator<Context*>::allocate(unsigned long, void const*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:127:27
#2 0x56440bd676e0 in std::allocator<Context*>::allocate(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/allocator.h:185:32
#3 0x56440bd676e0 in std::allocator_traits<std::allocator<Context*>>::allocate(std::allocator<Context*>&, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:464:20
#4 0x56440bd6730b in std::_Vector_base<Context*, std::allocator<Context*>>::_M_allocate(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:346:20
#5 0x7fd33e00e8d1 in std::vector<Context*, std::allocator<Context*>>::reserve(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:78:22
#6 0x7fd33e00c51c in librbd::cache::pwl::SyncPoint::SyncPoint(unsigned long, ceph::common::CephContext*) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/SyncPoint.cc:20:27
#7 0x56440bd65f26 in decltype(::new((void*)(0)) librbd::cache::pwl::SyncPoint(std::declval<unsigned long&>(), std::declval<ceph::common::CephContext*&>())) std::construct_at<librbd::cache::pwl::SyncPoint, unsigned long&, ceph::common::CephContext*&>(librbd::cache::pwl::SyncPoint*, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:97:39
#8 0x56440bd65b98 in void std::allocator_traits<std::allocator<librbd::cache::pwl::SyncPoint>>::construct<librbd::cache::pwl::SyncPoint, unsigned long&, ceph::common::CephContext*&>(std::allocator<librbd::cache::pwl::SyncPoint>&, librbd::cache::pwl::SyncPoint*, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:518:4
#9 0x56440bd657d3 in std::_Sp_counted_ptr_inplace<librbd::cache::pwl::SyncPoint, std::allocator<librbd::cache::pwl::SyncPoint>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<unsigned long&, ceph::common::CephContext*&>(std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:519:4
#10 0x56440bd65371 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<librbd::cache::pwl::SyncPoint, std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(librbd::cache::pwl::SyncPoint*&, std::_Sp_alloc_shared_tag<std::allocator<librbd::cache::pwl::SyncPoint>>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:651:6
#11 0x56440bd65163 in std::__shared_ptr<librbd::cache::pwl::SyncPoint, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(std::_Sp_alloc_shared_tag<std::allocator<librbd::cache::pwl::SyncPoint>>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1342:14
#12 0x56440bd650e6 in std::shared_ptr<librbd::cache::pwl::SyncPoint>::shared_ptr<std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(std::_Sp_alloc_shared_tag<std::allocator<librbd::cache::pwl::SyncPoint>>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:409:4
#13 0x56440bd65057 in std::shared_ptr<librbd::cache::pwl::SyncPoint> std::allocate_shared<librbd::cache::pwl::SyncPoint, std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(std::allocator<librbd::cache::pwl::SyncPoint> const&, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:862:14
#14 0x56440bca97e7 in std::shared_ptr<librbd::cache::pwl::SyncPoint> std::make_shared<librbd::cache::pwl::SyncPoint, unsigned long&, ceph::common::CephContext*&>(unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:878:14
#15 0x56440bd443c8 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::new_sync_point(librbd::cache::pwl::DeferredContexts&) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1905:20
#16 0x56440bd42e4c in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::flush_new_sync_point(librbd::cache::pwl::C_FlushRequest<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>>*, librbd::cache::pwl::DeferredContexts&) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1951:3
#17 0x56440bd9cbf2 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::flush_new_sync_point_if_needed(librbd::cache::pwl::C_FlushRequest<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>>*, librbd::cache::pwl::DeferredContexts&) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1990:5
#18 0x56440bd9c636 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::internal_flush(bool, Context*)::'lambda'(librbd::cache::pwl::GuardedRequestFunctionContext&)::operator()(librbd::cache::pwl::GuardedRequestFunctionContext&) const /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:2152:9
#19 0x56440bd9b9b4 in boost::detail::function::void_function_obj_invoker<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::internal_flush(bool, Context*)::'lambda'(librbd::cache::pwl::GuardedRequestFunctionContext&), void, librbd::cache::pwl::GuardedRequestFunctionContext&>::invoke(boost::detail::function::function_buffer&, librbd::cache::pwl::GuardedRequestFunctionContext&) /opt/ceph/include/boost/function/function_template.hpp:100:11
#20 0x56440bd29321 in boost::function_n<void, librbd::cache::pwl::GuardedRequestFunctionContext&>::operator()(librbd::cache::pwl::GuardedRequestFunctionContext&) const /opt/ceph/include/boost/function/function_template.hpp:789:14
#21 0x56440bd28d85 in librbd::cache::pwl::GuardedRequestFunctionContext::finish(int) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/Request.h:335:5
#22 0x5644091e0fe0 in Context::complete(int) /home/jenkins-build/build/workspace/ceph-pull-requests/src/include/Context.h:102:5
#23 0x56440bd9b378 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::detain_guarded_request(librbd::cache::pwl::C_BlockIORequest<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>>*, librbd::cache::pwl::GuardedRequestFunctionContext*, bool) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1202:20
#24 0x56440bd96c50 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::internal_flush(bool, Context*) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:2154:3
#25 0x56440bd1e4b5 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:703:3
#26 0x56440bdb9022 in librbd::cache::pwl::TestMockCacheSSDWriteLog_compare_and_write_compare_matched_Test::TestBody() /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/cache/pwl/test_mock_SSDWriteLog.cc:403:7
```
Add an instruction that includes the --enable-auth flag in a "git orch
apply mgmt-gateway" command, in accordance with a request made by
afreen23 here: https://github.com/ceph/ceph/pull/60440#discussion_r1953530599
Zac Dover [Fri, 30 May 2025 12:38:03 +0000 (22:38 +1000)]
doc/rados/operations: edit cache-tiering.rst
Strengthen the warning against deploying cache tiering in releases after
Reef. This follows up on Anthony D'Atri's request in
https://github.com/ceph/ceph/pull/63465.
Improve the English in the "desc" field of the
"osd_deep_scrub_interval_cv" variable, as suggested by Anthony D'Atri in
https://github.com/ceph/ceph/pull/63490#discussion_r2124893516.
Ronen Friedman [Sun, 15 Jun 2025 19:40:59 +0000 (14:40 -0500)]
osd/scrub: allow longer waits for replicas to respond
Increase the two timeouts associated with replica
responses to scrub requests.
This solves the issue of, when a cluster event causes
some form of repeering (e.g. an OSD in the active set
is down), having a request time out before the new
interval is established. This scenario does not
lead to any real data loss or crashes, but it does
result in log warnings (and failed tests).
Zac Dover [Tue, 10 Jun 2025 03:04:13 +0000 (13:04 +1000)]
doc/rados: edit ops/user-management.rst
Edit an sentence in the imperative mood so that it matches the general
form of imperative sentences immediately preceding commands that contain
replaceable portions.
This commit targets only the Squid release branch.
Follows up on https://github.com/ceph/ceph/pull/58235/.
Zac Dover [Tue, 10 Jun 2025 02:54:18 +0000 (12:54 +1000)]
doc/mgr: edit telemetry.rst (lines 300-400)
Edit doc/mgr/telemetry.rst (lines 300-400).
Follow up on the suggestions made by Anthony D'Atri in
https://github.com/ceph/ceph/pull/63741 (except for the one about
including Lovecraftian lore in the dummy user data in this file).
Zac Dover [Tue, 10 Jun 2025 10:58:22 +0000 (20:58 +1000)]
doc/rados: enhance "pools.rst"
Add a link to the instructions for modifying a user's caps for a given
pool. Add this link where it makes sense to add it. Add this link where
the reader would naturally want to have the link.
Zac Dover [Tue, 10 Jun 2025 10:38:54 +0000 (20:38 +1000)]
doc/rbd: add mirroring troubleshooting info
Add a note to doc/rbd/rbd-mirroring.rst that directs the reader to set
both "site-a" and "site-b" to have the same pool names in the event that
rbd throws the error message "failed to import peer bootstrap token".
This information was reported to the Ceph upstream by Petr Tlapa in June
of 2025, and credit for its development goes to Petr.
Zac Dover [Thu, 1 May 2025 07:31:33 +0000 (17:31 +1000)]
doc/src/common/options: mgr.yaml.in edit
Improve the "desc" field under the "mgr_data" entry in
src/common/options/mgr.yaml.in.
This is a test to determine whether the Jenkins tests can be passed.
This test is made after the mystifying failure of
https://github.com/ceph/ceph/pull/62983.
Zac Dover [Mon, 2 Jun 2025 02:32:36 +0000 (12:32 +1000)]
doc/start: edit documenting-ceph.rst
Edit the section "Build the Source" in doc/start/documenting-ceph.rst.
Also correct a misuse of the word "presently", which means "in a little
while", not "now".
Zac Dover [Wed, 4 Jun 2025 23:39:33 +0000 (09:39 +1000)]
doc/glossary: s/OMAP/omap/
Change "OMAP" to "omap" to match the capitalization established by
Eleanor Cawthon in her 2012 omap paper, here:
https://ceph.io/assets/pdfs/CawthonKeyValueStore.pdf.
J. Eric Ivancich [Tue, 25 Mar 2025 22:10:27 +0000 (18:10 -0400)]
rgw: fix bug with rgw-gap-list
rgw-gap-list would fail if it it reached the end of the second file
before the first, thereby causing an infinite loop.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> Signed-off-by: Michael J. Kidd <linuxkidd@gmail.com>
(cherry picked from commit 0cfbc57d2c43ea88845561f14e295d0d48e44b32)
Change the wording of a sentence in doc/radosgw/metrics.rst so that its
articles read as though they were written by a native speaker of the
English language.
This commit is being raised as part of a diagnostic process aimed at
discovering why the ReadtheDocs check is failing on PR
https://github.com/ceph/ceph/pull/62877.
Zac Dover [Fri, 30 May 2025 12:38:03 +0000 (22:38 +1000)]
doc/rados/operations: edit cache-tiering.rst
Strengthen the warning against deploying cache tiering in releases after
Reef. This follows up on Anthony D'Atri's request in
https://github.com/ceph/ceph/pull/63465.