Kefu Chai [Tue, 3 Jun 2025 08:07:33 +0000 (16:07 +0800)]
librbd/cache/pwl: fix memory leak in SyncPoint persist context cleanup
Previously, SyncPoint allocated two C_Gather instances tracked by raw
pointers but failed to properly clean them up when only a single sync
point existed, causing memory leaks detected by AddressSanitizer.
This change fixes the leak by modifying AbstractWriteLog::shut_down()
to check for prior sync points in the chain. When the current sync point
is the only one present, we now activate the m_prior_log_entries_persisted
context to ensure:
- The onfinish callback executes and releases the captured strong
reference to the enclosing SyncPoint
- The parent m_sync_point_persist context completes and gets properly
released
This ensures all allocated contexts are cleaned up correctly during
shutdown, eliminating the memory leak.
The ASan report:
```
Indirect leak of 2064 byte(s) in 1 object(s) allocated from:
#0 0x56440919ae2d in operator new(unsigned long) (/home/jenkins-build/build/workspace/ceph-pull-requests/build/bin/unittest_librbd+0x2f3de2d) (BuildId: 6a04677c6ee5235f1a41815df807f97c5b96d4cd)
#1 0x56440bd67751 in __gnu_cxx::new_allocator<Context*>::allocate(unsigned long, void const*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:127:27
#2 0x56440bd676e0 in std::allocator<Context*>::allocate(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/allocator.h:185:32
#3 0x56440bd676e0 in std::allocator_traits<std::allocator<Context*>>::allocate(std::allocator<Context*>&, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:464:20
#4 0x56440bd6730b in std::_Vector_base<Context*, std::allocator<Context*>>::_M_allocate(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:346:20
#5 0x7fd33e00e8d1 in std::vector<Context*, std::allocator<Context*>>::reserve(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:78:22
#6 0x7fd33e00c51c in librbd::cache::pwl::SyncPoint::SyncPoint(unsigned long, ceph::common::CephContext*) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/SyncPoint.cc:20:27
#7 0x56440bd65f26 in decltype(::new((void*)(0)) librbd::cache::pwl::SyncPoint(std::declval<unsigned long&>(), std::declval<ceph::common::CephContext*&>())) std::construct_at<librbd::cache::pwl::SyncPoint, unsigned long&, ceph::common::CephContext*&>(librbd::cache::pwl::SyncPoint*, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:97:39
#8 0x56440bd65b98 in void std::allocator_traits<std::allocator<librbd::cache::pwl::SyncPoint>>::construct<librbd::cache::pwl::SyncPoint, unsigned long&, ceph::common::CephContext*&>(std::allocator<librbd::cache::pwl::SyncPoint>&, librbd::cache::pwl::SyncPoint*, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:518:4
#9 0x56440bd657d3 in std::_Sp_counted_ptr_inplace<librbd::cache::pwl::SyncPoint, std::allocator<librbd::cache::pwl::SyncPoint>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<unsigned long&, ceph::common::CephContext*&>(std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:519:4
#10 0x56440bd65371 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<librbd::cache::pwl::SyncPoint, std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(librbd::cache::pwl::SyncPoint*&, std::_Sp_alloc_shared_tag<std::allocator<librbd::cache::pwl::SyncPoint>>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:651:6
#11 0x56440bd65163 in std::__shared_ptr<librbd::cache::pwl::SyncPoint, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(std::_Sp_alloc_shared_tag<std::allocator<librbd::cache::pwl::SyncPoint>>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1342:14
#12 0x56440bd650e6 in std::shared_ptr<librbd::cache::pwl::SyncPoint>::shared_ptr<std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(std::_Sp_alloc_shared_tag<std::allocator<librbd::cache::pwl::SyncPoint>>, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:409:4
#13 0x56440bd65057 in std::shared_ptr<librbd::cache::pwl::SyncPoint> std::allocate_shared<librbd::cache::pwl::SyncPoint, std::allocator<librbd::cache::pwl::SyncPoint>, unsigned long&, ceph::common::CephContext*&>(std::allocator<librbd::cache::pwl::SyncPoint> const&, unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:862:14
#14 0x56440bca97e7 in std::shared_ptr<librbd::cache::pwl::SyncPoint> std::make_shared<librbd::cache::pwl::SyncPoint, unsigned long&, ceph::common::CephContext*&>(unsigned long&, ceph::common::CephContext*&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:878:14
#15 0x56440bd443c8 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::new_sync_point(librbd::cache::pwl::DeferredContexts&) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1905:20
#16 0x56440bd42e4c in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::flush_new_sync_point(librbd::cache::pwl::C_FlushRequest<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>>*, librbd::cache::pwl::DeferredContexts&) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1951:3
#17 0x56440bd9cbf2 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::flush_new_sync_point_if_needed(librbd::cache::pwl::C_FlushRequest<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>>*, librbd::cache::pwl::DeferredContexts&) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1990:5
#18 0x56440bd9c636 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::internal_flush(bool, Context*)::'lambda'(librbd::cache::pwl::GuardedRequestFunctionContext&)::operator()(librbd::cache::pwl::GuardedRequestFunctionContext&) const /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:2152:9
#19 0x56440bd9b9b4 in boost::detail::function::void_function_obj_invoker<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::internal_flush(bool, Context*)::'lambda'(librbd::cache::pwl::GuardedRequestFunctionContext&), void, librbd::cache::pwl::GuardedRequestFunctionContext&>::invoke(boost::detail::function::function_buffer&, librbd::cache::pwl::GuardedRequestFunctionContext&) /opt/ceph/include/boost/function/function_template.hpp:100:11
#20 0x56440bd29321 in boost::function_n<void, librbd::cache::pwl::GuardedRequestFunctionContext&>::operator()(librbd::cache::pwl::GuardedRequestFunctionContext&) const /opt/ceph/include/boost/function/function_template.hpp:789:14
#21 0x56440bd28d85 in librbd::cache::pwl::GuardedRequestFunctionContext::finish(int) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/Request.h:335:5
#22 0x5644091e0fe0 in Context::complete(int) /home/jenkins-build/build/workspace/ceph-pull-requests/src/include/Context.h:102:5
#23 0x56440bd9b378 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::detain_guarded_request(librbd::cache::pwl::C_BlockIORequest<librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>>*, librbd::cache::pwl::GuardedRequestFunctionContext*, bool) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:1202:20
#24 0x56440bd96c50 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::internal_flush(bool, Context*) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:2154:3
#25 0x56440bd1e4b5 in librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::shut_down(Context*) /home/jenkins-build/build/workspace/ceph-pull-requests/src/librbd/cache/pwl/AbstractWriteLog.cc:703:3
#26 0x56440bdb9022 in librbd::cache::pwl::TestMockCacheSSDWriteLog_compare_and_write_compare_matched_Test::TestBody() /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/cache/pwl/test_mock_SSDWriteLog.cc:403:7
```
Ronen Friedman [Thu, 19 Jun 2025 15:27:38 +0000 (10:27 -0500)]
osd/scrub: clarify that osd_scrub_auto_repair_num_errors counts objects
'osd_scrub_auto_repair_num_errors' limits the number of damaged objects
that we will try to auto-repair during a scrub. Its documentation
referred to "number of errors", which did not fit the implementation.
Fixes: https://tracker.ceph.com/issues/71754 Fixes: Red Hat BZ2316244 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 680b58ffd0bf5b213ec525f8d783297fb0b14343)
Add an instruction that includes the --enable-auth flag in a "git orch
apply mgmt-gateway" command, in accordance with a request made by
afreen23 here: https://github.com/ceph/ceph/pull/60440#discussion_r1953530599
Improve the English in the "desc" field of the
"osd_deep_scrub_interval_cv" variable, as suggested by Anthony D'Atri in
https://github.com/ceph/ceph/pull/63490#discussion_r2124893516.
Zac Dover [Tue, 10 Jun 2025 03:04:13 +0000 (13:04 +1000)]
doc/rados: edit ops/user-management.rst
Edit an sentence in the imperative mood so that it matches the general
form of imperative sentences immediately preceding commands that contain
replaceable portions.
This commit targets only the Squid release branch.
Follows up on https://github.com/ceph/ceph/pull/58235/.
Zac Dover [Tue, 10 Jun 2025 02:54:18 +0000 (12:54 +1000)]
doc/mgr: edit telemetry.rst (lines 300-400)
Edit doc/mgr/telemetry.rst (lines 300-400).
Follow up on the suggestions made by Anthony D'Atri in
https://github.com/ceph/ceph/pull/63741 (except for the one about
including Lovecraftian lore in the dummy user data in this file).
Zac Dover [Tue, 10 Jun 2025 10:58:22 +0000 (20:58 +1000)]
doc/rados: enhance "pools.rst"
Add a link to the instructions for modifying a user's caps for a given
pool. Add this link where it makes sense to add it. Add this link where
the reader would naturally want to have the link.
Zac Dover [Tue, 10 Jun 2025 10:38:54 +0000 (20:38 +1000)]
doc/rbd: add mirroring troubleshooting info
Add a note to doc/rbd/rbd-mirroring.rst that directs the reader to set
both "site-a" and "site-b" to have the same pool names in the event that
rbd throws the error message "failed to import peer bootstrap token".
This information was reported to the Ceph upstream by Petr Tlapa in June
of 2025, and credit for its development goes to Petr.
Ronen Friedman [Wed, 4 Jun 2025 17:44:16 +0000 (12:44 -0500)]
osd/scrub: make m_session_started_at at Session state ctor
ScrubMachine::get_time_scrubbing() must access the Session object
to compute the scrub duration. But the State data is not externally
accessible before its ctor has completed.
As we always happen to try to access that data inside the ctor,
this always results in a warning log message.
Here we move m_session_started_at into the outer state, simplifying
the logic required to access it.
mgr/dashboard: fix KeyError exception in HardwareService.get_summary()
Typical error:
```
[dashboard ERROR exception] Internal Server Error
Traceback (most recent call last):
File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 48, in dashboard_exception_handler
return handler(*args, **kwargs)
File "/lib/python3.9/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 263, in inner
ret = func(*args, **kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/_rest_controller.py", line 193, in wrapper
return func(*vpath, **params)
File "/usr/share/ceph/mgr/dashboard/controllers/hardware.py", line 21, in summary
return HardwareService.get_summary(categories, hostname)
File "/usr/share/ceph/mgr/dashboard/services/hardware.py", line 33, in get_summary
'ok': sum(item['status']['health'] == 'OK' for items in data.values()
File "/usr/share/ceph/mgr/dashboard/services/hardware.py", line 33, in <genexpr>
'ok': sum(item['status']['health'] == 'OK' for items in data.values()
KeyError: 'status'
```
The recent change from commit `fbcdf571ca1` introduced this regression.