Adam C. Emerson [Wed, 26 Nov 2025 05:54:55 +0000 (00:54 -0500)]
neorados: On watch error, go through safe cookie interface to cancel
Possibly unnecessary defensive programming. At present I'm not sure of
a way this could happen, possibly shutdown while `watch_()` is
executing. In any case, adding a lookup to the error path isn't really
a problem.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Adam C. Emerson [Wed, 26 Nov 2025 02:54:58 +0000 (21:54 -0500)]
neorados: Go through linger_cancel on `io_context` shutdown
Rather than just dropping the reference, clean up the linger operation
properly within Objecter. Also, clear out handlers before
relinquishing reference to avoid use-after-free.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Kefu Chai [Thu, 4 Dec 2025 08:05:27 +0000 (16:05 +0800)]
mgr/PyModule: clear Python exception when NOTIFY_TYPES is missing
Commit 4589c4d8ac5 ("mgr: do not require NOTIFY_TYPES in python modules")
changed load_notify_types() to treat missing NOTIFY_TYPES as non-fatal,
but failed to clear the Python exception state set by PyObject_GetAttrString().
When PyObject_GetAttrString() fails to find an attribute, it:
1. Returns nullptr
2. Sets AttributeError in the interpreter's exception state
The exception state persists across subsequent Python C API calls until
explicitly cleared. This causes unrelated operations to fail with misleading
error messages.
Bug Manifestation:
------------------
In PyModule::load() (line 292), the call sequence is:
The failure occurs because PyImport_ImportModule() at line 658 checks
PyErr_Occurred() before operating. When it detects the pending AttributeError
from step 2, it refuses to import and returns NULL immediately.
This causes load_subclass_of() to report:
```
1 mgr[py] Loading python module 'balancer'
-1 mgr[py] Module balancer has missing NOTIFY_TYPES member
-1 mgr[py] Module not found: 'mgr_module'
-1 mgr[py] AttributeError: type object 'Module' has no attribute 'NOTIFY_TYPES'
```
This error is misleading - mgr_module exists and is importable, but the
import fails due to the uncleared exception from an unrelated operation.
Without this fix, modules without NOTIFY_TYPES may fail to load their
standby class, or cause subsequent module loads to fail entirely with
confusing error messages pointing to the wrong module.
Fix:
----
This patch follows Python's EAFP (Easier to Ask for Forgiveness than
Permission) philosophy by attempting the attribute access and handling the
expected AttributeError, rather than checking preconditions with
PyObject_HasAttrString() (LBYL approach). The EAFP approach:
- Is more pythonic and idiomatic
- Requires only one API call instead of two (more efficient)
- Properly distinguishes expected errors (AttributeError) from unexpected
errors (e.g., interpreter shutdown, memory errors)
- Handles unexpected errors appropriately by reporting and returning -EINVAL
Implementation:
- Use PyErr_ExceptionMatches() to check for expected AttributeError
- Clear expected exception with PyErr_Clear()
- Report unexpected exceptions via handle_pyerror()
Refs: 4589c4d8ac5 ("mgr: do not require NOTIFY_TYPES in python modules")
Refs: https://tracker.ceph.com/issues/55835 Fixes: https://tracker.ceph.com/issues/70789 Signed-off-by: Kefu Chai <k.chai@proxmox.com>
Casey Bodley [Tue, 2 Dec 2025 21:19:05 +0000 (16:19 -0500)]
rgw/rados: fix init/shutdown order for RGWIndexCompletionManager
RGWRados::init_complete() initializes this RGWIndexCompletionManager
after starting some background threads that depend on it, like RGWLC and
RGWDataSyncProcessorThread and RGWObjectExpirer. this can lead to a
crash when accessing a null index_completion_manager
RGWRados::finalize() deletes it before stopping the Restore thread
Fixes: https://tracker.ceph.com/issues/74067 Reported-by: J. Eric Ivancich <ivancich@redhat.com> Signed-off-by: Casey Bodley <cbodley@redhat.com>
Patrick Donnelly [Thu, 13 Nov 2025 19:51:20 +0000 (14:51 -0500)]
include: detect corrupt frag from byteswap
If a big-endian MDS writes frag_t values into the metadata pool, these
will persist and confuse the MDS after it tries properly parsing them as
little-endian. Fortunately detecting this situation is fairly easy as we
restrict the number of bits and the number of bits restricts the mask
value.
Fixes: https://tracker.ceph.com/issues/73792 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
errors in do_transaction_no_callbacks fail with vague abort message
showing only the transaction dump.
Instead, add meaningful abort message with the actual error and method.
This could be replaced by assert_all_func, though we would likely
rewrite this method soon.
Nitzan Mordechai [Tue, 26 Aug 2025 14:30:12 +0000 (14:30 +0000)]
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and `erase` / `clear` did not correctly decref the values, leading
to use-after-free or leaks depending on usage.
Changes:
- Move Py_INCREF from cacheable_get_python() to TTLCache::insert()
- Add `TTLCache::clear()` method for proper memory cleanup
- Ensure TTLCache::get() returns a new reference
- Fix misuse of std::move on c_str() in PyJSONFormatter
These changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.
Nitzan Mordechai [Wed, 20 Aug 2025 14:50:40 +0000 (14:50 +0000)]
mgr/prometheus: prune stale health checks, compress output
This patch introduces several improvements to the Prometheus module:
- Introduces `HealthHistory._prune()` to drop stale and inactive health checks.
Limits the in-memory healthcheck dict to a configurable max_entries (default 1000).
TTL for stale entries is configurable via `healthcheck_history_stale_ttl` (default 3600s).
- Refactors HealthHistory.check() to use a unified iteration over known and current checks,
improving concurrency and minimizing redundant updates.
- Use cherrypy.tools.gzip instead of manual gzip.compress() for cleaner
HTTP compression with proper header handling and client negotiation.
- Introduces new module options:
- `healthcheck_history_max_entries`
- Add proper error handling for CherryPy engine startup failures
- Remove os._exit monkey patch in favor of proper exception handling
- Remove manual Content-Type header setting (CherryPy handles automatically)
Dhairya Parmar [Wed, 15 Oct 2025 11:43:27 +0000 (17:13 +0530)]
mds: update older versioned sr_t struct(s) to incorporate snapshot visibility flag
- If there is cluster upgrade, the older sr_t structs' flags would not have the
SNAPSHOT_VISIBILITY flag bit set which means the existing subvolumes would
report 0 instead of 1 for the vxattr ceph.dir.subvolume.snaps.visible which
although doesn't lead to any actual denial of access to subvolume snapshots
since the client config to respect snapshot visibility is set to false as
default but might come as a surprise to any cluster operator if the client
config is toggled to true to see the snapshot visibility being denied(since the
`flags` is reporting to be 0). So, change the in-memory flags while decoding
the sr_t members.
- bump up struct_v to 8 while encoding and while decoding, for all the structs older
than 8 be assigned SNAPSHOT_VISIBILITY flag in-memory accordingly.
Casey Bodley [Mon, 1 Dec 2025 15:25:16 +0000 (10:25 -0500)]
cmake: fix for -DWITH_BREAKPAD=OFF
in 1ba55a20be1023c585ba96617dc6a9d2aa79a51b, i tried to avoid the NOT
condition by swapping the option's defaults. but when the condition is
false, the option is forced to ON even if the user manually set it OFF
fix this by inverting the condition and swapping the default values
Reported-by: Joseph Mundackal <joseph.j.mundackal@gmail.com> Signed-off-by: Casey Bodley <cbodley@redhat.com>
Imran Imtiaz [Mon, 1 Dec 2025 14:25:07 +0000 (14:25 +0000)]
mgr/dashboard: add API endpoint to delete images from consistency groups
Signed-off-by: Imran Imtiaz <imran.imtiaz@uk.ibm.com> Fixes: https://tracker.ceph.com/issues/74033
Create a consistency group dashboard API endpoint that enables removal
of RBD images from the group.
Kefu Chai [Thu, 27 Nov 2025 12:28:31 +0000 (20:28 +0800)]
osd: fix ODR violation in max_prio_map
The static std::map max_prio_map was defined in the osd_types.h header
file, causing every translation unit that included this header to get
its own copy of the variable. This led to One Definition Rule (ODR)
violations where multiple instances of the same variable existed at
runtime.
During program cleanup, destructors for these multiple instances would
attempt to free the same memory regions, resulting in segmentation
faults in tcmalloc/memory allocator as seen with ceph-dencoder.
This issue surfaced after a yet-merged-change which converts erasure_code
and json_spirit to OBJECT libraries. Before that change, these were
STATIC libraries that were linked via target_link_libraries. The
incorrect linkage meant their object files (and thus their copies of
max_prio_map) were kept separate and didn't conflict at runtime.
After converting to OBJECT libraries and properly incorporating them
into libceph-common.so (commit 8b0e3fb2c23), the multiple copies of
max_prio_map from different translation units all ended up in the same
shared library, exposing the ODR violation. During program exit, the
dynamic linker attempted to run destructors for all instances, leading
to double-free crashes.
Fix by moving the map into a static helper function in PeeringState.cc
(the only file that uses it). The map is now a function-local static
const variable, ensuring a single instance that is properly initialized
and destructed.
Backtrace before fix:
```
#0 0x00007ffff7dbb1a0 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) () from /lib/x86_64-linux-gnu/libtcmalloc.so.4
#1 0x00007ffff7dbb57f in tcmalloc::ThreadCache::Scavenge() () from /lib/x86_64-linux-gnu/libtcmalloc.so.4
#2 0x00007ffff6bc8aa2 in std::__new_allocator<std::_Rb_tree_node<std::pair<int const, int> > >::deallocate (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890, __n=1)
#3 0x00007ffff6bc89f9 in std::allocator<std::_Rb_tree_node<std::pair<int const, int> > >::deallocate (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890, __n=1)
#4 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<int const, int> > > >::deallocate (__a=..., __p=0x555555f43890, __n=1)
#5 std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_put_node (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890)
#6 0x00007ffff6bc892e in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_drop_node (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890)
#7 0x00007ffff6bc886e in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_erase (this=0x7ffff7d48f78 <max_prio_map>, __x=0x555555f43890)
#8 0x00007ffff6bc8854 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_erase (this=0x7ffff7d48f78 <max_prio_map>, __x=0x555555f43cb0)
#9 0x00007ffff6bc8854 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_erase (this=0x7ffff7d48f78 <max_prio_map>, __x=0x555555f43ad0)
#10 0x00007ffff6bc8805 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::~_Rb_tree (this=0x7ffff7d48f78 <max_prio_map>)
#11 0x00007ffff6bc7345 in std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >::~map (this=0x7ffff7d48f78 <max_prio_map>)
#12 0x00007ffff484bd51 in __cxa_finalize (d=0x7ffff7d3f440) at ./stdlib/cxa_finalize.c:97
#13 0x00007ffff6af9487 in __do_global_dtors_aux () from /home/kefu/dev/ceph/build/lib/libceph-common.so.2
#14 0x00007ffff7fbfd20 in ?? ()
#15 0x00007ffff7fc8fc2 in _dl_call_fini (closure_map=0x7fffffffd0f0, closure_map@entry=0x7ffff7fbfd20) at ./elf/dl-call_fini.c:43
#16 0x00007ffff7fcbe72 in _dl_fini () at ./elf/dl-fini.c:120
#17 0x00007ffff484c291 in __run_exit_handlers (status=0, listp=0x7ffff49f1680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:118
#18 0x00007ffff484c35a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:148
#19 0x00007ffff4833caf in __libc_start_call_main (main=main@entry=0x55555556cd90 <main(int, char const**)>, argc=argc@entry=2, argv=argv@entry=0x7fffffffd488) at ../sysdeps/nptl/libc_start_call_main.h:74
#20 0x00007ffff4833d65 in __libc_start_main_impl (main=0x55555556cd90 <main(int, char const**)>, argc=2, argv=0x7fffffffd488, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd478) at ../csu/libc-start.c:360
#21 0x00005555555695e1 in _start ()
```
Log buckets can now be created within erasure-coded (EC) pools.
To support append operations, a temporary log record object is initially
created in the replicated default.rgw.log pool. This object is then copied
to the EC pool upon log record commitment.
All implicit log commit operations will execute asynchronously. A new
BucketLoggingManager class is responsible for processing these pending
commits at set intervals. Explicit commit operations, however, will
continue to be performed synchronously.
Nizamudeen A [Tue, 25 Nov 2025 11:31:31 +0000 (17:01 +0530)]
mgr/dashboard: support custom prop for table item redirection
use an extra customTemplateConfig called `customRowProperty` where
you can provide the key of the property you wish to route, instead of
relying on the cell's prop itself
Fixes: https://tracker.ceph.com/issues/73989 Signed-off-by: Nizamudeen A <nia@redhat.com>
Ronen Friedman [Thu, 20 Nov 2025 13:54:20 +0000 (07:54 -0600)]
osd/scrub: do not attempt to read past the end of an object
When performing deep scrubs, the scrubber reads object data
in strides. Existing code uses a short read to detect the end
of the object (and if the object size is a multiple of the
stride - an extra read is performed, which returns 0 bytes).
The proposed change is to avoid such extra read attempts,
by using our knowledge of the object size.
Also - some minor code cleanups in the relevant function.
Alex Ainscow [Fri, 28 Nov 2025 14:33:13 +0000 (14:33 +0000)]
osd: Perform shard look up correctly in partial EC writes
Plugins are permitted to provide a mapping to change the order in which OSDs
are used. In practice only LRC does this and it is not currently enabled
with optimisations, so this is a theoretical bug.
The bug here was that the "first" shard was assumed to be shard_id_t(0). However,
this is not true for LRC.
Fixes: https://tracker.ceph.com/issues/74016 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Issue: Route was being force-reloaded using a two-step navigation hack causing unnecessary redirects and side effects.
Fix: Replaced the hack with Angular’s native same-URL reload using onSameUrlNavigation: 'reload' for a clean, stable route refresh.
Bill Scales [Fri, 21 Nov 2025 10:06:22 +0000 (10:06 +0000)]
qa: Reduce number of osd threads when using compression
Smithi nodes used by teuthology tests have 8 CPU cores and typically run
4 OSD processes. When bluestore software compression is enabled the size
of the OSD thread pool needs to be reduced to 2 threads per OSD because
these threads can easily use 100% of a core. This avoids excessive
amounts of context switches, which leads to OSD threads timing out,
which causes the OSD to drop heartbeat pings and for the monitor to
temporarily mark it down. In extreme cases this can lead to PGs getting
stuck in repeated loops of peering until the teuthology test times out.
Context switches happen oppurtunistically at the end of system calls
so functions with lots of logging are some of the worst affected.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:38:44 +0000 (10:38 +0000)]
osd: Restrict logging in MissingLoc::add_source_info
add_source_info can generate an excessive amount of logging
if a PG has thousands of missing objects. When a system is
under load and threads are repeatedly context switching this
can lead to timeouts (tests showed this function taking up
to 10 seconds to execute with 99% of that time being in
logging calls where the thread was being pre-empted).
Stopping logging after the function has been running for
more than 0.5 seconds strikes a balance between providing
sufficient informtion to debug problems while providing
more stability when a system is heavily loaded.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Fri, 21 Nov 2025 10:25:48 +0000 (10:25 +0000)]
osd: Increase log level for listing missing list
Logging the entire contents of a missing list can generate a
1M character log line when there are 8000 missing objects in a
PG. Other places in the code logging the missing list use debug
level 25 which is not enabled by default in teuthology tests.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Mon, 24 Nov 2025 09:18:21 +0000 (09:18 +0000)]
osd: reset_tp_timeout should reset timeout for all shards
ShardedThreadPools are only used by the classic OSD process
which can have more than one thread for the same shard. Each
thread has a heartbeat timeout used to detect stalled threads.
Some code that is known to take a long time makes calls to
reset_tp_timeout to reset this timeout. However for sharded
pools this can be ineffective because it is common for threads
for the same shard to use the same locks (e.g. PG Lock) and
therefore if thread A is taking a long time and resetting
its timeout while holding a lock, thread B for the same shard
is liable to be waiting for the same lock, will not be
resetting its timeout and can be timed out.
Debug for issue 72879 showed heartbeat timeouts occurring at
the same time for both shards, an attempt to fix the problem
by calling reset_tp_timeout for the slow thread still showed
the other threads for the shard timing out waiting for the PG
lock that was held bythe slow thread. Looking at the OSD code
most places where reset_tp_timeout is called the thread is
holding the PG lock.
This commit moves the concept of shard_index from OSD into
ShardedThreadPool and modifies reset_tp_timeout so that it resets
the timeout for all threads for the same shard.
Some code calls reset_tp_timeout from inside loops that can take
a long time without consideration for how long the thread has
actually been running for. There is a risk that this type of
call could repeatedly reset the timeout for another shard which
is genuinely stuck and hence defeat the heartbeat checks. To
prevent this reset_tp_timeout is modified to be a NOP unless
the thread has been processing the current workitem for more
than 0.5 seconds. Therefore threads have to be slow but making
forward progress to be abe to reset the timeout.
Fixes: https://tracker.ceph.com/issues/72879 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Tue, 14 Oct 2025 08:24:56 +0000 (09:24 +0100)]
osdc: Add SplitOp capability to Objecter
This will provide the ability for Objecter to split up
certain ops and distribute them to the OSDs directly if
that provides a preformance advantage.
This is experimental code and is switched off unless the
magic pool flags are enabled. These magic pool flags were
pushed in an earlier commit in the same PR.