Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
(cherry picked from commit cadd4a4f6cc376882555a31f0304766430ba9e6a)
(contributed as https://github.com/ceph/ceph/pull/64819)
* refs/pull/66482/head:
mgr/prometheus/test_module: Adding unit-test for new classes
mgr/prometheus: metrics header for standby module
mgr/prometheus: Use RLock to fix deadlock in HealthHistory
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
mgr/prometheus: prune stale health checks, compress output
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
* refs/pull/65913/head:
client: signal waitfor_commit waiters for write delegation enabled inode
test/libcephfs: add test for fsync on a write delegated inode
client: adjust `Fb` cap ref count check during synchronous fsync()
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Venky Shankar <vshankar@redhat.com>
Alex Ainscow [Wed, 18 Mar 2026 14:51:57 +0000 (14:51 +0000)]
src: Move the decision to build the ISA plugin to the top level make file
Previously, the first time you build ceph, common did not see the correct
value of WITH_EC_ISA_PLUGIN. The consequence is that the global.yaml gets
build with osd_erasure_code_plugins not including isa. This is not great
given its our default plugin.
We considered simply removing this parameter from make entirely, but this
may require more discussion about supporting old hardware.
So the slightly ugly fix is to move this erasure-code specific declartion
to the top-level.
Fixes: https://tracker.ceph.com/issues/75537 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit cecce28f16b0867ea8578a8f0c1478e24a40e525)
Backport 6dddf54 introduced a new connection feature bit
NVMEOF_BEACON_DIFF but there are plans (#66624) to make further
enhancements on that feature bit. This would cause the mons to crash
during upgrades.
However, this connection feature bit should not have been added to
begin with. The correct way to do this is extend e55ad7bce2fb85096cd31ff9846403f9dbd01e85 by @athanatos to require
`CEPH_MON_FEATURE_INCOMPAT_NVMEOF_BEACON_DIFF` if all mons support it.
This should be done by having mons add/update their supported features
the MonMap via an update from `MMonJoin` (see for instance `crush_loc`
which was recently added to `mon_info_t`). Once the supported features
indicated for each mon in the `MonMap` show they understand the new
NVMEOF_BEACON_DIFF, then it should be turned on globally in the
`MonMap` as a required feature (added to the incompat set).
Conflicts:
src/mon/NVMeofGwMon.h: conflicts with header change from 19c9be2
fix missing header change in #66584
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Ilya Dryomov [Sun, 1 Mar 2026 21:55:52 +0000 (22:55 +0100)]
qa/workunits/rbd: short-circuit status() if "ceph -s" fails
In mirror-thrash tests, status() can be invoked after one of the
clusters is effectively stopped due to a watchdog bark:
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.rbd_mirror.[cluster2] failed
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
...
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ status
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ local cluster daemon image_pool image_ns image
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ for cluster in ${CLUSTER1} ${CLUSTER2}
In this scenario all commands that are invoked from the loop body
are going to time out anyway.
Ilya Dryomov [Sun, 1 Mar 2026 16:45:51 +0000 (17:45 +0100)]
qa: rbd_mirror_fsx_compare.sh doesn't error out as expected
In mirror-thrash tests, one of the clusters can be effectively stopped
due to a watchdog bark while rbd_mirror_fsx_compare.sh is running and is
in the middle of the "wait for all images" loop:
In this scenario "rbd ls" is going to time out repeatedly, turning the
loop into up to a ~60-hour sleep (up to 720 iterations with a 5-minute
timeout + 10-second sleep per iteration).
Ilya Dryomov [Fri, 27 Feb 2026 14:18:27 +0000 (15:18 +0100)]
qa/tasks: make rbd_mirror_thrash inherit from ThrasherGreenlet
Commit 21b4b89e5280 ("qa/tasks: watchdog terminate thrasher") made it
required for a thrasher to have stop_and_join() method, but the
preceding commit a035b5a22fb8 ("thrashers: standardize stop and join
method names") missed to add it to rbd_mirror_thrash (whether as an
ad-hoc implementation or by way of inheriting from ThrasherGreenlet).
Later on, commit 783f0e3a9903 ("qa: Adding a new class for the
daemonwatchdog to monitor") worsened the issue by expanding the use
of stop_and_join() to all watchdog barks rather than just the case of
a thrasher throwing an exception which is something that practically
never happens.
client: adjust `Fb` cap ref count check during synchronous fsync()
cephfs client holds a ref on Fb caps when handing out a write delegation[0].
As fsync from (Ganesha) client holding write delegation will block indefinitely[1]
waiting for cap ref for Fb to drop to 0, which will never happen until the
delegation is returned/recalled.
If an inode has been write delegated, adjust for cap reference count
check in fsync().
Note: This only workls for synchronous fsync() since `client_lock` is
held for the entire duration of the call (at least till the patch leading
upto the reference count check). Asynchronous fsync() needs to be fixed
separately (as that can drop `client_lock`).
mgr/prometheus: Use RLock to fix deadlock in HealthHistory
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR #65245 added the locks.
Nitzan Mordechai [Tue, 26 Aug 2025 14:30:12 +0000 (14:30 +0000)]
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and `erase` / `clear` did not correctly decref the values, leading
to use-after-free or leaks depending on usage.
Changes:
- Move Py_INCREF from cacheable_get_python() to TTLCache::insert()
- Add `TTLCache::clear()` method for proper memory cleanup
- Ensure TTLCache::get() returns a new reference
- Fix misuse of std::move on c_str() in PyJSONFormatter
These changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.
Nitzan Mordechai [Wed, 20 Aug 2025 14:50:40 +0000 (14:50 +0000)]
mgr/prometheus: prune stale health checks, compress output
This patch introduces several improvements to the Prometheus module:
- Introduces `HealthHistory._prune()` to drop stale and inactive health checks.
Limits the in-memory healthcheck dict to a configurable max_entries (default 1000).
TTL for stale entries is configurable via `healthcheck_history_stale_ttl` (default 3600s).
- Refactors HealthHistory.check() to use a unified iteration over known and current checks,
improving concurrency and minimizing redundant updates.
- Use cherrypy.tools.gzip instead of manual gzip.compress() for cleaner
HTTP compression with proper header handling and client negotiation.
- Introduces new module options:
- `healthcheck_history_max_entries`
- Add proper error handling for CherryPy engine startup failures
- Remove os._exit monkey patch in favor of proper exception handling
- Remove manual Content-Type header setting (CherryPy handles automatically)
* refs/pull/67318/head:
qa/multisite: use boto3's ClientError in place of assert_raises from tools.py.
qa/multisite: test fixes
qa/multisite: boto3 in tests.py
qa/multisite: zone files use boto3 resource api
qa/multisite: switch to boto3 in multisite test libraries
Ilya Dryomov [Tue, 24 Feb 2026 11:46:35 +0000 (12:46 +0100)]
librbd/mirror: detect trashed snapshots in UnlinkPeerRequest
If two instances of UnlinkPeerRequest race with each other (e.g. due
to rbd-mirror daemon unlinking from a previous mirror snapshot and the
user taking another mirror snapshot at same time), the snapshot that
UnlinkPeerRequest was created for may be in the process of being removed
(which may mean trashed by SnapshotRemoveRequest::trash_snap()) or fully
removed by the time unlink_peer() grabs the image lock. Because trashed
snapshots weren't handled explicitly, UnlinkPeerRequest could spuriously
fail with EINVAL ("not mirror snapshot" case) instead of the expected
ENOENT ("missing snapshot" case). This in turn could lead to spurious
ImageReplayer failures with it stopping prematurely.
ImageUpdateWatchers::flush() requests aren't tracked with
m_in_flight-like mechanism the way ImageUpdateWatchers::send_notify()
requests are, but in both cases callbacks that represent delayed work
that is very likely to (indirectly) reference ImageCtx are involved.
When the image is getting closed, ImageUpdateWatchers::shut_down() is
called before anything that belongs to ImageCtx is destroyed. However,
the shutdown can complete prematurely in the face of a pending flush if
one gets sent shortly before CloseRequest is invoked. The callback for
that flush will then race with CloseRequest and may execute after parts
of or even the entire ImageCtx is destroyed, leading to use-after-free
and various segfaults.
Ilya Dryomov [Thu, 19 Feb 2026 14:45:39 +0000 (15:45 +0100)]
test: disable known flaky tests in run-rbd-unit-tests
The failures seem to be more frequent on newer hardware. In the
absence of immediate fixes, disable a few tests that have been known to
be flaky for a long time to avoid disrupting "make check" runs.
Patrick Donnelly [Thu, 26 Feb 2026 20:17:06 +0000 (15:17 -0500)]
Merge PR #66540 into tentacle
* refs/pull/66540/head:
include: detect corrupt frag from byteswap
test/encoding: print context on diff failure
mds: dump frag_t as an object
common/frag: produce valid fragments for test instances
common: simplify fragment printing
common: properly convert frag_t to net/store endianness
mds: include sysinfo in status command output
include/frag.h: un-inline methods to reduce header dependencies
Tested-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Patrick Donnelly [Thu, 26 Feb 2026 20:16:07 +0000 (15:16 -0500)]
Merge PR #65358 into tentacle
* refs/pull/65358/head:
qa: Disable a test for kernel mount
qa: Run test_admin with the squid client
src/test/mds: Fix TestMDSAuthCaps
client: Fix the multifs auth caps check
mds: Fix multifs auth caps check
qa: Fix validation of client_version
qa: Test cross fs access by single client in multifs
Patrick Donnelly [Thu, 26 Feb 2026 13:53:27 +0000 (08:53 -0500)]
Merge PR #67333 into tentacle
* refs/pull/67333/head:
test/test_bluefs: make a standalone test case to reproduce bug#74765
os/bluestore: update volume selector after recovering BlueFS WAL in
test/test_bluefs: reproduce volume selector inconsistency after
os/bluestore: move RocksDBBlueFSVolumeSelector to BlueFS.cc
os/bluestore: rename/repurpose bluefs_check_volume_selector_on_umount setting.
os/bluestore/bluefs: Fix stat() for WAL envelope mode
Igor Fedotov [Wed, 11 Feb 2026 16:36:20 +0000 (19:36 +0300)]
test/test_bluefs: reproduce volume selector inconsistency after
recovering WAL in envelope mode.
Reproduces: https://tracker.ceph.com/issues/74765
Exact bug reproduction (which occurs on file removal) requires
'bluefs_check_volume_selector_on_mount' to be set manually to false
in 'wal_v2_simulate_crash' test case.
That's not the case for the default implementation which fails volume selector
validation at earlier stage during BlueFS mount.
This seems a general approach which provides better test coverage.
Laura Flores [Fri, 28 Feb 2025 06:04:42 +0000 (06:04 +0000)]
osd: add pg-upmap-primary to clean_pg_upmaps
This commit adds handling of pg_upmap_primary to `clean_pg_upmaps`
so invalid mappings are properly removed. Invalid mappings occur
when pools are deleted, the PG size is decreased, or an OSD is
marked down/out.
Test cases cover cases where:
1. PG num is reduced (all pg_upmap_primary mappings should be canceled)
2. An OSD outside of the up set has been mapped as the primary
3. An OSD is marked "out"
4. A mapping is redundant
5. A pool is deleted
Run the unit test with:
$ ninja unittest_osdmap
$ ./bin/unittest_osdmap --gtest_filter=*CleanPGUpmapPrimaries* --debug_osd=10
Alex Ainscow [Wed, 11 Feb 2026 18:11:12 +0000 (18:11 +0000)]
mon: Deny EC optimizations (fast EC) for non-4k-aligned chunk-sizes.
There are some bugs in the way Fast EC handles non 4k-aligned chunk sizes.
Such chunk sizes are buggy and even if they did work, the performance
would not be very good. Storage efficiency is also not helped by these
unusual encodings.
This commit will reject any attempt to turn optimizations (fast EC) on.
If the default is set to turn optimizations on, then this will be ignored
if the profile is not 4k aligned.
Note that to create a profile in the first place requires a --force override.
Fixes: https://tracker.ceph.com/issues/74813 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 442b45295f707b8d155caf5d1d51afd4664900db)
Kefu Chai [Fri, 21 Nov 2025 11:50:26 +0000 (19:50 +0800)]
debian: remove invoke-rc.d calls from postrm scripts
Previously, we called "invoke-rc.d ceph stop" in postrm scripts to
support sysvinit-based installations. However, this fails on systemd-
based systems, which are now the default on modern Debian distributions.
When invoke-rc.d detects systemd as the init system, it delegates to
systemctl, converting "invoke-rc.d ceph stop" to "systemctl stop
ceph.service". Since Ceph provides ceph.target and template units
(ceph-osd@.service, ceph-mon@.service, etc.) rather than a monolithic
ceph.service, this command always fails with exit code 5
(LSB EXIT_NOTINSTALLED).
This failure prevents the auto-generated cleanup sections added by
debhelper from executing properly, which can leave the system in an
inconsistent state during package removal.
Changes:
- Remove the invoke-rc.d call entirely. Systemd will handle service
cleanup through its own mechanisms when the package is removed.
- Remove redundant "exit 0" statement (the script exits successfully
by default, and "set -e" is no longer needed without commands that
might fail).
- Remove vim modeline comment, as it's unnecessary for a 3-line script.
- Eventually remove this 3-line script stanza, as this is exactly what
debhelper provides us.
Conflicts:
src/test/rgw/rgw_multi/tests.py
- Assert changes
- We still don't replicate object locks in tentacle Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Shilpa Jagannath [Wed, 28 Jan 2026 00:04:19 +0000 (19:04 -0500)]
qa/multisite: test fixes
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
(cherry picked from commit 90eb0612fd83547b20a1e1eeae8f2384526be508) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Conflicts:
src/test/rgw/rgw_multi/tests.py
- Whitespace and other drift
- Tentacle doesn't sync object locks
- `test_suspended_delete_marker_incremental_sync` is a new test but
not a new behavior.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Shilpa Jagannath [Wed, 21 Jan 2026 06:50:01 +0000 (01:50 -0500)]
qa/multisite: zone files use boto3 resource api
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
(cherry picked from commit 47ad27cb2f373ee5057a8b2fd1acaaa5073729b7) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Shilpa Jagannath [Tue, 20 Jan 2026 23:42:30 +0000 (18:42 -0500)]
qa/multisite: switch to boto3 in multisite test libraries
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
(cherry picked from commit c84efd9af5a4e2845745285ada994aad072eb2a0) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
The existing StandardPolicy that exposed as RBD_LOCK_MODE_EXCLUSIVE
argument to rbd_lock_acquire() disables automatic exclusive lock
transitions with "permanent" semantics: any request to release the lock
causes the peer to error out immediately. Such a lock owner can
perform maintenance operations that are proxied from other peers, but
any write-like I/O issued by other peers will fail with EROFS.
This isn't suitable for use cases where one of the peers wants to
manage exclusive lock manually (i.e. rbd_lock_acquire() is used) but
the lock is acquired only for very short periods of time. The rest of
the time the lock is expected to be held by other peers that stay in
the default "auto" mode (AutomaticPolicy) and run as usual, completely
unconcerned with each other or the manual-mode peer. However, these
peers get acutely aware of the manual-mode peer because when it grabs
the lock with RBD_LOCK_MODE_EXCLUSIVE their I/O gets disrupted: higher
layers translate EROFS into generic EIO, filesystems shut down, etc.
Add a new TransientPolicy exposed as RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT
to allow disabling automatic exclusive lock transitions with semantics
that would cause the other peers to block waiting for the lock to be
released by the manual-mode peer. This is intended to be a low-level
interface -- no attempt to safeguard against potential misuse causing
e.g. indefinite blocking is made.
It's possible to switch between RBD_LOCK_MODE_EXCLUSIVE and
RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT modes of operation both while the
lock is held and after it's released.
Ilya Dryomov [Mon, 19 Jan 2026 16:43:41 +0000 (17:43 +0100)]
librbd: prepare lock_acquire() for changing between policies
In preparation for adding a new TransientPolicy, get rid of the check
implemented in terms of exclusive_lock::Policy::may_auto_request_lock()
that essentially makes it so that exclusive lock policy on a given
image handle can be changed from the default AutomaticPolicy only once.
In order to effect another change a new image handle would have been
needed which is pretty suboptimal.
Ilya Dryomov [Mon, 22 Dec 2025 18:07:27 +0000 (19:07 +0100)]
librbd: fix RequestLockPayload log message in ImageWatcher
exclusive_lock::Policy::lock_requested() isn't guaranteed to queue
the release of exclusive lock (and in fact only one of the two existing
implementations does that). Instead of talking about the lock, log the
response to the notification.
Ilya Dryomov [Mon, 22 Dec 2025 16:22:53 +0000 (17:22 +0100)]
librbd: amend error message in lock_acquire()
... since it went stale with commit 2914eef50d69 ("rbd: Changed
exclusive-lock implementation to use the new managed-lock"). In the
context of exclusive lock, requesting the lock refers to a specific
action which may or may not be performed as part of acquiring the lock
and lock_acquire() doesn't get visibility into that.
Ilya Dryomov [Tue, 11 Nov 2025 20:39:58 +0000 (21:39 +0100)]
qa/valgrind.supp: make gcm_cipher_internal suppression more resilient
gcm_cipher_internal() and ossl_gcm_stream_final() make it to the stack
trace only on CentOS Stream 9. On Ubuntu 22.04 and Rocky 10, it looks
as follows:
Thread 4 msgr-worker-1:
Conditional jump or move depends on uninitialised value(s)
at 0x70A36D4: ??? (in /usr/lib64/libcrypto.so.3.2.2)
by 0x70A39A1: ??? (in /usr/lib64/libcrypto.so.3.2.2)
by 0x6F8A09C: EVP_DecryptFinal_ex (in /usr/lib64/libcrypto.so.3.2.2)
by 0xB498C1F: ceph::crypto::onwire::AES128GCM_OnWireRxHandler::authenticated_decrypt_update_final(ceph::buffer::v15_2_0::list&) (crypto_onwire.cc:271)
by 0xB4992D7: ceph::msgr::v2::FrameAssembler::disassemble_preamble(ceph::buffer::v15_2_0::list&) (frames_v2.cc:281)
by 0xB482D98: ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int) (ProtocolV2.cc:1149)
by 0xB475318: ProtocolV2::run_continuation(Ct<ProtocolV2>&) (ProtocolV2.cc:54)
by 0xB457012: AsyncConnection::process() (AsyncConnection.cc:495)
by 0xB49E61A: EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (Event.cc:492)
by 0xB49EA9D: UnknownInlinedFun (Stack.cc:50)
by 0xB49EA9D: UnknownInlinedFun (invoke.h:61)
by 0xB49EA9D: UnknownInlinedFun (invoke.h:111)
by 0xB49EA9D: std::_Function_handler<void (), NetworkStack::add_thread(Worker*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:290)
by 0xBB11063: ??? (in /usr/lib64/libstdc++.so.6.0.33)
by 0x4F17119: start_thread (in /usr/lib64/libc.so.6)
The proposal to amend the existing suppression so that it's tied to the
specific callsite rather than libcrypto internals [1] received a thumbs
up from Radoslaw.
Ilya Dryomov [Tue, 11 Nov 2025 15:33:16 +0000 (16:33 +0100)]
qa/tasks/qemu: install genisoimage package
genisoimage is expected to be included in our base images but currently
isn't on Rocky 10. Since it's quite a niche thing, let's install the
package explicitly.