Ilya Dryomov [Mon, 19 Jan 2026 16:43:41 +0000 (17:43 +0100)]
librbd: prepare lock_acquire() for changing between policies
In preparation for adding a new TransientPolicy, get rid of the check
implemented in terms of exclusive_lock::Policy::may_auto_request_lock()
that essentially makes it so that exclusive lock policy on a given
image handle can be changed from the default AutomaticPolicy only once.
In order to effect another change a new image handle would have been
needed which is pretty suboptimal.
Ilya Dryomov [Mon, 22 Dec 2025 18:07:27 +0000 (19:07 +0100)]
librbd: fix RequestLockPayload log message in ImageWatcher
exclusive_lock::Policy::lock_requested() isn't guaranteed to queue
the release of exclusive lock (and in fact only one of the two existing
implementations does that). Instead of talking about the lock, log the
response to the notification.
Ilya Dryomov [Mon, 22 Dec 2025 16:22:53 +0000 (17:22 +0100)]
librbd: amend error message in lock_acquire()
... since it went stale with commit 2914eef50d69 ("rbd: Changed
exclusive-lock implementation to use the new managed-lock"). In the
context of exclusive lock, requesting the lock refers to a specific
action which may or may not be performed as part of acquiring the lock
and lock_acquire() doesn't get visibility into that.
Ville Ojamo [Mon, 5 Jan 2026 06:10:45 +0000 (13:10 +0700)]
doc: Remove sphinxcontrib-seqdiag Python package from RTD builds
This is a proactive PR to avoid breaking docs builds when Setuptools 81
starts to be used in the RTD builds process.
The sphnixcontrib-seqdiag Python package is not compatible with
Setuptools 81 or later due to use of pkg_resources:
https://setuptools.pypa.io/en/latest/pkg_resources.html
Setuptools 81 release should be imminent, with the Python deprecation
warning stating pkg_resources "removal as early as 2025-11-30".
Seqdiag seems to be unmaintained with the latest update at Pypi in
the year 2021 and also no updates to the seqdiag git repo.
There are no seqdiag directives left in the docs after last seqdiags
were removed in PR #52308.
Two other options would exist for fixing the situation (see PR for
discussion) but this seems to be the suitable one.
Kyr Shatskyy [Fri, 21 Nov 2025 21:20:04 +0000 (22:20 +0100)]
qa/workunits/rgw: drop netstat usage
The `netstat` is deprecated now in modern Linux and usually
requires an extra package dependency to be installed.
Usually it is `net-tools`, however, for example, opensuse,
`netstat` does not present in it. Thus, let us use `ss` as
an alternative.
When using `netstat -nltp` we get lines like:
'tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 25156/valgrind.bin \ntcp6 0 0 :::443 :::* LISTEN 25156/valgrind.bin \n'
When using `ss -nltp` we get lines like:
'LISTEN 0 4096 0.0.0.0:443 0.0.0.0:* users:(("memcheck-amd64-",pid=66045,fd=72))'
so we need to filter processes by `memcheck`. However further
parsing code works equivalently as for netstat.
Alex Ainscow [Fri, 2 Jan 2026 18:47:37 +0000 (18:47 +0000)]
osd/ECUtil: Fix erase_after_ro_offset length calculation and add tests
System test logs showed EC recovery failures with assertion errors when
recovering small objects (smaller than stripe width) in EC pools.
The recovery would fail with "shard_size >= tobj_size" assertions
because shards that should be empty incorrectly contained data.
The primary change in this commit fixes a bug in
shard_extent_map_t::erase_after_ro_offset() where the length
calculation was incorrect:
When ro_offset < ro_start, the incorrect calculation caused data that
should be erased to remain on shards, leading to recovery failures.
Additionally, this commit adds 13 comprehensive unit tests to TestECUtil
that thoroughly exercise erase_after_ro_offset across various edge cases,
including the critical scenario of objects smaller than stripe width where
some shards should remain empty. These tests successfully catch the bug
when it is re-introduced.
Note: The unit tests in this commit were generated with assistance from
an LLM (Large Language Model) and subsequently validated and refined.
Fixes: https://tracker.ceph.com/issues/74329 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit a60c86ae0a8db3f37832de779341d1df7510d9dd)
Afreen Misbah [Mon, 2 Feb 2026 18:56:10 +0000 (00:26 +0530)]
mgr/dashboard: Reverting server_addr to traddr
- the server_addr is used as traddr
- this change is not backported yet hence https://github.com/ceph/ceph/pull/66822 renaming to the convention used in tentacle
Afreen Misbah [Tue, 13 Jan 2026 20:47:40 +0000 (02:17 +0530)]
mgr/dashboard: Add productive card component
- add generic productive card component
- based on carbon design system
- there are two versions of card - with shadow(tinted affect) and without.
- applies gray10 theme which is decided by new designs.
Ilya Dryomov [Fri, 30 Jan 2026 15:32:35 +0000 (16:32 +0100)]
qa/tasks/rbd_mirror_thrash: don't use random.randrange() on floats
This stopped working in Python 3.12:
Changed in version 3.12: Automatic conversion of non-integer types
is no longer supported. Calls such as randrange(10.0) and
randrange(Fraction(10, 1)) now raise a TypeError.
Afreen Misbah [Mon, 15 Dec 2025 15:53:44 +0000 (21:23 +0530)]
'mgr/dashboard: Fix display of IP address in host page
- Hosts data is getting merged with hosts' facts which is not sending address hence not getting displayed in UI
- The value is empty hence in the API
- Caused by https://github.com/ceph/ceph/pull/65102
Ilya Dryomov [Thu, 29 Jan 2026 20:41:03 +0000 (21:41 +0100)]
qa/workunits/rbd: reduce randomized sleeps in live import tests
These tests were tuned for slower hardware than what we have now.
Currently "rbd migration execute" always finishes (successfully) before
the NBD server is killed.
Ilya Dryomov [Wed, 28 Jan 2026 09:41:13 +0000 (10:41 +0100)]
qa/workunits/rbd: drop randomized sleeps in "big image" tests
These tests were tuned for slower hardware than what we have now.
Even without these the image is often 25-30% synced by the time the
test gets to the "non-primary snapshot in question is still being
synced" assert.
Ilya Dryomov [Tue, 27 Jan 2026 20:56:23 +0000 (21:56 +0100)]
qa/workunits/rbd: avoid unnecessary sleeping in stop_mirror()
There is no need to wait for anything if -KILL is passed for sig
because the process would disappear immediately. In teuthology runs
where multiple rbd-mirror daemons are deployed (and therefore need to
be stopped when stop_mirrors() is called by the test), it causes
gratuitous delays of 4+ seconds.
Leonid Chernin [Thu, 9 Oct 2025 05:24:20 +0000 (08:24 +0300)]
nvmeofgw: fast-failover changes
beacon timeouts are measured also in prepare_beacon function to be able
to correctly implement shorter timeout values
default failover detection time was set to 7 sec
default beacon tick was set to 1 second
changed condition for detection of ceph slowness in NVMeofgwMon
Afreen Misbah [Wed, 28 Jan 2026 09:59:08 +0000 (15:29 +0530)]
mgr/dashboard: fetch all namespaces in a gateway group
- adds a new API /api/gateway_group/{group}/namespace
- updates tests
- needed for UI flows and in general to fetch all namespaces, could not change existing API due to the maintenence of backward compatibility
- in a followup PR will add server side pagination
Jon Bailey [Tue, 27 Jan 2026 14:29:05 +0000 (14:29 +0000)]
osd: Fix issue where it is possible for stats to be recovered incorrectly during merge operations.
To hit the problem, after you take a snapshot, you:
* Perform a write
* Perform a partial write that only involves the primary
* Perform a partial write that only involves a non-primary
* Primary goes down
* Primary comes up
* Primary goes through peering and chooses a non-primary shard as its peering partner
The result of these operations is the stats reporting a size difference equal to the partial write that only involves the primary, as the non-primary is not aware of the clone operation by design and so that is missing update is copied to the osd. This commit prevents it by invalidating the stats in the case where this happens. There will be a future commit to further narrow the set of cases where stats invalidations can happen.
Imran Imtiaz [Thu, 8 Jan 2026 10:37:32 +0000 (10:37 +0000)]
mgr/dashboard: fix RBD mirror schedule inheritance in pool and image APIs
Signed-off-by: Imran Imtiaz <imran.imtiaz@uk.ibm.com> Fixes: https://tracker.ceph.com/issues/74494
Fix the bug where the Pool API was reporting random image schedules
instead of pool schedules. Implement proper schedule inheritance
hierarchy (Image > Pool > Cluster) for both Pool and Image APIs.
Nitzan Mordechai [Thu, 13 Nov 2025 14:03:58 +0000 (14:03 +0000)]
qa/workunits: add Rocky Linux support to librados tests
Add Rocky Linux to the list of supported RPM-based distributions in
test_librados_build.sh and version_number_sanity.sh. Rocky Linux uses
the same package names and commands as CentOS/RHEL, so it can use the
existing RPM codepath.
Without this change, the tests fail on Rocky Linux systems with
"unknown distro" errors.
Leonid Chernin [Mon, 8 Dec 2025 20:54:44 +0000 (22:54 +0200)]
nvmeofgw: prevent map corruption while processing beacons from deleted gws
Fix race issue of map corruption when deleted gw sends beacons
but this gw data was removed from pending map and still exists in map.
Process beacons only if GW's data exists in both maps:
main-map and pending-map, otherwise just ignore beacons.
Samuel Just [Thu, 6 Nov 2025 23:54:50 +0000 (23:54 +0000)]
mon: add NVMEOF_BEACON_DIFF to mon_feature_t and mon CompatSet
NOPE NOPE
In order for the client to safely send BEACON_DIFF messages, it
needs to be the case that the leader at the time of receipt will
support BEACON_DIFF.
Simply using the connection features for the MonClient's target mon is
insufficient, because it might be a peon. If the peon supports
BEACON_DIFF and the leader does not the leader will either crash or
interpret it as a full BEACON. Neither outcome is acceptable.
Instead, we need to wire up a feature bit to the MonMap mon_feature_t
members and the CompatSet.
Adding FEATURE_BEACON_DIFF to ceph::features::mon get_supported()
and get_persistent() ensures that once all monitors in the quorum
support it, MonMap::get_required_features() will include it.
See Elector::propose_to_peers, Monitor::(win|lose)_election,
MonmapMonitor::apply_mon_features.
Once FEATURE_BEACON_DIFF is present in MonMap::get_required_features():
- Monitor::apply_monmap_to_compatset_features() will prevent
downgrades of the monitors by updating the CompatSet to include
CEPH_MON_FEATURE_INCOMPAT_NVMEOF_BEACON_DIFF
- Monitor::calc_quorum_requirements() will set
Monitor::required_features to require the NVMEOF_BEACON_DIFF
for any monitor peers.
- MonClient::get_monmap_required_features() will eventually include
ceph::features::mon::FEATURE_NVMEOF_BEACON_DIFF.
Leonid Chernin [Mon, 15 Sep 2025 11:04:04 +0000 (14:04 +0300)]
nvmeofgw: beacon diff implementation in the monitor and in the MonClient.
-monclient encodes subsystems by beacon-diff rules if BEACON_DIFF
bit is enabled by quorum
-monitor processes beacons by beacon-diff new schema
-monitor detects sequence out of order(ooo) condition and handles it
-in case ooo detected monitor send ack to the gw with the expected correct sequence
-monitor skips failovers for some interval when ooo detected
-monitor ignores all becons with incorrect sequences until gw sends expected one
-coding upgrade rules
Signed-off-by: Leonid Chernin <leonidc@il.ibm.com> Fixes: https://tracker.ceph.com/issues/72394
(cherry picked from commit 3555a28e45c5b44289f12abe2fc843e21c7ebf87)