mgr/DaemonServer: Re-order OSDs in crush bucket to maximize OSDs for upgrade
DaemonServer::_maximize_ok_to_upgrade_set() attempts to find which OSDs
from the initial set found as part of _populate_crush_bucket_osds() can be
upgraded as part of the initial phase. If the initial set results in failure,
the convergence logic trims the 'to_upgrade' vector from the end until a safe
set is found.
Therefore, it would be advantageous to sort the OSDs by the ascending number
of PGs hosted by the OSDs. By placing OSDs with smallest (or no PGs) at the
beginning of the vector, the trim logic along with _check_offlines_pgs() will
have the best chance of finding OSDs to upgrade as it approaches a grouping
of OSDs that have the smallest or no PGs.
To achieve the above, a temporary vector of struct pgs_per_osd is created and
sorted for a given crush bucket. The sorted OSDs are pushed to the main
crush_bucket_osds that is eventually used to run the _check_offlines_pgs()
logic to find a safe set of OSDs to upgrade.
pgmap is passed to _populate_crush_bucket_osds() to utilize get_num_pg_by_osd()
for the above logic to work.
Implement a new Mgr command called 'ok-to-upgrade' that returns a set of OSDs
within the provided CRUSH bucket that are safe to upgrade without reducing
immediate data availability.
The command accepts the following as input:
- CRUSH bucket name (required)
- The CRUSH bucket type is limited to 'rack', 'chassis', 'host' and 'osd'.
This is to prevent users from specifying a bucket type higher up the tree
which could result in performance issues if the number of OSDs in the
bucket is very high.
- The new Ceph version to check against. The format accepted is the short
form of the Ceph version, for e.g. 20.3.0-3803-g63ca1ffb5a2. (required)
- The maximum number of OSDs to consider if specified. (optional)
Implementation Details:
After sanity checks on the provided parameters, the following steps are
performed:
1. The set of OSDs within the CRUSH bucket is first determined.
2. From the main set of OSDs, a filtered set of OSDs not yet running the new
Ceph version is created.
- For this purpose, the OSD's 'ceph_version_short' string is read from
the metadata. For this purpose a new method called
DaemonServer::get_osd_metadata() is used. The information is determined
from the DaemonStatePtr maintained within the DaemonServer.
3. If all OSDs are already running the new Ceph version, a success report is
generated and returned.
4. If OSDs are not running the new Ceph version, a new set (to_upgrade) is
created.
5. If the current version cannot be determined, an error is logged and the
output report with 'bad_no_version' field populated with the OSD in question
is generated.
6. On the new set (to_upgrade), the existing logic in _check_offline_pgs() is
executed to see if stopping any or all OSDs in the set as part of the upgrade
can reduce immediate data availability.
- If data availability is impacted, then the number of OSDs in the filtered
set is reduced by a factor defined by a new config option called
'mgr_osd_upgrade_check_convergence_factor' which is set to 0.8 by default.
- The logic in _check_offline_pgs() is repeated for the new set.
- The above is repeated until a safe subset of OSDs that can be stopped for
upgrade is found. Each iteration reduces the number of OSDs to check by
the convergence factor mentioned above.
7. It must be noted that the default value of
'mgr_osd_upgrade_check_convergence_factor' is on the higher side in order to
help determine an optimal set of OSDs to upgrade. In other words, a higher
convergence factor would help maximize the number of OSDs to upgrade. In this
case, the number of iterations and therefore the time taken to determine the
OSDs to upgrade is proportional to the number of OSDs in the CRUSH bucket.
The converse is true if a lower convergence factor is used.
8. If the number of OSDs determined is lower than the 'max' specified, then an
additional loop is executed to determine if other children of the CRUSH
bucket can be added to the existing set.
9. Once a viable set is determined, an output report similar to the following is
generated:
A standalone test is introduced that exercises the logic for both replicated
and erasure-coded pools by manipulating the min_size for a pool and check for
upgradability. The tests also performs other basic sanity checks and error
conditions.
The output shown below is for a cluster running on a single node with 10 OSDs
and with replicated pool configuration:
mgr/DaemonServer: Modify offline_pg_report to handle set or vector types
The offline_pg_report structure to be used by both the 'ok-to-stop' and
'ok-to-upgrade' commands is modified to handle either std::set or std::vector
type containers. This is necessitated due to the differences in the way
both commands work. For the 'ok-to-upgrade' command logic to work optimally,
the items in the specified crush bucket including items found in the subtree
must be strictly ordered. The earlier std::set container re-orders the items
upon insertion by sorting the items which results in the offline pg check to
report sub-optimal results.
Therefore, the offline_pg_report struct is modified to use
std::variant<std::vector<int>, std::set<int>> as a ContainerType and handled
accordingly in dump() using std::visit(). This ensures backward compatibility
with the existing 'ok-to-stop' command while catering to the requirements of
the new 'ok-to-upgrade' command.
Conflicts:
src/mgr/DaemonServer.cc
- In DaemonServer::_check_offlines_pgs(), resolve a compiler error
when printing the list of osds which is of type std::variant. Since
tentacle branch is still not transitioned to use std::variant
instead of boost::variant, the operator<< is not handled like in the
main branch(see commit: 09d2731). Therefore, use std::visit() to
print the list of osds that could be of types std::set or
std::vector by using a lambda and passing the CephContext pointer
and a pointer to the offline pg report.
The existing StandardPolicy that exposed as RBD_LOCK_MODE_EXCLUSIVE
argument to rbd_lock_acquire() disables automatic exclusive lock
transitions with "permanent" semantics: any request to release the lock
causes the peer to error out immediately. Such a lock owner can
perform maintenance operations that are proxied from other peers, but
any write-like I/O issued by other peers will fail with EROFS.
This isn't suitable for use cases where one of the peers wants to
manage exclusive lock manually (i.e. rbd_lock_acquire() is used) but
the lock is acquired only for very short periods of time. The rest of
the time the lock is expected to be held by other peers that stay in
the default "auto" mode (AutomaticPolicy) and run as usual, completely
unconcerned with each other or the manual-mode peer. However, these
peers get acutely aware of the manual-mode peer because when it grabs
the lock with RBD_LOCK_MODE_EXCLUSIVE their I/O gets disrupted: higher
layers translate EROFS into generic EIO, filesystems shut down, etc.
Add a new TransientPolicy exposed as RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT
to allow disabling automatic exclusive lock transitions with semantics
that would cause the other peers to block waiting for the lock to be
released by the manual-mode peer. This is intended to be a low-level
interface -- no attempt to safeguard against potential misuse causing
e.g. indefinite blocking is made.
It's possible to switch between RBD_LOCK_MODE_EXCLUSIVE and
RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT modes of operation both while the
lock is held and after it's released.
Ilya Dryomov [Mon, 19 Jan 2026 16:43:41 +0000 (17:43 +0100)]
librbd: prepare lock_acquire() for changing between policies
In preparation for adding a new TransientPolicy, get rid of the check
implemented in terms of exclusive_lock::Policy::may_auto_request_lock()
that essentially makes it so that exclusive lock policy on a given
image handle can be changed from the default AutomaticPolicy only once.
In order to effect another change a new image handle would have been
needed which is pretty suboptimal.
Ilya Dryomov [Mon, 22 Dec 2025 18:07:27 +0000 (19:07 +0100)]
librbd: fix RequestLockPayload log message in ImageWatcher
exclusive_lock::Policy::lock_requested() isn't guaranteed to queue
the release of exclusive lock (and in fact only one of the two existing
implementations does that). Instead of talking about the lock, log the
response to the notification.
Ilya Dryomov [Mon, 22 Dec 2025 16:22:53 +0000 (17:22 +0100)]
librbd: amend error message in lock_acquire()
... since it went stale with commit 2914eef50d69 ("rbd: Changed
exclusive-lock implementation to use the new managed-lock"). In the
context of exclusive lock, requesting the lock refers to a specific
action which may or may not be performed as part of acquiring the lock
and lock_acquire() doesn't get visibility into that.
Ilya Dryomov [Tue, 11 Nov 2025 20:39:58 +0000 (21:39 +0100)]
qa/valgrind.supp: make gcm_cipher_internal suppression more resilient
gcm_cipher_internal() and ossl_gcm_stream_final() make it to the stack
trace only on CentOS Stream 9. On Ubuntu 22.04 and Rocky 10, it looks
as follows:
Thread 4 msgr-worker-1:
Conditional jump or move depends on uninitialised value(s)
at 0x70A36D4: ??? (in /usr/lib64/libcrypto.so.3.2.2)
by 0x70A39A1: ??? (in /usr/lib64/libcrypto.so.3.2.2)
by 0x6F8A09C: EVP_DecryptFinal_ex (in /usr/lib64/libcrypto.so.3.2.2)
by 0xB498C1F: ceph::crypto::onwire::AES128GCM_OnWireRxHandler::authenticated_decrypt_update_final(ceph::buffer::v15_2_0::list&) (crypto_onwire.cc:271)
by 0xB4992D7: ceph::msgr::v2::FrameAssembler::disassemble_preamble(ceph::buffer::v15_2_0::list&) (frames_v2.cc:281)
by 0xB482D98: ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int) (ProtocolV2.cc:1149)
by 0xB475318: ProtocolV2::run_continuation(Ct<ProtocolV2>&) (ProtocolV2.cc:54)
by 0xB457012: AsyncConnection::process() (AsyncConnection.cc:495)
by 0xB49E61A: EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (Event.cc:492)
by 0xB49EA9D: UnknownInlinedFun (Stack.cc:50)
by 0xB49EA9D: UnknownInlinedFun (invoke.h:61)
by 0xB49EA9D: UnknownInlinedFun (invoke.h:111)
by 0xB49EA9D: std::_Function_handler<void (), NetworkStack::add_thread(Worker*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:290)
by 0xBB11063: ??? (in /usr/lib64/libstdc++.so.6.0.33)
by 0x4F17119: start_thread (in /usr/lib64/libc.so.6)
The proposal to amend the existing suppression so that it's tied to the
specific callsite rather than libcrypto internals [1] received a thumbs
up from Radoslaw.
Ilya Dryomov [Tue, 11 Nov 2025 15:33:16 +0000 (16:33 +0100)]
qa/tasks/qemu: install genisoimage package
genisoimage is expected to be included in our base images but currently
isn't on Rocky 10. Since it's quite a niche thing, let's install the
package explicitly.
Ville Ojamo [Mon, 5 Jan 2026 06:10:45 +0000 (13:10 +0700)]
doc: Remove sphinxcontrib-seqdiag Python package from RTD builds
This is a proactive PR to avoid breaking docs builds when Setuptools 81
starts to be used in the RTD builds process.
The sphnixcontrib-seqdiag Python package is not compatible with
Setuptools 81 or later due to use of pkg_resources:
https://setuptools.pypa.io/en/latest/pkg_resources.html
Setuptools 81 release should be imminent, with the Python deprecation
warning stating pkg_resources "removal as early as 2025-11-30".
Seqdiag seems to be unmaintained with the latest update at Pypi in
the year 2021 and also no updates to the seqdiag git repo.
There are no seqdiag directives left in the docs after last seqdiags
were removed in PR #52308.
Two other options would exist for fixing the situation (see PR for
discussion) but this seems to be the suitable one.
Kyr Shatskyy [Fri, 21 Nov 2025 21:20:04 +0000 (22:20 +0100)]
qa/workunits/rgw: drop netstat usage
The `netstat` is deprecated now in modern Linux and usually
requires an extra package dependency to be installed.
Usually it is `net-tools`, however, for example, opensuse,
`netstat` does not present in it. Thus, let us use `ss` as
an alternative.
When using `netstat -nltp` we get lines like:
'tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 25156/valgrind.bin \ntcp6 0 0 :::443 :::* LISTEN 25156/valgrind.bin \n'
When using `ss -nltp` we get lines like:
'LISTEN 0 4096 0.0.0.0:443 0.0.0.0:* users:(("memcheck-amd64-",pid=66045,fd=72))'
so we need to filter processes by `memcheck`. However further
parsing code works equivalently as for netstat.
Alex Ainscow [Fri, 2 Jan 2026 18:47:37 +0000 (18:47 +0000)]
osd/ECUtil: Fix erase_after_ro_offset length calculation and add tests
System test logs showed EC recovery failures with assertion errors when
recovering small objects (smaller than stripe width) in EC pools.
The recovery would fail with "shard_size >= tobj_size" assertions
because shards that should be empty incorrectly contained data.
The primary change in this commit fixes a bug in
shard_extent_map_t::erase_after_ro_offset() where the length
calculation was incorrect:
When ro_offset < ro_start, the incorrect calculation caused data that
should be erased to remain on shards, leading to recovery failures.
Additionally, this commit adds 13 comprehensive unit tests to TestECUtil
that thoroughly exercise erase_after_ro_offset across various edge cases,
including the critical scenario of objects smaller than stripe width where
some shards should remain empty. These tests successfully catch the bug
when it is re-introduced.
Note: The unit tests in this commit were generated with assistance from
an LLM (Large Language Model) and subsequently validated and refined.
Fixes: https://tracker.ceph.com/issues/74329 Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit a60c86ae0a8db3f37832de779341d1df7510d9dd)
Afreen Misbah [Mon, 2 Feb 2026 18:56:10 +0000 (00:26 +0530)]
mgr/dashboard: Reverting server_addr to traddr
- the server_addr is used as traddr
- this change is not backported yet hence https://github.com/ceph/ceph/pull/66822 renaming to the convention used in tentacle
Afreen Misbah [Tue, 13 Jan 2026 20:47:40 +0000 (02:17 +0530)]
mgr/dashboard: Add productive card component
- add generic productive card component
- based on carbon design system
- there are two versions of card - with shadow(tinted affect) and without.
- applies gray10 theme which is decided by new designs.
Ilya Dryomov [Fri, 30 Jan 2026 15:32:35 +0000 (16:32 +0100)]
qa/tasks/rbd_mirror_thrash: don't use random.randrange() on floats
This stopped working in Python 3.12:
Changed in version 3.12: Automatic conversion of non-integer types
is no longer supported. Calls such as randrange(10.0) and
randrange(Fraction(10, 1)) now raise a TypeError.
Afreen Misbah [Mon, 15 Dec 2025 15:53:44 +0000 (21:23 +0530)]
'mgr/dashboard: Fix display of IP address in host page
- Hosts data is getting merged with hosts' facts which is not sending address hence not getting displayed in UI
- The value is empty hence in the API
- Caused by https://github.com/ceph/ceph/pull/65102
Ilya Dryomov [Thu, 29 Jan 2026 20:41:03 +0000 (21:41 +0100)]
qa/workunits/rbd: reduce randomized sleeps in live import tests
These tests were tuned for slower hardware than what we have now.
Currently "rbd migration execute" always finishes (successfully) before
the NBD server is killed.
Ilya Dryomov [Wed, 28 Jan 2026 09:41:13 +0000 (10:41 +0100)]
qa/workunits/rbd: drop randomized sleeps in "big image" tests
These tests were tuned for slower hardware than what we have now.
Even without these the image is often 25-30% synced by the time the
test gets to the "non-primary snapshot in question is still being
synced" assert.
Ilya Dryomov [Tue, 27 Jan 2026 20:56:23 +0000 (21:56 +0100)]
qa/workunits/rbd: avoid unnecessary sleeping in stop_mirror()
There is no need to wait for anything if -KILL is passed for sig
because the process would disappear immediately. In teuthology runs
where multiple rbd-mirror daemons are deployed (and therefore need to
be stopped when stop_mirrors() is called by the test), it causes
gratuitous delays of 4+ seconds.
Leonid Chernin [Thu, 9 Oct 2025 05:24:20 +0000 (08:24 +0300)]
nvmeofgw: fast-failover changes
beacon timeouts are measured also in prepare_beacon function to be able
to correctly implement shorter timeout values
default failover detection time was set to 7 sec
default beacon tick was set to 1 second
changed condition for detection of ceph slowness in NVMeofgwMon
Afreen Misbah [Wed, 28 Jan 2026 09:59:08 +0000 (15:29 +0530)]
mgr/dashboard: fetch all namespaces in a gateway group
- adds a new API /api/gateway_group/{group}/namespace
- updates tests
- needed for UI flows and in general to fetch all namespaces, could not change existing API due to the maintenence of backward compatibility
- in a followup PR will add server side pagination
Jon Bailey [Tue, 27 Jan 2026 14:29:05 +0000 (14:29 +0000)]
osd: Fix issue where it is possible for stats to be recovered incorrectly during merge operations.
To hit the problem, after you take a snapshot, you:
* Perform a write
* Perform a partial write that only involves the primary
* Perform a partial write that only involves a non-primary
* Primary goes down
* Primary comes up
* Primary goes through peering and chooses a non-primary shard as its peering partner
The result of these operations is the stats reporting a size difference equal to the partial write that only involves the primary, as the non-primary is not aware of the clone operation by design and so that is missing update is copied to the osd. This commit prevents it by invalidating the stats in the case where this happens. There will be a future commit to further narrow the set of cases where stats invalidations can happen.