mgr/DaemonServer: Re-order OSDs in crush bucket to maximize OSDs for upgrade
DaemonServer::_maximize_ok_to_upgrade_set() attempts to find which OSDs
from the initial set found as part of _populate_crush_bucket_osds() can be
upgraded as part of the initial phase. If the initial set results in failure,
the convergence logic trims the 'to_upgrade' vector from the end until a safe
set is found.
Therefore, it would be advantageous to sort the OSDs by the ascending number
of PGs hosted by the OSDs. By placing OSDs with smallest (or no PGs) at the
beginning of the vector, the trim logic along with _check_offlines_pgs() will
have the best chance of finding OSDs to upgrade as it approaches a grouping
of OSDs that have the smallest or no PGs.
To achieve the above, a temporary vector of struct pgs_per_osd is created and
sorted for a given crush bucket. The sorted OSDs are pushed to the main
crush_bucket_osds that is eventually used to run the _check_offlines_pgs()
logic to find a safe set of OSDs to upgrade.
pgmap is passed to _populate_crush_bucket_osds() to utilize get_num_pg_by_osd()
for the above logic to work.
Implement a new Mgr command called 'ok-to-upgrade' that returns a set of OSDs
within the provided CRUSH bucket that are safe to upgrade without reducing
immediate data availability.
The command accepts the following as input:
- CRUSH bucket name (required)
- The CRUSH bucket type is limited to 'rack', 'chassis', 'host' and 'osd'.
This is to prevent users from specifying a bucket type higher up the tree
which could result in performance issues if the number of OSDs in the
bucket is very high.
- The new Ceph version to check against. The format accepted is the short
form of the Ceph version, for e.g. 20.3.0-3803-g63ca1ffb5a2. (required)
- The maximum number of OSDs to consider if specified. (optional)
Implementation Details:
After sanity checks on the provided parameters, the following steps are
performed:
1. The set of OSDs within the CRUSH bucket is first determined.
2. From the main set of OSDs, a filtered set of OSDs not yet running the new
Ceph version is created.
- For this purpose, the OSD's 'ceph_version_short' string is read from
the metadata. For this purpose a new method called
DaemonServer::get_osd_metadata() is used. The information is determined
from the DaemonStatePtr maintained within the DaemonServer.
3. If all OSDs are already running the new Ceph version, a success report is
generated and returned.
4. If OSDs are not running the new Ceph version, a new set (to_upgrade) is
created.
5. If the current version cannot be determined, an error is logged and the
output report with 'bad_no_version' field populated with the OSD in question
is generated.
6. On the new set (to_upgrade), the existing logic in _check_offline_pgs() is
executed to see if stopping any or all OSDs in the set as part of the upgrade
can reduce immediate data availability.
- If data availability is impacted, then the number of OSDs in the filtered
set is reduced by a factor defined by a new config option called
'mgr_osd_upgrade_check_convergence_factor' which is set to 0.8 by default.
- The logic in _check_offline_pgs() is repeated for the new set.
- The above is repeated until a safe subset of OSDs that can be stopped for
upgrade is found. Each iteration reduces the number of OSDs to check by
the convergence factor mentioned above.
7. It must be noted that the default value of
'mgr_osd_upgrade_check_convergence_factor' is on the higher side in order to
help determine an optimal set of OSDs to upgrade. In other words, a higher
convergence factor would help maximize the number of OSDs to upgrade. In this
case, the number of iterations and therefore the time taken to determine the
OSDs to upgrade is proportional to the number of OSDs in the CRUSH bucket.
The converse is true if a lower convergence factor is used.
8. If the number of OSDs determined is lower than the 'max' specified, then an
additional loop is executed to determine if other children of the CRUSH
bucket can be added to the existing set.
9. Once a viable set is determined, an output report similar to the following is
generated:
A standalone test is introduced that exercises the logic for both replicated
and erasure-coded pools by manipulating the min_size for a pool and check for
upgradability. The tests also performs other basic sanity checks and error
conditions.
The output shown below is for a cluster running on a single node with 10 OSDs
and with replicated pool configuration:
mgr/DaemonServer: Modify offline_pg_report to handle set or vector types
The offline_pg_report structure to be used by both the 'ok-to-stop' and
ok-to-upgrade' commands is modified to handle either std::set or std::vector
type containers. This is necessitated due to the differences in the way
both commands work. For the 'ok-to-upgrade' command logic to work optimally,
the items in the specified crush bucket including items found in the subtree
must be strictly ordered. The earlier std::set container re-orders the items
upon insertion by sorting the items which results in the offline pg check to
report sub-optimal results.
Therefore, the offline_pg_report struct is modified to use
std::variant<std::vector<int>, std::set<int>> as a ContainerType and handled
accordingly in dump() using std::visit(). This ensures backward compatibility
with the existing 'ok-to-stop' command while catering to the requirements of
the new 'ok-to-upgrade' command.
Patrick Donnelly [Wed, 11 Feb 2026 19:05:07 +0000 (14:05 -0500)]
Merge PR #67011 into main
* refs/pull/67011/head:
qa/multisite: use boto3's ClientError in place of assert_raises from tools.py.
qa/multisite: test fixes
qa/multisite: boto3 in tests.py
qa/multisite: zone files use boto3 resource api
qa/multisite: switch to boto3 in multisite test libraries
Xavi Hernandez [Wed, 5 Nov 2025 09:05:45 +0000 (10:05 +0100)]
libcephfs_proxy: add the number of supported operations to negotiation
The v0 negotiation structure has been modified to hold the total number
of operations and callbacks supported by the peer. The changes are done in
a way that it's completely transparent and harmless for a peer expecting
the previous definition.
This will be useful to quickly check if the daemon supports some
operation, or the client supports some callback before sending them.
Shraddha Agrawal [Tue, 10 Feb 2026 13:02:10 +0000 (18:32 +0530)]
cephadm, ceph-volume: add tests for crimson OSD support
This commit adds tests for the crimson OSD support in cephadm and ceph-volume.
The following tests are added for the same:
1. cephadm: DriveGroupSpec validation checks for osd_type.
2. cephadm: entrypoint verification in runfile.
3. cephadm to ceph-volume: command verification when osd_type is specified in spec.
4. ceph-volume: binary selection verification for mkfs cmd.
Afreen Misbah [Sun, 1 Feb 2026 23:47:23 +0000 (05:17 +0530)]
mgr/dashboard: Add step two of subsystem create form
- add steps to add initiators
- can add by input field
- added right influencer (right panel) in tearsheet component
- added unit tests
- includes api updates
* refs/pull/67251/head:
qa: set column for insertion
qa: bail sqlite3 on any error
qa: use actual sqlite3 blob instead of string
test: use json_extract instead of awkward json_tree
Use OSDType.value when building the ceph-volume lvm batch command
so that the CLI receives "classic" or "crimson" instead of the enum
repr ("OSDType.classic").
We observed in Seastore, deletion of a large batch (default osd_target_transaction_size=30)
can take a significant amount of time.
Because this happens inside the peering_pp.process stage, it blocks the PG's peering pipeline.
During this block, any incoming OSDMap updates (PGAdvanceMap) are stalled behind the deletion work.
This eventually causes a global OSD-wide map progression hang because
the OSD cannot advance past an epoch until all PGs have processed
it.
To fix this, we are reducing osd_target_transaction_size to 5 to lower
conflict rates and allow deletion transactions to complete.