mgr/DaemonServer: Implement ok-to-upgrade command

author Sridhar Seshasayee <sseshasa@redhat.com>

Mon, 27 Oct 2025 16:34:54 +0000 (22:04 +0530)

committer Sridhar Seshasayee <sseshasa@redhat.com>

Mon, 23 Feb 2026 07:13:16 +0000 (12:43 +0530)
author Sridhar Seshasayee <sseshasa@redhat.com>
Mon, 27 Oct 2025 16:34:54 +0000 (22:04 +0530)
committer Sridhar Seshasayee <sseshasa@redhat.com>
Mon, 23 Feb 2026 07:13:16 +0000 (12:43 +0530)
diff --git a/PendingReleaseNotes b/PendingReleaseNotes

index 8060eb62b6e61f21c92f63f0eb0134df33de28a1..36e414648ef290b5396e836391efbee1c195e8e2 100644 (file)
--- a/PendingReleaseNotes
+++ b/PendingReleaseNotes
@@ -247,6 +247,18 @@
    to ensure compatibility during upgrades, but can be disabled once old usage logs
    are no longer present to avoid performance overhead.
  
+* MGR: A new command, `ceph osd ok-to-upgrade`, has been added that allows
+  users and orchestration tools to determine a safe set of OSDs within a CRUSH
+  bucket to upgrade simultaneously without impacting data availability. To help
+  converge to a safe set, a new config option
+  ``mgr_osd_upgrade_check_convergence_factor`` is introduced. This option can be
+  modified (if necessary) to help converge to an optimal set. Higher values
+  maximize the set of OSDs to upgrade at the cost of longer command response
+  times. Conversely, a lower value improves the command response time but
+  results in a non-optimal or smaller set of OSDs which impacts the overall time
+  to upgrade all OSDs in the cluster. For more details see tracker:
+  https://tracker.ceph.com/issues/73031.
+
  >=19.2.1
  
  * CephFS: The `fs subvolume create` command now allows tagging subvolumes through option
diff --git a/doc/man/8/ceph.rst b/doc/man/8/ceph.rst

index 0afc3cbbe266c8ae453a477161639c1778bb40c5..cec4300bc452d742fdc6f68b54129fec4fb3048f 100644 (file)
--- a/doc/man/8/ceph.rst
+++ b/doc/man/8/ceph.rst
@@ -37,7 +37,7 @@ Synopsis
  
  | **ceph** **mon** [ *add* \| *dump* \| *enable_stretch_mode* \| *getmap* \| *remove* \| *stat* ] ...
  
-| **ceph** **osd** [ *blocklist* \| *blocked-by* \| *create* \| *new* \| *deep-scrub* \| *df* \| *down* \| *dump* \| *erasure-code-profile* \| *find* \| *getcrushmap* \| *getmap* \| *getmaxosd* \| *in* \| *ls* \| *lspools* \| *map* \| *metadata* \| *ok-to-stop* \| *out* \| *pause* \| *perf* \| *pg-temp* \| *force-create-pg* \| *primary-affinity* \| *primary-temp* \| *repair* \| *reweight* \| *reweight-by-pg* \| *rm* \| *destroy* \| *purge* \| *safe-to-destroy* \| *scrub* \| *set* \| *setcrushmap* \| *setmaxosd*  \| *stat* \| *tree* \| *unpause* \| *unset* ] ...
+| **ceph** **osd** [ *blocklist* \| *blocked-by* \| *create* \| *new* \| *deep-scrub* \| *df* \| *down* \| *dump* \| *erasure-code-profile* \| *find* \| *getcrushmap* \| *getmap* \| *getmaxosd* \| *in* \| *ls* \| *lspools* \| *map* \| *metadata* \| *ok-to-stop* \| *ok-to-upgrade* \| *out* \| *pause* \| *perf* \| *pg-temp* \| *force-create-pg* \| *primary-affinity* \| *primary-temp* \| *repair* \| *reweight* \| *reweight-by-pg* \| *rm* \| *destroy* \| *purge* \| *safe-to-destroy* \| *scrub* \| *set* \| *setcrushmap* \| *setmaxosd*  \| *stat* \| *tree* \| *unpause* \| *unset* ] ...
  
  | **ceph** **osd** **crush** [ *add* \| *add-bucket* \| *create-or-move* \| *dump* \| *get-tunable* \| *link* \| *move* \| *remove* \| *rename-bucket* \| *reweight* \| *reweight-all* \| *reweight-subtree* \| *rm* \| *rule* \| *set* \| *set-tunable* \| *show-tunables* \| *tunables* \| *unlink* ] ...
  
@@ -1113,6 +1113,53 @@ Usage::
  
    ceph osd ok-to-stop <id> [<ids>...] [--max <num>]
  
+Subcommand ``ok-to-upgrade`` determines a safe set of OSDs found within the
+specified CRUSH bucket to upgrade simultaneously without impacting cluster
+data availability and with all data remaining readable and writeable. Data
+redundancy may be reduced with some PGs in degraded (but active) state. The
+command checks the Ceph version running on the OSDs against the specified
+version and filters those still needing upgrade. The command returns a
+success code if it finds a safe set of OSD(s) to upgrade and shows the list
+of OSD(s) in the response, or an error code and informative message otherwise
+or if no conclusion can be drawn.
+
+The CRUSH bucket types passed to the command can be one of 'rack', 'chassis',
+'host' or 'osd'. This restriction is to avoid performance issues with larger
+failure domains where the number of OSDs to check could be very high and to
+help manage failures if any during upgrades.
+
+The expected format of the option ``<new_ceph_version_short>`` is the short form
+of the Ceph version string. The version string format is similar to the value of
+``ceph_version_short`` key seen in the output of the ``ceph osd metadata <id>``
+command where ``id`` is the OSD number.
+
+When ``--max <num>`` is provided, up to <num> OSD IDs found either within the
+provided CRUSH bucket or across the CRUSH hierarchy that can be stopped for
+upgrade simultaneously will be returned. This logic can for example be triggered
+by specifying a single starting OSD and a max number. The search then spans both
+within and across the CRUSH hierarchy and additional OSDs are drawn from those
+locations.
+
+The command automatically determines a safe set of OSDs to upgrade found in the
+provided CRUSH bucket. If not all OSDs in the CRUSH bucket can be upgraded
+simultaneously, the command uses the config option
+``mgr_osd_upgrade_check_convergence_factor`` to progressively reduce the set of
+OSDs to check until a safe set is found. Note that the default value is on the
+higher side to help determine an optimal set of OSDs to upgrade. A higher
+convergence factor will help maximize the number of OSDs to upgrade at the cost
+of more iterations and time to find the set. The converse is true if a lower
+convergence factor is used. A lower value should be used only if the command is
+sluggish to respond.
+
+It must be noted that this command leverages the underlying logic of the
+``ok-to-stop`` command. The key difference is that ``ok-to-upgrade`` command
+operates strictly on the OSDs found in the CRUSH bucket and considers adjacent
+CRUSH locations if necessary to satisfy the ``--max`` criteria.
+
+Usage::
+
+  ceph osd ok-to-upgrade <crush_bucket_name> <new_ceph_version_short> [--max <num>]
+
  Subcommand ``pause`` pauses osd.
  
  Usage::
diff --git a/qa/standalone/misc/ok-to-upgrade.sh b/qa/standalone/misc/ok-to-upgrade.sh

new file mode 100755 (executable)

index 0000000..3436e48
--- /dev/null
+++ b/qa/standalone/misc/ok-to-upgrade.sh
@@ -0,0 +1,294 @@
+#!/usr/bin/env bash
+#
+# Copyright (C) 2025 IBM
+#
+# Author: Sridhar Seshasayee <sseshasa@redhat.com>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU Library Public License as published by
+# the Free Software Foundation; either version 2, or (at your option)
+# any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU Library Public License for more details.
+#
+
+source $CEPH_ROOT/qa/standalone/ceph-helpers.sh
+
+function run() {
+    local dir=$1
+    shift
+
+    export CEPH_MON="127.0.0.1:7170" # git grep '\<7170\>' : there must be only one
+    export CEPH_ARGS
+    CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none "
+    export ORIG_CEPH_ARGS="$CEPH_ARGS"
+
+    local funcs=${@:-$(set | ${SED} -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')}
+    for func in $funcs ; do
+        setup $dir || return 1
+        $func $dir || return 1
+        kill_daemons $dir KILL || return 1
+        teardown $dir || return 1
+    done
+}
+
+function TEST_ok_to_upgrade_invalid_args() {
+    local dir=$1
+    CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON "
+
+    run_mon $dir a --public-addr=$CEPH_MON || return 1
+    run_mgr $dir x || return 1
+    run_osd $dir 0 --osd-mclock-skip-benchmark=true || return 1
+
+    # test with no args
+    ! ceph osd ok-to-upgrade || return 1
+
+    # test with invalid crush bucket name
+    local crush_bucket="foo"
+    local ceph_version_short="01.2.3-1234-g1234deed"
+    ! ceph osd ok-to-upgrade $crush_bucket $ceph_version_short || return 1
+
+    # test with 'root' crush bucket name
+    crush_bucket="default"
+    ! ceph osd ok-to-upgrade $crush_bucket $ceph_version_short || return 1
+
+    # test with invalid ceph_version formats
+    crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }')
+    ceph_versions=("" "foo" "20" "20.3.0" "20.3.0-1234" \
+                   "20.3.0-1234-g" "20.1.0-145.el")
+    for ver in "${ceph_versions[@]}"; do
+      ! ceph osd ok-to-upgrade $crush_bucket $ver || return 1
+    done
+
+    # Invalid max parameter
+    max=-20
+    ! ceph osd ok-to-upgrade $crush_bucket $ver $max|| return 1
+}
+
+function TEST_ok_to_upgrade_replicated_pool() {
+    local dir=$1
+    local poolname="test"
+    local OSDS=10
+    local ceph_version="01.2.3-1234-g1234deed"
+
+    CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON "
+
+    run_mon $dir a --public-addr=$CEPH_MON || return 1
+    run_mgr $dir x || return 1
+
+    for osd in $(seq 0 $(expr $OSDS - 1))
+    do
+      run_osd $dir $osd --osd-mclock-skip-benchmark=true || return 1
+    done
+
+    create_pool $poolname 32 32
+    ceph osd pool set $poolname min_size 1
+    sleep 5
+
+    wait_for_clean || return 1
+
+    # Test for upgradability with min_size=1
+    local exp_osds_upgradable=2
+    local crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }')
+    local res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    # Specifying hostname as the crush bucket with a 3x replicated pool on 10 OSDs
+    # and with the default 'mgr_osd_upgrade_check_convergence_factor' would result
+    # in 4 OSDs being reported as upgradable.
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = true || return 1
+    local num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $num_osds_upgradable -ge $exp_osds_upgradable || return 1
+    local num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for upgradability with min_size=1, 1 OSD to upgrade and max=3.
+    # This tests the functionality of the 'max' parameter and checks the
+    # logic to find more OSDs in the crush bucket.
+    local max=2
+    exp_osds_upgradable=2
+    crush_bucket="osd.0"
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version $max --format=json)
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = true || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    test $max = $num_osds_upgradable || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for upgradability with min_size=2
+    ceph osd pool set $poolname min_size 2
+    sleep 5
+    wait_for_clean || return 1
+    exp_osds_upgradable=1
+    crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }')
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    # 3 OSDs should be reported as upgradable.
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = true || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $num_osds_upgradable -ge $exp_osds_upgradable || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for upgradability with min_size=3
+    ceph osd pool set $poolname min_size 3
+    sleep 5
+    wait_for_clean || return 1
+    exp_osds_upgradable=0
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    # No OSD should be reported as upgradable.
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = false || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for condition when all OSDs are running desired version.
+    upgrade_version=$(ceph osd metadata 0 --format=json | \
+      jq '.ceph_version_short' | sed 's/"//g')
+    res=$(ceph osd ok-to-upgrade $crush_bucket $upgrade_version --format=json)
+    test $(echo $res | jq '.all_osds_upgraded') = true || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = false || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $num_osds_upgradable -eq 0 || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq $OSDS || return 1
+}
+
+function TEST_ok_to_upgrade_erasure_pool() {
+    local dir=$1
+    local poolname="ec"
+    local OSDS=10
+    local ceph_version="01.2.3-1234-g1234deed"
+
+    CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON "
+
+    run_mon $dir a --public-addr=$CEPH_MON || return 1
+    run_mgr $dir x || return 1
+
+    for osd in $(seq 0 $(expr $OSDS - 1))
+    do
+      run_osd $dir $osd --osd-mclock-skip-benchmark=true || return 1
+    done
+
+    ceph osd erasure-code-profile set ec-profile m=3 k=5 crush-failure-domain=osd || return 1
+    ceph osd pool create $poolname erasure ec-profile || return 1
+    ceph osd pool set $poolname min_size 5
+    sleep 5
+
+    wait_for_clean || return 1
+
+    # Test for upgradability with min_size=5
+    local exp_osds_upgradable=3
+    local crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }')
+    local res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    # Specifying hostname as the crush bucket with a ec5+3 pool on 10 OSDs
+    # and with the default 'mgr_osd_upgrade_check_convergence_factor' would result
+    # in 3 OSDs being reported as upgradable.
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = true || return 1
+    local num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    local num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for upgradability with min_size=1, 1 OSD to upgrade and max=3.
+    # This tests the functionality of the 'max' parameter and also checks
+    # the logic to find more OSDs in the crush bucket.
+    local max=3
+    crush_bucket="osd.0"
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version $max --format=json)
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = true || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    test $max = $num_osds_upgradable || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for upgradability with min_size=6
+    ceph osd pool set $poolname min_size 6
+    sleep 5
+    wait_for_clean || return 1
+    exp_osds_upgradable=2
+    crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }')
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    # 2 OSDs should be reported as upgradable.
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = true || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for upgradability with min_size=8
+    ceph osd pool set $poolname min_size 8
+    sleep 5
+    wait_for_clean || return 1
+    exp_osds_upgradable=0
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    # No OSD should be reported as upgradable.
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = false || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq 0 || return 1
+
+    # Test for condition when all OSDs are running desired version.
+    ceph_version=$(ceph osd metadata 0 --format=json | \
+      jq '.ceph_version_short' | sed 's/"//g')
+    res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    test $(echo $res | jq '.all_osds_upgraded') = true || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = false || return 1
+    num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $num_osds_upgradable -eq 0 || return 1
+    num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc)
+    test $num_osds_upgraded -eq $OSDS || return 1
+}
+
+function TEST_ok_to_upgrade_bad_osd_version() {
+    local dir=$1
+    local poolname="test"
+    local OSDS=3
+    local ceph_version="01.2.3-1234-g1234deed"
+
+    CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON "
+
+    run_mon $dir a --public-addr=$CEPH_MON || return 1
+    run_mgr $dir x || return 1
+
+    for osd in $(seq 0 $(expr $OSDS - 1))
+    do
+      run_osd $dir $osd --osd-mclock-skip-benchmark=true || return 1
+    done
+
+    create_pool $poolname 8 8
+    ceph osd pool set $poolname min_size 1
+    sleep 5
+
+    wait_for_clean || return 1
+
+    # Set the option to enable testing metadata errors
+    ceph config set mgr mgr_test_metadata_error true
+
+    # Test for upgradability with min_size=1
+    local exp_osds_upgradable=0
+    local exp_osds_bad_version=3
+    local crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }')
+    local res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json)
+    test $(echo $res | jq '.all_osds_upgraded') = false || return 1
+    test $(echo $res | jq '.ok_to_upgrade') = false || return 1
+    local num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc)
+    test $exp_osds_upgradable = $num_osds_upgradable || return 1
+    local num_osds_bad_version=$(echo $res | jq '.bad_no_version | length' | bc)
+    test $num_osds_bad_version -eq 3 || return 1
+}
+
+
+main ok-to-upgrade "$@"
diff --git a/src/common/options/mgr.yaml.in b/src/common/options/mgr.yaml.in

index c6bdee1d156dad88ea09473ce6d421396a24090c..658a0160987effc27fb846210880b28ad76bd6d5 100644 (file)
--- a/src/common/options/mgr.yaml.in
+++ b/src/common/options/mgr.yaml.in
@@ -379,3 +379,27 @@ options:
    services:
    - mgr
    with_legacy: true
+- name: mgr_osd_upgrade_check_convergence_factor
+  type: float
+  level: advanced
+  desc: The factor used to converge to a subset of OSDs within a CRUSH bucket
+    that can be upgraded without affecting immediate data availability.
+  fmt_desc: The factor used in calculations to converge to a subset of OSDs that
+    can be safely upgraded simultaneously. Each iteration of the calculation
+    uses this factor until a safe subset is found. The smaller the factor, the
+    lower the number of iterations needed to find a safe set. But the number of
+    OSDs found may not be optimal. Conversely, with a larger factor, a higher
+    number of iterations and time would be consumed to find a safe set. The
+    number of OSDs found in this case would be closer to optimal.
+  default: 0.8
+  min: 0.1
+  max: 0.9
+  services:
+  - mgr
+- name: mgr_test_metadata_error
+  type: bool
+  level: dev
+  desc: Used for simulating errors during operations involving metadata.
+  default: false
+  services:
+  - mgr
diff --git a/src/mgr/DaemonServer.cc b/src/mgr/DaemonServer.cc

index 9dbcff6feb8213411216b5339a6b51093a96c845..9f20dc0dad3967f03ab0ed63350784afb1e24737 100644 (file)
--- a/src/mgr/DaemonServer.cc
+++ b/src/mgr/DaemonServer.cc
@@ -1164,6 +1164,335 @@ void DaemonServer::_maximize_ok_to_stop_set(
    }
  }
  
+void DaemonServer::_update_upgraded_osds(
+  const std::vector<int>& orig_osds,
+  const std::vector<int>& to_upgrade,
+  const std::vector<int>& upgraded,
+  const std::vector<int>& version_unknown,
+  upgrade_osd_report *report)
+{
+  // reset output
+  *report = upgrade_osd_report();
+  report->osds = orig_osds;
+  report->ok_upgrade = to_upgrade;
+  report->ok_upgraded = upgraded;
+  report->bad_no_version = version_unknown;
+}
+
+bool DaemonServer::_valid_bucket_type_for_upgrade_check(
+  std::string_view bucket_type_str)
+{
+  if (bucket_type_str.empty()) {
+    dout(20) << "bucket type string is empty!" << dendl;
+    return false;
+  }
+
+  return (bucket_type_str == "rack" || bucket_type_str == "chassis" ||
+          bucket_type_str == "host" || bucket_type_str == "osd");
+}
+
+int DaemonServer::_populate_crush_bucket_osds(
+  const int item_id,
+  const OSDMap& osdmap,
+  std::vector<int>& crush_bucket_osds,
+  std::ostream *ss)
+{
+  int r = 0;
+  int btype = osdmap.crush->get_bucket_type(item_id);
+  if (btype < 0) {
+    // For negative type an OSD may be assumed
+    btype = 0;
+  }
+  std::string item_name = osdmap.crush->get_item_name(item_id);
+  std::string bucket_type_str = osdmap.crush->get_type_name(btype);
+  if (!_valid_bucket_type_for_upgrade_check(bucket_type_str)) {
+    ostringstream os;
+    os << "crush bucket \"" << item_name << "\" of type "
+       << "\"" << bucket_type_str << "\" is incompatible for "
+       << "upgradability check; valid types are: 'rack', 'chassis', "
+       << "'host' and 'osd'";
+    if (ss) {
+      *ss << os.str();
+    }
+    dout(20) << os.str() << dendl;
+    return -EINVAL;
+  }
+  dout(20) << "bucket type of parent " << item_name << " is "
+             << bucket_type_str << dendl;
+
+  std::vector<std::string> bucket_names;
+  // get candidate additions that are beneath this point in the tree
+  if (bucket_type_str == "rack" || bucket_type_str == "chassis") {
+    std::list<int> crush_bucket_children;
+    // Get the list of children
+    if (osdmap.crush->get_children(item_id, &crush_bucket_children) <= 0) {
+      ostringstream os;
+      os << "crush bucket \"" << item_name << "\" of type: "
+         << bucket_type_str << " has no children!";
+      if (ss) {
+        *ss << os.str();
+      }
+      dout(20) << os.str() << dendl;
+      return -ENOENT;
+    }
+    // create a list of bucket names pertaining to each child in the tree
+    for (const auto &child : crush_bucket_children) {
+      bucket_names.push_back(osdmap.crush->get_item_name(child));
+    }
+  } else if (bucket_type_str == "host" || bucket_type_str == "osd") {
+    bucket_names.push_back(item_name);
+  }
+  // get osds under each child bucket
+  std::set<int> bucket_osds;
+  for (const auto &item : bucket_names) {
+    r = osdmap.get_osds_by_bucket_name(item, &bucket_osds);
+    if (r < 0) {
+      ostringstream os;
+      os << "cannot parse crush bucket:\"" << item
+         << "\" of type: " << bucket_type_str << ". "
+         << "got error code: " << r;
+      if (ss) {
+        *ss << os.str();
+      }
+      dout(20) << os.str() << dendl;
+      return r;
+    }
+    // The osds are pushed to the referenced crush_bucket_osds
+    // vector to maintain the order of osds according to the
+    // child order. This helps optimize the result of
+    // _check_offlines_pgs() down the line.
+    for (const auto &osd : bucket_osds) {
+      crush_bucket_osds.push_back(osd);
+    }
+    dout(20) << "Picked children: " << bucket_osds
+             << " from parent: " << item << dendl;
+  }
+  return r;
+}
+
+void DaemonServer::_maximize_ok_to_upgrade_set(
+  const std::vector<int>& orig_osds,
+  unsigned max,
+  const OSDMap& osdmap,
+  const PGMap& pgmap,
+  std::string_view ceph_version_new,
+  upgrade_osd_report *out_osd_report,
+  offline_pg_report *out_pg_report,
+  std::ostream *ss)
+{
+  std::vector<int> to_upgrade;
+  std::vector<int> upgraded;
+  std::vector<int> version_unknown;
+
+  dout(20) << "orig_osds " << orig_osds
+           << " new ceph_version " << ceph_version_new << dendl;
+  // Filter osds not yet running the new ceph_version.
+  // Limit the check for safe upgrade to only the set
+  // of OSDs that are still running the older version.
+  for (const auto& osd : orig_osds) {
+    auto osd_id = "osd." + std::to_string(osd);
+    auto ver = get_osd_metadata("ceph_version_short", osd_id);
+    if (ver.has_value()) {
+      if (*ver != ceph_version_new) {
+        dout(20) << "found " << osd_id << " to upgrade" << dendl;
+        to_upgrade.push_back(osd);
+      } else {
+        dout(20) << osd_id << " is already running the new version("
+                 << *ver << ")" << dendl;
+        upgraded.push_back(osd);
+      }
+    } else {
+        derr << "couldn't determine 'ceph_version_short' for "
+             << osd_id << dendl;
+        version_unknown.push_back(osd);
+    }
+  }
+
+  // Check if all OSDs are upgraded
+  _update_upgraded_osds(orig_osds, to_upgrade, upgraded,
+    version_unknown, out_osd_report);
+  if (!out_osd_report->bad_no_version.empty()) {
+    dout(20) << "'ceph_version_short' on osds couldn't be determined" << dendl;
+    return;
+  }
+  if (out_osd_report->all_osds_upgraded()) {
+    dout(20) << "all osds are upgraded!" << dendl;
+    return;
+  }
+
+  // Re-try until we can find a safe subset of OSDs to upgrade.
+  // On each attempt reduce the original set of OSDs to check by a
+  // factor defined by 'mgr_osd_upgrade_check_convergence_factor'.
+  // If no safe number can be found after all attempts, a minimum of
+  // 1 OSD is attempted.
+  const double convergence_factor =
+    g_conf().get_val<double>("mgr_osd_upgrade_check_convergence_factor");
+  size_t osd_subset_count = to_upgrade.size();
+  while (true) {
+    // Check impact to PGs with the filtered set. Use the existing
+    // ok-to-stop logic for this purpose.
+    _check_offlines_pgs(to_upgrade, osdmap, pgmap, out_pg_report);
+    if (!out_pg_report->ok_to_stop()) {
+      if (osd_subset_count == 1) {
+        // This means that there's no safe set of OSDs to upgrade.
+        // This probably indicates a problem with the cluster configuration.
+        to_upgrade.clear();
+        _update_upgraded_osds(orig_osds, to_upgrade, upgraded,
+          version_unknown, out_osd_report);
+        return;
+      }
+      // Reduce the number of OSDs in the set by the convergence factor.
+      osd_subset_count = std::max<size_t>(
+        1, static_cast<size_t>(osd_subset_count * convergence_factor));
+      // Prune the 'to-upgrade' set to hold the new subset of OSDs
+      auto start_it = std::next(to_upgrade.begin(), osd_subset_count);
+      auto end_it = to_upgrade.end();
+      to_upgrade.erase(start_it, end_it);
+      // reset pg report
+      *out_pg_report = offline_pg_report();
+    } else {
+      _update_upgraded_osds(orig_osds, to_upgrade, upgraded,
+        version_unknown, out_osd_report);
+      if (out_osd_report->ok_to_upgrade()) {
+        // Found a safe subset! Break and generate the output.
+        dout(20) << "found " << osd_subset_count << " OSDs that are safe to "
+                 << "upgrade" << dendl;
+        break;
+      }
+    }
+  }
+  if (to_upgrade.size() >= max) {
+    // already at max
+    dout(20) << "to_upgrade(" << to_upgrade.size() << ") >= "
+             <<  " max(" << max << ")" << dendl;
+    return;
+  }
+
+  /**
+   * semi-arbitrarily start with the first osd in the 'to_upgrade'
+   * vector and see if we can add more osds to upgrade. The reason
+   * for using a vector instead of set is to preserve the order of
+   * OSDs according to the order of other parent and their child
+   * buckets. This order ensures that the offline pgs check can
+   * correctly determine the outcome of a set of OSDs stopped from
+   * a specific bucket.
+   */
+  offline_pg_report _pg_report;
+  upgrade_osd_report _osd_report;
+  std::vector<int> osds = to_upgrade;
+  int parent = *osds.begin();
+  std::vector<int> children;
+
+  dout(20) << "Trying to add more children..." << dendl;
+  while (true) {
+    // identify the next parent
+    int r = osdmap.crush->get_immediate_parent_id(parent, &parent);
+    if (r < 0) {
+      dout(20) << "No parent found for item id: " << parent << dendl;
+      return;  // just go with what we have so far!
+    }
+
+    // get candidate additions that are beneath this point in the tree
+    children.clear();
+    r = _populate_crush_bucket_osds(parent, osdmap, children);
+    if (r != 0) {
+      return; // just go with what we have so far!
+    }
+
+    // try adding in more osds from the list of children
+    // determined above to maximize the upgrade set.
+    int failed = 0;  // how many children we failed to add to our set
+    for (auto o : children) {
+      auto it = std::find(osds.begin(), osds.end(), o);
+      bool can_add_osd = (it == osds.end());
+      if (o >= 0 && osdmap.is_up(o) && can_add_osd) {
+        osds.push_back(o);
+        _check_offlines_pgs(osds, osdmap, pgmap, &_pg_report);
+        if (!_pg_report.ok_to_stop()) {
+          osds.pop_back();
+          ++failed;
+          continue;
+        }
+        _update_upgraded_osds(orig_osds, osds, upgraded,
+          version_unknown, &_osd_report);
+        *out_pg_report = _pg_report;
+        *out_osd_report = _osd_report;
+        if (osds.size() == max) {
+          dout(20) << " hit max" << dendl;
+          if (out_osd_report->ok_to_upgrade()) {
+            // Found additional children that can be upgraded
+            dout(20) << "found " << osds.size() - to_upgrade.size()
+                     << " additional OSD(s) to upgrade" << dendl;
+          }
+          return;  // yay, we hit the max
+        }
+      }
+    }
+
+    if (failed) {
+      // we hit some failures; go with what we have
+      dout(20) << " hit some peer failures" << dendl;
+      return;
+    }
+  }
+}
+
+std::optional<std::string> DaemonServer::get_osd_metadata(
+  const std::string& name,
+  const std::string& osd_id)
+{
+    if (g_conf().get_val<bool>("mgr_test_metadata_error")) {
+      return std::nullopt;
+    }
+
+    auto [key, valid] = DaemonKey::parse(osd_id);
+    if (!valid) {
+      derr << "invalid daemon name: use <type>.<id>" << dendl;
+      return std::nullopt;
+    }
+    DaemonStatePtr daemon = daemon_state.get(key);
+    if (!daemon) {
+      derr << "daemon " << osd_id << " not found!" << dendl;
+      return std::nullopt;
+    }
+
+    std::lock_guard l(daemon->lock);
+    auto p = daemon->metadata.find(name);
+    if (p != daemon->metadata.end() && !p->second.empty()) {
+      return p->second;
+    }
+    return std::nullopt;
+}
+
+void upgrade_osd_report::dump(Formatter *f) const {
+  f->dump_bool("ok_to_upgrade", ok_to_upgrade());
+  f->dump_bool("all_osds_upgraded", all_osds_upgraded());
+
+  f->open_array_section("osds_in_crush_bucket");
+  for (auto o : osds) {
+    f->dump_int("osd", o);
+  }
+  f->close_section();
+
+  f->open_array_section("osds_ok_to_upgrade");
+  for (auto o : ok_upgrade) {
+    f->dump_int("ok_upgrade", o);
+  }
+  f->close_section();
+
+  f->open_array_section("osds_upgraded");
+  for (auto o : ok_upgraded) {
+    f->dump_int("ok_upgraded", o);
+  }
+  f->close_section();
+
+  f->open_array_section("bad_no_version");
+  for (auto o : bad_no_version) {
+    f->dump_int("bad_no_version", o);
+  }
+  f->close_section();
+}
+
  bool DaemonServer::_handle_command(
    std::shared_ptr<CommandContext>& cmdctx)
  {
@@ -1915,6 +2244,108 @@ bool DaemonServer::_handle_command(
        cmdctx->reply(0, ss);
      }
      return true;
+  } else if (prefix == "osd ok-to-upgrade") {
+    std::string crush_bucket_name;
+    cmd_getval(cmdctx->cmdmap, "crush_bucket", crush_bucket_name);
+    std::string ceph_version;
+    cmd_getval(cmdctx->cmdmap, "ceph_version", ceph_version);
+    int64_t max = 1;
+    cmd_getval(cmdctx->cmdmap, "max", max);
+    int r;
+    std::vector<int> osds_in_crush_bucket;
+    // Validate max parameter
+    if (max < 0) {
+      ss << "Invalid 'max' value: " << max << ". 'max' must be non-negative.";
+      cmdctx->reply(-EINVAL, ss);
+      return true;
+    }
+    // Validate ceph_version format. The pattern is generic and  matches
+    // the upstream and downstream version formats. Note that the suffix
+    // matches either the upstream Git format or the downstream OS format.
+    std::regex ceph_version_pattern
+      (R"(^(\d+)\.(\d+)\.(\d+)-(\d+)(-g[0-9a-f]+|\.el\d+[a-z]+)$)");
+    std::smatch matches;
+    if (!std::regex_match(ceph_version, matches, ceph_version_pattern)) {
+      ss << "Invalid Ceph version (short) format. The format to use is the"
+         << " same as 'ceph_version_short' found in OSD metadata."
+         << " Examples: \"20.3.0-3803-g63ca1ffb5a2\", \"20.1.0-144.el9cp\".";
+      cmdctx->reply(-EINVAL, ss);
+      return true;
+    }
+    // Validate the crush bucket name & type. For this command the
+    // bucket type is limited to 'rack', 'chassis', 'host' or 'osd'.
+    // This is to help limit the number of OSDs and avoid
+    // performance issues during the upgrade check.
+    cluster_state.with_osdmap([&](const OSDMap& osdmap) {
+        // Validate crush bucket
+        if (!osdmap.crush->name_exists(crush_bucket_name)) {
+          ss << "\"" << crush_bucket_name << "\" does not exist";
+          r = -ENOENT;
+          return;
+        }
+        int id = osdmap.crush->get_item_id(crush_bucket_name);
+        // get candidate additions that are beneath this point in the tree
+        r = _populate_crush_bucket_osds(id, osdmap, osds_in_crush_bucket, &ss);
+        if (r != 0) {
+          return;
+        }
+    });
+    if (r < 0) {
+      cmdctx->reply(r, ss);
+      return true;
+    }
+    dout(20) << "Crush Bucket OSDs: " << osds_in_crush_bucket << dendl;
+    if ((int)osds_in_crush_bucket.size() == 0) {
+      ss << "no osds found in crush bucket: \"" << crush_bucket_name << "\"";
+      cmdctx->reply(-ENOENT, ss);
+      return true;
+    }
+    if (max < (int)osds_in_crush_bucket.size()) {
+      max = osds_in_crush_bucket.size();
+    }
+    upgrade_osd_report osd_upgrade_report;
+    offline_pg_report pg_offline_report;
+    cluster_state.with_osdmap_and_pgmap([&](
+      const OSDMap& osdmap, const PGMap& pg_map) {
+        _maximize_ok_to_upgrade_set(
+          osds_in_crush_bucket, max, osdmap, pg_map, ceph_version,
+          &osd_upgrade_report, &pg_offline_report, &ss);
+      });
+    if (!f) {
+      f.reset(Formatter::create("json"));
+    }
+    f->dump_object("ok_to_upgrade", osd_upgrade_report);
+    f->flush(cmdctx->odata);
+    cmdctx->odata.append("\n");
+    if (!osd_upgrade_report.ok_to_upgrade()) {
+      if (!pg_offline_report.unknown.empty()) {
+        ss << pg_offline_report.unknown.size() << " pgs have unknown state; "
+           << "cannot draw any conclusions at this time; re-try after pgs "
+           << "transition to known states";
+        cmdctx->reply(-EBUSY, ss);
+      }
+      if (!osd_upgrade_report.bad_no_version.empty()) {
+        ss << osd_upgrade_report.bad_no_version.size()
+           << " osds have unknown version; cannot draw any conclusions";
+        cmdctx->reply(-EAGAIN, ss);
+      }
+      if (!pg_offline_report.ok_to_stop()) {
+        ss << "unsafe to upgrade osd(s) at this time ("
+           << pg_offline_report.not_ok.size()
+           << " PGs are or would become offline)";
+        cmdctx->reply(-EBUSY, ss);
+      }
+      // ok_to_upgrade() would be false in case all osds are upgraded
+      if (osd_upgrade_report.all_osds_upgraded()) {
+        ss << "all " << osds_in_crush_bucket.size()
+           << " osd(s) are running the new Ceph version("
+           << ceph_version << ")";
+        cmdctx->reply(0, ss);
+      }
+    } else {
+      cmdctx->reply(0, ss);
+    }
+    return true;
    } else if (prefix == "pg force-recovery" ||
              prefix == "pg force-backfill" ||
              prefix == "pg cancel-force-recovery" ||
diff --git a/src/mgr/DaemonServer.h b/src/mgr/DaemonServer.h

index 94b046332f3249da9997aa228a2e512b78126a65..edc1fe35fa2990985fa6336936b9846fb3a4cea4 100644 (file)
--- a/src/mgr/DaemonServer.h
+++ b/src/mgr/DaemonServer.h
@@ -126,6 +126,22 @@ struct offline_pg_report {
    }
  };
  
+struct upgrade_osd_report {
+  std::vector<int> osds;
+  std::vector<int> ok_upgrade, ok_upgraded, bad_no_version;
+
+  bool ok_to_upgrade() const {
+    return !ok_upgrade.empty() && bad_no_version.empty();
+  }
+
+  bool all_osds_upgraded() const {
+    return ((osds.size() == ok_upgraded.size()) &&
+            ok_upgrade.empty() && bad_no_version.empty());
+  }
+
+  void dump(Formatter *f) const;
+};
+
  /**
   * Server used in ceph-mgr to communicate with Ceph daemons like
   * MDSs and OSDs.
@@ -194,6 +210,31 @@ private:
      const OSDMap& osdmap,
      const PGMap& pgmap,
      offline_pg_report *report);
+  void _maximize_ok_to_upgrade_set(
+    const std::vector<int>& orig_osds,
+    unsigned max,
+    const OSDMap& osdmap,
+    const PGMap& pgmap,
+    std::string_view ceph_version_new,
+    upgrade_osd_report *osd_report,
+    offline_pg_report *pg_report,
+    std::ostream *ss);
+  std::optional<std::string> get_osd_metadata(
+    const std::string& name,
+    const std::string& osd_id);
+  void _update_upgraded_osds(
+    const std::vector<int>& orig_osds,
+    const std::vector<int>& to_upgrade,
+    const std::vector<int>& upgraded,
+    const std::vector<int>& version_unknown,
+    upgrade_osd_report *osd_report);
+  bool _valid_bucket_type_for_upgrade_check(
+    std::string_view bucket_type_str);
+  int _populate_crush_bucket_osds(
+    const int item_id,
+    const OSDMap& osdmap,
+    std::vector<int>& crush_bucket_osds,
+    std::ostream *ss = nullptr);
  
    utime_t started_at;
    std::atomic<bool> pgmap_ready;
diff --git a/src/mgr/MgrCommands.h b/src/mgr/MgrCommands.h

index 7adb215da01972f8bc42d9ac67a1261f0835f5f3..f44c615baa3e227bb725286adf3ad0853d97e09a 100644 (file)
--- a/src/mgr/MgrCommands.h
+++ b/src/mgr/MgrCommands.h
@@ -161,6 +161,14 @@ COMMAND("osd ok-to-stop name=ids,type=CephString,n=N "\
         "name=max,type=CephInt,req=false",
         "check whether osd(s) can be safely stopped without reducing immediate"\
         " data availability", "osd", "r")
+COMMAND("osd ok-to-upgrade " \
+        "name=crush_bucket,type=CephString " \
+        "name=ceph_version,type=CephString " \
+        "name=max,type=CephInt,req=false",
+        "determine a safe number of osd(s) subject to a maximum(if specified)" \
+        " within the provided CRUSH bucket that can be safely" \
+        " upgraded without reducing immediate data availability",
+        "osd", "r")
  
  COMMAND("osd scrub " \
         "name=who,type=CephString", \
author	Sridhar Seshasayee <sseshasa@redhat.com>
	Mon, 27 Oct 2025 16:34:54 +0000 (22:04 +0530)
committer	Sridhar Seshasayee <sseshasa@redhat.com>
	Mon, 23 Feb 2026 07:13:16 +0000 (12:43 +0530)
PendingReleaseNotes		patch \| blob \| history
doc/man/8/ceph.rst		patch \| blob \| history
qa/standalone/misc/ok-to-upgrade.sh	[new file with mode: 0755]	patch \| blob
src/common/options/mgr.yaml.in		patch \| blob \| history
src/mgr/DaemonServer.cc		patch \| blob \| history
src/mgr/DaemonServer.h		patch \| blob \| history
src/mgr/MgrCommands.h		patch \| blob \| history