From c63b188a9fb1431827b2299c5dfa2074c078854a Mon Sep 17 00:00:00 2001 From: Sridhar Seshasayee Date: Mon, 27 Oct 2025 22:04:54 +0530 Subject: [PATCH] mgr/DaemonServer: Implement ok-to-upgrade command Implement a new Mgr command called 'ok-to-upgrade' that returns a set of OSDs within the provided CRUSH bucket that are safe to upgrade without reducing immediate data availability. The command accepts the following as input: - CRUSH bucket name (required) - The CRUSH bucket type is limited to 'rack', 'chassis', 'host' and 'osd'. This is to prevent users from specifying a bucket type higher up the tree which could result in performance issues if the number of OSDs in the bucket is very high. - The new Ceph version to check against. The format accepted is the short form of the Ceph version, for e.g. 20.3.0-3803-g63ca1ffb5a2. (required) - The maximum number of OSDs to consider if specified. (optional) Implementation Details: After sanity checks on the provided parameters, the following steps are performed: 1. The set of OSDs within the CRUSH bucket is first determined. 2. From the main set of OSDs, a filtered set of OSDs not yet running the new Ceph version is created. - For this purpose, the OSD's 'ceph_version_short' string is read from the metadata. For this purpose a new method called DaemonServer::get_osd_metadata() is used. The information is determined from the DaemonStatePtr maintained within the DaemonServer. 3. If all OSDs are already running the new Ceph version, a success report is generated and returned. 4. If OSDs are not running the new Ceph version, a new set (to_upgrade) is created. 5. If the current version cannot be determined, an error is logged and the output report with 'bad_no_version' field populated with the OSD in question is generated. 6. On the new set (to_upgrade), the existing logic in _check_offline_pgs() is executed to see if stopping any or all OSDs in the set as part of the upgrade can reduce immediate data availability. - If data availability is impacted, then the number of OSDs in the filtered set is reduced by a factor defined by a new config option called 'mgr_osd_upgrade_check_convergence_factor' which is set to 0.8 by default. - The logic in _check_offline_pgs() is repeated for the new set. - The above is repeated until a safe subset of OSDs that can be stopped for upgrade is found. Each iteration reduces the number of OSDs to check by the convergence factor mentioned above. 7. It must be noted that the default value of 'mgr_osd_upgrade_check_convergence_factor' is on the higher side in order to help determine an optimal set of OSDs to upgrade. In other words, a higher convergence factor would help maximize the number of OSDs to upgrade. In this case, the number of iterations and therefore the time taken to determine the OSDs to upgrade is proportional to the number of OSDs in the CRUSH bucket. The converse is true if a lower convergence factor is used. 8. If the number of OSDs determined is lower than the 'max' specified, then an additional loop is executed to determine if other children of the CRUSH bucket can be added to the existing set. 9. Once a viable set is determined, an output report similar to the following is generated: A standalone test is introduced that exercises the logic for both replicated and erasure-coded pools by manipulating the min_size for a pool and check for upgradability. The tests also performs other basic sanity checks and error conditions. The output shown below is for a cluster running on a single node with 10 OSDs and with replicated pool configuration: $ ceph osd ok-to-upgrade incerta06 01.00.00-gversion-test --format=json {"ok_to_upgrade":true,"all_osds_upgraded":false,\ "osds_in_crush_bucket":[0,1,2,3,4,5,6,7,8,9],\ "osds_ok_to_upgrade":[0],"osds_upgraded":[],"bad_no_version":[]} The following report is shown if all OSDs are running the desired Ceph version: $ ceph osd ok-to-upgrade --crush_bucket localrack \ --ceph_version 20.3.0-3803-g63ca1ffb5a2 {"ok_to_upgrade":false,"all_osds_upgraded":true,\ "osds_in_crush_bucket":[0,1,2,3,4,5,6,7,8,9],"osds_ok_to_upgrade":[],\ "osds_upgraded":[0,1,2,3,4,5,6,7,8,9],"bad_no_version":[]}' Fixes: https://tracker.ceph.com/issues/73031 Signed-off-by: Sridhar Seshasayee --- PendingReleaseNotes | 12 + doc/man/8/ceph.rst | 49 +++- qa/standalone/misc/ok-to-upgrade.sh | 294 +++++++++++++++++++ src/common/options/mgr.yaml.in | 24 ++ src/mgr/DaemonServer.cc | 431 ++++++++++++++++++++++++++++ src/mgr/DaemonServer.h | 41 +++ src/mgr/MgrCommands.h | 8 + 7 files changed, 858 insertions(+), 1 deletion(-) create mode 100755 qa/standalone/misc/ok-to-upgrade.sh diff --git a/PendingReleaseNotes b/PendingReleaseNotes index 8060eb62b6e..36e414648ef 100644 --- a/PendingReleaseNotes +++ b/PendingReleaseNotes @@ -247,6 +247,18 @@ to ensure compatibility during upgrades, but can be disabled once old usage logs are no longer present to avoid performance overhead. +* MGR: A new command, `ceph osd ok-to-upgrade`, has been added that allows + users and orchestration tools to determine a safe set of OSDs within a CRUSH + bucket to upgrade simultaneously without impacting data availability. To help + converge to a safe set, a new config option + ``mgr_osd_upgrade_check_convergence_factor`` is introduced. This option can be + modified (if necessary) to help converge to an optimal set. Higher values + maximize the set of OSDs to upgrade at the cost of longer command response + times. Conversely, a lower value improves the command response time but + results in a non-optimal or smaller set of OSDs which impacts the overall time + to upgrade all OSDs in the cluster. For more details see tracker: + https://tracker.ceph.com/issues/73031. + >=19.2.1 * CephFS: The `fs subvolume create` command now allows tagging subvolumes through option diff --git a/doc/man/8/ceph.rst b/doc/man/8/ceph.rst index 0afc3cbbe26..cec4300bc45 100644 --- a/doc/man/8/ceph.rst +++ b/doc/man/8/ceph.rst @@ -37,7 +37,7 @@ Synopsis | **ceph** **mon** [ *add* \| *dump* \| *enable_stretch_mode* \| *getmap* \| *remove* \| *stat* ] ... -| **ceph** **osd** [ *blocklist* \| *blocked-by* \| *create* \| *new* \| *deep-scrub* \| *df* \| *down* \| *dump* \| *erasure-code-profile* \| *find* \| *getcrushmap* \| *getmap* \| *getmaxosd* \| *in* \| *ls* \| *lspools* \| *map* \| *metadata* \| *ok-to-stop* \| *out* \| *pause* \| *perf* \| *pg-temp* \| *force-create-pg* \| *primary-affinity* \| *primary-temp* \| *repair* \| *reweight* \| *reweight-by-pg* \| *rm* \| *destroy* \| *purge* \| *safe-to-destroy* \| *scrub* \| *set* \| *setcrushmap* \| *setmaxosd* \| *stat* \| *tree* \| *unpause* \| *unset* ] ... +| **ceph** **osd** [ *blocklist* \| *blocked-by* \| *create* \| *new* \| *deep-scrub* \| *df* \| *down* \| *dump* \| *erasure-code-profile* \| *find* \| *getcrushmap* \| *getmap* \| *getmaxosd* \| *in* \| *ls* \| *lspools* \| *map* \| *metadata* \| *ok-to-stop* \| *ok-to-upgrade* \| *out* \| *pause* \| *perf* \| *pg-temp* \| *force-create-pg* \| *primary-affinity* \| *primary-temp* \| *repair* \| *reweight* \| *reweight-by-pg* \| *rm* \| *destroy* \| *purge* \| *safe-to-destroy* \| *scrub* \| *set* \| *setcrushmap* \| *setmaxosd* \| *stat* \| *tree* \| *unpause* \| *unset* ] ... | **ceph** **osd** **crush** [ *add* \| *add-bucket* \| *create-or-move* \| *dump* \| *get-tunable* \| *link* \| *move* \| *remove* \| *rename-bucket* \| *reweight* \| *reweight-all* \| *reweight-subtree* \| *rm* \| *rule* \| *set* \| *set-tunable* \| *show-tunables* \| *tunables* \| *unlink* ] ... @@ -1113,6 +1113,53 @@ Usage:: ceph osd ok-to-stop [...] [--max ] +Subcommand ``ok-to-upgrade`` determines a safe set of OSDs found within the +specified CRUSH bucket to upgrade simultaneously without impacting cluster +data availability and with all data remaining readable and writeable. Data +redundancy may be reduced with some PGs in degraded (but active) state. The +command checks the Ceph version running on the OSDs against the specified +version and filters those still needing upgrade. The command returns a +success code if it finds a safe set of OSD(s) to upgrade and shows the list +of OSD(s) in the response, or an error code and informative message otherwise +or if no conclusion can be drawn. + +The CRUSH bucket types passed to the command can be one of 'rack', 'chassis', +'host' or 'osd'. This restriction is to avoid performance issues with larger +failure domains where the number of OSDs to check could be very high and to +help manage failures if any during upgrades. + +The expected format of the option ```` is the short form +of the Ceph version string. The version string format is similar to the value of +``ceph_version_short`` key seen in the output of the ``ceph osd metadata `` +command where ``id`` is the OSD number. + +When ``--max `` is provided, up to OSD IDs found either within the +provided CRUSH bucket or across the CRUSH hierarchy that can be stopped for +upgrade simultaneously will be returned. This logic can for example be triggered +by specifying a single starting OSD and a max number. The search then spans both +within and across the CRUSH hierarchy and additional OSDs are drawn from those +locations. + +The command automatically determines a safe set of OSDs to upgrade found in the +provided CRUSH bucket. If not all OSDs in the CRUSH bucket can be upgraded +simultaneously, the command uses the config option +``mgr_osd_upgrade_check_convergence_factor`` to progressively reduce the set of +OSDs to check until a safe set is found. Note that the default value is on the +higher side to help determine an optimal set of OSDs to upgrade. A higher +convergence factor will help maximize the number of OSDs to upgrade at the cost +of more iterations and time to find the set. The converse is true if a lower +convergence factor is used. A lower value should be used only if the command is +sluggish to respond. + +It must be noted that this command leverages the underlying logic of the +``ok-to-stop`` command. The key difference is that ``ok-to-upgrade`` command +operates strictly on the OSDs found in the CRUSH bucket and considers adjacent +CRUSH locations if necessary to satisfy the ``--max`` criteria. + +Usage:: + + ceph osd ok-to-upgrade [--max ] + Subcommand ``pause`` pauses osd. Usage:: diff --git a/qa/standalone/misc/ok-to-upgrade.sh b/qa/standalone/misc/ok-to-upgrade.sh new file mode 100755 index 00000000000..3436e4840c2 --- /dev/null +++ b/qa/standalone/misc/ok-to-upgrade.sh @@ -0,0 +1,294 @@ +#!/usr/bin/env bash +# +# Copyright (C) 2025 IBM +# +# Author: Sridhar Seshasayee +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU Library Public License as published by +# the Free Software Foundation; either version 2, or (at your option) +# any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU Library Public License for more details. +# + +source $CEPH_ROOT/qa/standalone/ceph-helpers.sh + +function run() { + local dir=$1 + shift + + export CEPH_MON="127.0.0.1:7170" # git grep '\<7170\>' : there must be only one + export CEPH_ARGS + CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none " + export ORIG_CEPH_ARGS="$CEPH_ARGS" + + local funcs=${@:-$(set | ${SED} -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')} + for func in $funcs ; do + setup $dir || return 1 + $func $dir || return 1 + kill_daemons $dir KILL || return 1 + teardown $dir || return 1 + done +} + +function TEST_ok_to_upgrade_invalid_args() { + local dir=$1 + CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON " + + run_mon $dir a --public-addr=$CEPH_MON || return 1 + run_mgr $dir x || return 1 + run_osd $dir 0 --osd-mclock-skip-benchmark=true || return 1 + + # test with no args + ! ceph osd ok-to-upgrade || return 1 + + # test with invalid crush bucket name + local crush_bucket="foo" + local ceph_version_short="01.2.3-1234-g1234deed" + ! ceph osd ok-to-upgrade $crush_bucket $ceph_version_short || return 1 + + # test with 'root' crush bucket name + crush_bucket="default" + ! ceph osd ok-to-upgrade $crush_bucket $ceph_version_short || return 1 + + # test with invalid ceph_version formats + crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }') + ceph_versions=("" "foo" "20" "20.3.0" "20.3.0-1234" \ + "20.3.0-1234-g" "20.1.0-145.el") + for ver in "${ceph_versions[@]}"; do + ! ceph osd ok-to-upgrade $crush_bucket $ver || return 1 + done + + # Invalid max parameter + max=-20 + ! ceph osd ok-to-upgrade $crush_bucket $ver $max|| return 1 +} + +function TEST_ok_to_upgrade_replicated_pool() { + local dir=$1 + local poolname="test" + local OSDS=10 + local ceph_version="01.2.3-1234-g1234deed" + + CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON " + + run_mon $dir a --public-addr=$CEPH_MON || return 1 + run_mgr $dir x || return 1 + + for osd in $(seq 0 $(expr $OSDS - 1)) + do + run_osd $dir $osd --osd-mclock-skip-benchmark=true || return 1 + done + + create_pool $poolname 32 32 + ceph osd pool set $poolname min_size 1 + sleep 5 + + wait_for_clean || return 1 + + # Test for upgradability with min_size=1 + local exp_osds_upgradable=2 + local crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }') + local res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + # Specifying hostname as the crush bucket with a 3x replicated pool on 10 OSDs + # and with the default 'mgr_osd_upgrade_check_convergence_factor' would result + # in 4 OSDs being reported as upgradable. + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = true || return 1 + local num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $num_osds_upgradable -ge $exp_osds_upgradable || return 1 + local num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for upgradability with min_size=1, 1 OSD to upgrade and max=3. + # This tests the functionality of the 'max' parameter and checks the + # logic to find more OSDs in the crush bucket. + local max=2 + exp_osds_upgradable=2 + crush_bucket="osd.0" + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version $max --format=json) + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = true || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + test $max = $num_osds_upgradable || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for upgradability with min_size=2 + ceph osd pool set $poolname min_size 2 + sleep 5 + wait_for_clean || return 1 + exp_osds_upgradable=1 + crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }') + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + # 3 OSDs should be reported as upgradable. + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = true || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $num_osds_upgradable -ge $exp_osds_upgradable || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for upgradability with min_size=3 + ceph osd pool set $poolname min_size 3 + sleep 5 + wait_for_clean || return 1 + exp_osds_upgradable=0 + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + # No OSD should be reported as upgradable. + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = false || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for condition when all OSDs are running desired version. + upgrade_version=$(ceph osd metadata 0 --format=json | \ + jq '.ceph_version_short' | sed 's/"//g') + res=$(ceph osd ok-to-upgrade $crush_bucket $upgrade_version --format=json) + test $(echo $res | jq '.all_osds_upgraded') = true || return 1 + test $(echo $res | jq '.ok_to_upgrade') = false || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $num_osds_upgradable -eq 0 || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq $OSDS || return 1 +} + +function TEST_ok_to_upgrade_erasure_pool() { + local dir=$1 + local poolname="ec" + local OSDS=10 + local ceph_version="01.2.3-1234-g1234deed" + + CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON " + + run_mon $dir a --public-addr=$CEPH_MON || return 1 + run_mgr $dir x || return 1 + + for osd in $(seq 0 $(expr $OSDS - 1)) + do + run_osd $dir $osd --osd-mclock-skip-benchmark=true || return 1 + done + + ceph osd erasure-code-profile set ec-profile m=3 k=5 crush-failure-domain=osd || return 1 + ceph osd pool create $poolname erasure ec-profile || return 1 + ceph osd pool set $poolname min_size 5 + sleep 5 + + wait_for_clean || return 1 + + # Test for upgradability with min_size=5 + local exp_osds_upgradable=3 + local crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }') + local res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + # Specifying hostname as the crush bucket with a ec5+3 pool on 10 OSDs + # and with the default 'mgr_osd_upgrade_check_convergence_factor' would result + # in 3 OSDs being reported as upgradable. + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = true || return 1 + local num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + local num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for upgradability with min_size=1, 1 OSD to upgrade and max=3. + # This tests the functionality of the 'max' parameter and also checks + # the logic to find more OSDs in the crush bucket. + local max=3 + crush_bucket="osd.0" + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version $max --format=json) + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = true || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + test $max = $num_osds_upgradable || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for upgradability with min_size=6 + ceph osd pool set $poolname min_size 6 + sleep 5 + wait_for_clean || return 1 + exp_osds_upgradable=2 + crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }') + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + # 2 OSDs should be reported as upgradable. + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = true || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for upgradability with min_size=8 + ceph osd pool set $poolname min_size 8 + sleep 5 + wait_for_clean || return 1 + exp_osds_upgradable=0 + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + # No OSD should be reported as upgradable. + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = false || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq 0 || return 1 + + # Test for condition when all OSDs are running desired version. + ceph_version=$(ceph osd metadata 0 --format=json | \ + jq '.ceph_version_short' | sed 's/"//g') + res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + test $(echo $res | jq '.all_osds_upgraded') = true || return 1 + test $(echo $res | jq '.ok_to_upgrade') = false || return 1 + num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $num_osds_upgradable -eq 0 || return 1 + num_osds_upgraded=$(echo $res | jq '.osds_upgraded | length' | bc) + test $num_osds_upgraded -eq $OSDS || return 1 +} + +function TEST_ok_to_upgrade_bad_osd_version() { + local dir=$1 + local poolname="test" + local OSDS=3 + local ceph_version="01.2.3-1234-g1234deed" + + CEPH_ARGS="$ORIG_CEPH_ARGS --mon-host=$CEPH_MON " + + run_mon $dir a --public-addr=$CEPH_MON || return 1 + run_mgr $dir x || return 1 + + for osd in $(seq 0 $(expr $OSDS - 1)) + do + run_osd $dir $osd --osd-mclock-skip-benchmark=true || return 1 + done + + create_pool $poolname 8 8 + ceph osd pool set $poolname min_size 1 + sleep 5 + + wait_for_clean || return 1 + + # Set the option to enable testing metadata errors + ceph config set mgr mgr_test_metadata_error true + + # Test for upgradability with min_size=1 + local exp_osds_upgradable=0 + local exp_osds_bad_version=3 + local crush_bucket=$(ceph osd tree | grep host | awk '{ print $4 }') + local res=$(ceph osd ok-to-upgrade $crush_bucket $ceph_version --format=json) + test $(echo $res | jq '.all_osds_upgraded') = false || return 1 + test $(echo $res | jq '.ok_to_upgrade') = false || return 1 + local num_osds_upgradable=$(echo $res | jq '.osds_ok_to_upgrade | length' | bc) + test $exp_osds_upgradable = $num_osds_upgradable || return 1 + local num_osds_bad_version=$(echo $res | jq '.bad_no_version | length' | bc) + test $num_osds_bad_version -eq 3 || return 1 +} + + +main ok-to-upgrade "$@" diff --git a/src/common/options/mgr.yaml.in b/src/common/options/mgr.yaml.in index c6bdee1d156..658a0160987 100644 --- a/src/common/options/mgr.yaml.in +++ b/src/common/options/mgr.yaml.in @@ -379,3 +379,27 @@ options: services: - mgr with_legacy: true +- name: mgr_osd_upgrade_check_convergence_factor + type: float + level: advanced + desc: The factor used to converge to a subset of OSDs within a CRUSH bucket + that can be upgraded without affecting immediate data availability. + fmt_desc: The factor used in calculations to converge to a subset of OSDs that + can be safely upgraded simultaneously. Each iteration of the calculation + uses this factor until a safe subset is found. The smaller the factor, the + lower the number of iterations needed to find a safe set. But the number of + OSDs found may not be optimal. Conversely, with a larger factor, a higher + number of iterations and time would be consumed to find a safe set. The + number of OSDs found in this case would be closer to optimal. + default: 0.8 + min: 0.1 + max: 0.9 + services: + - mgr +- name: mgr_test_metadata_error + type: bool + level: dev + desc: Used for simulating errors during operations involving metadata. + default: false + services: + - mgr diff --git a/src/mgr/DaemonServer.cc b/src/mgr/DaemonServer.cc index 9dbcff6feb8..9f20dc0dad3 100644 --- a/src/mgr/DaemonServer.cc +++ b/src/mgr/DaemonServer.cc @@ -1164,6 +1164,335 @@ void DaemonServer::_maximize_ok_to_stop_set( } } +void DaemonServer::_update_upgraded_osds( + const std::vector& orig_osds, + const std::vector& to_upgrade, + const std::vector& upgraded, + const std::vector& version_unknown, + upgrade_osd_report *report) +{ + // reset output + *report = upgrade_osd_report(); + report->osds = orig_osds; + report->ok_upgrade = to_upgrade; + report->ok_upgraded = upgraded; + report->bad_no_version = version_unknown; +} + +bool DaemonServer::_valid_bucket_type_for_upgrade_check( + std::string_view bucket_type_str) +{ + if (bucket_type_str.empty()) { + dout(20) << "bucket type string is empty!" << dendl; + return false; + } + + return (bucket_type_str == "rack" || bucket_type_str == "chassis" || + bucket_type_str == "host" || bucket_type_str == "osd"); +} + +int DaemonServer::_populate_crush_bucket_osds( + const int item_id, + const OSDMap& osdmap, + std::vector& crush_bucket_osds, + std::ostream *ss) +{ + int r = 0; + int btype = osdmap.crush->get_bucket_type(item_id); + if (btype < 0) { + // For negative type an OSD may be assumed + btype = 0; + } + std::string item_name = osdmap.crush->get_item_name(item_id); + std::string bucket_type_str = osdmap.crush->get_type_name(btype); + if (!_valid_bucket_type_for_upgrade_check(bucket_type_str)) { + ostringstream os; + os << "crush bucket \"" << item_name << "\" of type " + << "\"" << bucket_type_str << "\" is incompatible for " + << "upgradability check; valid types are: 'rack', 'chassis', " + << "'host' and 'osd'"; + if (ss) { + *ss << os.str(); + } + dout(20) << os.str() << dendl; + return -EINVAL; + } + dout(20) << "bucket type of parent " << item_name << " is " + << bucket_type_str << dendl; + + std::vector bucket_names; + // get candidate additions that are beneath this point in the tree + if (bucket_type_str == "rack" || bucket_type_str == "chassis") { + std::list crush_bucket_children; + // Get the list of children + if (osdmap.crush->get_children(item_id, &crush_bucket_children) <= 0) { + ostringstream os; + os << "crush bucket \"" << item_name << "\" of type: " + << bucket_type_str << " has no children!"; + if (ss) { + *ss << os.str(); + } + dout(20) << os.str() << dendl; + return -ENOENT; + } + // create a list of bucket names pertaining to each child in the tree + for (const auto &child : crush_bucket_children) { + bucket_names.push_back(osdmap.crush->get_item_name(child)); + } + } else if (bucket_type_str == "host" || bucket_type_str == "osd") { + bucket_names.push_back(item_name); + } + // get osds under each child bucket + std::set bucket_osds; + for (const auto &item : bucket_names) { + r = osdmap.get_osds_by_bucket_name(item, &bucket_osds); + if (r < 0) { + ostringstream os; + os << "cannot parse crush bucket:\"" << item + << "\" of type: " << bucket_type_str << ". " + << "got error code: " << r; + if (ss) { + *ss << os.str(); + } + dout(20) << os.str() << dendl; + return r; + } + // The osds are pushed to the referenced crush_bucket_osds + // vector to maintain the order of osds according to the + // child order. This helps optimize the result of + // _check_offlines_pgs() down the line. + for (const auto &osd : bucket_osds) { + crush_bucket_osds.push_back(osd); + } + dout(20) << "Picked children: " << bucket_osds + << " from parent: " << item << dendl; + } + return r; +} + +void DaemonServer::_maximize_ok_to_upgrade_set( + const std::vector& orig_osds, + unsigned max, + const OSDMap& osdmap, + const PGMap& pgmap, + std::string_view ceph_version_new, + upgrade_osd_report *out_osd_report, + offline_pg_report *out_pg_report, + std::ostream *ss) +{ + std::vector to_upgrade; + std::vector upgraded; + std::vector version_unknown; + + dout(20) << "orig_osds " << orig_osds + << " new ceph_version " << ceph_version_new << dendl; + // Filter osds not yet running the new ceph_version. + // Limit the check for safe upgrade to only the set + // of OSDs that are still running the older version. + for (const auto& osd : orig_osds) { + auto osd_id = "osd." + std::to_string(osd); + auto ver = get_osd_metadata("ceph_version_short", osd_id); + if (ver.has_value()) { + if (*ver != ceph_version_new) { + dout(20) << "found " << osd_id << " to upgrade" << dendl; + to_upgrade.push_back(osd); + } else { + dout(20) << osd_id << " is already running the new version(" + << *ver << ")" << dendl; + upgraded.push_back(osd); + } + } else { + derr << "couldn't determine 'ceph_version_short' for " + << osd_id << dendl; + version_unknown.push_back(osd); + } + } + + // Check if all OSDs are upgraded + _update_upgraded_osds(orig_osds, to_upgrade, upgraded, + version_unknown, out_osd_report); + if (!out_osd_report->bad_no_version.empty()) { + dout(20) << "'ceph_version_short' on osds couldn't be determined" << dendl; + return; + } + if (out_osd_report->all_osds_upgraded()) { + dout(20) << "all osds are upgraded!" << dendl; + return; + } + + // Re-try until we can find a safe subset of OSDs to upgrade. + // On each attempt reduce the original set of OSDs to check by a + // factor defined by 'mgr_osd_upgrade_check_convergence_factor'. + // If no safe number can be found after all attempts, a minimum of + // 1 OSD is attempted. + const double convergence_factor = + g_conf().get_val("mgr_osd_upgrade_check_convergence_factor"); + size_t osd_subset_count = to_upgrade.size(); + while (true) { + // Check impact to PGs with the filtered set. Use the existing + // ok-to-stop logic for this purpose. + _check_offlines_pgs(to_upgrade, osdmap, pgmap, out_pg_report); + if (!out_pg_report->ok_to_stop()) { + if (osd_subset_count == 1) { + // This means that there's no safe set of OSDs to upgrade. + // This probably indicates a problem with the cluster configuration. + to_upgrade.clear(); + _update_upgraded_osds(orig_osds, to_upgrade, upgraded, + version_unknown, out_osd_report); + return; + } + // Reduce the number of OSDs in the set by the convergence factor. + osd_subset_count = std::max( + 1, static_cast(osd_subset_count * convergence_factor)); + // Prune the 'to-upgrade' set to hold the new subset of OSDs + auto start_it = std::next(to_upgrade.begin(), osd_subset_count); + auto end_it = to_upgrade.end(); + to_upgrade.erase(start_it, end_it); + // reset pg report + *out_pg_report = offline_pg_report(); + } else { + _update_upgraded_osds(orig_osds, to_upgrade, upgraded, + version_unknown, out_osd_report); + if (out_osd_report->ok_to_upgrade()) { + // Found a safe subset! Break and generate the output. + dout(20) << "found " << osd_subset_count << " OSDs that are safe to " + << "upgrade" << dendl; + break; + } + } + } + if (to_upgrade.size() >= max) { + // already at max + dout(20) << "to_upgrade(" << to_upgrade.size() << ") >= " + << " max(" << max << ")" << dendl; + return; + } + + /** + * semi-arbitrarily start with the first osd in the 'to_upgrade' + * vector and see if we can add more osds to upgrade. The reason + * for using a vector instead of set is to preserve the order of + * OSDs according to the order of other parent and their child + * buckets. This order ensures that the offline pgs check can + * correctly determine the outcome of a set of OSDs stopped from + * a specific bucket. + */ + offline_pg_report _pg_report; + upgrade_osd_report _osd_report; + std::vector osds = to_upgrade; + int parent = *osds.begin(); + std::vector children; + + dout(20) << "Trying to add more children..." << dendl; + while (true) { + // identify the next parent + int r = osdmap.crush->get_immediate_parent_id(parent, &parent); + if (r < 0) { + dout(20) << "No parent found for item id: " << parent << dendl; + return; // just go with what we have so far! + } + + // get candidate additions that are beneath this point in the tree + children.clear(); + r = _populate_crush_bucket_osds(parent, osdmap, children); + if (r != 0) { + return; // just go with what we have so far! + } + + // try adding in more osds from the list of children + // determined above to maximize the upgrade set. + int failed = 0; // how many children we failed to add to our set + for (auto o : children) { + auto it = std::find(osds.begin(), osds.end(), o); + bool can_add_osd = (it == osds.end()); + if (o >= 0 && osdmap.is_up(o) && can_add_osd) { + osds.push_back(o); + _check_offlines_pgs(osds, osdmap, pgmap, &_pg_report); + if (!_pg_report.ok_to_stop()) { + osds.pop_back(); + ++failed; + continue; + } + _update_upgraded_osds(orig_osds, osds, upgraded, + version_unknown, &_osd_report); + *out_pg_report = _pg_report; + *out_osd_report = _osd_report; + if (osds.size() == max) { + dout(20) << " hit max" << dendl; + if (out_osd_report->ok_to_upgrade()) { + // Found additional children that can be upgraded + dout(20) << "found " << osds.size() - to_upgrade.size() + << " additional OSD(s) to upgrade" << dendl; + } + return; // yay, we hit the max + } + } + } + + if (failed) { + // we hit some failures; go with what we have + dout(20) << " hit some peer failures" << dendl; + return; + } + } +} + +std::optional DaemonServer::get_osd_metadata( + const std::string& name, + const std::string& osd_id) +{ + if (g_conf().get_val("mgr_test_metadata_error")) { + return std::nullopt; + } + + auto [key, valid] = DaemonKey::parse(osd_id); + if (!valid) { + derr << "invalid daemon name: use ." << dendl; + return std::nullopt; + } + DaemonStatePtr daemon = daemon_state.get(key); + if (!daemon) { + derr << "daemon " << osd_id << " not found!" << dendl; + return std::nullopt; + } + + std::lock_guard l(daemon->lock); + auto p = daemon->metadata.find(name); + if (p != daemon->metadata.end() && !p->second.empty()) { + return p->second; + } + return std::nullopt; +} + +void upgrade_osd_report::dump(Formatter *f) const { + f->dump_bool("ok_to_upgrade", ok_to_upgrade()); + f->dump_bool("all_osds_upgraded", all_osds_upgraded()); + + f->open_array_section("osds_in_crush_bucket"); + for (auto o : osds) { + f->dump_int("osd", o); + } + f->close_section(); + + f->open_array_section("osds_ok_to_upgrade"); + for (auto o : ok_upgrade) { + f->dump_int("ok_upgrade", o); + } + f->close_section(); + + f->open_array_section("osds_upgraded"); + for (auto o : ok_upgraded) { + f->dump_int("ok_upgraded", o); + } + f->close_section(); + + f->open_array_section("bad_no_version"); + for (auto o : bad_no_version) { + f->dump_int("bad_no_version", o); + } + f->close_section(); +} + bool DaemonServer::_handle_command( std::shared_ptr& cmdctx) { @@ -1915,6 +2244,108 @@ bool DaemonServer::_handle_command( cmdctx->reply(0, ss); } return true; + } else if (prefix == "osd ok-to-upgrade") { + std::string crush_bucket_name; + cmd_getval(cmdctx->cmdmap, "crush_bucket", crush_bucket_name); + std::string ceph_version; + cmd_getval(cmdctx->cmdmap, "ceph_version", ceph_version); + int64_t max = 1; + cmd_getval(cmdctx->cmdmap, "max", max); + int r; + std::vector osds_in_crush_bucket; + // Validate max parameter + if (max < 0) { + ss << "Invalid 'max' value: " << max << ". 'max' must be non-negative."; + cmdctx->reply(-EINVAL, ss); + return true; + } + // Validate ceph_version format. The pattern is generic and matches + // the upstream and downstream version formats. Note that the suffix + // matches either the upstream Git format or the downstream OS format. + std::regex ceph_version_pattern + (R"(^(\d+)\.(\d+)\.(\d+)-(\d+)(-g[0-9a-f]+|\.el\d+[a-z]+)$)"); + std::smatch matches; + if (!std::regex_match(ceph_version, matches, ceph_version_pattern)) { + ss << "Invalid Ceph version (short) format. The format to use is the" + << " same as 'ceph_version_short' found in OSD metadata." + << " Examples: \"20.3.0-3803-g63ca1ffb5a2\", \"20.1.0-144.el9cp\"."; + cmdctx->reply(-EINVAL, ss); + return true; + } + // Validate the crush bucket name & type. For this command the + // bucket type is limited to 'rack', 'chassis', 'host' or 'osd'. + // This is to help limit the number of OSDs and avoid + // performance issues during the upgrade check. + cluster_state.with_osdmap([&](const OSDMap& osdmap) { + // Validate crush bucket + if (!osdmap.crush->name_exists(crush_bucket_name)) { + ss << "\"" << crush_bucket_name << "\" does not exist"; + r = -ENOENT; + return; + } + int id = osdmap.crush->get_item_id(crush_bucket_name); + // get candidate additions that are beneath this point in the tree + r = _populate_crush_bucket_osds(id, osdmap, osds_in_crush_bucket, &ss); + if (r != 0) { + return; + } + }); + if (r < 0) { + cmdctx->reply(r, ss); + return true; + } + dout(20) << "Crush Bucket OSDs: " << osds_in_crush_bucket << dendl; + if ((int)osds_in_crush_bucket.size() == 0) { + ss << "no osds found in crush bucket: \"" << crush_bucket_name << "\""; + cmdctx->reply(-ENOENT, ss); + return true; + } + if (max < (int)osds_in_crush_bucket.size()) { + max = osds_in_crush_bucket.size(); + } + upgrade_osd_report osd_upgrade_report; + offline_pg_report pg_offline_report; + cluster_state.with_osdmap_and_pgmap([&]( + const OSDMap& osdmap, const PGMap& pg_map) { + _maximize_ok_to_upgrade_set( + osds_in_crush_bucket, max, osdmap, pg_map, ceph_version, + &osd_upgrade_report, &pg_offline_report, &ss); + }); + if (!f) { + f.reset(Formatter::create("json")); + } + f->dump_object("ok_to_upgrade", osd_upgrade_report); + f->flush(cmdctx->odata); + cmdctx->odata.append("\n"); + if (!osd_upgrade_report.ok_to_upgrade()) { + if (!pg_offline_report.unknown.empty()) { + ss << pg_offline_report.unknown.size() << " pgs have unknown state; " + << "cannot draw any conclusions at this time; re-try after pgs " + << "transition to known states"; + cmdctx->reply(-EBUSY, ss); + } + if (!osd_upgrade_report.bad_no_version.empty()) { + ss << osd_upgrade_report.bad_no_version.size() + << " osds have unknown version; cannot draw any conclusions"; + cmdctx->reply(-EAGAIN, ss); + } + if (!pg_offline_report.ok_to_stop()) { + ss << "unsafe to upgrade osd(s) at this time (" + << pg_offline_report.not_ok.size() + << " PGs are or would become offline)"; + cmdctx->reply(-EBUSY, ss); + } + // ok_to_upgrade() would be false in case all osds are upgraded + if (osd_upgrade_report.all_osds_upgraded()) { + ss << "all " << osds_in_crush_bucket.size() + << " osd(s) are running the new Ceph version(" + << ceph_version << ")"; + cmdctx->reply(0, ss); + } + } else { + cmdctx->reply(0, ss); + } + return true; } else if (prefix == "pg force-recovery" || prefix == "pg force-backfill" || prefix == "pg cancel-force-recovery" || diff --git a/src/mgr/DaemonServer.h b/src/mgr/DaemonServer.h index 94b046332f3..edc1fe35fa2 100644 --- a/src/mgr/DaemonServer.h +++ b/src/mgr/DaemonServer.h @@ -126,6 +126,22 @@ struct offline_pg_report { } }; +struct upgrade_osd_report { + std::vector osds; + std::vector ok_upgrade, ok_upgraded, bad_no_version; + + bool ok_to_upgrade() const { + return !ok_upgrade.empty() && bad_no_version.empty(); + } + + bool all_osds_upgraded() const { + return ((osds.size() == ok_upgraded.size()) && + ok_upgrade.empty() && bad_no_version.empty()); + } + + void dump(Formatter *f) const; +}; + /** * Server used in ceph-mgr to communicate with Ceph daemons like * MDSs and OSDs. @@ -194,6 +210,31 @@ private: const OSDMap& osdmap, const PGMap& pgmap, offline_pg_report *report); + void _maximize_ok_to_upgrade_set( + const std::vector& orig_osds, + unsigned max, + const OSDMap& osdmap, + const PGMap& pgmap, + std::string_view ceph_version_new, + upgrade_osd_report *osd_report, + offline_pg_report *pg_report, + std::ostream *ss); + std::optional get_osd_metadata( + const std::string& name, + const std::string& osd_id); + void _update_upgraded_osds( + const std::vector& orig_osds, + const std::vector& to_upgrade, + const std::vector& upgraded, + const std::vector& version_unknown, + upgrade_osd_report *osd_report); + bool _valid_bucket_type_for_upgrade_check( + std::string_view bucket_type_str); + int _populate_crush_bucket_osds( + const int item_id, + const OSDMap& osdmap, + std::vector& crush_bucket_osds, + std::ostream *ss = nullptr); utime_t started_at; std::atomic pgmap_ready; diff --git a/src/mgr/MgrCommands.h b/src/mgr/MgrCommands.h index 7adb215da01..f44c615baa3 100644 --- a/src/mgr/MgrCommands.h +++ b/src/mgr/MgrCommands.h @@ -161,6 +161,14 @@ COMMAND("osd ok-to-stop name=ids,type=CephString,n=N "\ "name=max,type=CephInt,req=false", "check whether osd(s) can be safely stopped without reducing immediate"\ " data availability", "osd", "r") +COMMAND("osd ok-to-upgrade " \ + "name=crush_bucket,type=CephString " \ + "name=ceph_version,type=CephString " \ + "name=max,type=CephInt,req=false", + "determine a safe number of osd(s) subject to a maximum(if specified)" \ + " within the provided CRUSH bucket that can be safely" \ + " upgraded without reducing immediate data availability", + "osd", "r") COMMAND("osd scrub " \ "name=who,type=CephString", \ -- 2.47.3