From: Sridhar Seshasayee Date: Fri, 16 Apr 2021 11:02:58 +0000 (+0530) Subject: osd: Override recovery/backfill/sleep options with mclock scheduler. X-Git-Tag: v17.1.0~2106^2~1 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=8876d27f389ce034732cc7b34869e970a3dc138f;p=ceph.git osd: Override recovery/backfill/sleep options with mclock scheduler. Make the osd_*_sleep options modifiable during runtime and add them to the set of tracked conf keys. This is to ensure that the sleep options can be disabled/overridden during OSD bring-up if mclock scheduler is employed. Introduce OSD::maybe_override_options_for_qos(): This method does the following if the mclock scheduler is enabled: - overrides the "recovery_max_active" to a high limit of 1000, - overrides "osd_max_backfills" option to a high limit of 1000 and sets the corresponding Async local and remote reserver objects also to the same value (1000), - disables osd_*_sleep options so that appropriate QoS may be provided with the mclock scheduler. The above method is called in the following scenarios: - After all the op shards are brought up during OSD initialization. - In OSD::handle_conf_change() to override any settings related to QoS that the user intended to change. Modify the mclock config reference to accurately reflect what options can be changed when using mclock's "custom" profile and clean up some whitespaces. Fixes: https://tracker.ceph.com/issues/50501 Signed-off-by: Sridhar Seshasayee --- diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst index 96f68f0e1646..4927b8a0790c 100644 --- a/doc/rados/configuration/mclock-config-ref.rst +++ b/doc/rados/configuration/mclock-config-ref.rst @@ -5,7 +5,7 @@ .. index:: mclock; configuration Mclock profiles mask the low level details from users, making it -easier for them to configure mclock. +easier for them to configure mclock. To use mclock, you must provide the following input parameters: @@ -18,7 +18,7 @@ lower-level mclock and Ceph parameters. The parameters applied by the mclock profile make it possible to tune the QoS between client I/O, recovery/backfill operations, and other background operations (for example, scrub, snap trim, and PG deletion). These background activities are considered best-effort internal -clients of Ceph. +clients of Ceph. .. index:: mclock; profile definition @@ -39,7 +39,7 @@ some Ceph-configuration parameters are transparently applied. The low-level mclock resource control parameters are the *reservation*, *limit*, and *weight* that provide control of the resource shares, as -described in the `OSD Config Reference`_. +described in the :ref:`dmclock-qos` section. .. index:: mclock; profile types @@ -64,14 +64,14 @@ mclock profiles can be broadly classified into two types, This profile allocates equal reservation to client ops and background recovery ops. -- **Custom**: This profile gives users complete control over all mclock and - Ceph configuration parameters. Using this profile is not recommended without +- **Custom**: This profile gives users complete control over all the mclock + configuration parameters. Using this profile is not recommended without a deep understanding of mclock and related Ceph-configuration options. .. note:: Across the built-in profiles, internal clients of mclock (for example - "scrub", "snap trim", and "pg deletion") are given slightly lower - reservations, but higher weight and no limit. This ensures that - these operations are able to complete quickly if there are no other + "scrub", "snap trim", and "pg deletion") are given slightly lower + reservations, but higher weight and no limit. This ensures that + these operations are able to complete quickly if there are no other competing services. @@ -111,8 +111,8 @@ there might be instances that necessitate giving higher allocations to either client ops or recovery ops. In order to deal with such a situation, you can enable one of the alternate built-in profiles mentioned above. -If a built-in profile is active, the following Ceph config sleep options will -be disabled, +If any mClock profile (including "custom") is active, the following Ceph config +sleep options will be disabled, - :confval:`osd_recovery_sleep` - :confval:`osd_recovery_sleep_hdd` @@ -154,7 +154,7 @@ Any existing benchmarking tool can be used for this purpose. The following steps use the *Ceph Benchmarking Tool* (cbt_). Regardless of the tool used, the steps described below remain the same. -As already described in the `OSD Config Reference`_ section, the number of +As already described in the :ref:`dmclock-qos` section, the number of shards and the bluestore's throttle parameters have an impact on the mclock op queues. Therefore, it is critical to set these values carefully in order to maximize the impact of the mclock scheduler. @@ -290,5 +290,3 @@ mClock Config Options .. confval:: osd_mclock_cost_per_byte_usec .. confval:: osd_mclock_cost_per_byte_usec_hdd .. confval:: osd_mclock_cost_per_byte_usec_ssd - -.. _OSD Config Reference: ../osd-config-ref#dmclock-qos diff --git a/src/common/options/global.yaml.in b/src/common/options/global.yaml.in index 459740da59ff..105098639ba0 100644 --- a/src/common/options/global.yaml.in +++ b/src/common/options/global.yaml.in @@ -4833,12 +4833,16 @@ options: Increasing this value will slow down snap trimming. This option overrides backend specific variants. default: 0 + flags: + - runtime with_legacy: true - name: osd_snap_trim_sleep_hdd type: float level: advanced desc: Time in seconds to sleep before next snap trim for HDDs default: 5 + flags: + - runtime - name: osd_snap_trim_sleep_ssd type: float level: advanced @@ -4846,6 +4850,8 @@ options: fmt_desc: Time in seconds to sleep before next snap trim op for SSD OSDs (including NVMe). default: 0 + flags: + - runtime - name: osd_snap_trim_sleep_hybrid type: float level: advanced @@ -4854,6 +4860,8 @@ options: fmt_desc: Time in seconds to sleep before next snap trim op when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD. default: 2 + flags: + - runtime - name: osd_scrub_invalid_stats type: bool level: advanced @@ -5279,6 +5287,8 @@ options: fmt_desc: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow down the overall rate of scrubbing so that client operations will be less impacted. default: 0 + flags: + - runtime with_legacy: true # more sleep between [deep]scrub ops - name: osd_scrub_extended_sleep @@ -5761,22 +5771,30 @@ options: fmt_desc: Time in seconds to sleep before the next removal transaction. This throttles the PG deletion process. default: 0 + flags: + - runtime - name: osd_delete_sleep_hdd type: float level: advanced desc: Time in seconds to sleep before next removal transaction for HDDs default: 5 + flags: + - runtime - name: osd_delete_sleep_ssd type: float level: advanced desc: Time in seconds to sleep before next removal transaction for SSDs default: 1 + flags: + - runtime - name: osd_delete_sleep_hybrid type: float level: advanced desc: Time in seconds to sleep before next removal transaction when OSD data is on HDD and OSD journal or WAL+DB is on SSD default: 1 + flags: + - runtime # what % full makes an OSD "full" (failsafe) - name: osd_failsafe_full_ratio type: float diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index d7d038efcc6c..e23a870a43c1 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -2322,6 +2322,9 @@ OSD::OSD(CephContext *cct_, ObjectStore *store_, this); shards.push_back(one_shard); } + + // override some config options if mclock is enabled on all the shards + maybe_override_options_for_qos(); } OSD::~OSD() @@ -9889,6 +9892,15 @@ const char** OSD::get_tracked_conf_keys() const "osd_recovery_sleep_hdd", "osd_recovery_sleep_ssd", "osd_recovery_sleep_hybrid", + "osd_delete_sleep", + "osd_delete_sleep_hdd", + "osd_delete_sleep_ssd", + "osd_delete_sleep_hybrid", + "osd_snap_trim_sleep", + "osd_snap_trim_sleep_hdd", + "osd_snap_trim_sleep_ssd", + "osd_snap_trim_sleep_hybrid" + "osd_scrub_sleep", "osd_recovery_max_active", "osd_recovery_max_active_hdd", "osd_recovery_max_active_ssd", @@ -9938,46 +9950,9 @@ void OSD::handle_conf_change(const ConfigProxy& conf, changed.count("osd_recovery_max_active") || changed.count("osd_recovery_max_active_hdd") || changed.count("osd_recovery_max_active_ssd")) { - if (cct->_conf.get_val("osd_op_queue") == "mclock_scheduler" && - cct->_conf.get_val("osd_mclock_profile") != "custom") { - // Set ceph config option to meet QoS goals - // Set high value for recovery max active - uint32_t recovery_max_active = 1000; - if (cct->_conf->osd_recovery_max_active) { - cct->_conf.set_val( - "osd_recovery_max_active", std::to_string(recovery_max_active)); - } - if (store_is_rotational) { - cct->_conf.set_val( - "osd_recovery_max_active_hdd", std::to_string(recovery_max_active)); - } else { - cct->_conf.set_val( - "osd_recovery_max_active_ssd", std::to_string(recovery_max_active)); - } - // Set high value for osd_max_backfill - cct->_conf.set_val("osd_max_backfills", std::to_string(1000)); - - // Disable recovery sleep - cct->_conf.set_val("osd_recovery_sleep", std::to_string(0)); - cct->_conf.set_val("osd_recovery_sleep_hdd", std::to_string(0)); - cct->_conf.set_val("osd_recovery_sleep_ssd", std::to_string(0)); - cct->_conf.set_val("osd_recovery_sleep_hybrid", std::to_string(0)); - - // Disable delete sleep - cct->_conf.set_val("osd_delete_sleep", std::to_string(0)); - cct->_conf.set_val("osd_delete_sleep_hdd", std::to_string(0)); - cct->_conf.set_val("osd_delete_sleep_ssd", std::to_string(0)); - cct->_conf.set_val("osd_delete_sleep_hybrid", std::to_string(0)); - - // Disable snap trim sleep - cct->_conf.set_val("osd_snap_trim_sleep", std::to_string(0)); - cct->_conf.set_val("osd_snap_trim_sleep_hdd", std::to_string(0)); - cct->_conf.set_val("osd_snap_trim_sleep_ssd", std::to_string(0)); - cct->_conf.set_val("osd_snap_trim_sleep_hybrid", std::to_string(0)); - - // Disable scrub sleep - cct->_conf.set_val("osd_scrub_sleep", std::to_string(0)); - } else { + if (!maybe_override_options_for_qos() && + changed.count("osd_max_backfills")) { + // Scheduler is not "mclock". Fallback to earlier behavior service.local_reserver.set_max(cct->_conf->osd_max_backfills); service.remote_reserver.set_max(cct->_conf->osd_max_backfills); } @@ -10071,6 +10046,54 @@ void OSD::handle_conf_change(const ConfigProxy& conf, } } +bool OSD::maybe_override_options_for_qos() +{ + // If the scheduler enabled is mclock, override the recovery, backfill + // and sleep options so that mclock can meet the QoS goals. + if (cct->_conf.get_val("osd_op_queue") == "mclock_scheduler") { + dout(1) << __func__ + << ": Changing recovery/backfill/sleep settings for QoS" << dendl; + + // Set high value for recovery max active + uint32_t rec_max_active = 1000; + cct->_conf.set_val( + "osd_recovery_max_active", std::to_string(rec_max_active)); + cct->_conf.set_val( + "osd_recovery_max_active_hdd", std::to_string(rec_max_active)); + cct->_conf.set_val( + "osd_recovery_max_active_ssd", std::to_string(rec_max_active)); + + // Set high value for osd_max_backfill + uint32_t max_backfills = 1000; + cct->_conf.set_val("osd_max_backfills", std::to_string(max_backfills)); + service.local_reserver.set_max(max_backfills); + service.remote_reserver.set_max(max_backfills); + + // Disable recovery sleep + cct->_conf.set_val("osd_recovery_sleep", std::to_string(0)); + cct->_conf.set_val("osd_recovery_sleep_hdd", std::to_string(0)); + cct->_conf.set_val("osd_recovery_sleep_ssd", std::to_string(0)); + cct->_conf.set_val("osd_recovery_sleep_hybrid", std::to_string(0)); + + // Disable delete sleep + cct->_conf.set_val("osd_delete_sleep", std::to_string(0)); + cct->_conf.set_val("osd_delete_sleep_hdd", std::to_string(0)); + cct->_conf.set_val("osd_delete_sleep_ssd", std::to_string(0)); + cct->_conf.set_val("osd_delete_sleep_hybrid", std::to_string(0)); + + // Disable snap trim sleep + cct->_conf.set_val("osd_snap_trim_sleep", std::to_string(0)); + cct->_conf.set_val("osd_snap_trim_sleep_hdd", std::to_string(0)); + cct->_conf.set_val("osd_snap_trim_sleep_ssd", std::to_string(0)); + cct->_conf.set_val("osd_snap_trim_sleep_hybrid", std::to_string(0)); + + // Disable scrub sleep + cct->_conf.set_val("osd_scrub_sleep", std::to_string(0)); + return true; + } + return false; +} + void OSD::update_log_config() { map log_to_monitors; diff --git a/src/osd/OSD.h b/src/osd/OSD.h index 0d9c7f810204..5e405f3c18d7 100644 --- a/src/osd/OSD.h +++ b/src/osd/OSD.h @@ -2059,6 +2059,7 @@ private: float get_osd_snap_trim_sleep(); int get_recovery_max_active(); + bool maybe_override_options_for_qos(); void scrub_purged_snaps(); void probe_smart(const std::string& devid, std::ostream& ss);