From: Sridhar Seshasayee <sseshasa@redhat.com>
Date: Mon, 10 Oct 2022 13:18:13 +0000 (+0530)
Subject: doc: Update mClock config reference doc to reflect new max recovery limits
X-Git-Tag: v17.2.6~276^2~2
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=d8be167eb32ef5b60da3f2edfac34b1208b2069e;p=ceph.git

doc: Update mClock config reference doc to reflect new max recovery limits

Document the following:

- New max backfill/recovery defaults for mClock.
- Steps to modify the backfill/recovery defaults.
  - Modify defaults using new osd_mclock_override_recovery_settings option
- Steps to mitigate unrealistic OSD bench results to set OSD capacity.
  - New capacity threshold options for ssd/hdd

Fixes: https://tracker.ceph.com/issues/57529
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit a4c2e28877daee0b5ad88c2ea9d359d567d473be)
---

diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst
index 1040b2e66c2e..f2a1603b5dc8 100644
--- a/doc/rados/configuration/mclock-config-ref.rst
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -88,6 +88,14 @@ Users can choose between the following built-in profile types:
 .. note:: The values mentioned in the tables below represent the percentage
           of the total IOPS capacity of the OSD allocated for the service type.
 
+By default, the *high_client_ops* profile is enabled to ensure that a larger
+chunk of the bandwidth allocation goes to client ops. Background recovery ops
+are given lower allocation (and therefore take a longer time to complete). But
+there might be instances that necessitate giving higher allocations to either
+client ops or recovery ops. In order to deal with such a situation, the
+alternate built-in profiles may be enabled by following the steps mentioned
+in next sections.
+
 high_client_ops (*default*)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 This profile optimizes client performance over background activities by
@@ -143,10 +151,7 @@ within the OSD.
 +------------------------+-------------+--------+-------+
 
 .. note:: Across the built-in profiles, internal background best-effort clients
-          of mclock ("scrub", "snap trim", and "pg deletion") are given lower
-          reservations but no limits(MAX). This ensures that requests from such
-          clients are able to complete quickly if there are no other competing
-          operations.
+          of mclock include "scrub", "snap trim", and "pg deletion" operations.
 
 
 Custom Profile
@@ -158,9 +163,13 @@ users, who understand mclock and Ceph related configuration options.
 
 .. index:: mclock; built-in profiles
 
-mClock Built-in Profiles
-========================
+mClock Built-in Profiles -  Locked Config Options
+=================================================
+The below sections describe the config options that are locked to certain values
+in order to ensure mClock scheduler is able to provide predictable QoS.
 
+mClock Config Options
+---------------------
 When a built-in profile is enabled, the mClock scheduler calculates the low
 level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
 enabled for each client type. The mclock parameters are calculated based on
@@ -177,24 +186,40 @@ config parameters cannot be modified when using any of the built-in profiles:
 - :confval:`osd_mclock_scheduler_background_best_effort_wgt`
 - :confval:`osd_mclock_scheduler_background_best_effort_lim`
 
-The following Ceph options will not be modifiable by the user:
+Recovery/Backfill Options
+-------------------------
+The following recovery and backfill related Ceph options are set to new defaults
+for mClock:
 
 - :confval:`osd_max_backfills`
 - :confval:`osd_recovery_max_active`
-
-This is because the above options are internally modified by the mclock
-scheduler in order to maximize the impact of the set profile.
-
-By default, the *high_client_ops* profile is enabled to ensure that a larger
-chunk of the bandwidth allocation goes to client ops. Background recovery ops
-are given lower allocation (and therefore take a longer time to complete). But
-there might be instances that necessitate giving higher allocations to either
-client ops or recovery ops. In order to deal with such a situation, the
-alternate built-in profiles may be enabled by following the steps mentioned
-in the next section.
-
+- :confval:`osd_recovery_max_active_hdd`
+- :confval:`osd_recovery_max_active_ssd`
+
+The following table shows the new mClock defaults. This is done to maximize the
+impact of the built-in profile:
+
++----------------------------------------+------------------+----------------+
+|  Config Option                         | Original Default | mClock Default |
++========================================+==================+================+
+| :confval:`osd_max_backfills`           | 1                | 10             |
++----------------------------------------+------------------+----------------+
+| :confval:`osd_recovery_max_active`     | 0                | 0              |
++----------------------------------------+------------------+----------------+
+| :confval:`osd_recovery_max_active_hdd` | 3                | 10             |
++----------------------------------------+------------------+----------------+
+| :confval:`osd_recovery_max_active_ssd` | 10               | 20             |
++----------------------------------------+------------------+----------------+
+
+The above mClock defaults, can be modified if necessary by enabling
+:confval:`osd_mclock_override_recovery_settings` (default: false). The
+steps for this is discussed in the
+`Steps to Modify mClock Max Backfills/Recovery Limits`_ section.
+
+Sleep Options
+-------------
 If any mClock profile (including "custom") is active, the following Ceph config
-sleep options will be disabled,
+sleep options are disabled (set to 0),
 
 - :confval:`osd_recovery_sleep`
 - :confval:`osd_recovery_sleep_hdd`
@@ -386,6 +411,70 @@ The individual QoS-related config options for the *custom* profile can also be
 modified ephemerally using the above commands.
 
 
+Steps to Modify mClock Max Backfills/Recovery Limits
+====================================================
+
+This section describes the steps to modify the default max backfills or recovery
+limits if the need arises.
+
+.. warning:: This section is for advanced users or for experimental testing. The
+   recommendation is to retain the defaults as is on a running cluster as
+   modifying them could have unexpected performance outcomes. The values may
+   be modified only if the cluster is unable to cope/showing poor performance
+   with the default settings or for performing experiments on a test cluster.
+
+.. important:: The max backfill/recovery options that can be modified are listed
+   in section `Recovery/Backfill Options`_. The modification of the mClock
+   default backfills/recovery limit is gated by the
+   :confval:`osd_mclock_override_recovery_settings` option, which is set to
+   *false* by default. Attempting to modify any default recovery/backfill
+   limits without setting the gating option will reset that option back to the
+   mClock defaults along with a warning message logged in the cluster log. Note
+   that it may take a few seconds for the default value to come back into
+   effect. Verify the limit using the *config show* command as shown below.
+
+#. Set the :confval:`osd_mclock_override_recovery_settings` config option on all
+   osds to *true* using:
+
+   .. prompt:: bash #
+
+     ceph config set osd osd_mclock_override_recovery_settings true
+
+#. Set the desired max backfill/recovery option using:
+
+   .. prompt:: bash #
+
+     ceph config set osd osd_max_backfills <value>
+
+   For example, the following command modifies the :confval:`osd_max_backfills`
+   option on all osds to 5.
+
+   .. prompt:: bash #
+
+     ceph config set osd osd_max_backfills 5
+
+#. Wait for a few seconds and verify the running configuration for a specific
+   OSD using:
+
+   .. prompt:: bash #
+
+     ceph config show osd.N | grep osd_max_backfills
+
+   For example, the following command shows the running configuration of
+   :confval:`osd_max_backfills` on osd.0.
+
+   .. prompt:: bash #
+
+     ceph config show osd.0 | grep osd_max_backfills
+
+#. Reset the :confval:`osd_mclock_override_recovery_settings` config option on
+   all osds to *false* using:
+
+   .. prompt:: bash #
+
+     ceph config set osd osd_mclock_override_recovery_settings false
+
+
 OSD Capacity Determination (Automated)
 ======================================
 
@@ -413,6 +502,46 @@ node whose underlying device type is SSD:
 
     ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
 
+Mitigation of Unrealistic OSD Capacity From Automated Test
+----------------------------------------------------------
+In certain conditions, the OSD bench tool may show unrealistic/inflated result
+depending on the drive configuration and other environment related conditions.
+To mitigate the performance impact due to this unrealistic capacity, a couple
+of threshold config options depending on the osd's device type are defined and
+used:
+
+- :confval:`osd_mclock_iops_capacity_threshold_hdd` = 500
+- :confval:`osd_mclock_iops_capacity_threshold_ssd` = 80000
+
+The following automated step is performed:
+
+Fallback to using default OSD capacity (automated)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If OSD bench reports a measurement that exceeds the above threshold values
+depending on the underlying device type, the fallback mechanism reverts to the
+default value of :confval:`osd_mclock_max_capacity_iops_hdd` or
+:confval:`osd_mclock_max_capacity_iops_ssd`. The threshold config options
+can be reconfigured based on the type of drive used. Additionally, a cluster
+warning is logged in case the measurement exceeds the threshold. For example, ::
+
+    2022-10-27T15:30:23.270+0000 7f9b5dbe95c0  0 log_channel(cluster) log [WRN]
+    : OSD bench result of 39546.479392 IOPS exceeded the threshold limit of
+    25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000
+    IOPS. The recommendation is to establish the osd's IOPS capacity using other
+    benchmark tools (e.g. Fio) and then override
+    osd_mclock_max_capacity_iops_[hdd|ssd].
+
+If the default capacity doesn't accurately represent the OSD's capacity, the
+following additional step is recommended to address this:
+
+Run custom drive benchmark if defaults are not accurate (manual)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If the default OSD capacity is not accurate, the recommendation is to run a
+custom benchmark using your preferred tool (e.g. Fio) on the drive and then
+override the ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option as described
+in the `Specifying  Max OSD Capacity`_ section.
+
+This step is highly recommended until an alternate mechansim is worked upon.
 
 Steps to Manually Benchmark an OSD (Optional)
 =============================================
@@ -426,9 +555,10 @@ Steps to Manually Benchmark an OSD (Optional)
          `Specifying  Max OSD Capacity`_.
 
 
-Any existing benchmarking tool can be used for this purpose. In this case, the
-steps use the *Ceph OSD Bench* command described in the next section. Regardless
-of the tool/command used, the steps outlined further below remain the same.
+Any existing benchmarking tool (e.g. Fio) can be used for this purpose. In this
+case, the steps use the *Ceph OSD Bench* command described in the next section.
+Regardless of the tool/command used, the steps outlined further below remain the
+same.
 
 As already described in the :ref:`dmclock-qos` section, the number of
 shards and the bluestore's throttle parameters have an impact on the mclock op
@@ -551,5 +681,8 @@ mClock Config Options
 .. confval:: osd_mclock_cost_per_byte_usec_ssd
 .. confval:: osd_mclock_force_run_benchmark_on_init
 .. confval:: osd_mclock_skip_benchmark
+.. confval:: osd_mclock_override_recovery_settings
+.. confval:: osd_mclock_iops_capacity_threshold_hdd
+.. confval:: osd_mclock_iops_capacity_threshold_ssd
 
 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf