From 76420f9d5911ca3dc4129e6dfe7c44dc4734b700 Mon Sep 17 00:00:00 2001
From: Sridhar Seshasayee <sseshasa@redhat.com>
Date: Wed, 12 May 2021 20:20:20 +0530
Subject: [PATCH] doc: Update mclock-config-ref to reflect automated OSD
 benchmarking

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
---
 doc/rados/configuration/mclock-config-ref.rst | 234 ++++++++++--------
 doc/rados/operations/control.rst              |   2 +
 2 files changed, 135 insertions(+), 101 deletions(-)
diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst
index 4927b8a0790c9..9e16ee2db0262 100644
--- a/doc/rados/configuration/mclock-config-ref.rst
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -7,11 +7,12 @@
 Mclock profiles mask the low level details from users, making it
 easier for them to configure mclock.
 
-To use mclock, you must provide the following input parameters:
+The following input parameters are required for a mclock profile to configure
+the QoS related parameters:
 
-* total capacity of each OSD
+* total capacity (IOPS) of each OSD (determined automatically)
 
-* an mclock profile to enable
+* an mclock profile type to enable
 
 Using the settings in the specified profile, the OSD determines and applies the
 lower-level mclock and Ceph parameters. The parameters applied by the mclock
@@ -31,11 +32,11 @@ Ceph cluster enables the throttling of the operations(IOPS) belonging to
 different client classes (background recovery, scrub, snaptrim, client op,
 osd subop)â*.
 
-The mclock profile uses the capacity limits and the mclock profile selected by
-the user to determine the low-level mclock resource control parameters.
+The mclock profile uses the capacity limits and the mclock profile type selected
+by the user to determine the low-level mclock resource control parameters.
 
-Depending on the profile, lower-level mclock resource-control parameters and
-some Ceph-configuration parameters are transparently applied.
+Depending on the profile type, lower-level mclock resource-control parameters
+and some Ceph-configuration parameters are transparently applied.
 
 The low-level mclock resource control parameters are the *reservation*,
 *limit*, and *weight* that provide control of the resource shares, as
@@ -56,7 +57,7 @@ mclock profiles can be broadly classified into two types,
     as compared to background recoveries and other internal clients within
     Ceph. This profile is enabled by default.
   - **high_recovery_ops**:
-    This profile allocates more reservation to background recoveries as 
+    This profile allocates more reservation to background recoveries as
     compared to external clients and other internal clients within Ceph. For
     example, an admin may enable this profile temporarily to speed-up background
     recoveries during non-peak hours.
@@ -109,7 +110,8 @@ chunk of the bandwidth allocation goes to client ops. Background recovery ops
 are given lower allocation (and therefore take a longer time to complete). But
 there might be instances that necessitate giving higher allocations to either
 client ops or recovery ops. In order to deal with such a situation, you can
-enable one of the alternate built-in profiles mentioned above.
+enable one of the alternate built-in profiles by following the steps mentioned
+in the next section.
 
 If any mClock profile (including "custom") is active, the following Ceph config
 sleep options will be disabled,
@@ -139,20 +141,64 @@ all its clients.
 Steps to Enable mClock Profile
 ==============================
 
-The following sections outline the steps required to enable a mclock profile.
+As already mentioned, the default mclock profile is set to *high_client_ops*.
+The other values for the built-in profiles include *balanced* and
+*high_recovery_ops*.
+
+If there is a requirement to change the default profile, then the option
+:confval:`osd_mclock_profile` may be set during runtime by using the following
+command:
+
+  .. prompt:: bash #
+
+    ceph config set [global,osd] osd_mclock_profile <value>
+
+For example, to change the profile to allow faster recoveries, the following
+command can be used to switch to the *high_recovery_ops* profile:
+
+  .. prompt:: bash #
+
+    ceph config set osd osd_mclock_profile high_recovery_ops
 
-Determining OSD Capacity Using Benchmark Tests
-----------------------------------------------
+.. note:: The *custom* profile is not recommended unless you are an advanced
+          user.
 
-To allow mclock to fulfill its QoS goals across its clients, it is most
-important to have a good understanding of each OSD's capacity in terms of its
-baseline throughputs (IOPS) across the Ceph nodes. To determine this capacity,
-you must perform appropriate benchmarking tests. The steps for performing these
-benchmarking tests are broadly outlined below.
+And that's it! You are ready to run workloads on the cluster and check if the
+QoS requirements are being met.
+
+
+OSD Capacity Determination (Automated)
+======================================
+
+The OSD capacity in terms of total IOPS is determined automatically during OSD
+initialization. This is achieved by running the OSD bench tool and overriding
+the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
+depending on the device type. No other action/input is expected from the user
+to set the OSD capacity. You may verify the capacity of an OSD after the
+cluster is brought up by using the following command:
+
+  .. prompt:: bash #
+
+    ceph config show osd.x osd_mclock_max_capacity_iops_[hdd, ssd]
 
-Any existing benchmarking tool can be used for this purpose. The following
-steps use the *Ceph Benchmarking Tool* (cbt_). Regardless of the tool
-used, the steps described below remain the same.
+For example, the following command shows the max capacity for osd.0 on a Ceph
+node whose underlying device type is SSD:
+
+  .. prompt:: bash #
+
+    ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
+
+
+Steps to Manually Benchmark an OSD (Optional)
+=============================================
+
+.. note:: These steps are only necessary if you want to override the OSD
+          capacity already determined automatically during OSD initialization.
+          Otherwise, you may skip this section entirely.
+
+Any existing benchmarking tool can be used for this purpose. In this case, the
+steps use the *Ceph OSD Bench* command described in the next section. Regardless
+of the tool/command used, the steps outlined further below remain the same.
 
 As already described in the :ref:`dmclock-qos` section, the number of
 shards and the bluestore's throttle parameters have an impact on the mclock op
@@ -167,112 +213,99 @@ maximize the impact of the mclock scheduler.
 
 :Bluestore Throttle Parameters:
   We recommend using the default values as defined by
-  :confval:`bluestore_throttle_bytes` and :confval:`bluestore_throttle_deferred_bytes`. But
-  these parameters may also be determined during the benchmarking phase as
-  described below.
-
-Benchmarking Test Steps Using CBT
-`````````````````````````````````
-
-The steps below use the default shards and detail the steps used to determine the
-correct bluestore throttle values.
-
-.. note:: These steps, although manual in April 2021, will be automated in the future.
-
-1. On the Ceph node hosting the OSDs, download cbt_ from git.
-2. Install cbt and all the dependencies mentioned on the cbt github page.
-3. Construct the Ceph configuration file and the cbt yaml file.
-4. Ensure that the bluestore throttle options ( i.e.
-   :confval:`bluestore_throttle_bytes` and :confval:`bluestore_throttle_deferred_bytes`) are
-   set to the default values.
-5. Ensure that the test is performed on similar device types to get reliable
-   OSD capacity data.
-6. The OSDs can be grouped together with the desired replication factor for the
-   test to ensure reliability of OSD capacity data.
-7. After ensuring that the OSDs nodes are in the desired configuration, run a
-   simple 4KiB random write workload on the OSD(s) for 300 secs.
-8. Note the overall throughput(IOPS) obtained from the cbt output file. This
-   value is the baseline throughput(IOPS) when the default bluestore
-   throttle options are in effect.
-9. If the intent is to determine the bluestore throttle values for your
-   environment, then set the two options, :confval:`bluestore_throttle_bytes` and
-   :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each to begin
-   with. Otherwise, you may skip to the next section.
-10. Run the 4KiB random write workload as before on the OSD(s) for 300 secs.
-11. Note the overall throughput from the cbt log files and compare the value
-    against the baseline throughput in step 8.
-12. If the throughput doesn't match with the baseline, increment the bluestore
-    throttle options by 2x and repeat steps 9 through 11 until the obtained
-    throughput is very close to the baseline value.
-
-For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for
-both bluestore throttle and deferred bytes was determined to maximize the impact
-of mclock. For HDDs, the corresponding value was 40 MiB, where the overall
-throughput was roughly equal to the baseline throughput. Note that in general
-for HDDs, the bluestore throttle values are expected to be higher when compared
-to SSDs.
-
-.. _cbt: https://github.com/ceph/cbt
+  :confval:`bluestore_throttle_bytes` and
+  :confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be
+  determined during the benchmarking phase as described below.
 
+OSD Bench Command Syntax
+````````````````````````
 
-Specifying  Max OSD Capacity
-----------------------------
+The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
+used for benchmarking is shown below :
 
-The steps in this section may be performed only if the max osd capacity is
-different from the default values (SSDs: 21500 IOPS and HDDs: 315 IOPS). The
-option ``osd_mclock_max_capacity_iops_[hdd, ssd]`` can be set by specifying it
-in either the **[global]** section or in a specific OSD section (**[osd.x]** of
-your Ceph configuration file).
+.. prompt:: bash #
 
-Alternatively, commands of the following form may be used:
+  ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
 
-  .. prompt:: bash #
+where,
 
-     ceph config set [global, osd] osd_mclock_max_capacity_iops_[hdd,ssd] <value>
+* ``TOTAL_BYTES``: Total number of bytes to write
+* ``BYTES_PER_WRITE``: Block size per write
+* ``OBJ_SIZE``: Bytes per object
+* ``NUM_OBJS``: Number of objects to write
 
-For example, the following command sets the max capacity for all the OSDs in a
-Ceph node whose underlying device type is SSDs:
+Benchmarking Test Steps Using OSD Bench
+```````````````````````````````````````
 
-  .. prompt:: bash #
+The steps below use the default shards and detail the steps used to determine
+the correct bluestore throttle values (optional).
 
-    ceph config set osd osd_mclock_max_capacity_iops_ssd 25000
+#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
+   you wish to benchmark.
+#. Run a simple 4KiB random write workload on an OSD using the following
+   commands:
 
-To set the capacity for a specific OSD (for example "osd.0") whose underlying
-device type is HDD, use a command like this:
+   .. note:: Note that before running the test, caches must be cleared to get an
+             accurate measurement.
 
-  .. prompt:: bash #
+   For example, if you are running the benchmark test on osd.0, run the following
+   commands:
 
-    ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
+   .. prompt:: bash #
 
+     ceph tell osd.0 cache drop
 
-Specifying Which mClock Profile to Enable
------------------------------------------
+   .. prompt:: bash #
 
-As already mentioned, the default mclock profile is set to *high_client_ops*.
-The other values for the built-in profiles include *balanced* and
-*high_recovery_ops*.
+     ceph tell osd.0 bench 12288000 4096 4194304 100
 
-If there is a requirement to change the default profile, then the option
-:confval:`osd_mclock_profile` may be set in the **[global]** or **[osd]** section of
-your Ceph configuration file before bringing up your cluster.
+#. Note the overall throughput(IOPS) obtained from the output of the osd bench
+   command. This value is the baseline throughput(IOPS) when the default
+   bluestore throttle options are in effect.
+#. If the intent is to determine the bluestore throttle values for your
+   environment, then set the two options, :confval:`bluestore_throttle_bytes`
+   and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each
+   to begin with. Otherwise, you may skip to the next section.
+#. Run the 4KiB random write test as before using OSD bench.
+#. Note the overall throughput from the output and compare the value
+   against the baseline throughput recorded in step 3.
+#. If the throughput doesn't match with the baseline, increment the bluestore
+   throttle options by 2x and repeat steps 5 through 7 until the obtained
+   throughput is very close to the baseline value.
+
+For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
+for both bluestore throttle and deferred bytes was determined to maximize the
+impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
+overall throughput was roughly equal to the baseline throughput. Note that in
+general for HDDs, the bluestore throttle values are expected to be higher when
+compared to SSDs.
 
-Alternatively, to change the profile during runtime, use the following command:
+
+Specifying  Max OSD Capacity
+````````````````````````````
+
+The steps in this section may be performed only if you want to override the
+max osd capacity automatically determined during OSD initialization. The option
+``osd_mclock_max_capacity_iops_[hdd, ssd]`` can be set by running the
+following command:
 
   .. prompt:: bash #
 
-    ceph config set [global,osd] osd_mclock_profile <value>
+     ceph config set [global,osd] osd_mclock_max_capacity_iops_[hdd,ssd] <value>
 
-For example, to change the profile to allow faster recoveries, the following
-command can be used to switch to the *high_recovery_ops* profile:
+For example, the following command sets the max capacity for all the OSDs in a
+Ceph node whose underlying device type is SSDs:
 
   .. prompt:: bash #
 
-    ceph config set osd osd_mclock_profile high_recovery_ops
+    ceph config set osd osd_mclock_max_capacity_iops_ssd 25000
 
-.. note:: The *custom* profile is not recommended unless you are an advanced user.
+To set the capacity for a specific OSD (for example "osd.0") whose underlying
+device type is HDD, use a command like this:
 
-And that's it! You are ready to run workloads on the cluster and check if the
-QoS requirements are being met.
+  .. prompt:: bash #
+
+    ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
 
 
 .. index:: mclock; config settings
@@ -281,7 +314,6 @@ mClock Config Options
 =====================
 
 .. confval:: osd_mclock_profile
-.. confval:: osd_mclock_max_capacity_iops
 .. confval:: osd_mclock_max_capacity_iops_hdd
 .. confval:: osd_mclock_max_capacity_iops_ssd
 .. confval:: osd_mclock_cost_per_io_usec
diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst
index 126f72bc66eb6..c5f911f81ca5b 100644
--- a/doc/rados/operations/control.rst
+++ b/doc/rados/operations/control.rst
@@ -95,6 +95,8 @@ or delete them if they were just created. ::
 	ceph pg {pgid} mark_unfound_lost revert|delete
 
 
+.. _osd-subsystem:
+
 OSD Subsystem
 =============
 
-- 
2.47.3