From 0d81e721378e6d7a647c5a4f6aab3cede1a828d3 Mon Sep 17 00:00:00 2001
From: Sridhar Seshasayee <sseshasa@redhat.com>
Date: Tue, 3 Sep 2024 11:09:08 +0530
Subject: [PATCH] common/options: Change HDD OSD shard configuration defaults
 for mClock

Based on tests performed at scale on a HDD based cluster, it was found
that scheduling with mClock was not optimal with multiple OSD shards. For
e.g., in the scaled cluster with multiple OSD node failures, the client
throughput was found to be inconsistent across test runs coupled with
multiple reported slow requests.

However, the same test with a single OSD shard and with multiple worker
threads yielded significantly better results in terms of consistency of
client and recovery throughput across multiple test runs.

For more details see https://tracker.ceph.com/issues/66289.

Therefore, as an interim measure until the issue with multiple OSD shards
(or multiple mClock queues per OSD) is investigated and fixed, the
following change to the default HDD OSD shard configuration is made:

 - osd_op_num_shards_hdd = 1 (was 5)
 - osd_op_num_threads_per_shard_hdd = 5 (was 1)

The other changes in this commit include:
 - Doc change to the OSD and mClock config reference describing
   this change.
 - OSD troubleshooting entry on the procedure to change the shard
   configuration for clusters affected by this issue running on older
   releases.
 - Add release note for this change.

Fixes: https://tracker.ceph.com/issues/66289
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

# Conflicts:
#	doc/rados/troubleshooting/troubleshooting-osd.rst
---
 PendingReleaseNotes                           | 13 +++++
 doc/rados/configuration/mclock-config-ref.rst | 54 +++++++++++++++++++
 doc/rados/configuration/osd-config-ref.rst    | 11 ++++
 .../troubleshooting/troubleshooting-osd.rst   | 40 ++++++++++++++
 src/common/options/osd.yaml.in                | 11 +++-
 5 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/PendingReleaseNotes b/PendingReleaseNotes
index 5853b7e290bc8..eb9d58bbba756 100644
--- a/PendingReleaseNotes
+++ b/PendingReleaseNotes
@@ -12,6 +12,19 @@
   of the column showing the state of a group snapshot in the unformatted CLI
   output is changed from 'STATUS' to 'STATE'. The state of a group snapshot
   that was shown as 'ok' is now shown as 'complete', which is more descriptive.
+* Based on tests performed at scale on a HDD based Ceph cluster, it was found
+  that scheduling with mClock was not optimal with multiple OSD shards. For
+  example, in the test cluster with multiple OSD node failures, the client
+  throughput was found to be inconsistent across test runs coupled with multiple
+  reported slow requests. However, the same test with a single OSD shard and
+  with multiple worker threads yielded significantly better results in terms of
+  consistency of client and recovery throughput across multiple test runs.
+  Therefore, as an interim measure until the issue with multiple OSD shards
+  (or multiple mClock queues per OSD) is investigated and fixed, the following
+  change to the default HDD OSD shard configuration is made:
+   - osd_op_num_shards_hdd = 1 (was 5)
+   - osd_op_num_threads_per_shard_hdd = 5 (was 1)
+  For more details see https://tracker.ceph.com/issues/66289.
 
 >=19.0.0
 
diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst
index a338aa6da5616..12af2522e17ad 100644
--- a/doc/rados/configuration/mclock-config-ref.rst
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -164,6 +164,60 @@ parameters. This profile should be used with caution and is meant for advanced
 users, who understand mclock and Ceph related configuration options.
 
 
+.. index:: mclock; shard config for HDD clusters
+
+.. _mclock-hdd-cfg:
+
+OSD Shard Configuration For HDD Based Clusters With mClock
+==========================================================
+Each OSD is configured with one or more shards to perform tasks. Each shard
+comprises a unique queue to handle various types of OSD specific operations
+like client I/O, recovery, scrub and so on. The scheduling of these operations
+in the queue is performed by a scheduler - in this case the mClock scheduler.
+
+For HDD based OSDs, the number of shards is controlled by configuration
+:confval:`osd_op_num_shards_hdd`. Items are queued and dequeued by one or
+more worker threads and this is controlled by configuration
+:confval:`osd_op_num_threads_per_shard_hdd`.
+
+As described in :ref:`dmclock-qos-caveats`, the number of OSD shards employed
+determines the impact of mClock queue. In general, a lower number of shards
+increases the impact of mClock queues with respect to scheduling accuracy.
+This is providing there are enough number of worker threads per shard
+to help process the items in the mClock queue.
+
+Based on tests performed at scale with small objects in the range
+[1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes,
+150 Million objects), it was found that scheduling with mClock was not optimal
+with multiple OSD shards. For example, in this cluster with multiple OSD node
+failures, the client throughput was found to be inconsistent across test runs
+coupled with multiple reported slow requests. For more details
+see https://tracker.ceph.com/issues/66289. With multiple shards, the situation
+was exacerbated when MAX limit was allocated to both client and background
+recovery class of operations. During the OSD failure phase, since both client
+and recovery ops were in direct competition to utilize the full bandwidth of
+OSDs, there was no predictability with respect to the throughput of either
+class of services.
+
+However, the same test with a single OSD shard and with multiple worker threads
+yielded significantly better results in terms of consistency of client and
+recovery throughput across multiple test runs. Please refer to the tracker
+above for more details. For sanity, the same test executed using this shard
+configuration with large objects in the range [1 MiB - 256 MiB] yielded similar
+results.
+
+Therefore, as an interim measure until the issue with multiple OSD shards
+(or multiple mClock queues per OSD) is investigated and fixed, the following
+change to the default HDD OSD shard configuration is made:
+
++---------------------------------------------+------------------+----------------+
+|  Config Option                              | Old Default      | New Default    |
++=============================================+==================+================+
+| :confval:`osd_op_num_shards_hdd`            | 5                | 1              |
++---------------------------------------------+------------------+----------------+
+| :confval:`osd_op_num_threads_per_shard_hdd` | 1                | 5              |
++---------------------------------------------+------------------+----------------+
+
 .. index:: mclock; built-in profiles
 
 mClock Built-in Profiles -  Locked Config Options
diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst
index 5064a95851cae..5c90a90aaf7fd 100644
--- a/doc/rados/configuration/osd-config-ref.rst
+++ b/doc/rados/configuration/osd-config-ref.rst
@@ -189,6 +189,9 @@ Operations
 .. confval:: osd_op_num_shards
 .. confval:: osd_op_num_shards_hdd
 .. confval:: osd_op_num_shards_ssd
+.. confval:: osd_op_num_threads_per_shard
+.. confval:: osd_op_num_threads_per_shard_hdd
+.. confval:: osd_op_num_threads_per_shard_ssd
 .. confval:: osd_op_queue
 .. confval:: osd_op_queue_cut_off
 .. confval:: osd_client_op_priority
@@ -292,6 +295,9 @@ of the current time. The ultimate lesson is that values for weight
 should not be too large. They should be under the number of requests
 one expects to be serviced each second.
 
+
+.. _dmclock-qos-caveats:
+
 Caveats
 ```````
 
@@ -303,6 +309,11 @@ number of shards can be controlled with the configuration options
 :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
 :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
 impact of the mClock queues, but may have other deleterious effects.
+This is especially the case if there are insufficient shard worker
+threads. The number of shard worker threads can be controlled with the
+configuration options :confval:`osd_op_num_threads_per_shard`,
+:confval:`osd_op_num_threads_per_shard_hdd` and
+:confval:`osd_op_num_threads_per_shard_ssd`.
 
 Second, requests are transferred from the operation queue to the
 operation sequencer, in which they go through the phases of
diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst
index d237226ea92cd..37de718a01288 100644
--- a/doc/rados/troubleshooting/troubleshooting-osd.rst
+++ b/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -618,6 +618,7 @@ Possible causes include:
 - A bug in the kernel file system (check ``dmesg`` output)
 - An overloaded cluster (check system load, iostat, etc.)
 - A bug in the ``ceph-osd`` daemon.
+- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
 
 Possible solutions:
 
@@ -626,6 +627,8 @@ Possible solutions:
 - Upgrade Ceph
 - Restart OSDs
 - Replace failed or failing components
+- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
+    - See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
 
 Debugging Slow Requests
 -----------------------
@@ -680,6 +683,43 @@ Although some of these events may appear redundant, they cross important
 boundaries in the internal code (such as passing data across locks into new
 threads).
 
+.. _mclock-tblshoot-hdd-shard-config:
+
+Slow Requests or Slow Recovery With mClock Scheduler
+----------------------------------------------------
+
+.. note:: This troubleshooting is applicable only for HDD based clusters running
+   mClock scheduler and with the following OSD shard configuration:
+   ``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
+   Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
+   made to the default OSD HDD shard configuration for mClock.
+
+On scaled HDD based clusters with mClock scheduler enabled and under multiple
+OSD node failure condition, the following could be reported or observed:
+
+- slow requests: This also manifests into degraded client I/O performance.
+- slow background recoveries: Lower than expected recovery throughput.
+
+**Troubleshooting Steps:**
+
+#. Verify from OSD events that the slow requests are predominantly of type
+   ``queued_for_pg``.
+#. Verify if the reported recovery rate is significantly lower than the expected
+   rate considering the QoS allocations for background recovery service.
+
+If either of the above steps are true, then the following resolution may be
+applied. Note that this is disruptive as it involves OSD restarts. Run the
+following commands to change the default OSD shard configuration for HDDs:
+
+.. prompt:: bash
+
+   ceph config set osd osd_op_num_shards_hdd 1
+   ceph config set osd osd_op_num_threads_per_shard_hdd 5
+
+The above configuration won't take effect immediately and would require a
+restart of the OSDs in the environment. For this process to be least disruptive,
+the OSDs may be restarted in a carefully staggered manner.
+
 .. _rados_tshooting_flapping_osd:
 
 Flapping OSDs
diff --git a/src/common/options/osd.yaml.in b/src/common/options/osd.yaml.in
index 268a89154de5a..3703290c549aa 100644
--- a/src/common/options/osd.yaml.in
+++ b/src/common/options/osd.yaml.in
@@ -834,6 +834,9 @@ options:
 - name: osd_op_num_threads_per_shard
   type: int
   level: advanced
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD.
+    Each worker thread when operational processes items in the shard queue.
+    This setting overrides _ssd and _hdd if non-zero.
   default: 0
   flags:
   - startup
@@ -841,7 +844,9 @@ options:
 - name: osd_op_num_threads_per_shard_hdd
   type: int
   level: advanced
-  default: 1
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+    (for rotational media).
+  default: 5
   see_also:
   - osd_op_num_threads_per_shard
   flags:
@@ -850,6 +855,8 @@ options:
 - name: osd_op_num_threads_per_shard_ssd
   type: int
   level: advanced
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+    (for solid state media).
   default: 2
   see_also:
   - osd_op_num_threads_per_shard
@@ -870,7 +877,7 @@ options:
   type: int
   level: advanced
   fmt_desc: the number of shards allocated for a given OSD (for rotational media).
-  default: 5
+  default: 1
   see_also:
   - osd_op_num_shards
   flags:
-- 
2.39.5