common/options: Change HDD OSD shard configuration defaults for mClock

author Sridhar Seshasayee <sseshasa@redhat.com>

Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)

committer Sridhar Seshasayee <sseshasa@redhat.com>

Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
author Sridhar Seshasayee <sseshasa@redhat.com>
Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
committer Sridhar Seshasayee <sseshasa@redhat.com>
Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
diff --git a/PendingReleaseNotes b/PendingReleaseNotes

index 5853b7e290bc86385e122c288440e3334071b35a..eb9d58bbba75676f8422040b91b436821d9832b3 100644 (file)
--- a/PendingReleaseNotes
+++ b/PendingReleaseNotes
@@ -12,6 +12,19 @@
    of the column showing the state of a group snapshot in the unformatted CLI
    output is changed from 'STATUS' to 'STATE'. The state of a group snapshot
    that was shown as 'ok' is now shown as 'complete', which is more descriptive.
+* Based on tests performed at scale on a HDD based Ceph cluster, it was found
+  that scheduling with mClock was not optimal with multiple OSD shards. For
+  example, in the test cluster with multiple OSD node failures, the client
+  throughput was found to be inconsistent across test runs coupled with multiple
+  reported slow requests. However, the same test with a single OSD shard and
+  with multiple worker threads yielded significantly better results in terms of
+  consistency of client and recovery throughput across multiple test runs.
+  Therefore, as an interim measure until the issue with multiple OSD shards
+  (or multiple mClock queues per OSD) is investigated and fixed, the following
+  change to the default HDD OSD shard configuration is made:
+   - osd_op_num_shards_hdd = 1 (was 5)
+   - osd_op_num_threads_per_shard_hdd = 5 (was 1)
+  For more details see https://tracker.ceph.com/issues/66289.
  
  >=19.0.0
  
diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst

index a338aa6da56167398cf7eabcfdfb69f2433d7aef..12af2522e17adcbb7c7d53a86d8ff389cb4a5ca8 100644 (file)
--- a/doc/rados/configuration/mclock-config-ref.rst
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -164,6 +164,60 @@ parameters. This profile should be used with caution and is meant for advanced
  users, who understand mclock and Ceph related configuration options.
  
  
+.. index:: mclock; shard config for HDD clusters
+
+.. _mclock-hdd-cfg:
+
+OSD Shard Configuration For HDD Based Clusters With mClock
+==========================================================
+Each OSD is configured with one or more shards to perform tasks. Each shard
+comprises a unique queue to handle various types of OSD specific operations
+like client I/O, recovery, scrub and so on. The scheduling of these operations
+in the queue is performed by a scheduler - in this case the mClock scheduler.
+
+For HDD based OSDs, the number of shards is controlled by configuration
+:confval:`osd_op_num_shards_hdd`. Items are queued and dequeued by one or
+more worker threads and this is controlled by configuration
+:confval:`osd_op_num_threads_per_shard_hdd`.
+
+As described in :ref:`dmclock-qos-caveats`, the number of OSD shards employed
+determines the impact of mClock queue. In general, a lower number of shards
+increases the impact of mClock queues with respect to scheduling accuracy.
+This is providing there are enough number of worker threads per shard
+to help process the items in the mClock queue.
+
+Based on tests performed at scale with small objects in the range
+[1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes,
+150 Million objects), it was found that scheduling with mClock was not optimal
+with multiple OSD shards. For example, in this cluster with multiple OSD node
+failures, the client throughput was found to be inconsistent across test runs
+coupled with multiple reported slow requests. For more details
+see https://tracker.ceph.com/issues/66289. With multiple shards, the situation
+was exacerbated when MAX limit was allocated to both client and background
+recovery class of operations. During the OSD failure phase, since both client
+and recovery ops were in direct competition to utilize the full bandwidth of
+OSDs, there was no predictability with respect to the throughput of either
+class of services.
+
+However, the same test with a single OSD shard and with multiple worker threads
+yielded significantly better results in terms of consistency of client and
+recovery throughput across multiple test runs. Please refer to the tracker
+above for more details. For sanity, the same test executed using this shard
+configuration with large objects in the range [1 MiB - 256 MiB] yielded similar
+results.
+
+Therefore, as an interim measure until the issue with multiple OSD shards
+(or multiple mClock queues per OSD) is investigated and fixed, the following
+change to the default HDD OSD shard configuration is made:
+
++---------------------------------------------+------------------+----------------+
+|  Config Option                              | Old Default      | New Default    |
++=============================================+==================+================+
+| :confval:`osd_op_num_shards_hdd`            | 5                | 1              |
++---------------------------------------------+------------------+----------------+
+| :confval:`osd_op_num_threads_per_shard_hdd` | 1                | 5              |
++---------------------------------------------+------------------+----------------+
+
  .. index:: mclock; built-in profiles
  
  mClock Built-in Profiles -  Locked Config Options
diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst

index 5064a95851cae4d173b6f642cf2c5247bb6de64a..5c90a90aaf7fd49c2280438d03dc4848eeaa6d59 100644 (file)
--- a/doc/rados/configuration/osd-config-ref.rst
+++ b/doc/rados/configuration/osd-config-ref.rst
@@ -189,6 +189,9 @@ Operations
  .. confval:: osd_op_num_shards
  .. confval:: osd_op_num_shards_hdd
  .. confval:: osd_op_num_shards_ssd
+.. confval:: osd_op_num_threads_per_shard
+.. confval:: osd_op_num_threads_per_shard_hdd
+.. confval:: osd_op_num_threads_per_shard_ssd
  .. confval:: osd_op_queue
  .. confval:: osd_op_queue_cut_off
  .. confval:: osd_client_op_priority
@@ -292,6 +295,9 @@ of the current time. The ultimate lesson is that values for weight
  should not be too large. They should be under the number of requests
  one expects to be serviced each second.
  
+
+.. _dmclock-qos-caveats:
+
  Caveats
  ```````
  
@@ -303,6 +309,11 @@ number of shards can be controlled with the configuration options
  :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
  :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
  impact of the mClock queues, but may have other deleterious effects.
+This is especially the case if there are insufficient shard worker
+threads. The number of shard worker threads can be controlled with the
+configuration options :confval:`osd_op_num_threads_per_shard`,
+:confval:`osd_op_num_threads_per_shard_hdd` and
+:confval:`osd_op_num_threads_per_shard_ssd`.
  
  Second, requests are transferred from the operation queue to the
  operation sequencer, in which they go through the phases of
diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst

index d237226ea92cdd7b7ce88a70c3b4095a7d97976d..37de718a012882b4fe9b177ea048881d3a2fc222 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-osd.rst
+++ b/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -618,6 +618,7 @@ Possible causes include:
  - A bug in the kernel file system (check ``dmesg`` output)
  - An overloaded cluster (check system load, iostat, etc.)
  - A bug in the ``ceph-osd`` daemon.
+- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
  
  Possible solutions:
  
@@ -626,6 +627,8 @@ Possible solutions:
  - Upgrade Ceph
  - Restart OSDs
  - Replace failed or failing components
+- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
+    - See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
  
  Debugging Slow Requests
  -----------------------
@@ -680,6 +683,43 @@ Although some of these events may appear redundant, they cross important
  boundaries in the internal code (such as passing data across locks into new
  threads).
  
+.. _mclock-tblshoot-hdd-shard-config:
+
+Slow Requests or Slow Recovery With mClock Scheduler
+----------------------------------------------------
+
+.. note:: This troubleshooting is applicable only for HDD based clusters running
+   mClock scheduler and with the following OSD shard configuration:
+   ``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
+   Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
+   made to the default OSD HDD shard configuration for mClock.
+
+On scaled HDD based clusters with mClock scheduler enabled and under multiple
+OSD node failure condition, the following could be reported or observed:
+
+- slow requests: This also manifests into degraded client I/O performance.
+- slow background recoveries: Lower than expected recovery throughput.
+
+**Troubleshooting Steps:**
+
+#. Verify from OSD events that the slow requests are predominantly of type
+   ``queued_for_pg``.
+#. Verify if the reported recovery rate is significantly lower than the expected
+   rate considering the QoS allocations for background recovery service.
+
+If either of the above steps are true, then the following resolution may be
+applied. Note that this is disruptive as it involves OSD restarts. Run the
+following commands to change the default OSD shard configuration for HDDs:
+
+.. prompt:: bash
+
+   ceph config set osd osd_op_num_shards_hdd 1
+   ceph config set osd osd_op_num_threads_per_shard_hdd 5
+
+The above configuration won't take effect immediately and would require a
+restart of the OSDs in the environment. For this process to be least disruptive,
+the OSDs may be restarted in a carefully staggered manner.
+
  .. _rados_tshooting_flapping_osd:
  
  Flapping OSDs
diff --git a/src/common/options/osd.yaml.in b/src/common/options/osd.yaml.in

index 268a89154de5acef38d93ff696bfd239c6c4d92c..3703290c549aacb4a6cc4feba95921e0b2c3e799 100644 (file)
--- a/src/common/options/osd.yaml.in
+++ b/src/common/options/osd.yaml.in
@@ -834,6 +834,9 @@ options:
  - name: osd_op_num_threads_per_shard
    type: int
    level: advanced
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD.
+    Each worker thread when operational processes items in the shard queue.
+    This setting overrides _ssd and _hdd if non-zero.
    default: 0
    flags:
    - startup
@@ -841,7 +844,9 @@ options:
  - name: osd_op_num_threads_per_shard_hdd
    type: int
    level: advanced
-  default: 1
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+    (for rotational media).
+  default: 5
    see_also:
    - osd_op_num_threads_per_shard
    flags:
@@ -850,6 +855,8 @@ options:
  - name: osd_op_num_threads_per_shard_ssd
    type: int
    level: advanced
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+    (for solid state media).
    default: 2
    see_also:
    - osd_op_num_threads_per_shard
@@ -870,7 +877,7 @@ options:
    type: int
    level: advanced
    fmt_desc: the number of shards allocated for a given OSD (for rotational media).
-  default: 5
+  default: 1
    see_also:
    - osd_op_num_shards
    flags:
author	Sridhar Seshasayee <sseshasa@redhat.com>
	Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
committer	Sridhar Seshasayee <sseshasa@redhat.com>
	Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
PendingReleaseNotes		patch \| blob \| history
doc/rados/configuration/mclock-config-ref.rst		patch \| blob \| history
doc/rados/configuration/osd-config-ref.rst		patch \| blob \| history
doc/rados/troubleshooting/troubleshooting-osd.rst		patch \| blob \| history
src/common/options/osd.yaml.in		patch \| blob \| history