common/options: Change HDD OSD shard configuration defaults for mClock

author Sridhar Seshasayee <sseshasa@redhat.com>

Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)

committer Sridhar Seshasayee <sseshasa@redhat.com>

Thu, 28 Nov 2024 13:51:12 +0000 (19:21 +0530)
author Sridhar Seshasayee <sseshasa@redhat.com>
Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
committer Sridhar Seshasayee <sseshasa@redhat.com>
Thu, 28 Nov 2024 13:51:12 +0000 (19:21 +0530)
diff --git a/PendingReleaseNotes b/PendingReleaseNotes

index 4bea99c9aa00c1e00b473b3b064821fc9e5e23ce..e324eb9ae8bf5ec5bfcf5a4217cef3dec354d063 100644 (file)
--- a/PendingReleaseNotes
+++ b/PendingReleaseNotes
@@ -236,6 +236,19 @@ CephFS: Disallow delegating preallocated inode ranges to clients. Config
    confirmation flag when some MDSs exhibit health warning MDS_TRIM or
    MDS_CACHE_OVERSIZED. This is to prevent accidental MDS failover causing
    further delays in recovery.
+* Based on tests performed at scale on a HDD based Ceph cluster, it was found
+  that scheduling with mClock was not optimal with multiple OSD shards. For
+  example, in the test cluster with multiple OSD node failures, the client
+  throughput was found to be inconsistent across test runs coupled with multiple
+  reported slow requests. However, the same test with a single OSD shard and
+  with multiple worker threads yielded significantly better results in terms of
+  consistency of client and recovery throughput across multiple test runs.
+  Therefore, as an interim measure until the issue with multiple OSD shards
+  (or multiple mClock queues per OSD) is investigated and fixed, the following
+  change to the default HDD OSD shard configuration is made:
+   - osd_op_num_shards_hdd = 1 (was 5)
+   - osd_op_num_threads_per_shard_hdd = 5 (was 1)
+  For more details see https://tracker.ceph.com/issues/66289.
  
  >=18.0.0
  
diff --git a/doc/rados/configuration/mclock-config-ref.rst b/doc/rados/configuration/mclock-config-ref.rst

index a31b43492b9ff80f11769f22999624542179a7f7..58de3e54bfef9a0df8a79921a1a1637fbf304f52 100644 (file)
--- a/doc/rados/configuration/mclock-config-ref.rst
+++ b/doc/rados/configuration/mclock-config-ref.rst
@@ -164,6 +164,60 @@ parameters. This profile should be used with caution and is meant for advanced
  users, who understand mclock and Ceph related configuration options.
  
  
+.. index:: mclock; shard config for HDD clusters
+
+.. _mclock-hdd-cfg:
+
+OSD Shard Configuration For HDD Based Clusters With mClock
+==========================================================
+Each OSD is configured with one or more shards to perform tasks. Each shard
+comprises a unique queue to handle various types of OSD specific operations
+like client I/O, recovery, scrub and so on. The scheduling of these operations
+in the queue is performed by a scheduler - in this case the mClock scheduler.
+
+For HDD based OSDs, the number of shards is controlled by configuration
+:confval:`osd_op_num_shards_hdd`. Items are queued and dequeued by one or
+more worker threads and this is controlled by configuration
+:confval:`osd_op_num_threads_per_shard_hdd`.
+
+As described in :ref:`dmclock-qos-caveats`, the number of OSD shards employed
+determines the impact of mClock queue. In general, a lower number of shards
+increases the impact of mClock queues with respect to scheduling accuracy.
+This is providing there are enough number of worker threads per shard
+to help process the items in the mClock queue.
+
+Based on tests performed at scale with small objects in the range
+[1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes,
+150 Million objects), it was found that scheduling with mClock was not optimal
+with multiple OSD shards. For example, in this cluster with multiple OSD node
+failures, the client throughput was found to be inconsistent across test runs
+coupled with multiple reported slow requests. For more details
+see https://tracker.ceph.com/issues/66289. With multiple shards, the situation
+was exacerbated when MAX limit was allocated to both client and background
+recovery class of operations. During the OSD failure phase, since both client
+and recovery ops were in direct competition to utilize the full bandwidth of
+OSDs, there was no predictability with respect to the throughput of either
+class of services.
+
+However, the same test with a single OSD shard and with multiple worker threads
+yielded significantly better results in terms of consistency of client and
+recovery throughput across multiple test runs. Please refer to the tracker
+above for more details. For sanity, the same test executed using this shard
+configuration with large objects in the range [1 MiB - 256 MiB] yielded similar
+results.
+
+Therefore, as an interim measure until the issue with multiple OSD shards
+(or multiple mClock queues per OSD) is investigated and fixed, the following
+change to the default HDD OSD shard configuration is made:
+
++---------------------------------------------+------------------+----------------+
+|  Config Option                              | Old Default      | New Default    |
++=============================================+==================+================+
+| :confval:`osd_op_num_shards_hdd`            | 5                | 1              |
++---------------------------------------------+------------------+----------------+
+| :confval:`osd_op_num_threads_per_shard_hdd` | 1                | 5              |
++---------------------------------------------+------------------+----------------+
+
  .. index:: mclock; built-in profiles
  
  mClock Built-in Profiles -  Locked Config Options
diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst

index 1dc9f49c1de12dd3a4ec46b1af852a0e4f686d6a..5127b4b8cf1a157e311700197bfd766a4d2c2d47 100644 (file)
--- a/doc/rados/configuration/osd-config-ref.rst
+++ b/doc/rados/configuration/osd-config-ref.rst
@@ -189,6 +189,9 @@ Operations
  .. confval:: osd_op_num_shards
  .. confval:: osd_op_num_shards_hdd
  .. confval:: osd_op_num_shards_ssd
+.. confval:: osd_op_num_threads_per_shard
+.. confval:: osd_op_num_threads_per_shard_hdd
+.. confval:: osd_op_num_threads_per_shard_ssd
  .. confval:: osd_op_queue
  .. confval:: osd_op_queue_cut_off
  .. confval:: osd_client_op_priority
@@ -292,6 +295,9 @@ of the current time. The ultimate lesson is that values for weight
  should not be too large. They should be under the number of requests
  one expects to be serviced each second.
  
+
+.. _dmclock-qos-caveats:
+
  Caveats
  ```````
  
@@ -303,6 +309,11 @@ number of shards can be controlled with the configuration options
  :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
  :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
  impact of the mClock queues, but may have other deleterious effects.
+This is especially the case if there are insufficient shard worker
+threads. The number of shard worker threads can be controlled with the
+configuration options :confval:`osd_op_num_threads_per_shard`,
+:confval:`osd_op_num_threads_per_shard_hdd` and
+:confval:`osd_op_num_threads_per_shard_ssd`.
  
  Second, requests are transferred from the operation queue to the
  operation sequencer, in which they go through the phases of
diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst

index d237226ea92cdd7b7ce88a70c3b4095a7d97976d..37de718a012882b4fe9b177ea048881d3a2fc222 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-osd.rst
+++ b/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -618,6 +618,7 @@ Possible causes include:
  - A bug in the kernel file system (check ``dmesg`` output)
  - An overloaded cluster (check system load, iostat, etc.)
  - A bug in the ``ceph-osd`` daemon.
+- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
  
  Possible solutions:
  
@@ -626,6 +627,8 @@ Possible solutions:
  - Upgrade Ceph
  - Restart OSDs
  - Replace failed or failing components
+- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
+    - See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
  
  Debugging Slow Requests
  -----------------------
@@ -680,6 +683,43 @@ Although some of these events may appear redundant, they cross important
  boundaries in the internal code (such as passing data across locks into new
  threads).
  
+.. _mclock-tblshoot-hdd-shard-config:
+
+Slow Requests or Slow Recovery With mClock Scheduler
+----------------------------------------------------
+
+.. note:: This troubleshooting is applicable only for HDD based clusters running
+   mClock scheduler and with the following OSD shard configuration:
+   ``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
+   Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
+   made to the default OSD HDD shard configuration for mClock.
+
+On scaled HDD based clusters with mClock scheduler enabled and under multiple
+OSD node failure condition, the following could be reported or observed:
+
+- slow requests: This also manifests into degraded client I/O performance.
+- slow background recoveries: Lower than expected recovery throughput.
+
+**Troubleshooting Steps:**
+
+#. Verify from OSD events that the slow requests are predominantly of type
+   ``queued_for_pg``.
+#. Verify if the reported recovery rate is significantly lower than the expected
+   rate considering the QoS allocations for background recovery service.
+
+If either of the above steps are true, then the following resolution may be
+applied. Note that this is disruptive as it involves OSD restarts. Run the
+following commands to change the default OSD shard configuration for HDDs:
+
+.. prompt:: bash
+
+   ceph config set osd osd_op_num_shards_hdd 1
+   ceph config set osd osd_op_num_threads_per_shard_hdd 5
+
+The above configuration won't take effect immediately and would require a
+restart of the OSDs in the environment. For this process to be least disruptive,
+the OSDs may be restarted in a carefully staggered manner.
+
  .. _rados_tshooting_flapping_osd:
  
  Flapping OSDs
diff --git a/src/common/options/osd.yaml.in b/src/common/options/osd.yaml.in

index 7f1fbb691ab955eddae2d6f87590eb0aeec7f93c..cd6221f0cb5859fae7b20551220c2b6d78955e8e 100644 (file)
--- a/src/common/options/osd.yaml.in
+++ b/src/common/options/osd.yaml.in
@@ -848,6 +848,9 @@ options:
  - name: osd_op_num_threads_per_shard
    type: int
    level: advanced
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD.
+    Each worker thread when operational processes items in the shard queue.
+    This setting overrides _ssd and _hdd if non-zero.
    default: 0
    flags:
    - startup
@@ -855,7 +858,9 @@ options:
  - name: osd_op_num_threads_per_shard_hdd
    type: int
    level: advanced
-  default: 1
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+    (for rotational media).
+  default: 5
    see_also:
    - osd_op_num_threads_per_shard
    flags:
@@ -864,6 +869,8 @@ options:
  - name: osd_op_num_threads_per_shard_ssd
    type: int
    level: advanced
+  fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+    (for solid state media).
    default: 2
    see_also:
    - osd_op_num_threads_per_shard
@@ -884,7 +891,7 @@ options:
    type: int
    level: advanced
    fmt_desc: the number of shards allocated for a given OSD (for rotational media).
-  default: 5
+  default: 1
    see_also:
    - osd_op_num_shards
    flags:
author	Sridhar Seshasayee <sseshasa@redhat.com>
	Tue, 3 Sep 2024 05:39:08 +0000 (11:09 +0530)
committer	Sridhar Seshasayee <sseshasa@redhat.com>
	Thu, 28 Nov 2024 13:51:12 +0000 (19:21 +0530)
PendingReleaseNotes		patch \| blob \| history
doc/rados/configuration/mclock-config-ref.rst		patch \| blob \| history
doc/rados/configuration/osd-config-ref.rst		patch \| blob \| history
doc/rados/troubleshooting/troubleshooting-osd.rst		patch \| blob \| history
src/common/options/osd.yaml.in		patch \| blob \| history