Based on tests performed at scale on a HDD based cluster, it was found
that scheduling with mClock was not optimal with multiple OSD shards. For
e.g., in the scaled cluster with multiple OSD node failures, the client
throughput was found to be inconsistent across test runs coupled with
multiple reported slow requests.
However, the same test with a single OSD shard and with multiple worker
threads yielded significantly better results in terms of consistency of
client and recovery throughput across multiple test runs.
For more details see https://tracker.ceph.com/issues/66289.
Therefore, as an interim measure until the issue with multiple OSD shards
(or multiple mClock queues per OSD) is investigated and fixed, the
following change to the default HDD OSD shard configuration is made:
- osd_op_num_shards_hdd = 1 (was 5)
- osd_op_num_threads_per_shard_hdd = 5 (was 1)
The other changes in this commit include:
- Doc change to the OSD and mClock config reference describing
this change.
- OSD troubleshooting entry on the procedure to change the shard
configuration for clusters affected by this issue running on older
releases.
- Add release note for this change.
Fixes: https://tracker.ceph.com/issues/66289
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit
0d81e721378e6d7a647c5a4f6aab3cede1a828d3)
Conflicts:
doc/rados/troubleshooting/troubleshooting-osd.rst
- Included the troubleshooting entry before the "Flapping OSDs" section.
PendingReleaseNotes
- Moved the release note under 19.0.0 section.
confirmation flag when some MDSs exhibit health warning MDS_TRIM or
MDS_CACHE_OVERSIZED. This is to prevent accidental MDS failover causing
further delays in recovery.
+* Based on tests performed at scale on a HDD based Ceph cluster, it was found
+ that scheduling with mClock was not optimal with multiple OSD shards. For
+ example, in the test cluster with multiple OSD node failures, the client
+ throughput was found to be inconsistent across test runs coupled with multiple
+ reported slow requests. However, the same test with a single OSD shard and
+ with multiple worker threads yielded significantly better results in terms of
+ consistency of client and recovery throughput across multiple test runs.
+ Therefore, as an interim measure until the issue with multiple OSD shards
+ (or multiple mClock queues per OSD) is investigated and fixed, the following
+ change to the default HDD OSD shard configuration is made:
+ - osd_op_num_shards_hdd = 1 (was 5)
+ - osd_op_num_threads_per_shard_hdd = 5 (was 1)
+ For more details see https://tracker.ceph.com/issues/66289.
>=18.0.0
users, who understand mclock and Ceph related configuration options.
+.. index:: mclock; shard config for HDD clusters
+
+.. _mclock-hdd-cfg:
+
+OSD Shard Configuration For HDD Based Clusters With mClock
+==========================================================
+Each OSD is configured with one or more shards to perform tasks. Each shard
+comprises a unique queue to handle various types of OSD specific operations
+like client I/O, recovery, scrub and so on. The scheduling of these operations
+in the queue is performed by a scheduler - in this case the mClock scheduler.
+
+For HDD based OSDs, the number of shards is controlled by configuration
+:confval:`osd_op_num_shards_hdd`. Items are queued and dequeued by one or
+more worker threads and this is controlled by configuration
+:confval:`osd_op_num_threads_per_shard_hdd`.
+
+As described in :ref:`dmclock-qos-caveats`, the number of OSD shards employed
+determines the impact of mClock queue. In general, a lower number of shards
+increases the impact of mClock queues with respect to scheduling accuracy.
+This is providing there are enough number of worker threads per shard
+to help process the items in the mClock queue.
+
+Based on tests performed at scale with small objects in the range
+[1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes,
+150 Million objects), it was found that scheduling with mClock was not optimal
+with multiple OSD shards. For example, in this cluster with multiple OSD node
+failures, the client throughput was found to be inconsistent across test runs
+coupled with multiple reported slow requests. For more details
+see https://tracker.ceph.com/issues/66289. With multiple shards, the situation
+was exacerbated when MAX limit was allocated to both client and background
+recovery class of operations. During the OSD failure phase, since both client
+and recovery ops were in direct competition to utilize the full bandwidth of
+OSDs, there was no predictability with respect to the throughput of either
+class of services.
+
+However, the same test with a single OSD shard and with multiple worker threads
+yielded significantly better results in terms of consistency of client and
+recovery throughput across multiple test runs. Please refer to the tracker
+above for more details. For sanity, the same test executed using this shard
+configuration with large objects in the range [1 MiB - 256 MiB] yielded similar
+results.
+
+Therefore, as an interim measure until the issue with multiple OSD shards
+(or multiple mClock queues per OSD) is investigated and fixed, the following
+change to the default HDD OSD shard configuration is made:
+
++---------------------------------------------+------------------+----------------+
+| Config Option | Old Default | New Default |
++=============================================+==================+================+
+| :confval:`osd_op_num_shards_hdd` | 5 | 1 |
++---------------------------------------------+------------------+----------------+
+| :confval:`osd_op_num_threads_per_shard_hdd` | 1 | 5 |
++---------------------------------------------+------------------+----------------+
+
.. index:: mclock; built-in profiles
mClock Built-in Profiles - Locked Config Options
.. confval:: osd_op_num_shards
.. confval:: osd_op_num_shards_hdd
.. confval:: osd_op_num_shards_ssd
+.. confval:: osd_op_num_threads_per_shard
+.. confval:: osd_op_num_threads_per_shard_hdd
+.. confval:: osd_op_num_threads_per_shard_ssd
.. confval:: osd_op_queue
.. confval:: osd_op_queue_cut_off
.. confval:: osd_client_op_priority
should not be too large. They should be under the number of requests
one expects to be serviced each second.
+
+.. _dmclock-qos-caveats:
+
Caveats
```````
:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
impact of the mClock queues, but may have other deleterious effects.
+This is especially the case if there are insufficient shard worker
+threads. The number of shard worker threads can be controlled with the
+configuration options :confval:`osd_op_num_threads_per_shard`,
+:confval:`osd_op_num_threads_per_shard_hdd` and
+:confval:`osd_op_num_threads_per_shard_ssd`.
Second, requests are transferred from the operation queue to the
operation sequencer, in which they go through the phases of
- A bug in the kernel file system (check ``dmesg`` output)
- An overloaded cluster (check system load, iostat, etc.)
- A bug in the ``ceph-osd`` daemon.
+- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
Possible solutions:
- Upgrade Ceph
- Restart OSDs
- Replace failed or failing components
+- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
+ - See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
Debugging Slow Requests
-----------------------
boundaries in the internal code (such as passing data across locks into new
threads).
+.. _mclock-tblshoot-hdd-shard-config:
+
+Slow Requests or Slow Recovery With mClock Scheduler
+----------------------------------------------------
+
+.. note:: This troubleshooting is applicable only for HDD based clusters running
+ mClock scheduler and with the following OSD shard configuration:
+ ``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
+ Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
+ made to the default OSD HDD shard configuration for mClock.
+
+On scaled HDD based clusters with mClock scheduler enabled and under multiple
+OSD node failure condition, the following could be reported or observed:
+
+- slow requests: This also manifests into degraded client I/O performance.
+- slow background recoveries: Lower than expected recovery throughput.
+
+**Troubleshooting Steps:**
+
+#. Verify from OSD events that the slow requests are predominantly of type
+ ``queued_for_pg``.
+#. Verify if the reported recovery rate is significantly lower than the expected
+ rate considering the QoS allocations for background recovery service.
+
+If either of the above steps are true, then the following resolution may be
+applied. Note that this is disruptive as it involves OSD restarts. Run the
+following commands to change the default OSD shard configuration for HDDs:
+
+.. prompt:: bash
+
+ ceph config set osd osd_op_num_shards_hdd 1
+ ceph config set osd osd_op_num_threads_per_shard_hdd 5
+
+The above configuration won't take effect immediately and would require a
+restart of the OSDs in the environment. For this process to be least disruptive,
+the OSDs may be restarted in a carefully staggered manner.
+
.. _rados_tshooting_flapping_osd:
Flapping OSDs
- name: osd_op_num_threads_per_shard
type: int
level: advanced
+ fmt_desc: The number of worker threads spawned per OSD shard for a given OSD.
+ Each worker thread when operational processes items in the shard queue.
+ This setting overrides _ssd and _hdd if non-zero.
default: 0
flags:
- startup
- name: osd_op_num_threads_per_shard_hdd
type: int
level: advanced
- default: 1
+ fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+ (for rotational media).
+ default: 5
see_also:
- osd_op_num_threads_per_shard
flags:
- name: osd_op_num_threads_per_shard_ssd
type: int
level: advanced
+ fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+ (for solid state media).
default: 2
see_also:
- osd_op_num_threads_per_shard
type: int
level: advanced
fmt_desc: the number of shards allocated for a given OSD (for rotational media).
- default: 5
+ default: 1
see_also:
- osd_op_num_shards
flags: