of the column showing the state of a group snapshot in the unformatted CLI
output is changed from 'STATUS' to 'STATE'. The state of a group snapshot
that was shown as 'ok' is now shown as 'complete', which is more descriptive.
+* Based on tests performed at scale on a HDD based Ceph cluster, it was found
+ that scheduling with mClock was not optimal with multiple OSD shards. For
+ example, in the test cluster with multiple OSD node failures, the client
+ throughput was found to be inconsistent across test runs coupled with multiple
+ reported slow requests. However, the same test with a single OSD shard and
+ with multiple worker threads yielded significantly better results in terms of
+ consistency of client and recovery throughput across multiple test runs.
+ Therefore, as an interim measure until the issue with multiple OSD shards
+ (or multiple mClock queues per OSD) is investigated and fixed, the following
+ change to the default HDD OSD shard configuration is made:
+ - osd_op_num_shards_hdd = 1 (was 5)
+ - osd_op_num_threads_per_shard_hdd = 5 (was 1)
+ For more details see https://tracker.ceph.com/issues/66289.
>=19.0.0
users, who understand mclock and Ceph related configuration options.
+.. index:: mclock; shard config for HDD clusters
+
+.. _mclock-hdd-cfg:
+
+OSD Shard Configuration For HDD Based Clusters With mClock
+==========================================================
+Each OSD is configured with one or more shards to perform tasks. Each shard
+comprises a unique queue to handle various types of OSD specific operations
+like client I/O, recovery, scrub and so on. The scheduling of these operations
+in the queue is performed by a scheduler - in this case the mClock scheduler.
+
+For HDD based OSDs, the number of shards is controlled by configuration
+:confval:`osd_op_num_shards_hdd`. Items are queued and dequeued by one or
+more worker threads and this is controlled by configuration
+:confval:`osd_op_num_threads_per_shard_hdd`.
+
+As described in :ref:`dmclock-qos-caveats`, the number of OSD shards employed
+determines the impact of mClock queue. In general, a lower number of shards
+increases the impact of mClock queues with respect to scheduling accuracy.
+This is providing there are enough number of worker threads per shard
+to help process the items in the mClock queue.
+
+Based on tests performed at scale with small objects in the range
+[1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes,
+150 Million objects), it was found that scheduling with mClock was not optimal
+with multiple OSD shards. For example, in this cluster with multiple OSD node
+failures, the client throughput was found to be inconsistent across test runs
+coupled with multiple reported slow requests. For more details
+see https://tracker.ceph.com/issues/66289. With multiple shards, the situation
+was exacerbated when MAX limit was allocated to both client and background
+recovery class of operations. During the OSD failure phase, since both client
+and recovery ops were in direct competition to utilize the full bandwidth of
+OSDs, there was no predictability with respect to the throughput of either
+class of services.
+
+However, the same test with a single OSD shard and with multiple worker threads
+yielded significantly better results in terms of consistency of client and
+recovery throughput across multiple test runs. Please refer to the tracker
+above for more details. For sanity, the same test executed using this shard
+configuration with large objects in the range [1 MiB - 256 MiB] yielded similar
+results.
+
+Therefore, as an interim measure until the issue with multiple OSD shards
+(or multiple mClock queues per OSD) is investigated and fixed, the following
+change to the default HDD OSD shard configuration is made:
+
++---------------------------------------------+------------------+----------------+
+| Config Option | Old Default | New Default |
++=============================================+==================+================+
+| :confval:`osd_op_num_shards_hdd` | 5 | 1 |
++---------------------------------------------+------------------+----------------+
+| :confval:`osd_op_num_threads_per_shard_hdd` | 1 | 5 |
++---------------------------------------------+------------------+----------------+
+
.. index:: mclock; built-in profiles
mClock Built-in Profiles - Locked Config Options
.. confval:: osd_op_num_shards
.. confval:: osd_op_num_shards_hdd
.. confval:: osd_op_num_shards_ssd
+.. confval:: osd_op_num_threads_per_shard
+.. confval:: osd_op_num_threads_per_shard_hdd
+.. confval:: osd_op_num_threads_per_shard_ssd
.. confval:: osd_op_queue
.. confval:: osd_op_queue_cut_off
.. confval:: osd_client_op_priority
should not be too large. They should be under the number of requests
one expects to be serviced each second.
+
+.. _dmclock-qos-caveats:
+
Caveats
```````
:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
impact of the mClock queues, but may have other deleterious effects.
+This is especially the case if there are insufficient shard worker
+threads. The number of shard worker threads can be controlled with the
+configuration options :confval:`osd_op_num_threads_per_shard`,
+:confval:`osd_op_num_threads_per_shard_hdd` and
+:confval:`osd_op_num_threads_per_shard_ssd`.
Second, requests are transferred from the operation queue to the
operation sequencer, in which they go through the phases of
- A bug in the kernel file system (check ``dmesg`` output)
- An overloaded cluster (check system load, iostat, etc.)
- A bug in the ``ceph-osd`` daemon.
+- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
Possible solutions:
- Upgrade Ceph
- Restart OSDs
- Replace failed or failing components
+- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
+ - See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
Debugging Slow Requests
-----------------------
boundaries in the internal code (such as passing data across locks into new
threads).
+.. _mclock-tblshoot-hdd-shard-config:
+
+Slow Requests or Slow Recovery With mClock Scheduler
+----------------------------------------------------
+
+.. note:: This troubleshooting is applicable only for HDD based clusters running
+ mClock scheduler and with the following OSD shard configuration:
+ ``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
+ Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
+ made to the default OSD HDD shard configuration for mClock.
+
+On scaled HDD based clusters with mClock scheduler enabled and under multiple
+OSD node failure condition, the following could be reported or observed:
+
+- slow requests: This also manifests into degraded client I/O performance.
+- slow background recoveries: Lower than expected recovery throughput.
+
+**Troubleshooting Steps:**
+
+#. Verify from OSD events that the slow requests are predominantly of type
+ ``queued_for_pg``.
+#. Verify if the reported recovery rate is significantly lower than the expected
+ rate considering the QoS allocations for background recovery service.
+
+If either of the above steps are true, then the following resolution may be
+applied. Note that this is disruptive as it involves OSD restarts. Run the
+following commands to change the default OSD shard configuration for HDDs:
+
+.. prompt:: bash
+
+ ceph config set osd osd_op_num_shards_hdd 1
+ ceph config set osd osd_op_num_threads_per_shard_hdd 5
+
+The above configuration won't take effect immediately and would require a
+restart of the OSDs in the environment. For this process to be least disruptive,
+the OSDs may be restarted in a carefully staggered manner.
+
.. _rados_tshooting_flapping_osd:
Flapping OSDs
- name: osd_op_num_threads_per_shard
type: int
level: advanced
+ fmt_desc: The number of worker threads spawned per OSD shard for a given OSD.
+ Each worker thread when operational processes items in the shard queue.
+ This setting overrides _ssd and _hdd if non-zero.
default: 0
flags:
- startup
- name: osd_op_num_threads_per_shard_hdd
type: int
level: advanced
- default: 1
+ fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+ (for rotational media).
+ default: 5
see_also:
- osd_op_num_threads_per_shard
flags:
- name: osd_op_num_threads_per_shard_ssd
type: int
level: advanced
+ fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
+ (for solid state media).
default: 2
see_also:
- osd_op_num_threads_per_shard
type: int
level: advanced
fmt_desc: the number of shards allocated for a given OSD (for rotational media).
- default: 5
+ default: 1
see_also:
- osd_op_num_shards
flags: