``osd client message size cap``
:Description: The largest client data message allowed in memory.
-:Type: 64-bit Integer Unsigned
+:Type: 64-bit Unsigned Integer
:Default: 500MB default. ``500*1024L*1024L``
token bucket system which when there are sufficient tokens will
dequeue high priority queues first. If there are not enough
tokens available, queues are dequeued low priority to high priority.
- The new WeightedPriorityQueue (``wpq``) dequeues all priorities in
+ The WeightedPriorityQueue (``wpq``) dequeues all priorities in
relation to their priorities to prevent starvation of any queue.
WPQ should help in cases where a few OSDs are more overloaded
- than others. Requires a restart.
+ than others. The new mClock based OpClassQueue
+ (``mclock_opclass``) prioritizes operations based on which class
+ they belong to (recovery, scrub, snaptrim, client op, osd subop).
+ And, the mClock based ClientQueue (``mclock_client``) also
+ incorporates the client identifier in order to promote fairness
+ between clients. See `QoS Based on mClock`_. Requires a restart.
:Type: String
-:Valid Choices: prio, wpq
+:Valid Choices: prio, wpq, mclock_opclass, mclock_client
:Default: ``prio``
:Type: 32-bit Integer
:Default: ``5``
+
+QoS Based on mClock
+-------------------
+
+The QoS support of Ceph is implemented using a queueing scheduler based on
+`the dmClock algorithm`_. This algorithm allocates the I/O resources of the
+Ceph cluster in proportion to weights, and enforces the constraits of minimum
+reservation and maximum limitation, so that the services can compete for the
+resources fairly. Currently, the Ceph services involving I/O resources are
+categorized into following buckets:
+
+- client op: the iops issued by client
+- osd subop: the iops issued by primary OSD
+- snap trim: the snap trimming related requests
+- pg recovery: the recovery related requests
+- pg scrub: the scrub related requests
+
+And the resources are partitioned using following three sets of tags. In other
+words, the share of each type of service is controlled by three tags:
+
+#. reservation: the minimum IOPS allocated for the service.
+#. limitation: the maximum IOPS allocated for the service.
+#. weight: the proportional share of capacity if extra capacity or system
+ oversubscribed.
+
+In Ceph, operations are graded with "cost". And the resources allocated for
+serving various services are consumed by these "costs". So, for example, the
+more reservation a services has, the more resource it is guaranteed to possess,
+as long as it requires. Assuming there are 2 services: recovery and client ops:
+
+- recovery: (r:1, l:5, w:1)
+- client ops: (r:2, l:0, w:9)
+
+The settings above ensure that the recovery won't take never more than 5 units
+of resources even if it requires so, and no other services are competing with
+it. But if the clients start to issue large amount of I/O requests, neither
+will they exhaust all the I/O resources. 1 unit of resources is always
+allocated for recovery jobs. So the recovery jobs won't be starved even in
+a cluster with high load. And in the meantime, the client ops can enjoy
+a larger portion of the I/O resource, because its weight is "9", while its
+competitor "1". In the case of client ops, it is not clamped by the limit
+setting, so it can make use of all the resources if there is no recovery
+ongoing.
+
+
+``osd push per object cost``
+
+:Description: the overhead for serving a push op
+
+:Type: Unsigned Integer
+:Default: 1000
+
+``osd recovery max chunk``
+
+:Description: the maximum total size of data chunks a recovery op can carry.
+
+:Type: Unsigned Integer
+:Default: 8 MiB
+
+
+``osd op queue mclock client op res``
+
+:Description: the reservation of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock client op wgt``
+
+:Description: the weight of client op.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock client op lim``
+
+:Description: the limit of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop res``
+
+:Description: the reservation of osd subop.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop wgt``
+
+:Description: the weight of osd subop.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock osd subop lim``
+
+:Description: the limit of osd subop.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap res``
+
+:Description: the reservation of snap trimming.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap wgt``
+
+:Description: the weight of snap trimming.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock snap lim``
+
+:Description: the limit of snap trimming.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock recov res``
+
+:Description: the reservation of recovery.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock recov wgt``
+
+:Description: the weight of recovery.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock recov lim``
+
+:Description: the limit of recovery.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock scrub res``
+
+:Description: the reservation of scrub jobs.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock scrub wgt``
+
+:Description: the weight of scrub jobs.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock scrub lim``
+
+:Description: the limit of scrub jobs.
+
+:Type: Float
+:Default: 0.001
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
+
+
.. index:: OSD; backfilling
Backfilling
``osd recovery max chunk``
:Description: The maximum size of a recovered chunk of data to push.
-:Type: 64-bit Integer Unsigned
+:Type: 64-bit Unsigned Integer
:Default: ``8 << 20``
:Description: The maximum number of recovery operations per OSD that will be
newly started when an OSD is recovering.
-:Type: 64-bit Integer Unsigned
+:Type: 64-bit Unsigned Integer
:Default: ``1``
``osd default notify timeout``
:Description: The OSD default notification timeout (in seconds).
-:Type: 32-bit Integer Unsigned
+:Type: 32-bit Unsigned Integer
:Default: ``30``