doc: document mClock related options

author Kefu Chai <kchai@redhat.com>

Tue, 25 Jul 2017 03:02:25 +0000 (11:02 +0800)

committer Kefu Chai <kchai@redhat.com>

Tue, 25 Jul 2017 04:59:13 +0000 (12:59 +0800)
author Kefu Chai <kchai@redhat.com>
Tue, 25 Jul 2017 03:02:25 +0000 (11:02 +0800)
committer Kefu Chai <kchai@redhat.com>
Tue, 25 Jul 2017 04:59:13 +0000 (12:59 +0800)
diff --git a/doc/man/8/ceph.rst b/doc/man/8/ceph.rst

index ec6702a1e0f3a17ec8ea5eda188831bb2590b9f5..01fe04f12c0d424c2188c3f2a7466792436698a6 100644 (file)
--- a/doc/man/8/ceph.rst
+++ b/doc/man/8/ceph.rst
@@ -1092,8 +1092,8 @@ Usage::
         ceph osd setmaxosd <int[0-]>
  
  Subcommand ``set-require-min-compat-client`` enforces the cluster to be backward
-compatible with the specified client version. This subcommand prevent you from
-makeing making any changes (e.g., crush tunables, or using new features) that
+compatible with the specified client version. This subcommand prevents you from
+making any changes (e.g., crush tunables, or using new features) that
  would violate the current setting. Please note, This subcommand will fail if
  any connected daemon or client is not compatible with the features offered by
  the given <version>. To see the features and releases of all clients connected
diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst

index 731c61258f34feaa6f3b145ca5feb62698d97274..03d6e2d30b1d0086cc10a8c613316c5f42a6875f 100644 (file)
--- a/doc/rados/configuration/osd-config-ref.rst
+++ b/doc/rados/configuration/osd-config-ref.rst
@@ -79,7 +79,7 @@ that Ceph uses the entire partition for the journal.
  ``osd client message size cap`` 
  
  :Description: The largest client data message allowed in memory.
-:Type: 64-bit Integer Unsigned
+:Type: 64-bit Unsigned Integer
  :Default: 500MB default. ``500*1024L*1024L`` 
  
  
@@ -377,13 +377,18 @@ recovery operations to ensure optimal performance during recovery.
                token bucket system which when there are sufficient tokens will
                dequeue high priority queues first. If there are not enough
                tokens available, queues are dequeued low priority to high priority.
-              The new WeightedPriorityQueue (``wpq``) dequeues all priorities in
+              The WeightedPriorityQueue (``wpq``) dequeues all priorities in
                relation to their priorities to prevent starvation of any queue.
                WPQ should help in cases where a few OSDs are more overloaded
-              than others. Requires a restart.
+              than others. The new mClock based OpClassQueue
+              (``mclock_opclass``) prioritizes operations based on which class
+              they belong to (recovery, scrub, snaptrim, client op, osd subop).
+              And, the mClock based ClientQueue (``mclock_client``) also
+              incorporates the client identifier in order to promote fairness
+              between clients. See `QoS Based on mClock`_. Requires a restart.
  
  :Type: String
-:Valid Choices: prio, wpq
+:Valid Choices: prio, wpq, mclock_opclass, mclock_client
  :Default: ``prio``
  
  
@@ -523,6 +528,188 @@ recovery operations to ensure optimal performance during recovery.
  :Type: 32-bit Integer
  :Default: ``5``
  
+
+QoS Based on mClock
+-------------------
+
+The QoS support of Ceph is implemented using a queueing scheduler based on
+`the dmClock algorithm`_. This algorithm allocates the I/O resources of the
+Ceph cluster in proportion to weights, and enforces the constraits of minimum
+reservation and maximum limitation, so that the services can compete for the
+resources fairly. Currently, the Ceph services involving I/O resources are
+categorized into following buckets:
+
+- client op: the iops issued by client
+- osd subop: the iops issued by primary OSD
+- snap trim: the snap trimming related requests
+- pg recovery: the recovery related requests
+- pg scrub: the scrub related requests
+
+And the resources are partitioned using following three sets of tags. In other
+words, the share of each type of service is controlled by three tags:
+
+#. reservation: the minimum IOPS allocated for the service.
+#. limitation: the maximum IOPS allocated for the service.
+#. weight: the proportional share of capacity if extra capacity or system
+   oversubscribed.
+
+In Ceph, operations are graded with "cost". And the resources allocated for
+serving various services are consumed by these "costs". So, for example, the
+more reservation a services has, the more resource it is guaranteed to possess,
+as long as it requires. Assuming there are 2 services: recovery and client ops:
+
+- recovery: (r:1, l:5, w:1)
+- client ops: (r:2, l:0, w:9)
+
+The settings above ensure that the recovery won't take never more than 5 units
+of resources even if it requires so, and no other services are competing with
+it. But if the clients start to issue large amount of I/O requests, neither
+will they exhaust all the I/O resources. 1 unit of resources is always
+allocated for recovery jobs. So the recovery jobs won't be starved even in
+a cluster with high load. And in the meantime, the client ops can enjoy
+a larger portion of the I/O resource, because its weight is "9", while its
+competitor "1". In the case of client ops, it is not clamped by the limit
+setting, so it can make use of all the resources if there is no recovery
+ongoing.
+
+
+``osd push per object cost``
+
+:Description: the overhead for serving a push op
+
+:Type: Unsigned Integer
+:Default: 1000
+
+``osd recovery max chunk``
+
+:Description: the maximum total size of data chunks a recovery op can carry.
+
+:Type: Unsigned Integer
+:Default: 8 MiB
+
+
+``osd op queue mclock client op res``
+
+:Description: the reservation of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock client op wgt``
+
+:Description: the weight of client op.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock client op lim``
+
+:Description: the limit of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop res``
+
+:Description: the reservation of osd subop.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop wgt``
+
+:Description: the weight of osd subop.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock osd subop lim``
+
+:Description: the limit of osd subop.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap res``
+
+:Description: the reservation of snap trimming.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap wgt``
+
+:Description: the weight of snap trimming.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock snap lim``
+
+:Description: the limit of snap trimming.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock recov res``
+
+:Description: the reservation of recovery.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock recov wgt``
+
+:Description: the weight of recovery.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock recov lim``
+
+:Description: the limit of recovery.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock scrub res``
+
+:Description: the reservation of scrub jobs.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock scrub wgt``
+
+:Description: the weight of scrub jobs.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock scrub lim``
+
+:Description: the limit of scrub jobs.
+
+:Type: Float
+:Default: 0.001
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
+
+
  .. index:: OSD; backfilling
  
  Backfilling
@@ -660,7 +847,7 @@ perform well in a degraded state.
  ``osd recovery max chunk`` 
  
  :Description: The maximum size of a recovered chunk of data to push. 
-:Type: 64-bit Integer Unsigned
+:Type: 64-bit Unsigned Integer
  :Default: ``8 << 20`` 
  
  
@@ -668,7 +855,7 @@ perform well in a degraded state.
  
  :Description: The maximum number of recovery operations per OSD that will be
                newly started when an OSD is recovering.
-:Type: 64-bit Integer Unsigned
+:Type: 64-bit Unsigned Integer
  :Default: ``1``
  
  
@@ -757,7 +944,7 @@ Miscellaneous
  ``osd default notify timeout`` 
  
  :Description: The OSD default notification timeout (in seconds).
-:Type: 32-bit Integer Unsigned
+:Type: 32-bit Unsigned Integer
  :Default: ``30`` 
  
  
diff --git a/doc/release-notes.rst b/doc/release-notes.rst

index c438fc7b8ff41fee320780eb5c20de70d1680ac1..b014c53e477f2f89ed9b3edcf1fb06bb4731f27b 100644 (file)
--- a/doc/release-notes.rst
+++ b/doc/release-notes.rst
@@ -65,7 +65,7 @@ Major Changes from Kraken
      distribution* (this requires luminous clients). FIXME DOCS
    * Each OSD now adjusts its default configuration based on whether the
      backing device is an HDD or SSD.  Manual tuning generally not required.
-  * The prototype *mclock QoS queueing algorithm* is now available.  FIXME DOCS
+  * The prototype `mClock QoS queueing algorithm`_ is now available.
    * There is now a *backoff* mechanism that prevents OSDs from being
      overloaded by requests to objects or PGs that are not currently able to
      process IO.
@@ -246,6 +246,7 @@ Major Changes from Kraken
      - ``ceph tell <daemon> help`` will now return a usage summary.
  
  .. _Read more about EC overwrites: ../rados/operations/erasure-code/#erasure-coding-with-overwrites
+.. _mClock QoS queueing algorithm: /rados/configuration/osd-config-ref#qos-based-on-mclock
  .. _simplified OSD replacement process: ../rados/operations/add-or-rm-osds/#replacing-an-osd
  
  Major Changes from Jewel
author	Kefu Chai <kchai@redhat.com>
	Tue, 25 Jul 2017 03:02:25 +0000 (11:02 +0800)
committer	Kefu Chai <kchai@redhat.com>
	Tue, 25 Jul 2017 04:59:13 +0000 (12:59 +0800)
doc/man/8/ceph.rst		patch \| blob \| history
doc/rados/configuration/osd-config-ref.rst		patch \| blob \| history
doc/release-notes.rst		patch \| blob \| history