From 81d3def3c26c32d4bce2bb6eed0436676d175986 Mon Sep 17 00:00:00 2001
From: "J. Eric Ivancich" <ivancich@redhat.com>
Date: Mon, 31 Jul 2017 13:18:02 -0400
Subject: [PATCH] doc: Describe mClock's use within Ceph in great detail.

It seems many are initially unclear as to how the current
implementations of mclock op queues work, so we need to document it to
avoid confusion.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
---
 doc/rados/configuration/osd-config-ref.rst | 126 +++++++++++++++++----
 1 file changed, 106 insertions(+), 20 deletions(-)

diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst
index 03d6e2d30b1..30e679fd273 100644
--- a/doc/rados/configuration/osd-config-ref.rst
+++ b/doc/rados/configuration/osd-config-ref.rst
@@ -532,12 +532,19 @@ recovery operations to ensure optimal performance during recovery.
 QoS Based on mClock
 -------------------
 
-The QoS support of Ceph is implemented using a queueing scheduler based on
-`the dmClock algorithm`_. This algorithm allocates the I/O resources of the
-Ceph cluster in proportion to weights, and enforces the constraits of minimum
-reservation and maximum limitation, so that the services can compete for the
-resources fairly. Currently, the Ceph services involving I/O resources are
-categorized into following buckets:
+Ceph's use of mClock is currently in the experimental phase and should
+be approached with an exploratory mindset.
+
+Core Concepts
+`````````````
+
+The QoS support of Ceph is implemented using a queueing scheduler
+based on `the dmClock algorithm`_. This algorithm allocates the I/O
+resources of the Ceph cluster in proportion to weights, and enforces
+the constraits of minimum reservation and maximum limitation, so that
+the services can compete for the resources fairly. Currently the
+*mclock_opclass* operation queue divides Ceph services involving I/O
+resources into following buckets:
 
 - client op: the iops issued by client
 - osd subop: the iops issued by primary OSD
@@ -553,24 +560,103 @@ words, the share of each type of service is controlled by three tags:
 #. weight: the proportional share of capacity if extra capacity or system
    oversubscribed.
 
-In Ceph, operations are graded with "cost". And the resources allocated for
-serving various services are consumed by these "costs". So, for example, the
-more reservation a services has, the more resource it is guaranteed to possess,
-as long as it requires. Assuming there are 2 services: recovery and client ops:
+In Ceph operations are graded with "cost". And the resources allocated
+for serving various services are consumed by these "costs". So, for
+example, the more reservation a services has, the more resource it is
+guaranteed to possess, as long as it requires. Assuming there are 2
+services: recovery and client ops:
 
 - recovery: (r:1, l:5, w:1)
 - client ops: (r:2, l:0, w:9)
 
-The settings above ensure that the recovery won't take never more than 5 units
-of resources even if it requires so, and no other services are competing with
-it. But if the clients start to issue large amount of I/O requests, neither
-will they exhaust all the I/O resources. 1 unit of resources is always
-allocated for recovery jobs. So the recovery jobs won't be starved even in
-a cluster with high load. And in the meantime, the client ops can enjoy
-a larger portion of the I/O resource, because its weight is "9", while its
-competitor "1". In the case of client ops, it is not clamped by the limit
-setting, so it can make use of all the resources if there is no recovery
-ongoing.
+The settings above ensure that the recovery won't get more than 5
+requests per second serviced, even if it requires so (see CURRENT
+IMPLEMENTATION NOTE below), and no other services are competing with
+it. But if the clients start to issue large amount of I/O requests,
+neither will they exhaust all the I/O resources. 1 request per second
+is always allocated for recovery jobs as long as there are any such
+requests. So the recovery jobs won't be starved even in a cluster with
+high load. And in the meantime, the client ops can enjoy a larger
+portion of the I/O resource, because its weight is "9", while its
+competitor "1". In the case of client ops, it is not clamped by the
+limit setting, so it can make use of all the resources if there is no
+recovery ongoing.
+
+Along with *mclock_opclass* another mclock operation queue named
+*mclock_client* is available. It divides operations based on category
+but also divides them based on the client making the request. This
+helps not only manage the distribution of resources spent on different
+classes of operations but also tries to insure fairness among clients.
+
+CURRENT IMPLEMENTATION NOTE: the current experimental implementation
+does not enforce the limit values. As a first approximation we decided
+not to prevent operations that would otherwise enter the operation
+sequencer from doing so.
+
+Subtleties of mClock
+````````````````````
+
+The reservation and limit values have a unit of requests per
+second. The weight, however, does not technically have a unit and the
+weights are relative to one another. So if one class of requests has a
+weight of 1 and another a weight of 9, then the latter class of
+requests should get 9 executed at a 9 to 1 ratio as the first class.
+However that will only happen once the reservations are met and those
+values include the operations executed under the reservation phase.
+
+Even though the weights do not have units, one must be careful in
+choosing their values due how the algorithm assigns weight tags to
+requests. If the weight is *W*, then for a given class of requests,
+the next one that comes in will have a weight tag of *1/W* plus the
+previous weight tag or the current time, whichever is larger. That
+means if *W* is sufficiently large and therefore *1/W* is sufficiently
+small, the calculated tag may never be assigned as it will get a value
+of the current time. The ultimate lesson is that values for weight
+should not be too large. They should be under the number of requests
+one expects to ve serviced each second.
+
+Caveats
+```````
+
+There are some factors that can reduce the impact of the mClock op
+queues within Ceph. First, requests to an OSD are sharded by their
+placement group identifier. Each shard has its own mClock queue and
+these queues neither interact nor share information among them. The
+number of shards can be controlled with the configuration options
+``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
+``osd_op_num_shards_ssd``. A lower number of shards will increase the
+impact of the mClock queues, but may have other deliterious effects.
+
+Second, requests are transferred from the operation queue to the
+operation sequencer, in which they go through the phases of
+execution. The operation queue is where mClock resides and mClock
+determines the next op to transfer to the operation sequencer. The
+number of operations allowed in the operation sequencer is a complex
+issue. In general we want to keep enough operations in the sequencer
+so it's always getting work done on some operations while it's waiting
+for disk and network access to complete on other operations. On the
+other hand, once an operation is transferred to the operation
+sequencer, mClock no longer has control over it. Therefore to maximize
+the impact of mClock, we want to keep as few operations in the
+operation sequencer as possible. So we have an inherent tension.
+
+The configuration options that influence the number of operations in
+the operation sequencer are ``bluestore_throttle_bytes``,
+``bluestore_throttle_deferred_bytes``,
+``bluestore_throttle_cost_per_io``,
+``bluestore_throttle_cost_per_io_hdd``, and
+``bluestore_throttle_cost_per_io_ssd``.
+
+A third factor that affects the impact of the mClock algorithm is that
+we're using a distributed system, where requests are made to multiple
+OSDs and each OSD has (can have) multiple shards. Yet we're currently
+using the mClock algorithm, which is not distributed (note: dmClock is
+the distributed version of mClock).
+
+Various organizations and individuals are currently experimenting with
+mClock as it exists in this code base along with their modifications
+to the code base. We hope you'll share you're experiences with your
+mClock and dmClock experiments in the ceph-devel mailing list.
 
 
 ``osd push per object cost``
-- 
2.39.5