From 8eb68353963e0ab16f4a3e0afb9d250e4bf7fb08 Mon Sep 17 00:00:00 2001
From: Samuel Just <sjust@redhat.com>
Date: Thu, 1 Feb 2024 11:38:13 -0800
Subject: [PATCH] doc/dev/crush-msr.rst: add developer summary of crush msr

Signed-off-by: Samuel Just <sjust@redhat.com>
---
 doc/dev/crush-msr.rst | 140 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 140 insertions(+)
 create mode 100644 doc/dev/crush-msr.rst

diff --git a/doc/dev/crush-msr.rst b/doc/dev/crush-msr.rst
new file mode 100644
index 00000000000..a18b56740c2
--- /dev/null
+++ b/doc/dev/crush-msr.rst
@@ -0,0 +1,140 @@
+============================
+CRUSH MSR (Multi-step Retry)
+============================
+
+Motivation
+----------
+
+Conventional CRUSH has an important limitation: rules with
+multiple `choose` steps which hit an `out` osd cannot retry
+prior steps.  As an example, with a rule like
+::
+
+    rule replicated_rule_1 {
+        ...
+        step take default class hdd
+        step chooseleaf firstn 3 type host
+        step emit
+    }
+
+one might expect that if all of the OSDs on a particular host
+are marked out, mappings including those OSDs would end up
+on another host (provided that there are enough hosts).  Indeed,
+that's what will happen.  Moreover, if 1/8 OSDs on a host are
+marked out, roughly 1/8 of the PGs mapped to that host will end
+up remapped to some other host keeping overall per-OSD utilization
+even.
+
+Suppose, instead, the rule were written like this:
+::
+
+    rule replicated_rule_1 {
+        ...
+        step take default class hdd
+        step choose firstn 3 type host
+        step choose firstn 1 type osd
+        step emit
+    }
+
+The behavior would be very similar as long as no OSDs are marked
+out.  However, if an OSD is marked out, any PGs mapped to that
+OSD will be remapped to other OSDs on the same host resulting in
+those OSDs being over-utilized relative to OSDs on other hosts.
+Moreover, if all of the OSDs on a host are marked out, mappings
+that happen to hit that host will fail resulting in undersized PGs.
+
+As long as the goal is to split N OSDs between N failure domains,
+the solution is simply to use the `chooseleaf` variant above.  However,
+consider a use case where we want to split an 8+6 EC encoding over 4
+hosts in order to tolerate the loss of a host and an OSD on another
+host with 1.75x storage overhead.  The rule would have to look
+something like:
+::
+
+    rule ecpool-86 {
+        ...
+        step take default class hdd
+        step choose indep 4 type host
+        step choose indep 4 type osd
+        step emit
+    }
+
+This does split up to 16 OSDs between 4 hosts (with an 8+6 code,
+it would put 4 OSDs on each of the first 3 and 2 on the last) and
+meets our failure requirements.  However, for the reasons outlined
+above, it will behave poorly as OSDs are marked out if there are
+other hosts to rebalance to.  `chooseleaf` is not a solution here
+because it does not support mapping more than one leaf below the
+specified type.
+
+MSR
+---
+
+CRUSH MSR (Multi-step Retry) rules solve the above problem by using a
+different descent algorithm which retries all of the steps upon
+hitting an out OSD.  Where classic CRUSH is breadth first (for each
+step, it fully populates the vector before proceeding to the next
+step), MSR rules are depth first -- for each choice, we recursively
+descend through all of the steps before continuing with the next
+choice.  The above use case can be satisfied with the following rule:
+
+::
+
+    rule ecpool-86 {
+        type msr_indep
+        ...
+        step take default class hdd
+        step choosemsr 4 type host
+        step choosemsr 4 type osd
+        step emit
+    }
+
+As with the `chooseleaf` example at the top, as OSDs are marked out,
+those OSDs are be remapped proportionately to other hosts so long as
+there are extras available.  For details on how that works while
+still preserving failure domain isolation, see the comments in
+mapper.c:crush_msr_choose.
+
+Rule Structure
+--------------
+
+CRUSH MSR rules are crush rules with type CRUSH_RULE_TYPE_MSR_FIRSTN
+or CRUSH_RULE_TYPE_MSR_INDEP (see mapper.c: rule_type_is_msr).  Unlike
+with classic crush rules, individual steps do not specify firstn or
+indep.  The output order is instead defined by the rule type for the
+whole rule.
+
+MSR rules have some structural differences from conventional rules:
+
+- The rule type determines whether the mapping is FIRSTN or INDEP.
+  Because the descent can retry steps, it doesn't really make sense
+  for steps to individually specify output order and I'm not really
+  aware of any use cases that would benefit from it.
+- MSR rules *must* be structured as a (possibly empty) prefix of
+  config steps (CRUSH_RULE_SET_CHOOSE_MSR*) followed by a sequence of
+  EMIT blocks each comprised of a TAKE step, a sequence of CHOOSE_MSR
+  steps, and ended by an EMIT step.
+- MSR steps must be `choosemsr`.  `choose` and `chooseleaf` are not
+  permitted.
+
+Working Space
+-------------
+
+MSR rules also have different requirements for working space.
+Conventional CRUSH requires 3 vectors of size result_max to use for
+working space -- two to alternate as it processes each rule and one,
+additionally, for `chooseleaf`.  MSR rules need N vectors where N is the
+number of `choosemsr` steps in the longest EMIT block since it needs to
+retain all of the choices made as part of each descent.
+
+See mapper.h/c:crush_work_size, crush_msr_scan_rule for details.
+
+Implementation
+--------------
+
+mapper.h/c:crush_do_rule internally branches to
+mapper.c:crush_msr_do_rule for rules of type CRUSH_RULE_TYPE_MSR_*
+(see mapper.c:rule_type_is_msr).
+
+MSR related functions in mapper.c are annotated with more details
+about the algorithm.
-- 
2.39.5