--- /dev/null
+============================
+CRUSH MSR (Multi-step Retry)
+============================
+
+Motivation
+----------
+
+Conventional CRUSH has an important limitation: rules with
+multiple `choose` steps which hit an `out` osd cannot retry
+prior steps. As an example, with a rule like
+::
+
+ rule replicated_rule_1 {
+ ...
+ step take default class hdd
+ step chooseleaf firstn 3 type host
+ step emit
+ }
+
+one might expect that if all of the OSDs on a particular host
+are marked out, mappings including those OSDs would end up
+on another host (provided that there are enough hosts). Indeed,
+that's what will happen. Moreover, if 1/8 OSDs on a host are
+marked out, roughly 1/8 of the PGs mapped to that host will end
+up remapped to some other host keeping overall per-OSD utilization
+even.
+
+Suppose, instead, the rule were written like this:
+::
+
+ rule replicated_rule_1 {
+ ...
+ step take default class hdd
+ step choose firstn 3 type host
+ step choose firstn 1 type osd
+ step emit
+ }
+
+The behavior would be very similar as long as no OSDs are marked
+out. However, if an OSD is marked out, any PGs mapped to that
+OSD will be remapped to other OSDs on the same host resulting in
+those OSDs being over-utilized relative to OSDs on other hosts.
+Moreover, if all of the OSDs on a host are marked out, mappings
+that happen to hit that host will fail resulting in undersized PGs.
+
+As long as the goal is to split N OSDs between N failure domains,
+the solution is simply to use the `chooseleaf` variant above. However,
+consider a use case where we want to split an 8+6 EC encoding over 4
+hosts in order to tolerate the loss of a host and an OSD on another
+host with 1.75x storage overhead. The rule would have to look
+something like:
+::
+
+ rule ecpool-86 {
+ ...
+ step take default class hdd
+ step choose indep 4 type host
+ step choose indep 4 type osd
+ step emit
+ }
+
+This does split up to 16 OSDs between 4 hosts (with an 8+6 code,
+it would put 4 OSDs on each of the first 3 and 2 on the last) and
+meets our failure requirements. However, for the reasons outlined
+above, it will behave poorly as OSDs are marked out if there are
+other hosts to rebalance to. `chooseleaf` is not a solution here
+because it does not support mapping more than one leaf below the
+specified type.
+
+MSR
+---
+
+CRUSH MSR (Multi-step Retry) rules solve the above problem by using a
+different descent algorithm which retries all of the steps upon
+hitting an out OSD. Where classic CRUSH is breadth first (for each
+step, it fully populates the vector before proceeding to the next
+step), MSR rules are depth first -- for each choice, we recursively
+descend through all of the steps before continuing with the next
+choice. The above use case can be satisfied with the following rule:
+
+::
+
+ rule ecpool-86 {
+ type msr_indep
+ ...
+ step take default class hdd
+ step choosemsr 4 type host
+ step choosemsr 4 type osd
+ step emit
+ }
+
+As with the `chooseleaf` example at the top, as OSDs are marked out,
+those OSDs are be remapped proportionately to other hosts so long as
+there are extras available. For details on how that works while
+still preserving failure domain isolation, see the comments in
+mapper.c:crush_msr_choose.
+
+Rule Structure
+--------------
+
+CRUSH MSR rules are crush rules with type CRUSH_RULE_TYPE_MSR_FIRSTN
+or CRUSH_RULE_TYPE_MSR_INDEP (see mapper.c: rule_type_is_msr). Unlike
+with classic crush rules, individual steps do not specify firstn or
+indep. The output order is instead defined by the rule type for the
+whole rule.
+
+MSR rules have some structural differences from conventional rules:
+
+- The rule type determines whether the mapping is FIRSTN or INDEP.
+ Because the descent can retry steps, it doesn't really make sense
+ for steps to individually specify output order and I'm not really
+ aware of any use cases that would benefit from it.
+- MSR rules *must* be structured as a (possibly empty) prefix of
+ config steps (CRUSH_RULE_SET_CHOOSE_MSR*) followed by a sequence of
+ EMIT blocks each comprised of a TAKE step, a sequence of CHOOSE_MSR
+ steps, and ended by an EMIT step.
+- MSR steps must be `choosemsr`. `choose` and `chooseleaf` are not
+ permitted.
+
+Working Space
+-------------
+
+MSR rules also have different requirements for working space.
+Conventional CRUSH requires 3 vectors of size result_max to use for
+working space -- two to alternate as it processes each rule and one,
+additionally, for `chooseleaf`. MSR rules need N vectors where N is the
+number of `choosemsr` steps in the longest EMIT block since it needs to
+retain all of the choices made as part of each descent.
+
+See mapper.h/c:crush_work_size, crush_msr_scan_rule for details.
+
+Implementation
+--------------
+
+mapper.h/c:crush_do_rule internally branches to
+mapper.c:crush_msr_do_rule for rules of type CRUSH_RULE_TYPE_MSR_*
+(see mapper.c:rule_type_is_msr).
+
+MSR related functions in mapper.c are annotated with more details
+about the algorithm.