--- /dev/null
+Manually editing a CRUSH Map
+============================
+
+.. note:: Manually editing the CRUSH map is considered an advanced
+ administrator operation. All CRUSH changes that are
+ necessary for the overwhelming majority of installations are
+ possible via the standard ceph CLI and do not require manual
+ CRUSH map edits. If you have identified a use case where
+ manual edits *are* necessary, consider contacting the Ceph
+ developers so that future versions of Ceph can make this
+ unnecessary.
+
+To edit an existing CRUSH map:
+
+#. `Get the CRUSH map`_.
+#. `Decompile`_ the CRUSH map.
+#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
+#. `Recompile`_ the CRUSH map.
+#. `Set the CRUSH map`_.
+
+To activate CRUSH map rules for a specific pool, identify the common ruleset
+number for those rules and specify that ruleset number for the pool. See `Set
+Pool Values`_ for details.
+
+.. _Get the CRUSH map: #getcrushmap
+.. _Decompile: #decompilecrushmap
+.. _Devices: #crushmapdevices
+.. _Buckets: #crushmapbuckets
+.. _Rules: #crushmaprules
+.. _Recompile: #compilecrushmap
+.. _Set the CRUSH map: #setcrushmap
+.. _Set Pool Values: ../pools#setpoolvalues
+
+.. _getcrushmap:
+
+Get a CRUSH Map
+---------------
+
+To get the CRUSH map for your cluster, execute the following::
+
+ ceph osd getcrushmap -o {compiled-crushmap-filename}
+
+Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
+the CRUSH map is in a compiled form, you must decompile it first before you can
+edit it.
+
+.. _decompilecrushmap:
+
+Decompile a CRUSH Map
+---------------------
+
+To decompile a CRUSH map, execute the following::
+
+ crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
+
+
+Sections
+--------
+
+There are six main sections to a CRUSH Map.
+
+#. **tunables:** The preamble at the top of the map described any *tunables*
+ for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These
+ correct for old bugs, optimizations, or other changes in behavior that have
+ been made over the years to improve CRUSH's behavior.
+
+#. **devices:** Devices are individual ``ceph-osd`` daemons that can
+ store data.
+
+#. **types**: Bucket ``types`` define the types of buckets used in
+ your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
+ of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
+ their assigned weights.
+
+#. **buckets:** Once you define bucket types, you must define each node
+ in the hierarchy, its type, and which devices or other nodes it
+ containes.
+
+#. **rules:** Rules define policy about how data is distributed across
+ devices in the hierarchy.
+
+#. **choose_args:** Choose_args are alternative weights associated with
+ the hierarchy that have been adjusted to optimize data placement. A single
+ choose_args map can be used for the entire cluster, or one can be
+ created for each individual pool.
+
+
+.. _crushmapdevices:
+
+CRUSH Map Devices
+-----------------
+
+Devices are individual ``ceph-osd`` daemons that can store data. You
+will normally have one defined here for each OSD daemon in your
+cluster. Devices are identified by an id (a non-negative integer) and
+a name, normally ``osd.N`` where ``N`` is the device id.
+
+Devices may also have a *device class* associated with them (e.g.,
+``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
+crush rule.
+
+::
+
+ # devices
+ device {num} {osd.name} [class {class}]
+
+For example::
+
+ # devices
+ device 0 osd.0 class ssd
+ device 1 osd.1 class hdd
+ device 2 osd.2
+ device 3 osd.3
+
+In most cases, each device maps to a single ``ceph-osd`` daemon. This
+is normally a single storage device, a pair of devices (for example,
+one for data and one for a journal or metadata), or in some cases a
+small RAID device.
+
+
+
+
+
+CRUSH Map Bucket Types
+----------------------
+
+The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
+a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
+physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
+Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
+media.
+
+.. tip:: The term "bucket" used in the context of CRUSH means a node in
+ the hierarchy, i.e. a location or a piece of physical hardware. It
+ is a different concept from the term "bucket" when used in the
+ context of RADOS Gateway APIs.
+
+To add a bucket type to the CRUSH map, create a new line under your list of
+bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
+By convention, there is one leaf bucket and it is ``type 0``; however, you may
+give it any name you like (e.g., osd, disk, drive, storage, etc.)::
+
+ #types
+ type {num} {bucket-name}
+
+For example::
+
+ # types
+ type 0 osd
+ type 1 host
+ type 2 chassis
+ type 3 rack
+ type 4 row
+ type 5 pdu
+ type 6 pod
+ type 7 room
+ type 8 datacenter
+ type 9 region
+ type 10 root
+
+
+
+.. _crushmapbuckets:
+
+CRUSH Map Bucket Hierarchy
+--------------------------
+
+The CRUSH algorithm distributes data objects among storage devices according
+to a per-device weight value, approximating a uniform probability distribution.
+CRUSH distributes objects and their replicas according to the hierarchical
+cluster map you define. Your CRUSH map represents the available storage
+devices and the logical elements that contain them.
+
+To map placement groups to OSDs across failure domains, a CRUSH map defines a
+hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
+map). The purpose of creating a bucket hierarchy is to segregate the
+leaf nodes by their failure domains, such as hosts, chassis, racks, power
+distribution units, pods, rows, rooms, and data centers. With the exception of
+the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
+you may define it according to your own needs.
+
+We recommend adapting your CRUSH map to your firms's hardware naming conventions
+and using instances names that reflect the physical hardware. Your naming
+practice can make it easier to administer the cluster and troubleshoot
+problems when an OSD and/or other hardware malfunctions and the administrator
+need access to physical hardware.
+
+In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
+and two node buckets named ``host`` and ``rack`` respectively.
+
+.. ditaa::
+ +-----------+
+ | {o}rack |
+ | Bucket |
+ +-----+-----+
+ |
+ +---------------+---------------+
+ | |
+ +-----+-----+ +-----+-----+
+ | {o}host | | {o}host |
+ | Bucket | | Bucket |
+ +-----+-----+ +-----+-----+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd | | osd | | osd | | osd |
+ | Bucket | | Bucket | | Bucket | | Bucket |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+.. note:: The higher numbered ``rack`` bucket type aggregates the lower
+ numbered ``host`` bucket type.
+
+Since leaf nodes reflect storage devices declared under the ``#devices`` list
+at the beginning of the CRUSH map, you do not need to declare them as bucket
+instances. The second lowest bucket type in your hierarchy usually aggregates
+the devices (i.e., it's usually the computer containing the storage media, and
+uses whatever term you prefer to describe it, such as "node", "computer",
+"server," "host", "machine", etc.). In high density environments, it is
+increasingly common to see multiple hosts/nodes per chassis. You should account
+for chassis failure too--e.g., the need to pull a chassis if a node fails may
+result in bringing down numerous hosts/nodes and their OSDs.
+
+When declaring a bucket instance, you must specify its type, give it a unique
+name (string), assign it a unique ID expressed as a negative integer (optional),
+specify a weight relative to the total capacity/capability of its item(s),
+specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``,
+reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
+The items may consist of node buckets or leaves. Items may have a weight that
+reflects the relative weight of the item.
+
+You may declare a node bucket with the following syntax::
+
+ [bucket-type] [bucket-name] {
+ id [a unique negative numeric ID]
+ weight [the relative capacity/capability of the item(s)]
+ alg [the bucket type: uniform | list | tree | straw ]
+ hash [the hash type: 0 by default]
+ item [item-name] weight [weight]
+ }
+
+For example, using the diagram above, we would define two host buckets
+and one rack bucket. The OSDs are declared as items within the host buckets::
+
+ host node1 {
+ id -1
+ alg straw
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host node2 {
+ id -2
+ alg straw
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ rack rack1 {
+ id -3
+ alg straw
+ hash 0
+ item node1 weight 2.00
+ item node2 weight 2.00
+ }
+
+.. note:: In the foregoing example, note that the rack bucket does not contain
+ any OSDs. Rather it contains lower level host buckets, and includes the
+ sum total of their weight in the item entry.
+
+.. topic:: Bucket Types
+
+ Ceph supports four bucket types, each representing a tradeoff between
+ performance and reorganization efficiency. If you are unsure of which bucket
+ type to use, we recommend using a ``straw`` bucket. For a detailed
+ discussion of bucket types, refer to
+ `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
+ and more specifically to **Section 3.4**. The bucket types are:
+
+ #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same
+ weight. For example, when firms commission or decommission hardware, they
+ typically do so with many machines that have exactly the same physical
+ configuration (e.g., bulk purchases). When storage devices have exactly
+ the same weight, you may use the ``uniform`` bucket type, which allows
+ CRUSH to map replicas into uniform buckets in constant time. With
+ non-uniform weights, you should use another bucket algorithm.
+
+ #. **List**: List buckets aggregate their content as linked lists. Based on
+ the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
+ a list is a natural and intuitive choice for an **expanding cluster**:
+ either an object is relocated to the newest device with some appropriate
+ probability, or it remains on the older devices as before. The result is
+ optimal data migration when items are added to the bucket. Items removed
+ from the middle or tail of the list, however, can result in a significant
+ amount of unnecessary movement, making list buckets most suitable for
+ circumstances in which they **never (or very rarely) shrink**.
+
+ #. **Tree**: Tree buckets use a binary search tree. They are more efficient
+ than list buckets when a bucket contains a larger set of items. Based on
+ the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
+ tree buckets reduce the placement time to O(log :sub:`n`), making them
+ suitable for managing much larger sets of devices or nested buckets.
+
+ #. **Straw:** List and Tree buckets use a divide and conquer strategy
+ in a way that either gives certain items precedence (e.g., those
+ at the beginning of a list) or obviates the need to consider entire
+ subtrees of items at all. That improves the performance of the replica
+ placement process, but can also introduce suboptimal reorganization
+ behavior when the contents of a bucket change due an addition, removal,
+ or re-weighting of an item. The straw bucket type allows all items to
+ fairly “compete” against each other for replica placement through a
+ process analogous to a draw of straws.
+
+.. topic:: Hash
+
+ Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
+ Enter ``0`` as your hash setting to select ``rjenkins1``.
+
+
+.. _weightingbucketitems:
+
+.. topic:: Weighting Bucket Items
+
+ Ceph expresses bucket weights as doubles, which allows for fine
+ weighting. A weight is the relative difference between device capacities. We
+ recommend using ``1.00`` as the relative weight for a 1TB storage device.
+ In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
+ and a weight of ``3.00`` would represent approximately 3TB. Higher level
+ buckets have a weight that is the sum total of the leaf items aggregated by
+ the bucket.
+
+ A bucket item weight is one dimensional, but you may also calculate your
+ item weights to reflect the performance of the storage drive. For example,
+ if you have many 1TB drives where some have relatively low data transfer
+ rate and the others have a relatively high data transfer rate, you may
+ weight them differently, even though they have the same capacity (e.g.,
+ a weight of 0.80 for the first set of drives with lower total throughput,
+ and 1.20 for the second set of drives with higher total throughput).
+
+
+.. _crushmaprules:
+
+CRUSH Map Rules
+---------------
+
+CRUSH maps support the notion of 'CRUSH rules', which are the rules that
+determine data placement for a pool. For large clusters, you will likely create
+many pools where each pool may have its own CRUSH ruleset and rules. The default
+CRUSH map has a rule for each pool, and one ruleset assigned to each of the
+default pools.
+
+.. note:: In most cases, you will not need to modify the default rules. When
+ you create a new pool, its default ruleset is ``0``.
+
+
+CRUSH rules define placement and replication strategies or distribution policies
+that allow you to specify exactly how CRUSH places object replicas. For
+example, you might create a rule selecting a pair of targets for 2-way
+mirroring, another rule for selecting three targets in two different data
+centers for 3-way mirroring, and yet another rule for erasure coding over six
+storage devices. For a detailed discussion of CRUSH rules, refer to
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
+and more specifically to **Section 3.2**.
+
+A rule takes the following form::
+
+ rule <rulename> {
+
+ ruleset <ruleset>
+ type [ replicated | erasure ]
+ min_size <min-size>
+ max_size <max-size>
+ step take <bucket-name> [class <device-class>]
+ step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
+ step emit
+ }
+
+
+``ruleset``
+
+:Description: A means of classifying a rule as belonging to a set of rules.
+ Activated by `setting the ruleset in a pool`_.
+
+:Purpose: A component of the rule mask.
+:Type: Integer
+:Required: Yes
+:Default: 0
+
+.. _setting the ruleset in a pool: ../pools#setpoolvalues
+
+
+``type``
+
+:Description: Describes a rule for either a storage drive (replicated)
+ or a RAID.
+
+:Purpose: A component of the rule mask.
+:Type: String
+:Required: Yes
+:Default: ``replicated``
+:Valid Values: Currently only ``replicated`` and ``erasure``
+
+``min_size``
+
+:Description: If a pool makes fewer replicas than this number, CRUSH will
+ **NOT** select this rule.
+
+:Type: Integer
+:Purpose: A component of the rule mask.
+:Required: Yes
+:Default: ``1``
+
+``max_size``
+
+:Description: If a pool makes more replicas than this number, CRUSH will
+ **NOT** select this rule.
+
+:Type: Integer
+:Purpose: A component of the rule mask.
+:Required: Yes
+:Default: 10
+
+
+``step take <bucket-name> [class <device-class>]``
+
+:Description: Takes a bucket name, and begins iterating down the tree.
+ If the ``device-class`` is specified, it must match
+ a class previously used when defining a device. All
+ devices that do not belong to the class are excluded.
+:Purpose: A component of the rule.
+:Required: Yes
+:Example: ``step take data``
+
+
+``step choose firstn {num} type {bucket-type}``
+
+:Description: Selects the number of buckets of the given type. The number is
+ usually the number of replicas in the pool (i.e., pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
+ - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
+ - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
+
+:Purpose: A component of the rule.
+:Prerequisite: Follows ``step take`` or ``step choose``.
+:Example: ``step choose firstn 1 type row``
+
+
+``step chooseleaf firstn {num} type {bucket-type}``
+
+:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
+ node from the subtree of each bucket in the set of buckets. The
+ number of buckets in the set is usually the number of replicas in
+ the pool (i.e., pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
+ - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
+ - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
+
+:Purpose: A component of the rule. Usage removes the need to select a device using two steps.
+:Prerequisite: Follows ``step take`` or ``step choose``.
+:Example: ``step chooseleaf firstn 0 type row``
+
+
+
+``step emit``
+
+:Description: Outputs the current value and empties the stack. Typically used
+ at the end of a rule, but may also be used to pick from different
+ trees in the same rule.
+
+:Purpose: A component of the rule.
+:Prerequisite: Follows ``step choose``.
+:Example: ``step emit``
+
+.. important:: To activate one or more rules with a common ruleset number to a
+ pool, set the ruleset number of the pool.
+
+
+Placing Different Pools on Different OSDS:
+==========================================
+
+Suppose you want to have most pools default to OSDs backed by large hard drives,
+but have some pools mapped to OSDs backed by fast solid-state drives (SSDs).
+It's possible to have multiple independent CRUSH hierarchies within the same
+CRUSH map. Define two hierarchies with two different root nodes--one for hard
+disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown
+below::
+
+ device 0 osd.0
+ device 1 osd.1
+ device 2 osd.2
+ device 3 osd.3
+ device 4 osd.4
+ device 5 osd.5
+ device 6 osd.6
+ device 7 osd.7
+
+ host ceph-osd-ssd-server-1 {
+ id -1
+ alg straw
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host ceph-osd-ssd-server-2 {
+ id -2
+ alg straw
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ host ceph-osd-platter-server-1 {
+ id -3
+ alg straw
+ hash 0
+ item osd.4 weight 1.00
+ item osd.5 weight 1.00
+ }
+
+ host ceph-osd-platter-server-2 {
+ id -4
+ alg straw
+ hash 0
+ item osd.6 weight 1.00
+ item osd.7 weight 1.00
+ }
+
+ root platter {
+ id -5
+ alg straw
+ hash 0
+ item ceph-osd-platter-server-1 weight 2.00
+ item ceph-osd-platter-server-2 weight 2.00
+ }
+
+ root ssd {
+ id -6
+ alg straw
+ hash 0
+ item ceph-osd-ssd-server-1 weight 2.00
+ item ceph-osd-ssd-server-2 weight 2.00
+ }
+
+ rule data {
+ ruleset 0
+ type replicated
+ min_size 2
+ max_size 2
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule metadata {
+ ruleset 1
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule rbd {
+ ruleset 2
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule platter {
+ ruleset 3
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule ssd {
+ ruleset 4
+ type replicated
+ min_size 0
+ max_size 4
+ step take ssd
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule ssd-primary {
+ ruleset 5
+ type replicated
+ min_size 5
+ max_size 10
+ step take ssd
+ step chooseleaf firstn 1 type host
+ step emit
+ step take platter
+ step chooseleaf firstn -1 type host
+ step emit
+ }
+
+You can then set a pool to use the SSD rule by::
+
+ ceph osd pool set <poolname> crush_ruleset 4
+
+Similarly, using the ``ssd-primary`` rule will cause each placement group in the
+pool to be placed with an SSD as the primary and platters as the replicas.
+
+
+Tuning CRUSH, the hard way
+--------------------------
+
+If you can ensure that all clients are running recent code, you can
+adjust the tunables by extracting the CRUSH map, modifying the values,
+and reinjecting it into the cluster.
+
+* Extract the latest CRUSH map::
+
+ ceph osd getcrushmap -o /tmp/crush
+
+* Adjust tunables. These values appear to offer the best behavior
+ for both large and small clusters we tested with. You will need to
+ additionally specify the ``--enable-unsafe-tunables`` argument to
+ ``crushtool`` for this to work. Please use this option with
+ extreme care.::
+
+ crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
+
+* Reinject modified map::
+
+ ceph osd setcrushmap -i /tmp/crush.new
+
+Legacy values
+-------------
+
+For reference, the legacy values for the CRUSH tunables can be set
+with::
+
+ crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
+
+Again, the special ``--enable-unsafe-tunables`` option is required.
+Further, as noted above, be careful running old versions of the
+``ceph-osd`` daemon after reverting to legacy values as the feature
+bit is not perfectly enforced.
identifier (the OSD number), and the daemon type is typically ``osd``.
-Manually editing a CRUSH Map
-============================
-
-.. note:: Manually editing the CRUSH map is considered an advanced
- administrator operation. All CRUSH changes that are
- necessary for the overwhelming majority of installations are
- possible via the standard ceph CLI and do not require manual
- CRUSH map edits. If you have identified a use case where
- manual edits *are* necessary, consider contacting the Ceph
- developers so that future versions of Ceph can make this
- unnecessary.
-
-To edit an existing CRUSH map:
-
-#. `Get the CRUSH map`_.
-#. `Decompile`_ the CRUSH map.
-#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
-#. `Recompile`_ the CRUSH map.
-#. `Set the CRUSH map`_.
-
-To activate CRUSH map rules for a specific pool, identify the common ruleset
-number for those rules and specify that ruleset number for the pool. See `Set
-Pool Values`_ for details.
-
-.. _Get the CRUSH map: #getcrushmap
-.. _Decompile: #decompilecrushmap
-.. _Devices: #crushmapdevices
-.. _Buckets: #crushmapbuckets
-.. _Rules: #crushmaprules
-.. _Recompile: #compilecrushmap
-.. _Set the CRUSH map: #setcrushmap
-.. _Set Pool Values: ../pools#setpoolvalues
-
-.. _getcrushmap:
-
-Get a CRUSH Map
----------------
-
-To get the CRUSH map for your cluster, execute the following::
-
- ceph osd getcrushmap -o {compiled-crushmap-filename}
-
-Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
-the CRUSH map is in a compiled form, you must decompile it first before you can
-edit it.
-
-.. _decompilecrushmap:
-
-Decompile a CRUSH Map
----------------------
-
-To decompile a CRUSH map, execute the following::
-
- crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
-
-Ceph will decompile (-d) the compiled CRUSH map and output (-o) it to the
-filename you specified.
-
-
-.. _compilecrushmap:
-
-Compile a CRUSH Map
--------------------
-
-To compile a CRUSH map, execute the following::
-
- crushtool -c {decompiled-crush-map-filename} -o {compiled-crush-map-filename}
-
-Ceph will store a compiled CRUSH map to the filename you specified.
-
-
-.. _setcrushmap:
-
-Set a CRUSH Map
----------------
-
-To set the CRUSH map for your cluster, execute the following::
-
- ceph osd setcrushmap -i {compiled-crushmap-filename}
-
-Ceph will input the compiled CRUSH map of the filename you specified as the
-CRUSH map for the cluster.
-
-
-
-CRUSH Map Parameters
-====================
-
-There are four main sections to a CRUSH Map.
-
-#. **Devices:** Devices consist of any object storage device--i.e., the storage
- drive corresponding to a ``ceph-osd`` daemon. You should have a device for
- each OSD daemon in your Ceph configuration file.
-
-#. **Bucket Types**: Bucket ``types`` define the types of buckets used in your
- CRUSH hierarchy. Buckets consist of a hierarchical aggregation of storage
- locations (e.g., rows, racks, chassis, hosts, etc.) and their assigned
- weights.
-
-#. **Bucket Instances:** Once you define bucket types, you must declare bucket
- instances for your hosts, and any other failure domain partitioning
- you choose.
-
-#. **Rules:** Rules consist of the manner of selecting buckets.
-
-If you launched Ceph using one of our Quick Start guides, you'll notice
-that you didn't need to create a CRUSH map. Ceph's deployment tools generate
-a default CRUSH map that lists devices from the OSDs you defined in your
-Ceph configuration file, and it declares a bucket for each host you specified
-in the ``[osd]`` sections of your Ceph configuration file. You should create
-your own CRUSH maps with buckets that reflect your cluster's failure domains
-to better ensure data safety and availability.
+CRUSH structure
+===============
-.. note:: The generated CRUSH map doesn't take your larger grained failure
- domains into account. So you should modify your CRUSH map to account for
- larger grained failure domains such as chassis, racks, rows, data
- centers, etc.
+The CRUSH map consists of, loosely speaking, a hierarchy describing
+the physical topology of the cluster, and a set of rules defining
+policy about how we place data on those devices. The hierarchy has
+devices (``ceph-osd`` daemons) at the leaves, and internal nodes
+corresponding to other physical features or groupings: hosts, racks,
+rows, datacenters, and so on. The rules describe how replicas are
+placed in terms of that hierarchy (e.g., 'three replicas in different
+racks').
+Devices
+-------
+Devices are individual ``ceph-osd`` daemons that can store data. You
+will normally have one defined here for each OSD daemon in your
+cluster. Devices are identified by an id (a non-negative integer) and
+a name, normally ``osd.N`` where ``N`` is the device id.
-.. _crushmapdevices:
+Devices may also have a *device class* associated with them (e.g.,
+``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
+crush rule.
-CRUSH Map Devices
+Types and Buckets
-----------------
-To map placement groups to OSDs, a CRUSH map requires a list of OSD devices
-(i.e., the names of the OSD daemons from the Ceph configuration file). The list
-of devices appears first in the CRUSH map. To declare a device in the CRUSH map,
-create a new line under your list of devices, enter ``device`` followed by a
-unique numeric ID, followed by the corresponding ``ceph-osd`` daemon instance.
-The device class can optionaly be added to group devices so they can be
-conveniently targetted by a crush rule.
-
-::
-
- #devices
- device {num} {osd.name} [class {class}]
-
-For example::
-
- #devices
- device 0 osd.0 class ssd
- device 1 osd.1 class hdd
- device 2 osd.2
- device 3 osd.3
-
-As a general rule, an OSD daemon maps to a single storage drive or to a RAID.
-
-
-CRUSH Map Bucket Types
-----------------------
-
-The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
-a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
-physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
-Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
-media.
-
-.. tip:: The term "bucket" used in the context of CRUSH means a node in
- the hierarchy, i.e. a location or a piece of physical hardware. It
- is a different concept from the term "bucket" when used in the
- context of RADOS Gateway APIs.
-
-To add a bucket type to the CRUSH map, create a new line under your list of
-bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
-By convention, there is one leaf bucket and it is ``type 0``; however, you may
-give it any name you like (e.g., osd, disk, drive, storage, etc.)::
-
- #types
- type {num} {bucket-name}
-
-For example::
-
- # types
- type 0 osd
- type 1 host
- type 2 chassis
- type 3 rack
- type 4 row
- type 5 pdu
- type 6 pod
- type 7 room
- type 8 datacenter
- type 9 region
- type 10 root
-
-
-
-.. _crushmapbuckets:
-
-CRUSH Map Bucket Hierarchy
---------------------------
-
-The CRUSH algorithm distributes data objects among storage devices according
-to a per-device weight value, approximating a uniform probability distribution.
-CRUSH distributes objects and their replicas according to the hierarchical
-cluster map you define. Your CRUSH map represents the available storage
-devices and the logical elements that contain them.
-
-To map placement groups to OSDs across failure domains, a CRUSH map defines a
-hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
-map). The purpose of creating a bucket hierarchy is to segregate the
-leaf nodes by their failure domains, such as hosts, chassis, racks, power
-distribution units, pods, rows, rooms, and data centers. With the exception of
-the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
-you may define it according to your own needs.
-
-We recommend adapting your CRUSH map to your firms's hardware naming conventions
-and using instances names that reflect the physical hardware. Your naming
-practice can make it easier to administer the cluster and troubleshoot
-problems when an OSD and/or other hardware malfunctions and the administrator
-need access to physical hardware.
-
-In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
-and two node buckets named ``host`` and ``rack`` respectively.
-
-.. ditaa::
- +-----------+
- | {o}rack |
- | Bucket |
- +-----+-----+
+A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
+racks, rows, etc. The CRUSH map defines a series of *types* that are
+used to describe these nodes. By default, these types include:
+
+- osd (or device)
+- host
+- chassis
+- rack
+- row
+- pdu
+- pod
+- room
+- datacenter
+- region
+- root
+
+Most clusters make use of only a handful of these types, and others
+can be defined as needed.
+
+The hierarchy is built with devices (normally type ``osd``) at the
+leaves, interior nodes with non-device types, and a root node of type
+``root``. For example,
+
+.. ditaa::
+
+ +-----------------+
+ | {o}root default |
+ +--------+--------+
|
+---------------+---------------+
| |
- +-----+-----+ +-----+-----+
- | {o}host | | {o}host |
- | Bucket | | Bucket |
- +-----+-----+ +-----+-----+
+ +-------+-------+ +-----+-------+
+ | {o}host foo | | {o}host bar |
+ +-------+-------+ +-----+-------+
| |
+-------+-------+ +-------+-------+
| | | |
+-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
- | osd | | osd | | osd | | osd |
- | Bucket | | Bucket | | Bucket | | Bucket |
+ | osd.0 | | osd.1 | | osd.2 | | osd.3 |
+-----------+ +-----------+ +-----------+ +-----------+
-.. note:: The higher numbered ``rack`` bucket type aggregates the lower
- numbered ``host`` bucket type.
-
-Since leaf nodes reflect storage devices declared under the ``#devices`` list
-at the beginning of the CRUSH map, you do not need to declare them as bucket
-instances. The second lowest bucket type in your hierarchy usually aggregates
-the devices (i.e., it's usually the computer containing the storage media, and
-uses whatever term you prefer to describe it, such as "node", "computer",
-"server," "host", "machine", etc.). In high density environments, it is
-increasingly common to see multiple hosts/nodes per chassis. You should account
-for chassis failure too--e.g., the need to pull a chassis if a node fails may
-result in bringing down numerous hosts/nodes and their OSDs.
-
-When declaring a bucket instance, you must specify its type, give it a unique
-name (string), assign it a unique ID expressed as a negative integer (optional),
-specify a weight relative to the total capacity/capability of its item(s),
-specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``,
-reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
-The items may consist of node buckets or leaves. Items may have a weight that
-reflects the relative weight of the item.
-
-You may declare a node bucket with the following syntax::
-
- [bucket-type] [bucket-name] {
- id [a unique negative numeric ID]
- weight [the relative capacity/capability of the item(s)]
- alg [the bucket type: uniform | list | tree | straw ]
- hash [the hash type: 0 by default]
- item [item-name] weight [weight]
- }
-
-For example, using the diagram above, we would define two host buckets
-and one rack bucket. The OSDs are declared as items within the host buckets::
-
- host node1 {
- id -1
- alg straw
- hash 0
- item osd.0 weight 1.00
- item osd.1 weight 1.00
- }
-
- host node2 {
- id -2
- alg straw
- hash 0
- item osd.2 weight 1.00
- item osd.3 weight 1.00
- }
-
- rack rack1 {
- id -3
- alg straw
- hash 0
- item node1 weight 2.00
- item node2 weight 2.00
- }
-
-.. note:: In the foregoing example, note that the rack bucket does not contain
- any OSDs. Rather it contains lower level host buckets, and includes the
- sum total of their weight in the item entry.
-
-.. topic:: Bucket Types
-
- Ceph supports four bucket types, each representing a tradeoff between
- performance and reorganization efficiency. If you are unsure of which bucket
- type to use, we recommend using a ``straw`` bucket. For a detailed
- discussion of bucket types, refer to
- `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
- and more specifically to **Section 3.4**. The bucket types are:
+Each node (device or bucket) in the hierarchy has a *weight*
+associated with it, indicating the relative proportion of the total
+data that device or hierarchy subtree should store. Weights are set
+at the leaves, indicating the size of the device, and automatically
+sum up the tree from there, such that the weight of the default node
+will be the total of all devices contained beneath it. Normally
+weights are in units of terabytes (TB).
+
+You can get a simple view the CRUSH hierarchy for your cluster,
+including the weights, with::
+
+ ceph osd crush tree
+
+Rules
+-----
+
+Rules define policy about how data is distributed across the devices
+in the hierarchy.
+
+CRUSH rules define placement and replication strategies or
+distribution policies that allow you to specify exactly how CRUSH
+places object replicas. For example, you might create a rule selecting
+a pair of targets for 2-way mirroring, another rule for selecting
+three targets in two different data centers for 3-way mirroring, and
+yet another rule for erasure coding over six storage devices. For a
+detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
+Scalable, Decentralized Placement of Replicated Data`_, and more
+specifically to **Section 3.2**.
+
+In almost all cases, CRUSH rules can be created via the CLI by
+specifying the *pool type* they will be used for (replicated or
+erasure coded), the *failure domain*, and optionally a *device class*.
+In rare cases rules must be written by hand by manually editing the
+CRUSH map.
- #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same
- weight. For example, when firms commission or decommission hardware, they
- typically do so with many machines that have exactly the same physical
- configuration (e.g., bulk purchases). When storage devices have exactly
- the same weight, you may use the ``uniform`` bucket type, which allows
- CRUSH to map replicas into uniform buckets in constant time. With
- non-uniform weights, you should use another bucket algorithm.
-
- #. **List**: List buckets aggregate their content as linked lists. Based on
- the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
- a list is a natural and intuitive choice for an **expanding cluster**:
- either an object is relocated to the newest device with some appropriate
- probability, or it remains on the older devices as before. The result is
- optimal data migration when items are added to the bucket. Items removed
- from the middle or tail of the list, however, can result in a significant
- amount of unnecessary movement, making list buckets most suitable for
- circumstances in which they **never (or very rarely) shrink**.
-
- #. **Tree**: Tree buckets use a binary search tree. They are more efficient
- than list buckets when a bucket contains a larger set of items. Based on
- the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
- tree buckets reduce the placement time to O(log :sub:`n`), making them
- suitable for managing much larger sets of devices or nested buckets.
-
- #. **Straw:** List and Tree buckets use a divide and conquer strategy
- in a way that either gives certain items precedence (e.g., those
- at the beginning of a list) or obviates the need to consider entire
- subtrees of items at all. That improves the performance of the replica
- placement process, but can also introduce suboptimal reorganization
- behavior when the contents of a bucket change due an addition, removal,
- or re-weighting of an item. The straw bucket type allows all items to
- fairly “compete” against each other for replica placement through a
- process analogous to a draw of straws.
-
-.. topic:: Hash
-
- Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
- Enter ``0`` as your hash setting to select ``rjenkins1``.
-
-
-.. _weightingbucketitems:
-
-.. topic:: Weighting Bucket Items
-
- Ceph expresses bucket weights as doubles, which allows for fine
- weighting. A weight is the relative difference between device capacities. We
- recommend using ``1.00`` as the relative weight for a 1TB storage device.
- In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
- and a weight of ``3.00`` would represent approximately 3TB. Higher level
- buckets have a weight that is the sum total of the leaf items aggregated by
- the bucket.
-
- A bucket item weight is one dimensional, but you may also calculate your
- item weights to reflect the performance of the storage drive. For example,
- if you have many 1TB drives where some have relatively low data transfer
- rate and the others have a relatively high data transfer rate, you may
- weight them differently, even though they have the same capacity (e.g.,
- a weight of 0.80 for the first set of drives with lower total throughput,
- and 1.20 for the second set of drives with higher total throughput).
-
-
-.. _crushmaprules:
-
-CRUSH Map Rules
----------------
-
-CRUSH maps support the notion of 'CRUSH rules', which are the rules that
-determine data placement for a pool. For large clusters, you will likely create
-many pools where each pool may have its own CRUSH ruleset and rules. The default
-CRUSH map has a rule for each pool, and one ruleset assigned to each of the
-default pools.
-
-.. note:: In most cases, you will not need to modify the default rules. When
- you create a new pool, its default ruleset is ``0``.
-
-
-CRUSH rules define placement and replication strategies or distribution policies
-that allow you to specify exactly how CRUSH places object replicas. For
-example, you might create a rule selecting a pair of targets for 2-way
-mirroring, another rule for selecting three targets in two different data
-centers for 3-way mirroring, and yet another rule for erasure coding over six
-storage devices. For a detailed discussion of CRUSH rules, refer to
-`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
-and more specifically to **Section 3.2**.
-
-A rule takes the following form::
-
- rule <rulename> {
-
- ruleset <ruleset>
- type [ replicated | erasure ]
- min_size <min-size>
- max_size <max-size>
- step take <bucket-name> [class <device-class>]
- step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
- step emit
- }
-
-
-``ruleset``
+You can see what rules are defined for your cluster with::
-:Description: A means of classifying a rule as belonging to a set of rules.
- Activated by `setting the ruleset in a pool`_.
+ ceph osd crush rule ls
-:Purpose: A component of the rule mask.
-:Type: Integer
-:Required: Yes
-:Default: 0
-
-.. _setting the ruleset in a pool: ../pools#setpoolvalues
-
-
-``type``
-
-:Description: Describes a rule for either a storage drive (replicated)
- or a RAID.
-
-:Purpose: A component of the rule mask.
-:Type: String
-:Required: Yes
-:Default: ``replicated``
-:Valid Values: Currently only ``replicated`` and ``erasure``
-
-``min_size``
-
-:Description: If a pool makes fewer replicas than this number, CRUSH will
- **NOT** select this rule.
-
-:Type: Integer
-:Purpose: A component of the rule mask.
-:Required: Yes
-:Default: ``1``
-
-``max_size``
-
-:Description: If a pool makes more replicas than this number, CRUSH will
- **NOT** select this rule.
-
-:Type: Integer
-:Purpose: A component of the rule mask.
-:Required: Yes
-:Default: 10
-
-
-``step take <bucket-name> [class <device-class>]``
-
-:Description: Takes a bucket name, and begins iterating down the tree.
- If the ``device-class`` is specified, it must match
- a class previously used when defining a device. All
- devices that do not belong to the class are excluded.
-:Purpose: A component of the rule.
-:Required: Yes
-:Example: ``step take data``
+You can view the contents of the rules with::
+ ceph osd crush rule dump
-``step choose firstn {num} type {bucket-type}``
-
-:Description: Selects the number of buckets of the given type. The number is
- usually the number of replicas in the pool (i.e., pool size).
-
- - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
- - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
- - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
-
-:Purpose: A component of the rule.
-:Prerequisite: Follows ``step take`` or ``step choose``.
-:Example: ``step choose firstn 1 type row``
-
-
-``step chooseleaf firstn {num} type {bucket-type}``
-
-:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
- node from the subtree of each bucket in the set of buckets. The
- number of buckets in the set is usually the number of replicas in
- the pool (i.e., pool size).
-
- - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
- - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
- - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
-
-:Purpose: A component of the rule. Usage removes the need to select a device using two steps.
-:Prerequisite: Follows ``step take`` or ``step choose``.
-:Example: ``step chooseleaf firstn 0 type row``
-
-
-
-``step emit``
-
-:Description: Outputs the current value and empties the stack. Typically used
- at the end of a rule, but may also be used to pick from different
- trees in the same rule.
-
-:Purpose: A component of the rule.
-:Prerequisite: Follows ``step choose``.
-:Example: ``step emit``
-
-.. important:: To activate one or more rules with a common ruleset number to a
- pool, set the ruleset number of the pool.
-
-
-
-Primary Affinity
-================
-
-When a Ceph Client reads or writes data, it always contacts the primary OSD in
-the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
-OSD isn't well suited to act as a primary compared to other OSDs (e.g., it has
-a slow disk or a slow controller). To prevent performance bottlenecks
-(especially on read operations) while maximizing utilization of your hardware,
-you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
-the OSD as a primary in an acting set. ::
-
- ceph osd primary-affinity <osd-id> <weight>
-
-Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
-may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
-**NOT** be used as a primary and ``1`` means that an OSD may be used as a
-primary. When the weight is ``< 1``, it is less likely that CRUSH will select
-the Ceph OSD Daemon to act as a primary.
+Weights sets
+------------
-Placing Different Pools on Different OSDS:
-==========================================
-
-Suppose you want to have most pools default to OSDs backed by large hard drives,
-but have some pools mapped to OSDs backed by fast solid-state drives (SSDs).
-It's possible to have multiple independent CRUSH hierarchies within the same
-CRUSH map. Define two hierarchies with two different root nodes--one for hard
-disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown
-below::
-
- device 0 osd.0
- device 1 osd.1
- device 2 osd.2
- device 3 osd.3
- device 4 osd.4
- device 5 osd.5
- device 6 osd.6
- device 7 osd.7
-
- host ceph-osd-ssd-server-1 {
- id -1
- alg straw
- hash 0
- item osd.0 weight 1.00
- item osd.1 weight 1.00
- }
-
- host ceph-osd-ssd-server-2 {
- id -2
- alg straw
- hash 0
- item osd.2 weight 1.00
- item osd.3 weight 1.00
- }
-
- host ceph-osd-platter-server-1 {
- id -3
- alg straw
- hash 0
- item osd.4 weight 1.00
- item osd.5 weight 1.00
- }
-
- host ceph-osd-platter-server-2 {
- id -4
- alg straw
- hash 0
- item osd.6 weight 1.00
- item osd.7 weight 1.00
- }
-
- root platter {
- id -5
- alg straw
- hash 0
- item ceph-osd-platter-server-1 weight 2.00
- item ceph-osd-platter-server-2 weight 2.00
- }
-
- root ssd {
- id -6
- alg straw
- hash 0
- item ceph-osd-ssd-server-1 weight 2.00
- item ceph-osd-ssd-server-2 weight 2.00
- }
-
- rule data {
- ruleset 0
- type replicated
- min_size 2
- max_size 2
- step take platter
- step chooseleaf firstn 0 type host
- step emit
- }
-
- rule metadata {
- ruleset 1
- type replicated
- min_size 0
- max_size 10
- step take platter
- step chooseleaf firstn 0 type host
- step emit
- }
-
- rule rbd {
- ruleset 2
- type replicated
- min_size 0
- max_size 10
- step take platter
- step chooseleaf firstn 0 type host
- step emit
- }
-
- rule platter {
- ruleset 3
- type replicated
- min_size 0
- max_size 10
- step take platter
- step chooseleaf firstn 0 type host
- step emit
- }
-
- rule ssd {
- ruleset 4
- type replicated
- min_size 0
- max_size 4
- step take ssd
- step chooseleaf firstn 0 type host
- step emit
- }
-
- rule ssd-primary {
- ruleset 5
- type replicated
- min_size 5
- max_size 10
- step take ssd
- step chooseleaf firstn 1 type host
- step emit
- step take platter
- step chooseleaf firstn -1 type host
- step emit
- }
-
-You can then set a pool to use the SSD rule by::
-
- ceph osd pool set <poolname> crush_ruleset 4
-
-Similarly, using the ``ssd-primary`` rule will cause each placement group in the
-pool to be placed with an SSD as the primary and platters as the replicas.
+A *weight set* is an alternative set of weights to use when
+calculating data placement. The normal weights associated with each
+device in the CRUSH map are set based on the device size and indicate
+how much data we *should* be storing where. However, because CRUSH is
+based on a pseudorandom placement process, there is always some
+variation from this ideal distribution, the same way that rolling a
+die sixty times will not result in rolling exactly 10 ones and 10
+sixes. Weight sets allow the cluster to do a numerical optimization
+based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
+a balanced distribution.
+
+There are two types of weight sets supported:
+
+ #. A **compat** weight set is a single alternative set of weights for
+ each device and node in the cluster. This is not well-suited for
+ correcting for all anomalies (for example, placement groups for
+ different pools may be different sizes and have different load
+ levels, but will be mostly treated the same by the balancer).
+ However, compat weight sets have the huge advantage that they are
+ *backward compatible* with previous versions of Ceph, which means
+ that even though weight sets were first introduced in Luminous
+ v12.2.z, older clients (e.g., firefly) can still connect to the
+ cluster when a compat weight set is being used to balance data.
+ #. A **per-pool** weight set is more flexible in that it allows
+ placement to be optimized for each data pool. Additionally,
+ weights can be adjusted for each position of placement, allowing
+ the optimizer to correct for a suble skew of data toward devices
+ with small weights relative to their peers (and effect that is
+ usually only apparently in very large clusters but which can cause
+ balancing problems).
+
+When weight sets are in use, the weights associated with each node in
+the hierarchy is visible as a separate column (labeled either
+``(compat)`` or the pool name) from the command::
+
+ ceph osd crush tree
+
+Although weight sets can be set up and manipulated by hand, it is
+recommended that the *balancer* module be enabled to do so
+automatically.
+
+
+Modifying the CRUSH map
+=======================
.. _addosd:
Add/Move an OSD
-===============
+---------------
-To add or move an OSD in the CRUSH map of a running cluster, execute the
-``ceph osd crush set``. For Argonaut (v 0.48), execute the following::
+.. note: OSDs are normally automatically added to the CRUSH map when
+ the OSD is created. This command is rarely needed.
- ceph osd crush set {id} {name} {weight} pool={pool-name} [{bucket-type}={bucket-name} ...]
-
-For Bobtail (v 0.56), execute the following::
+To add or move an OSD in the CRUSH map of a running cluster::
- ceph osd crush set {id-or-name} {weight} root={pool-name} [{bucket-type}={bucket-name} ...]
+ ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
Where:
-``id``
-
-:Description: The numeric ID of the OSD.
-:Type: Integer
-:Required: Yes
-:Example: ``0``
-
-
``name``
:Description: The full name of the OSD.
``weight``
-:Description: The CRUSH weight for the OSD.
+:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
:Type: Double
:Required: Yes
:Example: ``2.0``
``root``
-:Description: The root of the tree in which the OSD resides.
+:Description: The root node of the tree in which the OSD resides (normally ``default``)
:Type: Key/value pair.
:Required: Yes
:Example: ``root=default``
:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
-The following example adds ``osd.0`` to the hierarchy, or moves the OSD from a
-previous location. ::
+The following example adds ``osd.0`` to the hierarchy, or moves the
+OSD from a previous location. ::
- ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
+ ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
-Adjust an OSD's CRUSH Weight
-============================
+Adjust OSD weight
+-----------------
+
+.. note: Normally OSDs automatically add themselves to the CRUSH map
+ with the correct weight when they are created. This command
+ is rarely needed.
To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
the following::
- ceph osd crush reweight {name} {weight}
+ ceph osd crush reweight {name} {weight}
Where:
.. _removeosd:
Remove an OSD
-=============
+-------------
+
+.. note: OSDs are normally removed from the CRUSH as part of the
+ ``ceph osd purge`` command. This command is rarely needed.
-To remove an OSD from the CRUSH map of a running cluster, execute the following::
+To remove an OSD from the CRUSH map of a running cluster, execute the
+following::
- ceph osd crush remove {name}
+ ceph osd crush remove {name}
Where:
:Required: Yes
:Example: ``osd.0``
+
Add a Bucket
-============
+------------
+
+.. note: Buckets are normally implicitly created when an OSD is added
+ that specifies a ``{bucket-type}={bucket-name}`` as part of its
+ location and a bucket with that name does not already exist. This
+ command is typically used when manually adjusting the structure of the
+ hierarchy after OSDs have been created (for example, to move a
+ series of hosts underneath a new rack-level bucket).
-To add a bucket in the CRUSH map of a running cluster, execute the ``ceph osd crush add-bucket`` command::
+To add a bucket in the CRUSH map of a running cluster, execute the
+``ceph osd crush add-bucket`` command::
- ceph osd crush add-bucket {bucket-name} {bucket-type}
+ ceph osd crush add-bucket {bucket-name} {bucket-type}
Where:
The following example adds the ``rack12`` bucket to the hierarchy::
- ceph osd crush add-bucket rack12 rack
+ ceph osd crush add-bucket rack12 rack
Move a Bucket
-=============
+-------------
-To move a bucket to a different location or position in the CRUSH map hierarchy,
-execute the following::
+To move a bucket to a different location or position in the CRUSH map
+hierarchy, execute the following::
- ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
+ ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
Where:
:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
Remove a Bucket
-===============
+---------------
To remove a bucket from the CRUSH map hierarchy, execute the following::
- ceph osd crush remove {bucket-name}
+ ceph osd crush remove {bucket-name}
.. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
The following example removes the ``rack12`` bucket from the hierarchy::
- ceph osd crush remove rack12
+ ceph osd crush remove rack12
+
+
Tunables
========
* ``argonaut``: the legacy values supported by the original argonaut release
* ``bobtail``: the values supported by the bobtail release
* ``firefly``: the values supported by the firefly release
+ * ``hammer``: the values supported by the hammer release
+ * ``jewel``: the values supported by the jewel release
* ``optimal``: the best (ie optimal) values of the current version of Ceph
* ``default``: the default values of a new cluster installed from
scratch. These values, which depend on the current version of Ceph,
Note that this may result in some data movement.
-Tuning CRUSH, the hard way
---------------------------
-
-If you can ensure that all clients are running recent code, you can
-adjust the tunables by extracting the CRUSH map, modifying the values,
-and reinjecting it into the cluster.
-
-* Extract the latest CRUSH map::
-
- ceph osd getcrushmap -o /tmp/crush
-
-* Adjust tunables. These values appear to offer the best behavior
- for both large and small clusters we tested with. You will need to
- additionally specify the ``--enable-unsafe-tunables`` argument to
- ``crushtool`` for this to work. Please use this option with
- extreme care.::
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
- crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
-* Reinject modified map::
+Primary Affinity
+================
- ceph osd setcrushmap -i /tmp/crush.new
+When a Ceph Client reads or writes data, it always contacts the primary OSD in
+the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
+OSD isn't well suited to act as a primary compared to other OSDs (e.g., it has
+a slow disk or a slow controller). To prevent performance bottlenecks
+(especially on read operations) while maximizing utilization of your hardware,
+you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
+the OSD as a primary in an acting set. ::
-Legacy values
--------------
+ ceph osd primary-affinity <osd-id> <weight>
-For reference, the legacy values for the CRUSH tunables can be set
-with::
+Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
+may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
+**NOT** be used as a primary and ``1`` means that an OSD may be used as a
+primary. When the weight is ``< 1``, it is less likely that CRUSH will select
+the Ceph OSD Daemon to act as a primary.
- crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
-Again, the special ``--enable-unsafe-tunables`` option is required.
-Further, as noted above, be careful running old versions of the
-``ceph-osd`` daemon after reverting to legacy values as the feature
-bit is not perfectly enforced.
-.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf