============
The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
-determines how to store and retrieve data by computing storage locations.
-CRUSH empowers Ceph clients to communicate with OSDs directly rather than
-through a centralized server or broker. With an algorithmically determined
+computes storage locations in order to determine how to store and retrieve
+data. CRUSH allows Ceph clients to communicate with OSDs directly rather than
+through a centralized server or broker. By using an algorithmically-determined
method of storing and retrieving data, Ceph avoids a single point of failure, a
performance bottleneck, and a physical limit to its scalability.
-CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly
-map data to OSDs, distributing it across the cluster according to configured
-replication policy and failure domain. For a detailed discussion of CRUSH, see
+CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs,
+distributing the data across the cluster in accordance with configured
+replication policy and failure domains. For a detailed discussion of CRUSH, see
`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
-CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy
-of 'buckets' for aggregating devices and buckets, and
-rules that govern how CRUSH replicates data within the cluster's pools. By
-reflecting the underlying physical organization of the installation, CRUSH can
-model (and thereby address) the potential for correlated device failures.
-Typical factors include chassis, racks, physical proximity, a shared power
-source, and shared networking. By encoding this information into the cluster
-map, CRUSH placement
-policies distribute object replicas across failure domains while
-maintaining the desired distribution. For example, to address the
-possibility of concurrent failures, it may be desirable to ensure that data
-replicas are on devices using different shelves, racks, power supplies,
-controllers, and/or physical locations.
-
-When you deploy OSDs they are automatically added to the CRUSH map under a
-``host`` bucket named for the node on which they run. This,
-combined with the configured CRUSH failure domain, ensures that replicas or
-erasure code shards are distributed across hosts and that a single host or other
-failure will not affect availability. For larger clusters, administrators must
-carefully consider their choice of failure domain. Separating replicas across racks,
-for example, is typical for mid- to large-sized clusters.
+CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a
+hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH
+replicates data within the cluster's pools. By reflecting the underlying
+physical organization of the installation, CRUSH can model (and thereby
+address) the potential for correlated device failures. Some factors relevant
+to the CRUSH hierarchy include chassis, racks, physical proximity, a shared
+power source, shared networking, and failure domains. By encoding this
+information into the CRUSH map, CRUSH placement policies distribute object
+replicas across failure domains while maintaining the desired distribution. For
+example, to address the possibility of concurrent failures, it might be
+desirable to ensure that data replicas are on devices that reside in or rely
+upon different shelves, racks, power supplies, controllers, or physical
+locations.
+
+When OSDs are deployed, they are automatically added to the CRUSH map under a
+``host`` bucket that is named for the node on which the OSDs run. This
+behavior, combined with the configured CRUSH failure domain, ensures that
+replicas or erasure-code shards are distributed across hosts and that the
+failure of a single host or other kinds of failures will not affect
+availability. For larger clusters, administrators must carefully consider their
+choice of failure domain. For example, distributing replicas across racks is
+typical for mid- to large-sized clusters.
CRUSH Location
==============
-The location of an OSD within the CRUSH map's hierarchy is
-referred to as a ``CRUSH location``. This location specifier takes the
-form of a list of key and value pairs. For
-example, if an OSD is in a particular row, rack, chassis and host, and
-is part of the 'default' CRUSH root (which is the case for most
-clusters), its CRUSH location could be described as::
+The location of an OSD within the CRUSH map's hierarchy is referred to as its
+``CRUSH location``. The specification of a CRUSH location takes the form of a
+list of key-value pairs. For example, if an OSD is in a particular row, rack,
+chassis, and host, and is also part of the 'default' CRUSH root (which is the
+case for most clusters), its CRUSH location can be specified as follows::
root=default row=a rack=a2 chassis=a2a host=a2a1
-Note:
+.. note::
-#. Note that the order of the keys does not matter.
-#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
- these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``,
- ``rack``, ``chassis`` and ``host``.
- These defined types suffice for almost all clusters, but can be customized
- by modifying the CRUSH map.
-#. Not all keys need to be specified. For example, by default, Ceph
- automatically sets an ``OSD``'s location to be
- ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
+ #. The order of the keys does not matter.
+ #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default,
+ valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``,
+ ``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined
+ types suffice for nearly all clusters, but can be customized by
+ modifying the CRUSH map.
+ #. Not all keys need to be specified. For example, by default, Ceph
+ automatically sets an ``OSD``'s location as ``root=default
+ host=HOSTNAME`` (as determined by the output of ``hostname -s``).
-The CRUSH location for an OSD can be defined by adding the ``crush location``
-option in ``ceph.conf``. Each time the OSD starts,
-it verifies it is in the correct location in the CRUSH map and, if it is not,
-it moves itself. To disable this automatic CRUSH map management, add the
-following to your configuration file in the ``[osd]`` section::
+The CRUSH location for an OSD can be modified by adding the ``crush location``
+option in ``ceph.conf``. When this option has been added, every time the OSD
+starts it verifies that it is in the correct location in the CRUSH map and
+moves itself if it is not. To disable this automatic CRUSH map management, add
+the following to the ``ceph.conf`` configuration file in the ``[osd]``
+section::
- osd crush update on start = false
+ osd crush update on start = false
-Note that in most cases you will not need to manually configure this.
+Note that this action is unnecessary in most cases.
Custom location hooks
---------------------
-A customized location hook can be used to generate a more complete
-CRUSH location on startup. The CRUSH location is based on, in order
-of preference:
+A custom location hook can be used to generate a more complete CRUSH location
+on startup. The CRUSH location is determined by, in order of preference:
#. A ``crush location`` option in ``ceph.conf``
-#. A default of ``root=default host=HOSTNAME`` where the hostname is
- derived from the ``hostname -s`` command
+#. A default of ``root=default host=HOSTNAME`` where the hostname is determined
+ by the output of the ``hostname -s`` command
-A script can be written to provide additional
-location fields (for example, ``rack`` or ``datacenter``) and the
-hook enabled via the config option::
+A script can be written to provide additional location fields (for example,
+``rack`` or ``datacenter``) and the hook can be enabled via the following
+config option::
- crush location hook = /path/to/customized-ceph-crush-location
+ crush location hook = /path/to/customized-ceph-crush-location
-This hook is passed several arguments (below) and should output a single line
-to ``stdout`` with the CRUSH location description.::
+This hook is passed several arguments (see below). The hook outputs a single
+line to ``stdout`` that contains the CRUSH location description. The output
+resembles the following:::
--cluster CLUSTER --id ID --type TYPE
-where the cluster name is typically ``ceph``, the ``id`` is the daemon
-identifier (e.g., the OSD number or daemon identifier), and the daemon
-type is ``osd``, ``mds``, etc.
+Here the cluster name is typically ``ceph``, the ``id`` is the daemon
+identifier or (in the case of OSDs) the OSD number, and the daemon type is
+``osd``, ``mds, ``mgr``, or ``mon``.
-For example, a simple hook that additionally specifies a rack location
-based on a value in the file ``/etc/rack`` might be::
+For example, a simple hook that specifies a rack location via a value in the
+file ``/etc/rack`` might be as follows::
#!/bin/sh
echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
CRUSH structure
===============
-The CRUSH map consists of a hierarchy that describes
-the physical topology of the cluster and a set of rules defining
-data placement policy. The hierarchy has
-devices (OSDs) at the leaves, and internal nodes
-corresponding to other physical features or groupings: hosts, racks,
-rows, datacenters, and so on. The rules describe how replicas are
-placed in terms of that hierarchy (e.g., 'three replicas in different
-racks').
+The CRUSH map consists of (1) a hierarchy that describes the physical topology
+of the cluster and (2) a set of rules that defines data placement policy. The
+hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to
+other physical features or groupings: hosts, racks, rows, data centers, and so
+on. The rules determine how replicas are placed in terms of that hierarchy (for
+example, 'three replicas in different racks').
Devices
-------
-Devices are individual OSDs that store data, usually one for each storage drive.
-Devices are identified by an ``id``
-(a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id.
+Devices are individual OSDs that store data (usually one device for each
+storage drive). Devices are identified by an ``id`` (a non-negative integer)
+and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``).
-Since the Luminous release, devices may also have a *device class* assigned (e.g.,
-``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by
-CRUSH rules. This is especially useful when mixing device types within hosts.
+In Luminous and later releases, OSDs can have a *device class* assigned (for
+example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH
+rules. Device classes are especially useful when mixing device types within
+hosts.
.. _crush_map_default_types:
Types and Buckets
-----------------
-A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
-racks, rows, etc. The CRUSH map defines a series of *types* that are
-used to describe these nodes. Default types include:
+"Bucket", in the context of CRUSH, is a term for any of the internal nodes in
+the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of
+*types* that are used to identify these nodes. Default types include:
- ``osd`` (or ``device``)
- ``host``
- ``region``
- ``root``
-Most clusters use only a handful of these types, and others
-can be defined as needed.
+Most clusters use only a handful of these types, and other types can be defined
+as needed.
+
+The hierarchy is built with devices (normally of type ``osd``) at the leaves
+and non-device types as the internal nodes. The root node is of type ``root``.
+For example:
-The hierarchy is built with devices (normally type ``osd``) at the
-leaves, interior nodes with non-device types, and a root node of type
-``root``. For example,
.. ditaa::
- +-----------------+
+ +-----------------+
|{o}root default |
- +--------+--------+
+ +--------+--------+
|
+---------------+---------------+
| |
+------+------+ +------+------+
- |{o}host foo | |{o}host bar |
+ |{o}host foo | |{o}host bar |
+------+------+ +------+------+
| |
+-------+-------+ +-------+-------+
| | | |
+-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
- | osd.0 | | osd.1 | | osd.2 | | osd.3 |
+ | osd.0 | | osd.1 | | osd.2 | | osd.3 |
+-----------+ +-----------+ +-----------+ +-----------+
-Each node (device or bucket) in the hierarchy has a *weight*
-that indicates the relative proportion of the total
-data that device or hierarchy subtree should store. Weights are set
-at the leaves, indicating the size of the device, and automatically
-sum up the tree, such that the weight of the ``root`` node
-will be the total of all devices contained beneath it. Normally
-weights are in units of terabytes (TB).
-You can get a simple view the of CRUSH hierarchy for your cluster,
-including weights, with:
+Each node (device or bucket) in the hierarchy has a *weight* that indicates the
+relative proportion of the total data that should be stored by that device or
+hierarchy subtree. Weights are set at the leaves, indicating the size of the
+device. These weights automatically sum in an 'up the tree' direction: that is,
+the weight of the ``root`` node will be the sum of the weights of all devices
+contained under it. Weights are typically measured in tebibytes (TiB).
+
+To get a simple view of the cluster's CRUSH hierarchy, including weights, run
+the following command:
.. prompt:: bash $
Rules
-----
-CRUSH Rules define policy about how data is distributed across the devices
-in the hierarchy. They define placement and replication strategies or
-distribution policies that allow you to specify exactly how CRUSH
-places data replicas. For example, you might create a rule selecting
-a pair of targets for two-way mirroring, another rule for selecting
-three targets in two different data centers for three-way mirroring, and
-yet another rule for erasure coding (EC) across six storage devices. For a
-detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
-Scalable, Decentralized Placement of Replicated Data`_, and more
-specifically to **Section 3.2**.
-
-CRUSH rules can be created via the CLI by
-specifying the *pool type* they will be used for (replicated or
-erasure coded), the *failure domain*, and optionally a *device class*.
-In rare cases rules must be written by hand by manually editing the
-CRUSH map.
-
-You can see what rules are defined for your cluster with:
+CRUSH rules define policy governing how data is distributed across the devices
+in the hierarchy. The rules define placement as well as replication strategies
+or distribution policies that allow you to specify exactly how CRUSH places
+data replicas. For example, you might create one rule selecting a pair of
+targets for two-way mirroring, another rule for selecting three targets in two
+different data centers for three-way replication, and yet another rule for
+erasure coding across six storage devices. For a detailed discussion of CRUSH
+rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized
+Placement of Replicated Data`_.
+
+CRUSH rules can be created via the command-line by specifying the *pool type*
+that they will govern (replicated or erasure coded), the *failure domain*, and
+optionally a *device class*. In rare cases, CRUSH rules must be created by
+manually editing the CRUSH map.
+
+To see the rules that are defined for the cluster, run the following command:
.. prompt:: bash $
ceph osd crush rule ls
-You can view the contents of the rules with:
+To view the contents of the rules, run the following command:
.. prompt:: bash $
Device classes
--------------
-Each device can optionally have a *class* assigned. By
-default, OSDs automatically set their class at startup to
-`hdd`, `ssd`, or `nvme` based on the type of device they are backed
-by.
+Each device can optionally have a *class* assigned. By default, OSDs
+automatically set their class at startup to `hdd`, `ssd`, or `nvme` in
+accordance with the type of device they are backed by.
-The device class for one or more OSDs can be explicitly set with:
+To explicitly set the device class of one or more OSDs, run a command of the
+following form:
.. prompt:: bash $
ceph osd crush set-device-class <class> <osd-name> [...]
-Once a device class is set, it cannot be changed to another class
-until the old class is unset with:
+Once a device class has been set, it cannot be changed to another class until
+the old class is unset. To remove the old class of one or more OSDs, run a
+command of the following form:
.. prompt:: bash $
ceph osd crush rm-device-class <osd-name> [...]
-This allows administrators to set device classes without the class
-being changed on OSD restart or by some other script.
+This restriction allows administrators to set device classes that won't be
+changed on OSD restart or by a script.
-A placement rule that targets a specific device class can be created with:
+To create a placement rule that targets a specific device class, run a command
+of the following form:
.. prompt:: bash $
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
-A pool can then be changed to use the new rule with:
+To apply the new placement rule to a specific pool, run a command of the
+following form:
.. prompt:: bash $
ceph osd pool set <pool-name> crush_rule <rule-name>
-Device classes are implemented by creating a "shadow" CRUSH hierarchy
-for each device class in use that contains only devices of that class.
-CRUSH rules can then distribute data over the shadow hierarchy.
-This approach is fully backward compatible with
-old Ceph clients. You can view the CRUSH hierarchy with shadow items
-with:
+Device classes are implemented by creating one or more "shadow" CRUSH
+hierarchies. For each device class in use, there will be a shadow hierarchy
+that contains only devices of that class. CRUSH rules can then distribute data
+across the relevant shadow hierarchy. This approach is fully backward
+compatible with older Ceph clients. To view the CRUSH hierarchy with shadow
+items displayed, run the following command:
-.. prompt:: bash $
+.. prompt:: bash #
ceph osd crush tree --show-shadow
-For older clusters created before Luminous that relied on manually
-crafted CRUSH maps to maintain per-device-type hierarchies, there is a
-*reclassify* tool available to help transition to device classes
-without triggering data movement (see :ref:`crush-reclassify`).
-
+Some older clusters that were created before the Luminous release rely on
+manually crafted CRUSH maps to maintain per-device-type hierarchies. For these
+clusters, there is a *reclassify* tool available that can help them transition
+to device classes without triggering unwanted data movement (see
+:ref:`crush-reclassify`).
+
+Weight sets
+-----------
+
+A *weight set* is an alternative set of weights to use when calculating data
+placement. The normal weights associated with each device in the CRUSH map are
+set in accordance with the device size and indicate how much data should be
+stored where. However, because CRUSH is a probabilistic pseudorandom placement
+process, there is always some variation from this ideal distribution (in the
+same way that rolling a die sixty times will likely not result in exactly ten
+ones and ten sixes). Weight sets allow the cluster to perform numerical
+optimization based on the specifics of your cluster (for example: hierarchy,
+pools) to achieve a balanced distribution.
+
+Ceph supports two types of weight sets:
+
+ #. A **compat** weight set is a single alternative set of weights for each
+ device and each node in the cluster. Compat weight sets cannot be expected
+ to correct all anomalies (for example, PGs for different pools might be of
+ different sizes and have different load levels, but are mostly treated
+ alike by the balancer). However, they have the major advantage of being
+ *backward compatible* with previous versions of Ceph. This means that even
+ though weight sets were first introduced in Luminous v12.2.z, older
+ clients (for example, Firefly) can still connect to the cluster when a
+ compat weight set is being used to balance data.
+
+ #. A **per-pool** weight set is more flexible in that it allows placement to
+ be optimized for each data pool. Additionally, weights can be adjusted
+ for each position of placement, allowing the optimizer to correct for a
+ subtle skew of data toward devices with small weights relative to their
+ peers (an effect that is usually apparent only in very large clusters
+ but that can cause balancing problems).
+
+When weight sets are in use, the weights associated with each node in the
+hierarchy are visible in a separate column (labeled either as ``(compat)`` or
+as the pool name) in the output of the following command:
+
+.. prompt:: bash #
-Weights sets
-------------
-
-A *weight set* is an alternative set of weights to use when
-calculating data placement. The normal weights associated with each
-device in the CRUSH map are set based on the device size and indicate
-how much data we *should* be storing where. However, because CRUSH is
-a "probabilistic" pseudorandom placement process, there is always some
-variation from this ideal distribution, in the same way that rolling a
-die sixty times will not result in rolling exactly 10 ones and 10
-sixes. Weight sets allow the cluster to perform numerical optimization
-based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
-a balanced distribution.
-
-There are two types of weight sets supported:
-
- #. A **compat** weight set is a single alternative set of weights for
- each device and node in the cluster. This is not well-suited for
- correcting for all anomalies (for example, placement groups for
- different pools may be different sizes and have different load
- levels, but will be mostly treated the same by the balancer).
- However, compat weight sets have the huge advantage that they are
- *backward compatible* with previous versions of Ceph, which means
- that even though weight sets were first introduced in Luminous
- v12.2.z, older clients (e.g., firefly) can still connect to the
- cluster when a compat weight set is being used to balance data.
- #. A **per-pool** weight set is more flexible in that it allows
- placement to be optimized for each data pool. Additionally,
- weights can be adjusted for each position of placement, allowing
- the optimizer to correct for a subtle skew of data toward devices
- with small weights relative to their peers (and effect that is
- usually only apparently in very large clusters but which can cause
- balancing problems).
-
-When weight sets are in use, the weights associated with each node in
-the hierarchy is visible as a separate column (labeled either
-``(compat)`` or the pool name) from the command:
-
-.. prompt:: bash $
-
- ceph osd crush tree
+ ceph osd tree
-When both *compat* and *per-pool* weight sets are in use, data
-placement for a particular pool will use its own per-pool weight set
-if present. If not, it will use the compat weight set if present. If
-neither are present, it will use the normal CRUSH weights.
+If both *compat* and *per-pool* weight sets are in use, data placement for a
+particular pool will use its own per-pool weight set if present. If only
+*compat* weight sets are in use, data placement will use the compat weight set.
+If neither are in use, data placement will use the normal CRUSH weights.
-Although weight sets can be set up and manipulated by hand, it is
-recommended that the ``ceph-mgr`` *balancer* module be enabled to do so
-automatically when running Luminous or later releases.
+Although weight sets can be set up and adjusted manually, we recommend enabling
+the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the
+cluster is running Luminous or a later release.
Modifying the CRUSH map