From: Sage Weil Date: Fri, 14 Jul 2017 22:06:15 +0000 (-0400) Subject: doc/rados/operations/crush-map*: restructure and clean up X-Git-Tag: v12.1.2~150^2~6 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=2e8afd16a93cf352d7bb08a2aa684daa1b9efcd8;p=ceph.git doc/rados/operations/crush-map*: restructure and clean up Signed-off-by: Sage Weil --- diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 000000000000..52222703823c --- /dev/null +++ b/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,654 @@ +Manually editing a CRUSH Map +============================ + +.. note:: Manually editing the CRUSH map is considered an advanced + administrator operation. All CRUSH changes that are + necessary for the overwhelming majority of installations are + possible via the standard ceph CLI and do not require manual + CRUSH map edits. If you have identified a use case where + manual edits *are* necessary, consider contacting the Ceph + developers so that future versions of Ceph can make this + unnecessary. + +To edit an existing CRUSH map: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +To activate CRUSH map rules for a specific pool, identify the common ruleset +number for those rules and specify that ruleset number for the pool. See `Set +Pool Values`_ for details. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get a CRUSH Map +--------------- + +To get the CRUSH map for your cluster, execute the following:: + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since +the CRUSH map is in a compiled form, you must decompile it first before you can +edit it. + +.. _decompilecrushmap: + +Decompile a CRUSH Map +--------------------- + +To decompile a CRUSH map, execute the following:: + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + + +Sections +-------- + +There are six main sections to a CRUSH Map. + +#. **tunables:** The preamble at the top of the map described any *tunables* + for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes in behavior that have + been made over the years to improve CRUSH's behavior. + +#. **devices:** Devices are individual ``ceph-osd`` daemons that can + store data. + +#. **types**: Bucket ``types`` define the types of buckets used in + your CRUSH hierarchy. Buckets consist of a hierarchical aggregation + of storage locations (e.g., rows, racks, chassis, hosts, etc.) and + their assigned weights. + +#. **buckets:** Once you define bucket types, you must define each node + in the hierarchy, its type, and which devices or other nodes it + containes. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** Choose_args are alternative weights associated with + the hierarchy that have been adjusted to optimize data placement. A single + choose_args map can be used for the entire cluster, or one can be + created for each individual pool. + + +.. _crushmapdevices: + +CRUSH Map Devices +----------------- + +Devices are individual ``ceph-osd`` daemons that can store data. You +will normally have one defined here for each OSD daemon in your +cluster. Devices are identified by an id (a non-negative integer) and +a name, normally ``osd.N`` where ``N`` is the device id. + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targetted by a +crush rule. + +:: + + # devices + device {num} {osd.name} [class {class}] + +For example:: + + # devices + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a single ``ceph-osd`` daemon. This +is normally a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a +small RAID device. + + + + + +CRUSH Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate +a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent +physical locations in a hierarchy. Nodes aggregate other nodes or leaves. +Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage +media. + +.. tip:: The term "bucket" used in the context of CRUSH means a node in + the hierarchy, i.e. a location or a piece of physical hardware. It + is a different concept from the term "bucket" when used in the + context of RADOS Gateway APIs. + +To add a bucket type to the CRUSH map, create a new line under your list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is one leaf bucket and it is ``type 0``; however, you may +give it any name you like (e.g., osd, disk, drive, storage, etc.):: + + #types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 region + type 10 root + + + +.. _crushmapbuckets: + +CRUSH Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according +to a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. Your CRUSH map represents the available storage +devices and the logical elements that contain them. + +To map placement groups to OSDs across failure domains, a CRUSH map defines a +hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH +map). The purpose of creating a bucket hierarchy is to segregate the +leaf nodes by their failure domains, such as hosts, chassis, racks, power +distribution units, pods, rows, rooms, and data centers. With the exception of +the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your firms's hardware naming conventions +and using instances names that reflect the physical hardware. Your naming +practice can make it easier to administer the cluster and troubleshoot +problems when an OSD and/or other hardware malfunctions and the administrator +need access to physical hardware. + +In the following example, the bucket hierarchy has a leaf bucket named ``osd``, +and two node buckets named ``host`` and ``rack`` respectively. + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher numbered ``rack`` bucket type aggregates the lower + numbered ``host`` bucket type. + +Since leaf nodes reflect storage devices declared under the ``#devices`` list +at the beginning of the CRUSH map, you do not need to declare them as bucket +instances. The second lowest bucket type in your hierarchy usually aggregates +the devices (i.e., it's usually the computer containing the storage media, and +uses whatever term you prefer to describe it, such as "node", "computer", +"server," "host", "machine", etc.). In high density environments, it is +increasingly common to see multiple hosts/nodes per chassis. You should account +for chassis failure too--e.g., the need to pull a chassis if a node fails may +result in bringing down numerous hosts/nodes and their OSDs. + +When declaring a bucket instance, you must specify its type, give it a unique +name (string), assign it a unique ID expressed as a negative integer (optional), +specify a weight relative to the total capacity/capability of its item(s), +specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, +reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. +The items may consist of node buckets or leaves. Items may have a weight that +reflects the relative weight of the item. + +You may declare a node bucket with the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, using the diagram above, we would define two host buckets +and one rack bucket. The OSDs are declared as items within the host buckets:: + + host node1 { + id -1 + alg straw + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In the foregoing example, note that the rack bucket does not contain + any OSDs. Rather it contains lower level host buckets, and includes the + sum total of their weight in the item entry. + +.. topic:: Bucket Types + + Ceph supports four bucket types, each representing a tradeoff between + performance and reorganization efficiency. If you are unsure of which bucket + type to use, we recommend using a ``straw`` bucket. For a detailed + discussion of bucket types, refer to + `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, + and more specifically to **Section 3.4**. The bucket types are: + + #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same + weight. For example, when firms commission or decommission hardware, they + typically do so with many machines that have exactly the same physical + configuration (e.g., bulk purchases). When storage devices have exactly + the same weight, you may use the ``uniform`` bucket type, which allows + CRUSH to map replicas into uniform buckets in constant time. With + non-uniform weights, you should use another bucket algorithm. + + #. **List**: List buckets aggregate their content as linked lists. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, + a list is a natural and intuitive choice for an **expanding cluster**: + either an object is relocated to the newest device with some appropriate + probability, or it remains on the older devices as before. The result is + optimal data migration when items are added to the bucket. Items removed + from the middle or tail of the list, however, can result in a significant + amount of unnecessary movement, making list buckets most suitable for + circumstances in which they **never (or very rarely) shrink**. + + #. **Tree**: Tree buckets use a binary search tree. They are more efficient + than list buckets when a bucket contains a larger set of items. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, + tree buckets reduce the placement time to O(log :sub:`n`), making them + suitable for managing much larger sets of devices or nested buckets. + + #. **Straw:** List and Tree buckets use a divide and conquer strategy + in a way that either gives certain items precedence (e.g., those + at the beginning of a list) or obviates the need to consider entire + subtrees of items at all. That improves the performance of the replica + placement process, but can also introduce suboptimal reorganization + behavior when the contents of a bucket change due an addition, removal, + or re-weighting of an item. The straw bucket type allows all items to + fairly “compete” against each other for replica placement through a + process analogous to a draw of straws. + +.. topic:: Hash + + Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. + Enter ``0`` as your hash setting to select ``rjenkins1``. + + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1TB storage device. + In such a scenario, a weight of ``0.5`` would represent approximately 500GB, + and a weight of ``3.00`` would represent approximately 3TB. Higher level + buckets have a weight that is the sum total of the leaf items aggregated by + the bucket. + + A bucket item weight is one dimensional, but you may also calculate your + item weights to reflect the performance of the storage drive. For example, + if you have many 1TB drives where some have relatively low data transfer + rate and the others have a relatively high data transfer rate, you may + weight them differently, even though they have the same capacity (e.g., + a weight of 0.80 for the first set of drives with lower total throughput, + and 1.20 for the second set of drives with higher total throughput). + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps support the notion of 'CRUSH rules', which are the rules that +determine data placement for a pool. For large clusters, you will likely create +many pools where each pool may have its own CRUSH ruleset and rules. The default +CRUSH map has a rule for each pool, and one ruleset assigned to each of the +default pools. + +.. note:: In most cases, you will not need to modify the default rules. When + you create a new pool, its default ruleset is ``0``. + + +CRUSH rules define placement and replication strategies or distribution policies +that allow you to specify exactly how CRUSH places object replicas. For +example, you might create a rule selecting a pair of targets for 2-way +mirroring, another rule for selecting three targets in two different data +centers for 3-way mirroring, and yet another rule for erasure coding over six +storage devices. For a detailed discussion of CRUSH rules, refer to +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, +and more specifically to **Section 3.2**. + +A rule takes the following form:: + + rule { + + ruleset + type [ replicated | erasure ] + min_size + max_size + step take [class ] + step [choose|chooseleaf] [firstn|indep] + step emit + } + + +``ruleset`` + +:Description: A means of classifying a rule as belonging to a set of rules. + Activated by `setting the ruleset in a pool`_. + +:Purpose: A component of the rule mask. +:Type: Integer +:Required: Yes +:Default: 0 + +.. _setting the ruleset in a pool: ../pools#setpoolvalues + + +``type`` + +:Description: Describes a rule for either a storage drive (replicated) + or a RAID. + +:Purpose: A component of the rule mask. +:Type: String +:Required: Yes +:Default: ``replicated`` +:Valid Values: Currently only ``replicated`` and ``erasure`` + +``min_size`` + +:Description: If a pool makes fewer replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: ``1`` + +``max_size`` + +:Description: If a pool makes more replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: 10 + + +``step take [class ]`` + +:Description: Takes a bucket name, and begins iterating down the tree. + If the ``device-class`` is specified, it must match + a class previously used when defining a device. All + devices that do not belong to the class are excluded. +:Purpose: A component of the rule. +:Required: Yes +:Example: ``step take data`` + + +``step choose firstn {num} type {bucket-type}`` + +:Description: Selects the number of buckets of the given type. The number is + usually the number of replicas in the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + +:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf + node from the subtree of each bucket in the set of buckets. The + number of buckets in the set is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. Usage removes the need to select a device using two steps. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step chooseleaf firstn 0 type row`` + + + +``step emit`` + +:Description: Outputs the current value and empties the stack. Typically used + at the end of a rule, but may also be used to pick from different + trees in the same rule. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step choose``. +:Example: ``step emit`` + +.. important:: To activate one or more rules with a common ruleset number to a + pool, set the ruleset number of the pool. + + +Placing Different Pools on Different OSDS: +========================================== + +Suppose you want to have most pools default to OSDs backed by large hard drives, +but have some pools mapped to OSDs backed by fast solid-state drives (SSDs). +It's possible to have multiple independent CRUSH hierarchies within the same +CRUSH map. Define two hierarchies with two different root nodes--one for hard +disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown +below:: + + device 0 osd.0 + device 1 osd.1 + device 2 osd.2 + device 3 osd.3 + device 4 osd.4 + device 5 osd.5 + device 6 osd.6 + device 7 osd.7 + + host ceph-osd-ssd-server-1 { + id -1 + alg straw + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host ceph-osd-ssd-server-2 { + id -2 + alg straw + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + host ceph-osd-platter-server-1 { + id -3 + alg straw + hash 0 + item osd.4 weight 1.00 + item osd.5 weight 1.00 + } + + host ceph-osd-platter-server-2 { + id -4 + alg straw + hash 0 + item osd.6 weight 1.00 + item osd.7 weight 1.00 + } + + root platter { + id -5 + alg straw + hash 0 + item ceph-osd-platter-server-1 weight 2.00 + item ceph-osd-platter-server-2 weight 2.00 + } + + root ssd { + id -6 + alg straw + hash 0 + item ceph-osd-ssd-server-1 weight 2.00 + item ceph-osd-ssd-server-2 weight 2.00 + } + + rule data { + ruleset 0 + type replicated + min_size 2 + max_size 2 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule metadata { + ruleset 1 + type replicated + min_size 0 + max_size 10 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule rbd { + ruleset 2 + type replicated + min_size 0 + max_size 10 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule platter { + ruleset 3 + type replicated + min_size 0 + max_size 10 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule ssd { + ruleset 4 + type replicated + min_size 0 + max_size 4 + step take ssd + step chooseleaf firstn 0 type host + step emit + } + + rule ssd-primary { + ruleset 5 + type replicated + min_size 5 + max_size 10 + step take ssd + step chooseleaf firstn 1 type host + step emit + step take platter + step chooseleaf firstn -1 type host + step emit + } + +You can then set a pool to use the SSD rule by:: + + ceph osd pool set crush_ruleset 4 + +Similarly, using the ``ssd-primary`` rule will cause each placement group in the +pool to be placed with an SSD as the primary and platters as the replicas. + + +Tuning CRUSH, the hard way +-------------------------- + +If you can ensure that all clients are running recent code, you can +adjust the tunables by extracting the CRUSH map, modifying the values, +and reinjecting it into the cluster. + +* Extract the latest CRUSH map:: + + ceph osd getcrushmap -o /tmp/crush + +* Adjust tunables. These values appear to offer the best behavior + for both large and small clusters we tested with. You will need to + additionally specify the ``--enable-unsafe-tunables`` argument to + ``crushtool`` for this to work. Please use this option with + extreme care.:: + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +* Reinject modified map:: + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +For reference, the legacy values for the CRUSH tunables can be set +with:: + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +Again, the special ``--enable-unsafe-tunables`` option is required. +Further, as noted above, be careful running old versions of the +``ceph-osd`` daemon after reverting to legacy values as the feature +bit is not perfectly enforced. diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst index 76f6519b947d..fa4f86e15cab 100644 --- a/doc/rados/operations/crush-map.rst +++ b/doc/rados/operations/crush-map.rst @@ -95,695 +95,180 @@ where the cluster name is typically 'ceph', the id is the daemon identifier (the OSD number), and the daemon type is typically ``osd``. -Manually editing a CRUSH Map -============================ - -.. note:: Manually editing the CRUSH map is considered an advanced - administrator operation. All CRUSH changes that are - necessary for the overwhelming majority of installations are - possible via the standard ceph CLI and do not require manual - CRUSH map edits. If you have identified a use case where - manual edits *are* necessary, consider contacting the Ceph - developers so that future versions of Ceph can make this - unnecessary. - -To edit an existing CRUSH map: - -#. `Get the CRUSH map`_. -#. `Decompile`_ the CRUSH map. -#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. -#. `Recompile`_ the CRUSH map. -#. `Set the CRUSH map`_. - -To activate CRUSH map rules for a specific pool, identify the common ruleset -number for those rules and specify that ruleset number for the pool. See `Set -Pool Values`_ for details. - -.. _Get the CRUSH map: #getcrushmap -.. _Decompile: #decompilecrushmap -.. _Devices: #crushmapdevices -.. _Buckets: #crushmapbuckets -.. _Rules: #crushmaprules -.. _Recompile: #compilecrushmap -.. _Set the CRUSH map: #setcrushmap -.. _Set Pool Values: ../pools#setpoolvalues - -.. _getcrushmap: - -Get a CRUSH Map ---------------- - -To get the CRUSH map for your cluster, execute the following:: - - ceph osd getcrushmap -o {compiled-crushmap-filename} - -Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since -the CRUSH map is in a compiled form, you must decompile it first before you can -edit it. - -.. _decompilecrushmap: - -Decompile a CRUSH Map ---------------------- - -To decompile a CRUSH map, execute the following:: - - crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} - -Ceph will decompile (-d) the compiled CRUSH map and output (-o) it to the -filename you specified. - - -.. _compilecrushmap: - -Compile a CRUSH Map -------------------- - -To compile a CRUSH map, execute the following:: - - crushtool -c {decompiled-crush-map-filename} -o {compiled-crush-map-filename} - -Ceph will store a compiled CRUSH map to the filename you specified. - - -.. _setcrushmap: - -Set a CRUSH Map ---------------- - -To set the CRUSH map for your cluster, execute the following:: - - ceph osd setcrushmap -i {compiled-crushmap-filename} - -Ceph will input the compiled CRUSH map of the filename you specified as the -CRUSH map for the cluster. - - - -CRUSH Map Parameters -==================== - -There are four main sections to a CRUSH Map. - -#. **Devices:** Devices consist of any object storage device--i.e., the storage - drive corresponding to a ``ceph-osd`` daemon. You should have a device for - each OSD daemon in your Ceph configuration file. - -#. **Bucket Types**: Bucket ``types`` define the types of buckets used in your - CRUSH hierarchy. Buckets consist of a hierarchical aggregation of storage - locations (e.g., rows, racks, chassis, hosts, etc.) and their assigned - weights. - -#. **Bucket Instances:** Once you define bucket types, you must declare bucket - instances for your hosts, and any other failure domain partitioning - you choose. - -#. **Rules:** Rules consist of the manner of selecting buckets. - -If you launched Ceph using one of our Quick Start guides, you'll notice -that you didn't need to create a CRUSH map. Ceph's deployment tools generate -a default CRUSH map that lists devices from the OSDs you defined in your -Ceph configuration file, and it declares a bucket for each host you specified -in the ``[osd]`` sections of your Ceph configuration file. You should create -your own CRUSH maps with buckets that reflect your cluster's failure domains -to better ensure data safety and availability. +CRUSH structure +=============== -.. note:: The generated CRUSH map doesn't take your larger grained failure - domains into account. So you should modify your CRUSH map to account for - larger grained failure domains such as chassis, racks, rows, data - centers, etc. +The CRUSH map consists of, loosely speaking, a hierarchy describing +the physical topology of the cluster, and a set of rules defining +policy about how we place data on those devices. The hierarchy has +devices (``ceph-osd`` daemons) at the leaves, and internal nodes +corresponding to other physical features or groupings: hosts, racks, +rows, datacenters, and so on. The rules describe how replicas are +placed in terms of that hierarchy (e.g., 'three replicas in different +racks'). +Devices +------- +Devices are individual ``ceph-osd`` daemons that can store data. You +will normally have one defined here for each OSD daemon in your +cluster. Devices are identified by an id (a non-negative integer) and +a name, normally ``osd.N`` where ``N`` is the device id. -.. _crushmapdevices: +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targetted by a +crush rule. -CRUSH Map Devices +Types and Buckets ----------------- -To map placement groups to OSDs, a CRUSH map requires a list of OSD devices -(i.e., the names of the OSD daemons from the Ceph configuration file). The list -of devices appears first in the CRUSH map. To declare a device in the CRUSH map, -create a new line under your list of devices, enter ``device`` followed by a -unique numeric ID, followed by the corresponding ``ceph-osd`` daemon instance. -The device class can optionaly be added to group devices so they can be -conveniently targetted by a crush rule. - -:: - - #devices - device {num} {osd.name} [class {class}] - -For example:: - - #devices - device 0 osd.0 class ssd - device 1 osd.1 class hdd - device 2 osd.2 - device 3 osd.3 - -As a general rule, an OSD daemon maps to a single storage drive or to a RAID. - - -CRUSH Map Bucket Types ----------------------- - -The second list in the CRUSH map defines 'bucket' types. Buckets facilitate -a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent -physical locations in a hierarchy. Nodes aggregate other nodes or leaves. -Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage -media. - -.. tip:: The term "bucket" used in the context of CRUSH means a node in - the hierarchy, i.e. a location or a piece of physical hardware. It - is a different concept from the term "bucket" when used in the - context of RADOS Gateway APIs. - -To add a bucket type to the CRUSH map, create a new line under your list of -bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. -By convention, there is one leaf bucket and it is ``type 0``; however, you may -give it any name you like (e.g., osd, disk, drive, storage, etc.):: - - #types - type {num} {bucket-name} - -For example:: - - # types - type 0 osd - type 1 host - type 2 chassis - type 3 rack - type 4 row - type 5 pdu - type 6 pod - type 7 room - type 8 datacenter - type 9 region - type 10 root - - - -.. _crushmapbuckets: - -CRUSH Map Bucket Hierarchy --------------------------- - -The CRUSH algorithm distributes data objects among storage devices according -to a per-device weight value, approximating a uniform probability distribution. -CRUSH distributes objects and their replicas according to the hierarchical -cluster map you define. Your CRUSH map represents the available storage -devices and the logical elements that contain them. - -To map placement groups to OSDs across failure domains, a CRUSH map defines a -hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH -map). The purpose of creating a bucket hierarchy is to segregate the -leaf nodes by their failure domains, such as hosts, chassis, racks, power -distribution units, pods, rows, rooms, and data centers. With the exception of -the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and -you may define it according to your own needs. - -We recommend adapting your CRUSH map to your firms's hardware naming conventions -and using instances names that reflect the physical hardware. Your naming -practice can make it easier to administer the cluster and troubleshoot -problems when an OSD and/or other hardware malfunctions and the administrator -need access to physical hardware. - -In the following example, the bucket hierarchy has a leaf bucket named ``osd``, -and two node buckets named ``host`` and ``rack`` respectively. - -.. ditaa:: - +-----------+ - | {o}rack | - | Bucket | - +-----+-----+ +A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, +racks, rows, etc. The CRUSH map defines a series of *types* that are +used to describe these nodes. By default, these types include: + +- osd (or device) +- host +- chassis +- rack +- row +- pdu +- pod +- room +- datacenter +- region +- root + +Most clusters make use of only a handful of these types, and others +can be defined as needed. + +The hierarchy is built with devices (normally type ``osd``) at the +leaves, interior nodes with non-device types, and a root node of type +``root``. For example, + +.. ditaa:: + + +-----------------+ + | {o}root default | + +--------+--------+ | +---------------+---------------+ | | - +-----+-----+ +-----+-----+ - | {o}host | | {o}host | - | Bucket | | Bucket | - +-----+-----+ +-----+-----+ + +-------+-------+ +-----+-------+ + | {o}host foo | | {o}host bar | + +-------+-------+ +-----+-------+ | | +-------+-------+ +-------+-------+ | | | | +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ - | osd | | osd | | osd | | osd | - | Bucket | | Bucket | | Bucket | | Bucket | + | osd.0 | | osd.1 | | osd.2 | | osd.3 | +-----------+ +-----------+ +-----------+ +-----------+ -.. note:: The higher numbered ``rack`` bucket type aggregates the lower - numbered ``host`` bucket type. - -Since leaf nodes reflect storage devices declared under the ``#devices`` list -at the beginning of the CRUSH map, you do not need to declare them as bucket -instances. The second lowest bucket type in your hierarchy usually aggregates -the devices (i.e., it's usually the computer containing the storage media, and -uses whatever term you prefer to describe it, such as "node", "computer", -"server," "host", "machine", etc.). In high density environments, it is -increasingly common to see multiple hosts/nodes per chassis. You should account -for chassis failure too--e.g., the need to pull a chassis if a node fails may -result in bringing down numerous hosts/nodes and their OSDs. - -When declaring a bucket instance, you must specify its type, give it a unique -name (string), assign it a unique ID expressed as a negative integer (optional), -specify a weight relative to the total capacity/capability of its item(s), -specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, -reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. -The items may consist of node buckets or leaves. Items may have a weight that -reflects the relative weight of the item. - -You may declare a node bucket with the following syntax:: - - [bucket-type] [bucket-name] { - id [a unique negative numeric ID] - weight [the relative capacity/capability of the item(s)] - alg [the bucket type: uniform | list | tree | straw ] - hash [the hash type: 0 by default] - item [item-name] weight [weight] - } - -For example, using the diagram above, we would define two host buckets -and one rack bucket. The OSDs are declared as items within the host buckets:: - - host node1 { - id -1 - alg straw - hash 0 - item osd.0 weight 1.00 - item osd.1 weight 1.00 - } - - host node2 { - id -2 - alg straw - hash 0 - item osd.2 weight 1.00 - item osd.3 weight 1.00 - } - - rack rack1 { - id -3 - alg straw - hash 0 - item node1 weight 2.00 - item node2 weight 2.00 - } - -.. note:: In the foregoing example, note that the rack bucket does not contain - any OSDs. Rather it contains lower level host buckets, and includes the - sum total of their weight in the item entry. - -.. topic:: Bucket Types - - Ceph supports four bucket types, each representing a tradeoff between - performance and reorganization efficiency. If you are unsure of which bucket - type to use, we recommend using a ``straw`` bucket. For a detailed - discussion of bucket types, refer to - `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, - and more specifically to **Section 3.4**. The bucket types are: +Each node (device or bucket) in the hierarchy has a *weight* +associated with it, indicating the relative proportion of the total +data that device or hierarchy subtree should store. Weights are set +at the leaves, indicating the size of the device, and automatically +sum up the tree from there, such that the weight of the default node +will be the total of all devices contained beneath it. Normally +weights are in units of terabytes (TB). + +You can get a simple view the CRUSH hierarchy for your cluster, +including the weights, with:: + + ceph osd crush tree + +Rules +----- + +Rules define policy about how data is distributed across the devices +in the hierarchy. + +CRUSH rules define placement and replication strategies or +distribution policies that allow you to specify exactly how CRUSH +places object replicas. For example, you might create a rule selecting +a pair of targets for 2-way mirroring, another rule for selecting +three targets in two different data centers for 3-way mirroring, and +yet another rule for erasure coding over six storage devices. For a +detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, +Scalable, Decentralized Placement of Replicated Data`_, and more +specifically to **Section 3.2**. + +In almost all cases, CRUSH rules can be created via the CLI by +specifying the *pool type* they will be used for (replicated or +erasure coded), the *failure domain*, and optionally a *device class*. +In rare cases rules must be written by hand by manually editing the +CRUSH map. - #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same - weight. For example, when firms commission or decommission hardware, they - typically do so with many machines that have exactly the same physical - configuration (e.g., bulk purchases). When storage devices have exactly - the same weight, you may use the ``uniform`` bucket type, which allows - CRUSH to map replicas into uniform buckets in constant time. With - non-uniform weights, you should use another bucket algorithm. - - #. **List**: List buckets aggregate their content as linked lists. Based on - the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, - a list is a natural and intuitive choice for an **expanding cluster**: - either an object is relocated to the newest device with some appropriate - probability, or it remains on the older devices as before. The result is - optimal data migration when items are added to the bucket. Items removed - from the middle or tail of the list, however, can result in a significant - amount of unnecessary movement, making list buckets most suitable for - circumstances in which they **never (or very rarely) shrink**. - - #. **Tree**: Tree buckets use a binary search tree. They are more efficient - than list buckets when a bucket contains a larger set of items. Based on - the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, - tree buckets reduce the placement time to O(log :sub:`n`), making them - suitable for managing much larger sets of devices or nested buckets. - - #. **Straw:** List and Tree buckets use a divide and conquer strategy - in a way that either gives certain items precedence (e.g., those - at the beginning of a list) or obviates the need to consider entire - subtrees of items at all. That improves the performance of the replica - placement process, but can also introduce suboptimal reorganization - behavior when the contents of a bucket change due an addition, removal, - or re-weighting of an item. The straw bucket type allows all items to - fairly “compete” against each other for replica placement through a - process analogous to a draw of straws. - -.. topic:: Hash - - Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. - Enter ``0`` as your hash setting to select ``rjenkins1``. - - -.. _weightingbucketitems: - -.. topic:: Weighting Bucket Items - - Ceph expresses bucket weights as doubles, which allows for fine - weighting. A weight is the relative difference between device capacities. We - recommend using ``1.00`` as the relative weight for a 1TB storage device. - In such a scenario, a weight of ``0.5`` would represent approximately 500GB, - and a weight of ``3.00`` would represent approximately 3TB. Higher level - buckets have a weight that is the sum total of the leaf items aggregated by - the bucket. - - A bucket item weight is one dimensional, but you may also calculate your - item weights to reflect the performance of the storage drive. For example, - if you have many 1TB drives where some have relatively low data transfer - rate and the others have a relatively high data transfer rate, you may - weight them differently, even though they have the same capacity (e.g., - a weight of 0.80 for the first set of drives with lower total throughput, - and 1.20 for the second set of drives with higher total throughput). - - -.. _crushmaprules: - -CRUSH Map Rules ---------------- - -CRUSH maps support the notion of 'CRUSH rules', which are the rules that -determine data placement for a pool. For large clusters, you will likely create -many pools where each pool may have its own CRUSH ruleset and rules. The default -CRUSH map has a rule for each pool, and one ruleset assigned to each of the -default pools. - -.. note:: In most cases, you will not need to modify the default rules. When - you create a new pool, its default ruleset is ``0``. - - -CRUSH rules define placement and replication strategies or distribution policies -that allow you to specify exactly how CRUSH places object replicas. For -example, you might create a rule selecting a pair of targets for 2-way -mirroring, another rule for selecting three targets in two different data -centers for 3-way mirroring, and yet another rule for erasure coding over six -storage devices. For a detailed discussion of CRUSH rules, refer to -`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, -and more specifically to **Section 3.2**. - -A rule takes the following form:: - - rule { - - ruleset - type [ replicated | erasure ] - min_size - max_size - step take [class ] - step [choose|chooseleaf] [firstn|indep] - step emit - } - - -``ruleset`` +You can see what rules are defined for your cluster with:: -:Description: A means of classifying a rule as belonging to a set of rules. - Activated by `setting the ruleset in a pool`_. + ceph osd crush rule ls -:Purpose: A component of the rule mask. -:Type: Integer -:Required: Yes -:Default: 0 - -.. _setting the ruleset in a pool: ../pools#setpoolvalues - - -``type`` - -:Description: Describes a rule for either a storage drive (replicated) - or a RAID. - -:Purpose: A component of the rule mask. -:Type: String -:Required: Yes -:Default: ``replicated`` -:Valid Values: Currently only ``replicated`` and ``erasure`` - -``min_size`` - -:Description: If a pool makes fewer replicas than this number, CRUSH will - **NOT** select this rule. - -:Type: Integer -:Purpose: A component of the rule mask. -:Required: Yes -:Default: ``1`` - -``max_size`` - -:Description: If a pool makes more replicas than this number, CRUSH will - **NOT** select this rule. - -:Type: Integer -:Purpose: A component of the rule mask. -:Required: Yes -:Default: 10 - - -``step take [class ]`` - -:Description: Takes a bucket name, and begins iterating down the tree. - If the ``device-class`` is specified, it must match - a class previously used when defining a device. All - devices that do not belong to the class are excluded. -:Purpose: A component of the rule. -:Required: Yes -:Example: ``step take data`` +You can view the contents of the rules with:: + ceph osd crush rule dump -``step choose firstn {num} type {bucket-type}`` - -:Description: Selects the number of buckets of the given type. The number is - usually the number of replicas in the pool (i.e., pool size). - - - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). - - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. - - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. - -:Purpose: A component of the rule. -:Prerequisite: Follows ``step take`` or ``step choose``. -:Example: ``step choose firstn 1 type row`` - - -``step chooseleaf firstn {num} type {bucket-type}`` - -:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf - node from the subtree of each bucket in the set of buckets. The - number of buckets in the set is usually the number of replicas in - the pool (i.e., pool size). - - - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). - - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. - - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. - -:Purpose: A component of the rule. Usage removes the need to select a device using two steps. -:Prerequisite: Follows ``step take`` or ``step choose``. -:Example: ``step chooseleaf firstn 0 type row`` - - - -``step emit`` - -:Description: Outputs the current value and empties the stack. Typically used - at the end of a rule, but may also be used to pick from different - trees in the same rule. - -:Purpose: A component of the rule. -:Prerequisite: Follows ``step choose``. -:Example: ``step emit`` - -.. important:: To activate one or more rules with a common ruleset number to a - pool, set the ruleset number of the pool. - - - -Primary Affinity -================ - -When a Ceph Client reads or writes data, it always contacts the primary OSD in -the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an -OSD isn't well suited to act as a primary compared to other OSDs (e.g., it has -a slow disk or a slow controller). To prevent performance bottlenecks -(especially on read operations) while maximizing utilization of your hardware, -you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use -the OSD as a primary in an acting set. :: - - ceph osd primary-affinity - -Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You -may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may -**NOT** be used as a primary and ``1`` means that an OSD may be used as a -primary. When the weight is ``< 1``, it is less likely that CRUSH will select -the Ceph OSD Daemon to act as a primary. +Weights sets +------------ -Placing Different Pools on Different OSDS: -========================================== - -Suppose you want to have most pools default to OSDs backed by large hard drives, -but have some pools mapped to OSDs backed by fast solid-state drives (SSDs). -It's possible to have multiple independent CRUSH hierarchies within the same -CRUSH map. Define two hierarchies with two different root nodes--one for hard -disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown -below:: - - device 0 osd.0 - device 1 osd.1 - device 2 osd.2 - device 3 osd.3 - device 4 osd.4 - device 5 osd.5 - device 6 osd.6 - device 7 osd.7 - - host ceph-osd-ssd-server-1 { - id -1 - alg straw - hash 0 - item osd.0 weight 1.00 - item osd.1 weight 1.00 - } - - host ceph-osd-ssd-server-2 { - id -2 - alg straw - hash 0 - item osd.2 weight 1.00 - item osd.3 weight 1.00 - } - - host ceph-osd-platter-server-1 { - id -3 - alg straw - hash 0 - item osd.4 weight 1.00 - item osd.5 weight 1.00 - } - - host ceph-osd-platter-server-2 { - id -4 - alg straw - hash 0 - item osd.6 weight 1.00 - item osd.7 weight 1.00 - } - - root platter { - id -5 - alg straw - hash 0 - item ceph-osd-platter-server-1 weight 2.00 - item ceph-osd-platter-server-2 weight 2.00 - } - - root ssd { - id -6 - alg straw - hash 0 - item ceph-osd-ssd-server-1 weight 2.00 - item ceph-osd-ssd-server-2 weight 2.00 - } - - rule data { - ruleset 0 - type replicated - min_size 2 - max_size 2 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule metadata { - ruleset 1 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule rbd { - ruleset 2 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule platter { - ruleset 3 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule ssd { - ruleset 4 - type replicated - min_size 0 - max_size 4 - step take ssd - step chooseleaf firstn 0 type host - step emit - } - - rule ssd-primary { - ruleset 5 - type replicated - min_size 5 - max_size 10 - step take ssd - step chooseleaf firstn 1 type host - step emit - step take platter - step chooseleaf firstn -1 type host - step emit - } - -You can then set a pool to use the SSD rule by:: - - ceph osd pool set crush_ruleset 4 - -Similarly, using the ``ssd-primary`` rule will cause each placement group in the -pool to be placed with an SSD as the primary and platters as the replicas. +A *weight set* is an alternative set of weights to use when +calculating data placement. The normal weights associated with each +device in the CRUSH map are set based on the device size and indicate +how much data we *should* be storing where. However, because CRUSH is +based on a pseudorandom placement process, there is always some +variation from this ideal distribution, the same way that rolling a +die sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to do a numerical optimization +based on the specifics of your cluster (hierarchy, pools, etc.) to achieve +a balanced distribution. + +There are two types of weight sets supported: + + #. A **compat** weight set is a single alternative set of weights for + each device and node in the cluster. This is not well-suited for + correcting for all anomalies (for example, placement groups for + different pools may be different sizes and have different load + levels, but will be mostly treated the same by the balancer). + However, compat weight sets have the huge advantage that they are + *backward compatible* with previous versions of Ceph, which means + that even though weight sets were first introduced in Luminous + v12.2.z, older clients (e.g., firefly) can still connect to the + cluster when a compat weight set is being used to balance data. + #. A **per-pool** weight set is more flexible in that it allows + placement to be optimized for each data pool. Additionally, + weights can be adjusted for each position of placement, allowing + the optimizer to correct for a suble skew of data toward devices + with small weights relative to their peers (and effect that is + usually only apparently in very large clusters but which can cause + balancing problems). + +When weight sets are in use, the weights associated with each node in +the hierarchy is visible as a separate column (labeled either +``(compat)`` or the pool name) from the command:: + + ceph osd crush tree + +Although weight sets can be set up and manipulated by hand, it is +recommended that the *balancer* module be enabled to do so +automatically. + + +Modifying the CRUSH map +======================= .. _addosd: Add/Move an OSD -=============== +--------------- -To add or move an OSD in the CRUSH map of a running cluster, execute the -``ceph osd crush set``. For Argonaut (v 0.48), execute the following:: +.. note: OSDs are normally automatically added to the CRUSH map when + the OSD is created. This command is rarely needed. - ceph osd crush set {id} {name} {weight} pool={pool-name} [{bucket-type}={bucket-name} ...] - -For Bobtail (v 0.56), execute the following:: +To add or move an OSD in the CRUSH map of a running cluster:: - ceph osd crush set {id-or-name} {weight} root={pool-name} [{bucket-type}={bucket-name} ...] + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] Where: -``id`` - -:Description: The numeric ID of the OSD. -:Type: Integer -:Required: Yes -:Example: ``0`` - - ``name`` :Description: The full name of the OSD. @@ -794,7 +279,7 @@ Where: ``weight`` -:Description: The CRUSH weight for the OSD. +:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). :Type: Double :Required: Yes :Example: ``2.0`` @@ -802,7 +287,7 @@ Where: ``root`` -:Description: The root of the tree in which the OSD resides. +:Description: The root node of the tree in which the OSD resides (normally ``default``) :Type: Key/value pair. :Required: Yes :Example: ``root=default`` @@ -816,19 +301,23 @@ Where: :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` -The following example adds ``osd.0`` to the hierarchy, or moves the OSD from a -previous location. :: +The following example adds ``osd.0`` to the hierarchy, or moves the +OSD from a previous location. :: - ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 -Adjust an OSD's CRUSH Weight -============================ +Adjust OSD weight +----------------- + +.. note: Normally OSDs automatically add themselves to the CRUSH map + with the correct weight when they are created. This command + is rarely needed. To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute the following:: - ceph osd crush reweight {name} {weight} + ceph osd crush reweight {name} {weight} Where: @@ -851,11 +340,15 @@ Where: .. _removeosd: Remove an OSD -============= +------------- + +.. note: OSDs are normally removed from the CRUSH as part of the + ``ceph osd purge`` command. This command is rarely needed. -To remove an OSD from the CRUSH map of a running cluster, execute the following:: +To remove an OSD from the CRUSH map of a running cluster, execute the +following:: - ceph osd crush remove {name} + ceph osd crush remove {name} Where: @@ -866,12 +359,21 @@ Where: :Required: Yes :Example: ``osd.0`` + Add a Bucket -============ +------------ + +.. note: Buckets are normally implicitly created when an OSD is added + that specifies a ``{bucket-type}={bucket-name}`` as part of its + location and a bucket with that name does not already exist. This + command is typically used when manually adjusting the structure of the + hierarchy after OSDs have been created (for example, to move a + series of hosts underneath a new rack-level bucket). -To add a bucket in the CRUSH map of a running cluster, execute the ``ceph osd crush add-bucket`` command:: +To add a bucket in the CRUSH map of a running cluster, execute the +``ceph osd crush add-bucket`` command:: - ceph osd crush add-bucket {bucket-name} {bucket-type} + ceph osd crush add-bucket {bucket-name} {bucket-type} Where: @@ -893,15 +395,15 @@ Where: The following example adds the ``rack12`` bucket to the hierarchy:: - ceph osd crush add-bucket rack12 rack + ceph osd crush add-bucket rack12 rack Move a Bucket -============= +------------- -To move a bucket to a different location or position in the CRUSH map hierarchy, -execute the following:: +To move a bucket to a different location or position in the CRUSH map +hierarchy, execute the following:: - ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] Where: @@ -920,11 +422,11 @@ Where: :Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` Remove a Bucket -=============== +--------------- To remove a bucket from the CRUSH map hierarchy, execute the following:: - ceph osd crush remove {bucket-name} + ceph osd crush remove {bucket-name} .. note:: A bucket must be empty before removing it from the CRUSH hierarchy. @@ -939,7 +441,9 @@ Where: The following example removes the ``rack12`` bucket from the hierarchy:: - ceph osd crush remove rack12 + ceph osd crush remove rack12 + + Tunables ======== @@ -1196,6 +700,8 @@ profile. Those are: * ``argonaut``: the legacy values supported by the original argonaut release * ``bobtail``: the values supported by the bobtail release * ``firefly``: the values supported by the firefly release + * ``hammer``: the values supported by the hammer release + * ``jewel``: the values supported by the jewel release * ``optimal``: the best (ie optimal) values of the current version of Ceph * ``default``: the default values of a new cluster installed from scratch. These values, which depend on the current version of Ceph, @@ -1211,40 +717,27 @@ You can select a profile on a running cluster with the command:: Note that this may result in some data movement. -Tuning CRUSH, the hard way --------------------------- - -If you can ensure that all clients are running recent code, you can -adjust the tunables by extracting the CRUSH map, modifying the values, -and reinjecting it into the cluster. - -* Extract the latest CRUSH map:: - - ceph osd getcrushmap -o /tmp/crush - -* Adjust tunables. These values appear to offer the best behavior - for both large and small clusters we tested with. You will need to - additionally specify the ``--enable-unsafe-tunables`` argument to - ``crushtool`` for this to work. Please use this option with - extreme care.:: +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf - crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new -* Reinject modified map:: +Primary Affinity +================ - ceph osd setcrushmap -i /tmp/crush.new +When a Ceph Client reads or writes data, it always contacts the primary OSD in +the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an +OSD isn't well suited to act as a primary compared to other OSDs (e.g., it has +a slow disk or a slow controller). To prevent performance bottlenecks +(especially on read operations) while maximizing utilization of your hardware, +you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use +the OSD as a primary in an acting set. :: -Legacy values -------------- + ceph osd primary-affinity -For reference, the legacy values for the CRUSH tunables can be set -with:: +Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You +may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may +**NOT** be used as a primary and ``1`` means that an OSD may be used as a +primary. When the weight is ``< 1``, it is less likely that CRUSH will select +the Ceph OSD Daemon to act as a primary. - crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy -Again, the special ``--enable-unsafe-tunables`` option is required. -Further, as noted above, be careful running old versions of the -``ceph-osd`` daemon after reverting to legacy values as the feature -bit is not perfectly enforced. -.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index a21f12eced08..a24e25cc9af5 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -36,6 +36,7 @@ CRUSH algorithm. cache-tiering placement-groups crush-map + crush-map-edits