From 1ceeab30ebde767aaf675fe46e79b9b7d28b881a Mon Sep 17 00:00:00 2001 From: Zac Dover Date: Sat, 27 May 2023 04:44:18 +1000 Subject: [PATCH] doc/rados: edit bluestore-config-ref.rst (2 of x) Edit the second part of doc/rados/configuration/bluestore-config-ref.rst. https://tracker.ceph.com/issues/58485 Co-authored-by: Anthoy D'Atri Signed-off-by: Zac Dover --- .../configuration/bluestore-config-ref.rst | 321 +++++++++--------- 1 file changed, 167 insertions(+), 154 deletions(-) diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst index bbcedad8ce40c..3707be1aa9bc6 100644 --- a/doc/rados/configuration/bluestore-config-ref.rst +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -279,37 +279,38 @@ configuration option or via the global configuration option. For example: .. confval:: bluestore_csum_type -inline compression +Inline Compression ================== -bluestore supports inline compression using `snappy`, `zlib`, or -`lz4`. please note that the `lz4` compression plugin is not -distributed in the official release. +BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`. -whether data in bluestore is compressed is determined by a combination -of the *compression mode* and any hints associated with a write -operation. the modes are: +Whether data in BlueStore is compressed is determined by two factors: (1) the +*compression mode* and (2) any client hints associated with a write operation. +The compression modes are as follows: -* **none**: never compress data. -* **passive**: do not compress data unless the write operation has a +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation has a *compressible* hint set. -* **aggressive**: compress data unless the write operation has an +* **aggressive**: Do compress data unless the write operation has an *incompressible* hint set. -* **force**: try to compress data no matter what. +* **force**: Try to compress data no matter what. -for more information about the *compressible* and *incompressible* io -hints, see :c:func:`rados_set_alloc_hint`. +For more information about the *compressible* and *incompressible* I/O hints, +see :c:func:`rados_set_alloc_hint`. -note that regardless of the mode, if the size of the data chunk is not -reduced sufficiently it will not be used and the original -(uncompressed) data will be stored. for example, if the ``bluestore -compression required ratio`` is set to ``.7`` then the compressed data -must be 70% of the size of the original (or smaller). +Note that data in Bluestore will be compressed only if the data chunk will be +sufficiently reduced in size (as determined by the ``bluestore compression +required ratio`` setting). No matter which compression modes have been used, if +the data chunk is too big, then it will be discarded and the original +(uncompressed) data will be stored instead. For example, if ``bluestore +compression required ratio`` is set to ``.7``, then data compression will take +place only if the size of the compressed data is no more than 70% of the size +of the original data. -the *compression mode*, *compression algorithm*, *compression required -ratio*, *min blob size*, and *max blob size* can be set either via a -per-pool property or a global config option. pool properties can be -set with: +The *compression mode*, *compression algorithm*, *compression required ratio*, +*min blob size*, and *max blob size* settings can be specified either via a +per-pool property or via a global config option. To specify pool properties, +run the following commands: .. prompt:: bash $ @@ -331,28 +332,31 @@ set with: .. _bluestore-rocksdb-sharding: -rocksdb sharding +RocksDB Sharding ================ -internally bluestore uses multiple types of key-value data, -stored in rocksdb. each data type in bluestore is assigned a -unique prefix. until pacific all key-value data was stored in -single rocksdb column family: 'default'. since pacific, -bluestore can divide this data into multiple rocksdb column -families. when keys have similar access frequency, modification -frequency and lifetime, bluestore benefits from better caching -and more precise compaction. this improves performance, and also -requires less disk space during compaction, since each column -family is smaller and can compact independent of others. - -osds deployed in pacific or later use rocksdb sharding by default. -if ceph is upgraded to pacific from a previous version, sharding is off. - -to enable sharding and apply the pacific defaults, stop an osd and run +BlueStore maintains several types of internal key-value data, all of which are +stored in RocksDB. Each data type in BlueStore is assigned a unique prefix. +Prior to the Pacific release, all key-value data was stored in a single RocksDB +column family: 'default'. In Pacific and later releases, however, BlueStore can +divide key-value data into several RocksDB column families. BlueStore achieves +better caching and more precise compaction when keys are similar: specifically, +when keys have similar access frequency, similar modification frequency, and a +similar lifetime. Under such conditions, performance is improved and less disk +space is required during compaction (because each column family is smaller and +is able to compact independently of the others). + +OSDs deployed in Pacific or later releases use RocksDB sharding by default. +However, if Ceph has been upgraded to Pacific or a later version from a +previous version, sharding is disabled on any OSDs that were created before +Pacific. + +To enable sharding and apply the Pacific defaults to a specific OSD, stop the +OSD and run the following command: .. prompt:: bash # - ceph-bluestore-tool \ + ceph-bluestore-tool \ --path \ --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \ reshard @@ -360,7 +364,7 @@ to enable sharding and apply the pacific defaults, stop an osd and run .. confval:: bluestore_rocksdb_cf .. confval:: bluestore_rocksdb_cfs -throttling +Throttling ========== .. confval:: bluestore_throttle_bytes @@ -369,167 +373,176 @@ throttling .. confval:: bluestore_throttle_cost_per_io_hdd .. confval:: bluestore_throttle_cost_per_io_ssd -spdk usage -================== +SPDK Usage +========== -if you want to use the spdk driver for nvme devices, you must prepare your system. -refer to `spdk document`__ for more details. +To use the SPDK driver for NVMe devices, you must first prepare your system. +See `SPDK document`__. .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples -spdk offers a script to configure the device automatically. users can run the -script as root: +SPDK offers a script that will configure the device automatically. Run this +script with root permissions: .. prompt:: bash $ sudo src/spdk/scripts/setup.sh -you will need to specify the subject nvme device's device selector with -the "spdk:" prefix for ``bluestore_block_path``. +You will need to specify the subject NVMe device's device selector with the +"spdk:" prefix for ``bluestore_block_path``. -for example, you can find the device selector of an intel pcie ssd with: +In the following example, you first find the device selector of an Intel NVMe +SSD by running the following command: .. prompt:: bash $ lspci -mm -n -d -d 8086:0953 -the device selector always has the form of ``dddd:bb:dd.ff`` or ``dddd.bb.dd.ff``. +The form of the device selector is either ``DDDD:BB:DD.FF`` or +``DDDD.BB.DD.FF``. -and then set:: +Next, supposing that ``0000:01:00.0`` is the device selector found in the +output of the ``lspci`` command, you can specify the device selector by running +the following command:: bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0" -where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` -command above. - -you may also specify a remote nvmeof target over the tcp transport as in the +You may also specify a remote NVMeoF target over the TCP transport, as in the following example:: bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" -to run multiple spdk instances per node, you must specify the -amount of dpdk memory in mb that each instance will use, to make sure each -instance uses its own dpdk memory. +To run multiple SPDK instances per node, you must make sure each instance uses +its own DPDK memory by specifying for each instance the amount of DPDK memory +(in MB) that the instance will use. -in most cases, a single device can be used for data, db, and wal. we describe -this strategy as *colocating* these components. be sure to enter the below -settings to ensure that all ios are issued through spdk.:: +In most cases, a single device can be used for data, DB, and WAL. We describe +this strategy as *colocating* these components. Be sure to enter the below +settings to ensure that all I/Os are issued through SPDK:: bluestore_block_db_path = "" bluestore_block_db_size = 0 bluestore_block_wal_path = "" bluestore_block_wal_size = 0 -otherwise, the current implementation will populate the spdk map files with -kernel file system symbols and will use the kernel driver to issue db/wal io. - -minimum allocation size -======================== - -there is a configured minimum amount of storage that bluestore will allocate on -an osd. in practice, this is the least amount of capacity that a rados object -can consume. the value of :confval:`bluestore_min_alloc_size` is derived from the -value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd` -depending on the osd's ``rotational`` attribute. this means that when an osd -is created on an hdd, bluestore will be initialized with the current value -of :confval:`bluestore_min_alloc_size_hdd`, and ssd osds (including nvme devices) -with the value of :confval:`bluestore_min_alloc_size_ssd`. - -through the mimic release, the default values were 64kb and 16kb for rotational -(hdd) and non-rotational (ssd) media respectively. octopus changed the default -for ssd (non-rotational) media to 4kb, and pacific changed the default for hdd -(rotational) media to 4kb as well. - -these changes were driven by space amplification experienced by ceph rados -gateway (rgw) deployments that host large numbers of small files -(s3/swift objects). - -for example, when an rgw client stores a 1kb s3 object, it is written to a -single rados object. with the default :confval:`min_alloc_size` value, 4kb of -underlying drive space is allocated. this means that roughly -(4kb - 1kb) == 3kb of that rados object's allocated space is never used, which corresponds to 300% -overhead or 25% efficiency. similarly, a 5kb user object will be stored -as one 4kb and one 1kb rados object, again stranding 4kb of device capacity, -though in this case the overhead is a much smaller percentage. think of this -in terms of the remainder from a modulus operation. the overhead *percentage* -thus decreases rapidly as user object size increases. - -an easily missed additional subtlety is that this -takes place for *each* replica. so when using the default three copies of -data (3r), a 1kb s3 object actually consumes 12kb of storage device -capacity, with 11kb of overhead. if erasure coding (ec) is used instead of replication, the -amplification may be even higher: for a ``k=4,m=2`` pool, our 1kb s3 object -will allocate (6 * 4kb) = 24kb of device capacity. - -when an rgw bucket pool contains many relatively large user objects, the effect -of this phenomenon is often negligible, but should be considered for deployments -that expect a significant fraction of relatively small objects. - -the 4kb default value aligns well with conventional hdd and ssd devices. some -new coarse-iu (indirection unit) qlc ssds however perform and wear best -when :confval:`bluestore_min_alloc_size_ssd` -is set at osd creation to match the device's iu:. 8kb, 16kb, or even 64kb. -these novel storage drives allow one to achieve read performance competitive -with conventional tlc ssds and write performance faster than hdds, with -high density and lower cost than tlc ssds. - -note that when creating osds on these devices, one must carefully apply the -non-default value only to appropriate devices, and not to conventional ssd and -hdd devices. this may be done through careful ordering of osd creation, custom -osd device classes, and especially by the use of central configuration _masks_. - -quincy and later releases add -the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size` -option that enables automatic discovery of the appropriate value as each osd is -created. note that the use of ``bcache``, ``opencas``, ``dmcrypt``, -``ata over ethernet``, `iscsi`, or other device layering / abstraction -technologies may confound the determination of appropriate values. osds -deployed on top of vmware storage have been reported to also -sometimes report a ``rotational`` attribute that does not match the underlying -hardware. - -we suggest inspecting such osds at startup via logs and admin sockets to ensure that -behavior is appropriate. note that this also may not work as desired with -older kernels. you can check for this by examining the presence and value -of ``/sys/block//queue/optimal_io_size``. - -you may also inspect a given osd: +If these settings are not entered, then the current implementation will +populate the SPDK map files with kernel file system symbols and will use the +kernel driver to issue DB/WAL I/Os. + +Minimum Allocation Size +======================= + +There is a configured minimum amount of storage that BlueStore allocates on an +underlying storage device. In practice, this is the least amount of capacity +that even a tiny RADOS object can consume on each OSD's primary device. The +configuration option in question--:confval:`bluestore_min_alloc_size`--derives +its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or +:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational`` +attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with +the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs +(including NVMe devices), Bluestore is initialized with the current value of +:confval:`bluestore_min_alloc_size_ssd`. + +In Mimic and earlier releases, the default values were 64KB for rotational +media (HDD) and 16KB for non-rotational media (SSD). The Octopus release +changed the the default value for non-rotational media (SSD) to 4KB, and the +Pacific release changed the default value for rotational media (HDD) to 4KB. + +These changes were driven by space amplification that was experienced by Ceph +RADOS GateWay (RGW) deployments that hosted large numbers of small files +(S3/Swift objects). + +For example, when an RGW client stores a 1 KB S3 object, that object is written +to a single RADOS object. In accordance with the default +:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated. +This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never +used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB +user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB +RADOS object, with the result that 4KB of device capacity is stranded. In this +case, however, the overhead percentage is much smaller. Think of this in terms +of the remainder from a modulus operation. The overhead *percentage* thus +decreases rapidly as object size increases. + +There is an additional subtlety that is easily missed: the amplification +phenomenon just described takes place for *each* replica. For example, when +using the default of three copies of data (3R), a 1 KB S3 object actually +strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used +instead of replication, the amplification might be even higher: for a ``k=4, +m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6) +of device capacity. + +When an RGW bucket pool contains many relatively large user objects, the effect +of this phenomenon is often negligible. However, with deployments that can +expect a significant fraction of relatively small user objects, the effect +should be taken into consideration. + +The 4KB default value aligns well with conventional HDD and SSD devices. +However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear +best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation +to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel +storage drives can achieve read performance that is competitive with that of +conventional TLC SSDs and write performance that is faster than that of HDDs, +with higher density and lower cost than TLC SSDs. + +Note that when creating OSDs on these novel devices, one must be careful to +apply the non-default value only to appropriate devices, and not to +conventional HDD and SSD devices. Error can be avoided through careful ordering +of OSD creation, with custom OSD device classes, and especially by the use of +central configuration *masks*. + +In Quincy and later releases, you can use the +:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow +automatic discovery of the correct value as each OSD is created. Note that the +use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or +other device-layering and abstraction technologies might confound the +determination of correct values. Moreover, OSDs deployed on top of VMware +storage have sometimes been found to report a ``rotational`` attribute that +does not match the underlying hardware. + +We suggest inspecting such OSDs at startup via logs and admin sockets in order +to ensure that their behavior is correct. Be aware that this kind of inspection +might not work as expected with older kernels. To check for this issue, +examine the presence and value of ``/sys/block//queue/optimal_io_size``. + +.. note:: When running Reef or a later Ceph release, the ``min_alloc_size`` + baked into each OSD is conveniently reported by ``ceph osd metadata``. + +To inspect a specific OSD, run the following command: .. prompt:: bash # - ceph osd metadata osd.1701 | grep rotational - -this space amplification may manifest as an unusually high ratio of raw to -stored data reported by ``ceph df``. ``ceph osd df`` may also report -anomalously high ``%use`` / ``var`` values when -compared to other, ostensibly identical osds. a pool using osds with -mismatched ``min_alloc_size`` values may experience unexpected balancer -behavior as well. + ceph osd metadata osd.1701 | egrep rotational\|alloc -note that this bluestore attribute takes effect *only* at osd creation; if -changed later, a given osd's behavior will not change unless / until it is -destroyed and redeployed with the appropriate option value(s). upgrading -to a later ceph release will *not* change the value used by osds deployed -under older releases or with other settings. +This space amplification might manifest as an unusually high ratio of raw to +stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR`` +values reported by ``ceph osd df`` that are unusually high in comparison to +other, ostensibly identical, OSDs. Finally, there might be unexpected balancer +behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values. +This BlueStore attribute takes effect *only* at OSD creation; if the attribute +is changed later, a specific OSD's behavior will not change unless and until +the OSD is destroyed and redeployed with the appropriate option value(s). +Upgrading to a later Ceph release will *not* change the value used by OSDs that +were deployed under older releases or with other settings. .. confval:: bluestore_min_alloc_size .. confval:: bluestore_min_alloc_size_hdd .. confval:: bluestore_min_alloc_size_ssd .. confval:: bluestore_use_optimal_io_size_for_min_alloc_size -dsa (data streaming accelerator usage) +DSA (Data Streaming Accelerator) Usage ====================================== -if you want to use the dml library to drive dsa device for offloading -read/write operations on persist memory in bluestore. you need to install -`dml`_ and `idxd-config`_ library in your machine with spr (sapphire rapids) cpu. +If you want to use the DML library to drive the DSA device for offloading +read/write operations on persistent memory (PMEM) in BlueStore, you need to +install `DML`_ and the `idxd-config`_ library. This will work only on machines +that have a SPR (Sapphire Rapids) CPU. .. _dml: https://github.com/intel/dml .. _idxd-config: https://github.com/intel/idxd-config -after installing the dml software, you need to configure the shared -work queues (wqs) with the following wq configuration example via accel-config tool: +After installing the DML software, configure the shared work queues (WQs) with +reference to the following WQ configuration example: .. prompt:: bash $ -- 2.39.5