-==========================
-BlueStore Config Reference
-==========================
+==================================
+ BlueStore Configuration Reference
+==================================
Devices
=======
-BlueStore manages either one, two, or (in certain cases) three storage
-devices.
-
-In the simplest case, BlueStore consumes a single (primary) storage device.
-The storage device is normally used as a whole, occupying the full device that
-is managed directly by BlueStore. This *primary device* is normally identified
-by a ``block`` symlink in the data directory.
-
-The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
-when ``ceph-volume`` activates it) with all the common OSD files that hold
-information about the OSD, like: its identifier, which cluster it belongs to,
-and its private keyring.
-
-It is also possible to deploy BlueStore across one or two additional devices:
-
-* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
- used for BlueStore's internal journal or write-ahead log. It is only useful
- to use a WAL device if the device is faster than the primary device (e.g.,
- when it is on an SSD and the primary device is an HDD).
+BlueStore manages either one, two, or in certain cases three storage devices.
+These *devices* are "devices" in the Linux/Unix sense. This means that they are
+assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
+entire storage drive, or a partition of a storage drive, or a logical volume.
+BlueStore does not create or mount a conventional file system on devices that
+it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
+
+In the simplest case, BlueStore consumes all of a single storage device. This
+device is known as the *primary device*. The primary device is identified by
+the ``block`` symlink in the data directory.
+
+The data directory is a ``tmpfs`` mount. When this data directory is booted or
+activated by ``ceph-volume``, it is populated with metadata files and links
+that hold information about the OSD: for example, the OSD's identifier, the
+name of the cluster that the OSD belongs to, and the OSD's private keyring.
+
+In more complicated cases, BlueStore is deployed across one or two additional
+devices:
+
+* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
+ directory) can be used to separate out BlueStore's internal journal or
+ write-ahead log. Using a WAL device is advantageous only if the WAL device
+ is faster than the primary device (for example, if the WAL device is an SSD
+ and the primary device is an HDD).
* A *DB device* (identified as ``block.db`` in the data directory) can be used
- for storing BlueStore's internal metadata. BlueStore (or rather, the
- embedded RocksDB) will put as much metadata as it can on the DB device to
- improve performance. If the DB device fills up, metadata will spill back
- onto the primary device (where it would have been otherwise). Again, it is
- only helpful to provision a DB device if it is faster than the primary
- device.
-
-If there is only a small amount of fast storage available (e.g., less
-than a gigabyte), we recommend using it as a WAL device. If there is
-more, provisioning a DB device makes more sense. The BlueStore
-journal will always be placed on the fastest device available, so
-using a DB device will provide the same benefit that the WAL device
-would while *also* allowing additional metadata to be stored there (if
-it will fit). This means that if a DB device is specified but an explicit
-WAL device is not, the WAL will be implicitly colocated with the DB on the faster
-device.
-
-A single-device (colocated) BlueStore OSD can be provisioned with:
+ to store BlueStore's internal metadata. BlueStore (or more precisely, the
+ embedded RocksDB) will put as much metadata as it can on the DB device in
+ order to improve performance. If the DB device becomes full, metadata will
+ spill back onto the primary device (where it would have been located in the
+ absence of the DB device). Again, it is advantageous to provision a DB device
+ only if it is faster than the primary device.
+
+If there is only a small amount of fast storage available (for example, less
+than a gigabyte), we recommend using the available space as a WAL device. But
+if more fast storage is available, it makes more sense to provision a DB
+device. Because the BlueStore journal is always placed on the fastest device
+available, using a DB device provides the same benefit that using a WAL device
+would, while *also* allowing additional metadata to be stored off the primary
+device (provided that it fits). DB devices make this possible because whenever
+a DB device is specified but an explicit WAL device is not, the WAL will be
+implicitly colocated with the DB on the faster device.
+
+To provision a single-device (colocated) BlueStore OSD, run the following
+command:
.. prompt:: bash $
ceph-volume lvm prepare --bluestore --data <device>
-To specify a WAL device and/or DB device:
+To specify a WAL device or DB device, run the following command:
.. prompt:: bash $
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
-.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
- devices can be existing logical volumes or GPT partitions.
+.. note:: The option ``--data`` can take as its argument any of the the
+ following devices: logical volumes specified using *vg/lv* notation,
+ existing logical volumes, and GPT partitions.
+
+
Provisioning strategies
-----------------------
-Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
-which had just one), there are two common arrangements that should help clarify
-the deployment strategy:
+
+BlueStore differs from Filestore in that there are several ways to deploy a
+BlueStore OSD. However, the overall deployment strategy for BlueStore can be
+clarified by examining just these two common arrangements:
.. _bluestore-single-type-device-config:
**block (data) only**
^^^^^^^^^^^^^^^^^^^^^
-If all devices are the same type, for example all rotational drives, and
-there are no fast devices to use for metadata, it makes sense to specify the
-block device only and to not separate ``block.db`` or ``block.wal``. The
-:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like:
+If all devices are of the same type (for example, they are all HDDs), and if
+there are no fast devices available for the storage of metadata, then it makes
+sense to specify the block device only and to leave ``block.db`` and
+``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
+``/dev/sda`` device is as follows:
.. prompt:: bash $
ceph-volume lvm create --bluestore --data /dev/sda
-If logical volumes have already been created for each device, (a single LV
-using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
-``ceph-vg/block-lv`` would look like:
+If the devices to be used for a BlueStore OSD are pre-created logical volumes,
+then the :ref:`ceph-volume-lvm` call for an logical volume named
+``ceph-vg/block-lv`` is as follows:
.. prompt:: bash $
**block and block.db**
^^^^^^^^^^^^^^^^^^^^^^
-If you have a mix of fast and slow devices (SSD / NVMe and rotational),
-it is recommended to place ``block.db`` on the faster device while ``block``
-(data) lives on the slower (spinning drive).
-You must create these volume groups and logical volumes manually as
-the ``ceph-volume`` tool is currently not able to do so automatically.
+If you have a mix of fast and slow devices (for example, SSD or HDD), then we
+recommend placing ``block.db`` on the faster device while ``block`` (that is,
+the data) is stored on the slower device (that is, the rotational drive).
-For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
-and one (fast) solid state drive (``sdx``). First create the volume groups:
+You must create these volume groups and these logical volumes manually. as The
+``ceph-volume`` tool is currently unable to do so [create them?] automatically.
+
+The following procedure illustrates the manual creation of volume groups and
+logical volumes. For this example, we shall assume four rotational drives
+(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
+to create the volume groups, run the following commands:
.. prompt:: bash $
vgcreate ceph-block-2 /dev/sdc
vgcreate ceph-block-3 /dev/sdd
-Now create the logical volumes for ``block``:
+Next, to create the logical volumes for ``block``, run the following commands:
.. prompt:: bash $
lvcreate -l 100%FREE -n block-2 ceph-block-2
lvcreate -l 100%FREE -n block-3 ceph-block-3
-We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
-SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
+Because there are four HDDs, there will be four OSDs. Supposing that there is a
+200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
+the following commands:
.. prompt:: bash $
lvcreate -L 50GB -n db-2 ceph-db-0
lvcreate -L 50GB -n db-3 ceph-db-0
-Finally, create the 4 OSDs with ``ceph-volume``:
+Finally, to create the four OSDs, run the following commands:
.. prompt:: bash $
ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
-These operations should end up creating four OSDs, with ``block`` on the slower
-rotational drives with a 50 GB logical volume (DB) for each on the solid state
-drive.
+After this procedure is finished, there should be four OSDs, ``block`` should
+be on the four HDDs, and each HDD should have a 50GB logical volume
+(specifically, a DB device) on the shared SSD.
Sizing
======
-When using a :ref:`mixed spinning and solid drive setup
-<bluestore-mixed-device-config>` it is important to make a large enough
-``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
-*as large as possible* logical volumes.
-
-The general recommendation is to have ``block.db`` size in between 1% to 4%
-of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
-size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
-metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
-be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
-
-In older releases, internal level sizes mean that the DB can fully utilize only
-specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
-etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
-so forth. Most deployments will not substantially benefit from sizing to
-accommodate L3 and higher, though DB compaction can be facilitated by doubling
-these figures to 6GB, 60GB, and 600GB.
-
-Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
-enable better utilization of arbitrary DB device sizes, and the Pacific
-release brings experimental dynamic level support. Users of older releases may
-thus wish to plan ahead by provisioning larger DB devices today so that their
-benefits may be realized with future upgrades.
-
-When *not* using a mix of fast and slow devices, it isn't required to create
-separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
-automatically colocate these within the space of ``block``.
-
+When using a :ref:`mixed spinning-and-solid-drive setup
+<bluestore-mixed-device-config>`, it is important to make a large enough
+``block.db`` logical volume for BlueStore. The logical volumes associated with
+``block.db`` should have logical volumes that are *as large as possible*.
+
+It is generally recommended that the size of ``block.db`` be somewhere between
+1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
+the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
+use of ``block.db`` to store metadata (in particular, omap keys). For example,
+if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
+40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
+2% of the ``block`` size.
+
+In older releases, internal level sizes are such that the DB can fully utilize
+only those specific partition / logical volume sizes that correspond to sums of
+L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
+3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
+sizing that accommodates L3 and higher, though DB compaction can be facilitated
+by doubling these figures to 6GB, 60GB, and 600GB.
+
+Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
+for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
+release brings experimental dynamic-level support. Because of these advances,
+users of older releases might want to plan ahead by provisioning larger DB
+devices today so that the benefits of scale can be realized when upgrades are
+made in the future.
+
+When *not* using a mix of fast and slow devices, there is no requirement to
+create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
+will automatically colocate these devices within the space of ``block``.
Automatic Cache Sizing
======================
-BlueStore can be configured to automatically resize its caches when TCMalloc
-is configured as the memory allocator and the ``bluestore_cache_autotune``
-setting is enabled. This option is currently enabled by default. BlueStore
-will attempt to keep OSD heap memory usage under a designated target size via
-the ``osd_memory_target`` configuration option. This is a best effort
-algorithm and caches will not shrink smaller than the amount specified by
-``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
-of priorities. If priority information is not available, the
-``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
-used as fallbacks.
+BlueStore can be configured to automatically resize its caches, provided that
+certain conditions are met: TCMalloc must be configured as the memory allocator
+and the ``bluestore_cache_autotune`` configuration option must be enabled (note
+that it is currently enabled by default). When automatic cache sizing is in
+effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
+size (as determined by ``osd_memory_target``). This approach makes use of a
+best-effort algorithm and caches do not shrink smaller than the size defined by
+the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
+with a hierarchy of priorities. But if priority information is not available,
+the values specified in the ``bluestore_cache_meta_ratio`` and
+``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
.. confval:: bluestore_cache_autotune
.. confval:: osd_memory_target
Manual Cache Sizing
===================
-The amount of memory consumed by each OSD for BlueStore caches is
-determined by the ``bluestore_cache_size`` configuration option. If
-that config option is not set (i.e., remains at 0), there is a
-different default value that is used depending on whether an HDD or
-SSD is used for the primary device (set by the
-``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
-options).
+The amount of memory consumed by each OSD to be used for its BlueStore cache is
+determined by the ``bluestore_cache_size`` configuration option. If that option
+has not been specified (that is, if it remains at 0), then Ceph uses a
+different configuration option to determine the default memory budget:
+``bluestore_cache_size_hdd`` if the primary device is an HDD, or
+``bluestore_cache_size_ssd`` if the primary device is an SSD.
-BlueStore and the rest of the Ceph OSD daemon do the best they can
-to work within this memory budget. Note that on top of the configured
-cache size, there is also memory consumed by the OSD itself, and
-some additional utilization due to memory fragmentation and other
-allocator overhead.
+BlueStore and the rest of the Ceph OSD daemon make every effort to work within
+this memory budget. Note that in addition to the configured cache size, there
+is also memory consumed by the OSD itself. There is additional utilization due
+to memory fragmentation and other allocator overhead.
-The configured cache memory budget can be used in a few different ways:
+The configured cache-memory budget can be used to store the following types of
+things:
-* Key/Value metadata (i.e., RocksDB's internal cache)
+* Key/Value metadata (that is, RocksDB's internal cache)
* BlueStore metadata
-* BlueStore data (i.e., recently read or written object data)
+* BlueStore data (that is, recently read or recently written object data)
-Cache memory usage is governed by the following options:
-``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
-The fraction of the cache devoted to data
-is governed by the effective bluestore cache size (depending on
-``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
-device) as well as the meta and kv ratios.
-The data fraction can be calculated by
-``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
+Cache memory usage is governed by the configuration options
+``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction
+of the cache that is reserved for data is governed by both the effective
+BlueStore cache size (which depends on the relevant
+``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
+device) and the "meta" and "kv" ratios. This data fraction can be calculated
+with the following formula: ``<effective_cache_size> * (1 -
+bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
.. confval:: bluestore_cache_size
.. confval:: bluestore_cache_size_hdd
Checksums
=========
-BlueStore checksums all metadata and data written to disk. Metadata
-checksumming is handled by RocksDB and uses `crc32c`. Data
-checksumming is done by BlueStore and can make use of `crc32c`,
-`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
-suitable for most purposes.
-
-Full data checksumming does increase the amount of metadata that
-BlueStore must store and manage. When possible, e.g., when clients
-hint that data is written and read sequentially, BlueStore will
-checksum larger blocks, but in many cases it must store a checksum
-value (usually 4 bytes) for every 4 kilobyte block of data.
-
-It is possible to use a smaller checksum value by truncating the
-checksum to two or one byte, reducing the metadata overhead. The
-trade-off is that the probability that a random error will not be
-detected is higher with a smaller checksum, going from about one in
-four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
-16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
-The smaller checksum values can be used by selecting `crc32c_16` or
-`crc32c_8` as the checksum algorithm.
-
-The *checksum algorithm* can be set either via a per-pool
-``csum_type`` property or the global config option. For example:
+BlueStore checksums all metadata and all data written to disk. Metadata
+checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
+contrast, data checksumming is handled by BlueStore and can use either
+`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
+checksum algorithm and it is suitable for most purposes.
+
+Full data checksumming increases the amount of metadata that BlueStore must
+store and manage. Whenever possible (for example, when clients hint that data
+is written and read sequentially), BlueStore will checksum larger blocks. In
+many cases, however, it must store a checksum value (usually 4 bytes) for every
+4 KB block of data.
+
+It is possible to obtain a smaller checksum value by truncating the checksum to
+one or two bytes and reducing the metadata overhead. A drawback of this
+approach is that it increases the probability of a random error going
+undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
+65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
+checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
+as the checksum algorithm.
+
+The *checksum algorithm* can be specified either via a per-pool ``csum_type``
+configuration option or via the global configuration option. For example:
.. prompt:: bash $
.. confval:: bluestore_csum_type
-Inline Compression
+inline compression
==================
-BlueStore supports inline compression using `snappy`, `zlib`, or
-`lz4`. Please note that the `lz4` compression plugin is not
+bluestore supports inline compression using `snappy`, `zlib`, or
+`lz4`. please note that the `lz4` compression plugin is not
distributed in the official release.
-Whether data in BlueStore is compressed is determined by a combination
+whether data in bluestore is compressed is determined by a combination
of the *compression mode* and any hints associated with a write
-operation. The modes are:
+operation. the modes are:
-* **none**: Never compress data.
-* **passive**: Do not compress data unless the write operation has a
+* **none**: never compress data.
+* **passive**: do not compress data unless the write operation has a
*compressible* hint set.
-* **aggressive**: Compress data unless the write operation has an
+* **aggressive**: compress data unless the write operation has an
*incompressible* hint set.
-* **force**: Try to compress data no matter what.
+* **force**: try to compress data no matter what.
-For more information about the *compressible* and *incompressible* IO
+for more information about the *compressible* and *incompressible* io
hints, see :c:func:`rados_set_alloc_hint`.
-Note that regardless of the mode, if the size of the data chunk is not
+note that regardless of the mode, if the size of the data chunk is not
reduced sufficiently it will not be used and the original
-(uncompressed) data will be stored. For example, if the ``bluestore
+(uncompressed) data will be stored. for example, if the ``bluestore
compression required ratio`` is set to ``.7`` then the compressed data
must be 70% of the size of the original (or smaller).
-The *compression mode*, *compression algorithm*, *compression required
+the *compression mode*, *compression algorithm*, *compression required
ratio*, *min blob size*, and *max blob size* can be set either via a
-per-pool property or a global config option. Pool properties can be
+per-pool property or a global config option. pool properties can be
set with:
.. prompt:: bash $
.. _bluestore-rocksdb-sharding:
-RocksDB Sharding
+rocksdb sharding
================
-Internally BlueStore uses multiple types of key-value data,
-stored in RocksDB. Each data type in BlueStore is assigned a
-unique prefix. Until Pacific all key-value data was stored in
-single RocksDB column family: 'default'. Since Pacific,
-BlueStore can divide this data into multiple RocksDB column
-families. When keys have similar access frequency, modification
-frequency and lifetime, BlueStore benefits from better caching
-and more precise compaction. This improves performance, and also
+internally bluestore uses multiple types of key-value data,
+stored in rocksdb. each data type in bluestore is assigned a
+unique prefix. until pacific all key-value data was stored in
+single rocksdb column family: 'default'. since pacific,
+bluestore can divide this data into multiple rocksdb column
+families. when keys have similar access frequency, modification
+frequency and lifetime, bluestore benefits from better caching
+and more precise compaction. this improves performance, and also
requires less disk space during compaction, since each column
family is smaller and can compact independent of others.
-OSDs deployed in Pacific or later use RocksDB sharding by default.
-If Ceph is upgraded to Pacific from a previous version, sharding is off.
+osds deployed in pacific or later use rocksdb sharding by default.
+if ceph is upgraded to pacific from a previous version, sharding is off.
-To enable sharding and apply the Pacific defaults, stop an OSD and run
+to enable sharding and apply the pacific defaults, stop an osd and run
.. prompt:: bash #
ceph-bluestore-tool \
--path <data path> \
- --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
+ --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
reshard
.. confval:: bluestore_rocksdb_cf
.. confval:: bluestore_rocksdb_cfs
-Throttling
+throttling
==========
.. confval:: bluestore_throttle_bytes
.. confval:: bluestore_throttle_cost_per_io_hdd
.. confval:: bluestore_throttle_cost_per_io_ssd
-SPDK Usage
+spdk usage
==================
-If you want to use the SPDK driver for NVMe devices, you must prepare your system.
-Refer to `SPDK document`__ for more details.
+if you want to use the spdk driver for nvme devices, you must prepare your system.
+refer to `spdk document`__ for more details.
.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
-SPDK offers a script to configure the device automatically. Users can run the
+spdk offers a script to configure the device automatically. users can run the
script as root:
.. prompt:: bash $
sudo src/spdk/scripts/setup.sh
-You will need to specify the subject NVMe device's device selector with
+you will need to specify the subject nvme device's device selector with
the "spdk:" prefix for ``bluestore_block_path``.
-For example, you can find the device selector of an Intel PCIe SSD with:
+for example, you can find the device selector of an intel pcie ssd with:
.. prompt:: bash $
- lspci -mm -n -D -d 8086:0953
+ lspci -mm -n -d -d 8086:0953
-The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
+the device selector always has the form of ``dddd:bb:dd.ff`` or ``dddd.bb.dd.ff``.
and then set::
bluestore_block_path = spdk:0000:01:00.0
-Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
+where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
command above.
To run multiple SPDK instances per node, you must specify the
amount of dpdk memory in MB that each instance will use, to make sure each
instance uses its own dpdk memory
-In most cases, a single device can be used for data, DB, and WAL. We describe
-this strategy as *colocating* these components. Be sure to enter the below
-settings to ensure that all IOs are issued through SPDK.::
+in most cases, a single device can be used for data, db, and wal. we describe
+this strategy as *colocating* these components. be sure to enter the below
+settings to ensure that all ios are issued through spdk.::
bluestore_block_db_path = ""
bluestore_block_db_size = 0
bluestore_block_wal_path = ""
bluestore_block_wal_size = 0
-Otherwise, the current implementation will populate the SPDK map files with
-kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
+otherwise, the current implementation will populate the spdk map files with
+kernel file system symbols and will use the kernel driver to issue db/wal io.
-Minimum Allocation Size
+minimum allocation size
========================
-There is a configured minimum amount of storage that BlueStore will allocate on
-an OSD. In practice, this is the least amount of capacity that a RADOS object
-can consume. The value of :confval:`bluestore_min_alloc_size` is derived from the
+there is a configured minimum amount of storage that bluestore will allocate on
+an osd. in practice, this is the least amount of capacity that a rados object
+can consume. the value of :confval:`bluestore_min_alloc_size` is derived from the
value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd`
-depending on the OSD's ``rotational`` attribute. This means that when an OSD
-is created on an HDD, BlueStore will be initialized with the current value
-of :confval:`bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices)
+depending on the osd's ``rotational`` attribute. this means that when an osd
+is created on an hdd, bluestore will be initialized with the current value
+of :confval:`bluestore_min_alloc_size_hdd`, and ssd osds (including nvme devices)
with the value of :confval:`bluestore_min_alloc_size_ssd`.
-Through the Mimic release, the default values were 64KB and 16KB for rotational
-(HDD) and non-rotational (SSD) media respectively. Octopus changed the default
-for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD
-(rotational) media to 4KB as well.
+through the mimic release, the default values were 64kb and 16kb for rotational
+(hdd) and non-rotational (ssd) media respectively. octopus changed the default
+for ssd (non-rotational) media to 4kb, and pacific changed the default for hdd
+(rotational) media to 4kb as well.
-These changes were driven by space amplification experienced by Ceph RADOS
-GateWay (RGW) deployments that host large numbers of small files
-(S3/Swift objects).
+these changes were driven by space amplification experienced by ceph rados
+gateway (rgw) deployments that host large numbers of small files
+(s3/swift objects).
+<<<<<<< HEAD
For example, when an RGW client stores a 1KB S3 object, it is written to a
single RADOS object. With the default :confval:`min_alloc_size` value, 4KB of
underlying drive space is allocated. This means that roughly
amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object
will allocate (6 * 4KB) = 24KB of device capacity.
-When an RGW bucket pool contains many relatively large user objects, the effect
+when an rgw bucket pool contains many relatively large user objects, the effect
of this phenomenon is often negligible, but should be considered for deployments
that expect a significant fraction of relatively small objects.
-The 4KB default value aligns well with conventional HDD and SSD devices. Some
-new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best
+the 4kb default value aligns well with conventional hdd and ssd devices. some
+new coarse-iu (indirection unit) qlc ssds however perform and wear best
when :confval:`bluestore_min_alloc_size_ssd`
-is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB.
-These novel storage drives allow one to achieve read performance competitive
-with conventional TLC SSDs and write performance faster than HDDs, with
-high density and lower cost than TLC SSDs.
+is set at osd creation to match the device's iu:. 8kb, 16kb, or even 64kb.
+these novel storage drives allow one to achieve read performance competitive
+with conventional tlc ssds and write performance faster than hdds, with
+high density and lower cost than tlc ssds.
-Note that when creating OSDs on these devices, one must carefully apply the
-non-default value only to appropriate devices, and not to conventional SSD and
-HDD devices. This may be done through careful ordering of OSD creation, custom
-OSD device classes, and especially by the use of central configuration _masks_.
+note that when creating osds on these devices, one must carefully apply the
+non-default value only to appropriate devices, and not to conventional ssd and
+hdd devices. this may be done through careful ordering of osd creation, custom
+osd device classes, and especially by the use of central configuration _masks_.
-Quincy and later releases add
+quincy and later releases add
the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size`
-option that enables automatic discovery of the appropriate value as each OSD is
-created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``,
-``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction
-technologies may confound the determination of appropriate values. OSDs
-deployed on top of VMware storage have been reported to also
+option that enables automatic discovery of the appropriate value as each osd is
+created. note that the use of ``bcache``, ``opencas``, ``dmcrypt``,
+``ata over ethernet``, `iscsi`, or other device layering / abstraction
+technologies may confound the determination of appropriate values. osds
+deployed on top of vmware storage have been reported to also
sometimes report a ``rotational`` attribute that does not match the underlying
hardware.
-We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that
-behavior is appropriate. Note that this also may not work as desired with
-older kernels. You can check for this by examining the presence and value
+we suggest inspecting such osds at startup via logs and admin sockets to ensure that
+behavior is appropriate. note that this also may not work as desired with
+older kernels. you can check for this by examining the presence and value
of ``/sys/block/<drive>/queue/optimal_io_size``.
-You may also inspect a given OSD:
+you may also inspect a given osd:
.. prompt:: bash #
ceph osd metadata osd.1701 | grep rotational
-This space amplification may manifest as an unusually high ratio of raw to
+this space amplification may manifest as an unusually high ratio of raw to
stored data reported by ``ceph df``. ``ceph osd df`` may also report
-anomalously high ``%USE`` / ``VAR`` values when
-compared to other, ostensibly identical OSDs. A pool using OSDs with
+anomalously high ``%use`` / ``var`` values when
+compared to other, ostensibly identical osds. a pool using osds with
mismatched ``min_alloc_size`` values may experience unexpected balancer
behavior as well.
-Note that this BlueStore attribute takes effect *only* at OSD creation; if
-changed later, a given OSD's behavior will not change unless / until it is
-destroyed and redeployed with the appropriate option value(s). Upgrading
-to a later Ceph release will *not* change the value used by OSDs deployed
+note that this bluestore attribute takes effect *only* at osd creation; if
+changed later, a given osd's behavior will not change unless / until it is
+destroyed and redeployed with the appropriate option value(s). upgrading
+to a later ceph release will *not* change the value used by osds deployed
under older releases or with other settings.
.. confval:: bluestore_min_alloc_size_ssd
.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
-DSA (Data Streaming Accelerator Usage)
+dsa (data streaming accelerator usage)
======================================
-If you want to use the DML library to drive DSA device for offloading
-read/write operations on Persist memory in Bluestore. You need to install
-`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU.
+if you want to use the dml library to drive dsa device for offloading
+read/write operations on persist memory in bluestore. you need to install
+`dml`_ and `idxd-config`_ library in your machine with spr (sapphire rapids) cpu.
-.. _DML: https://github.com/intel/DML
+.. _dml: https://github.com/intel/dml
.. _idxd-config: https://github.com/intel/idxd-config
-After installing the DML software, you need to configure the shared
-work queues (WQs) with the following WQ configuration example via accel-config tool:
+after installing the dml software, you need to configure the shared
+work queues (wqs) with the following wq configuration example via accel-config tool:
.. prompt:: bash $
- accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
+ accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
accel-config config-engine dsa0/engine0.1 --group-id=1
accel-config enable-device dsa0
accel-config enable-wq dsa0/wq0.1