doc/rados: edit bluestore-config-ref.rst (1 of x)

author Zac Dover <zac.dover@proton.me>

Fri, 26 May 2023 08:59:36 +0000 (18:59 +1000)

committer Zac Dover <zac.dover@proton.me>

Fri, 26 May 2023 21:50:50 +0000 (07:50 +1000)
author Zac Dover <zac.dover@proton.me>
Fri, 26 May 2023 08:59:36 +0000 (18:59 +1000)
committer Zac Dover <zac.dover@proton.me>
Fri, 26 May 2023 21:50:50 +0000 (07:50 +1000)
diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst

index 37b3c43d7ca3fabffcd897bc2c4756bb422c6d70..bbcedad8ce40c0f26c94816216871627d0bf5173 100644 (file)
--- a/doc/rados/configuration/bluestore-config-ref.rst
+++ b/doc/rados/configuration/bluestore-config-ref.rst
@@ -1,84 +1,95 @@
-==========================
-BlueStore Config Reference
-==========================
+==================================
+ BlueStore Configuration Reference 
+==================================
  
  Devices
  =======
  
-BlueStore manages either one, two, or (in certain cases) three storage
-devices.
-
-In the simplest case, BlueStore consumes a single (primary) storage device.
-The storage device is normally used as a whole, occupying the full device that
-is managed directly by BlueStore. This *primary device* is normally identified
-by a ``block`` symlink in the data directory.
-
-The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
-when ``ceph-volume`` activates it) with all the common OSD files that hold
-information about the OSD, like: its identifier, which cluster it belongs to,
-and its private keyring.
-
-It is also possible to deploy BlueStore across one or two additional devices:
-
-* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
-  used for BlueStore's internal journal or write-ahead log. It is only useful
-  to use a WAL device if the device is faster than the primary device (e.g.,
-  when it is on an SSD and the primary device is an HDD).
+BlueStore manages either one, two, or in certain cases three storage devices.
+These *devices* are "devices" in the Linux/Unix sense. This means that they are
+assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
+entire storage drive, or a partition of a storage drive, or a logical volume.
+BlueStore does not create or mount a conventional file system on devices that
+it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
+
+In the simplest case, BlueStore consumes all of a single storage device. This
+device is known as the *primary device*. The primary device is identified by
+the ``block`` symlink in the data directory.
+
+The data directory is a ``tmpfs`` mount. When this data directory is booted or
+activated by ``ceph-volume``, it is populated with metadata files and links
+that hold information about the OSD: for example, the OSD's identifier, the
+name of the cluster that the OSD belongs to, and the OSD's private keyring.
+
+In more complicated cases, BlueStore is deployed across one or two additional
+devices:
+
+* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
+  directory) can be used to separate out BlueStore's internal journal or
+  write-ahead log. Using a WAL device is advantageous only if the WAL device
+  is faster than the primary device (for example, if the WAL device is an SSD
+  and the primary device is an HDD).
  * A *DB device* (identified as ``block.db`` in the data directory) can be used
-  for storing BlueStore's internal metadata.  BlueStore (or rather, the
-  embedded RocksDB) will put as much metadata as it can on the DB device to
-  improve performance.  If the DB device fills up, metadata will spill back
-  onto the primary device (where it would have been otherwise).  Again, it is
-  only helpful to provision a DB device if it is faster than the primary
-  device.
-
-If there is only a small amount of fast storage available (e.g., less
-than a gigabyte), we recommend using it as a WAL device.  If there is
-more, provisioning a DB device makes more sense.  The BlueStore
-journal will always be placed on the fastest device available, so
-using a DB device will provide the same benefit that the WAL device
-would while *also* allowing additional metadata to be stored there (if
-it will fit).  This means that if a DB device is specified but an explicit
-WAL device is not, the WAL will be implicitly colocated with the DB on the faster
-device.
-
-A single-device (colocated) BlueStore OSD can be provisioned with:
+  to store BlueStore's internal metadata. BlueStore (or more precisely, the
+  embedded RocksDB) will put as much metadata as it can on the DB device in
+  order to improve performance. If the DB device becomes full, metadata will
+  spill back onto the primary device (where it would have been located in the
+  absence of the DB device). Again, it is advantageous to provision a DB device
+  only if it is faster than the primary device.
+
+If there is only a small amount of fast storage available (for example, less
+than a gigabyte), we recommend using the available space as a WAL device. But
+if more fast storage is available, it makes more sense to provision a DB
+device. Because the BlueStore journal is always placed on the fastest device
+available, using a DB device provides the same benefit that using a WAL device
+would, while *also* allowing additional metadata to be stored off the primary
+device (provided that it fits). DB devices make this possible because whenever
+a DB device is specified but an explicit WAL device is not, the WAL will be
+implicitly colocated with the DB on the faster device.
+
+To provision a single-device (colocated) BlueStore OSD, run the following
+command:
  
  .. prompt:: bash $
  
     ceph-volume lvm prepare --bluestore --data <device>
  
-To specify a WAL device and/or DB device:
+To specify a WAL device or DB device, run the following command:
     
  .. prompt:: bash $
  
     ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
  
-.. note:: ``--data`` can be a Logical Volume using  *vg/lv* notation. Other
-          devices can be existing logical volumes or GPT partitions.
+.. note:: The option ``--data`` can take as its argument any of the the
+   following devices: logical volumes specified using *vg/lv* notation,
+   existing logical volumes, and GPT partitions.
+
+
  
  Provisioning strategies
  -----------------------
-Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
-which had just one), there are two common arrangements that should help clarify
-the deployment strategy:
+
+BlueStore differs from Filestore in that there are several ways to deploy a
+BlueStore OSD. However, the overall deployment strategy for BlueStore can be
+clarified by examining just these two common arrangements:
  
  .. _bluestore-single-type-device-config:
  
  **block (data) only**
  ^^^^^^^^^^^^^^^^^^^^^
-If all devices are the same type, for example all rotational drives, and
-there are no fast devices to use for metadata, it makes sense to specify the
-block device only and to not separate ``block.db`` or ``block.wal``. The
-:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like:
+If all devices are of the same type (for example, they are all HDDs), and if
+there are no fast devices available for the storage of metadata, then it makes
+sense to specify the block device only and to leave ``block.db`` and
+``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
+``/dev/sda`` device is as follows:
  
  .. prompt:: bash $
  
     ceph-volume lvm create --bluestore --data /dev/sda
  
-If logical volumes have already been created for each device, (a single LV
-using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
-``ceph-vg/block-lv`` would look like:
+If the devices to be used for a BlueStore OSD are pre-created logical volumes,
+then the :ref:`ceph-volume-lvm` call for an logical volume named
+``ceph-vg/block-lv`` is as follows:
  
  .. prompt:: bash $
  
@@ -88,15 +99,18 @@ using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
  
  **block and block.db**
  ^^^^^^^^^^^^^^^^^^^^^^
-If you have a mix of fast and slow devices (SSD / NVMe and rotational),
-it is recommended to place ``block.db`` on the faster device while ``block``
-(data) lives on the slower (spinning drive).
  
-You must create these volume groups and logical volumes manually as 
-the ``ceph-volume`` tool is currently not able to do so automatically.
+If you have a mix of fast and slow devices (for example, SSD or HDD), then we
+recommend placing ``block.db`` on the faster device while ``block`` (that is,
+the data) is stored on the slower device (that is, the rotational drive).
  
-For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
-and one (fast) solid state drive (``sdx``). First create the volume groups:
+You must create these volume groups and these logical volumes manually. as The
+``ceph-volume`` tool is currently unable to do so [create them?] automatically.
+
+The following procedure illustrates the manual creation of volume groups and
+logical volumes.  For this example, we shall assume four rotational drives
+(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
+to create the volume groups, run the following commands:
  
  .. prompt:: bash $
  
@@ -105,7 +119,7 @@ and one (fast) solid state drive (``sdx``). First create the volume groups:
     vgcreate ceph-block-2 /dev/sdc
     vgcreate ceph-block-3 /dev/sdd
  
-Now create the logical volumes for ``block``:
+Next, to create the logical volumes for ``block``, run the following commands:
  
  .. prompt:: bash $
  
@@ -114,8 +128,9 @@ Now create the logical volumes for ``block``:
     lvcreate -l 100%FREE -n block-2 ceph-block-2
     lvcreate -l 100%FREE -n block-3 ceph-block-3
  
-We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
-SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
+Because there are four HDDs, there will be four OSDs. Supposing that there is a
+200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
+the following commands:
  
  .. prompt:: bash $
  
@@ -125,7 +140,7 @@ SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
     lvcreate -L 50GB -n db-2 ceph-db-0
     lvcreate -L 50GB -n db-3 ceph-db-0
  
-Finally, create the 4 OSDs with ``ceph-volume``:
+Finally, to create the four OSDs, run the following commands:
  
  .. prompt:: bash $
  
@@ -134,54 +149,57 @@ Finally, create the 4 OSDs with ``ceph-volume``:
     ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
     ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
  
-These operations should end up creating four OSDs, with ``block`` on the slower
-rotational drives with a 50 GB logical volume (DB) for each on the solid state
-drive.
+After this procedure is finished, there should be four OSDs, ``block`` should
+be on the four HDDs, and each HDD should have a 50GB logical volume
+(specifically, a DB device) on the shared SSD.
  
  Sizing
  ======
-When using a :ref:`mixed spinning and solid drive setup
-<bluestore-mixed-device-config>` it is important to make a large enough
-``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
-*as large as possible* logical volumes.
-
-The general recommendation is to have ``block.db`` size in between 1% to 4%
-of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
-size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
-metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
-be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
-
-In older releases, internal level sizes mean that the DB can fully utilize only
-specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
-etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
-so forth.  Most deployments will not substantially benefit from sizing to
-accommodate L3 and higher, though DB compaction can be facilitated by doubling
-these figures to 6GB, 60GB, and 600GB.
-
-Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
-enable better utilization of arbitrary DB device sizes, and the Pacific
-release brings experimental dynamic level support.  Users of older releases may
-thus wish to plan ahead by provisioning larger DB devices today so that their
-benefits may be realized with future upgrades.
-
-When *not* using a mix of fast and slow devices, it isn't required to create
-separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
-automatically colocate these within the space of ``block``.
-
+When using a :ref:`mixed spinning-and-solid-drive setup
+<bluestore-mixed-device-config>`, it is important to make a large enough
+``block.db`` logical volume for BlueStore. The logical volumes associated with
+``block.db`` should have logical volumes that are *as large as possible*.
+
+It is generally recommended that the size of ``block.db`` be somewhere between
+1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
+the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
+use of ``block.db`` to store metadata (in particular, omap keys). For example,
+if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
+40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
+2% of the ``block`` size.
+
+In older releases, internal level sizes are such that the DB can fully utilize
+only those specific partition / logical volume sizes that correspond to sums of
+L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
+3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
+sizing that accommodates L3 and higher, though DB compaction can be facilitated
+by doubling these figures to 6GB, 60GB, and 600GB.
+
+Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
+for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
+release brings experimental dynamic-level support. Because of these advances,
+users of older releases might want to plan ahead by provisioning larger DB
+devices today so that the benefits of scale can be realized when upgrades are
+made in the future.
+
+When *not* using a mix of fast and slow devices, there is no requirement to
+create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
+will automatically colocate these devices within the space of ``block``.
  
  Automatic Cache Sizing
  ======================
  
-BlueStore can be configured to automatically resize its caches when TCMalloc
-is configured as the memory allocator and the ``bluestore_cache_autotune``
-setting is enabled.  This option is currently enabled by default.  BlueStore
-will attempt to keep OSD heap memory usage under a designated target size via
-the ``osd_memory_target`` configuration option.  This is a best effort
-algorithm and caches will not shrink smaller than the amount specified by
-``osd_memory_cache_min``.  Cache ratios will be chosen based on a hierarchy
-of priorities.  If priority information is not available, the
-``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
-used as fallbacks.
+BlueStore can be configured to automatically resize its caches, provided that
+certain conditions are met: TCMalloc must be configured as the memory allocator
+and the ``bluestore_cache_autotune`` configuration option must be enabled (note
+that it is currently enabled by default). When automatic cache sizing is in
+effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
+size (as determined by ``osd_memory_target``). This approach makes use of a
+best-effort algorithm and caches do not shrink smaller than the size defined by
+the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
+with a hierarchy of priorities.  But if priority information is not available,
+the values specified in the ``bluestore_cache_meta_ratio`` and
+``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
  
  .. confval:: bluestore_cache_autotune
  .. confval:: osd_memory_target
@@ -195,34 +213,33 @@ used as fallbacks.
  Manual Cache Sizing
  ===================
  
-The amount of memory consumed by each OSD for BlueStore caches is
-determined by the ``bluestore_cache_size`` configuration option.  If
-that config option is not set (i.e., remains at 0), there is a
-different default value that is used depending on whether an HDD or
-SSD is used for the primary device (set by the
-``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
-options).
+The amount of memory consumed by each OSD to be used for its BlueStore cache is
+determined by the ``bluestore_cache_size`` configuration option. If that option
+has not been specified (that is, if it remains at 0), then Ceph uses a
+different configuration option to determine the default memory budget:
+``bluestore_cache_size_hdd`` if the primary device is an HDD, or
+``bluestore_cache_size_ssd`` if the primary device is an SSD.
  
-BlueStore and the rest of the Ceph OSD daemon do the best they can
-to work within this memory budget.  Note that on top of the configured
-cache size, there is also memory consumed by the OSD itself, and
-some additional utilization due to memory fragmentation and other
-allocator overhead.
+BlueStore and the rest of the Ceph OSD daemon make every effort to work within
+this memory budget. Note that in addition to the configured cache size, there
+is also memory consumed by the OSD itself. There is additional utilization due
+to memory fragmentation and other allocator overhead. 
  
-The configured cache memory budget can be used in a few different ways:
+The configured cache-memory budget can be used to store the following types of
+things:
  
-* Key/Value metadata (i.e., RocksDB's internal cache)
+* Key/Value metadata (that is, RocksDB's internal cache)
  * BlueStore metadata
-* BlueStore data (i.e., recently read or written object data)
+* BlueStore data (that is, recently read or recently written object data)
  
-Cache memory usage is governed by the following options:
-``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
-The fraction of the cache devoted to data
-is governed by the effective bluestore cache size (depending on
-``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
-device) as well as the meta and kv ratios.
-The data fraction can be calculated by
-``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
+Cache memory usage is governed by the configuration options
+``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.  The fraction
+of the cache that is reserved for data is governed by both the effective
+BlueStore cache size (which depends on the relevant
+``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
+device) and the "meta" and "kv" ratios.  This data fraction can be calculated
+with the following formula: ``<effective_cache_size> * (1 -
+bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
  
  .. confval:: bluestore_cache_size
  .. confval:: bluestore_cache_size_hdd
@@ -233,29 +250,28 @@ The data fraction can be calculated by
  Checksums
  =========
  
-BlueStore checksums all metadata and data written to disk.  Metadata
-checksumming is handled by RocksDB and uses `crc32c`. Data
-checksumming is done by BlueStore and can make use of `crc32c`,
-`xxhash32`, or `xxhash64`.  The default is `crc32c` and should be
-suitable for most purposes.
-
-Full data checksumming does increase the amount of metadata that
-BlueStore must store and manage.  When possible, e.g., when clients
-hint that data is written and read sequentially, BlueStore will
-checksum larger blocks, but in many cases it must store a checksum
-value (usually 4 bytes) for every 4 kilobyte block of data.
-
-It is possible to use a smaller checksum value by truncating the
-checksum to two or one byte, reducing the metadata overhead.  The
-trade-off is that the probability that a random error will not be
-detected is higher with a smaller checksum, going from about one in
-four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
-16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
-The smaller checksum values can be used by selecting `crc32c_16` or
-`crc32c_8` as the checksum algorithm.
-
-The *checksum algorithm* can be set either via a per-pool
-``csum_type`` property or the global config option.  For example:
+BlueStore checksums all metadata and all data written to disk. Metadata
+checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
+contrast, data checksumming is handled by BlueStore and can use either
+`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
+checksum algorithm and it is suitable for most purposes.
+
+Full data checksumming increases the amount of metadata that BlueStore must
+store and manage. Whenever possible (for example, when clients hint that data
+is written and read sequentially), BlueStore will checksum larger blocks. In
+many cases, however, it must store a checksum value (usually 4 bytes) for every
+4 KB block of data.
+
+It is possible to obtain a smaller checksum value by truncating the checksum to
+one or two bytes and reducing the metadata overhead.  A drawback of this
+approach is that it increases the probability of a random error going
+undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
+65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
+checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
+as the checksum algorithm.
+
+The *checksum algorithm* can be specified either via a per-pool ``csum_type``
+configuration option or via the global configuration option. For example:
  
  .. prompt:: bash $
  
@@ -263,36 +279,36 @@ The *checksum algorithm* can be set either via a per-pool
  
  .. confval:: bluestore_csum_type
  
-Inline Compression
+inline compression
  ==================
  
-BlueStore supports inline compression using `snappy`, `zlib`, or
-`lz4`. Please note that the `lz4` compression plugin is not
+bluestore supports inline compression using `snappy`, `zlib`, or
+`lz4`. please note that the `lz4` compression plugin is not
  distributed in the official release.
  
-Whether data in BlueStore is compressed is determined by a combination
+whether data in bluestore is compressed is determined by a combination
  of the *compression mode* and any hints associated with a write
-operation.  The modes are:
+operation.  the modes are:
  
-* **none**: Never compress data.
-* **passive**: Do not compress data unless the write operation has a
+* **none**: never compress data.
+* **passive**: do not compress data unless the write operation has a
    *compressible* hint set.
-* **aggressive**: Compress data unless the write operation has an
+* **aggressive**: compress data unless the write operation has an
    *incompressible* hint set.
-* **force**: Try to compress data no matter what.
+* **force**: try to compress data no matter what.
  
-For more information about the *compressible* and *incompressible* IO
+for more information about the *compressible* and *incompressible* io
  hints, see :c:func:`rados_set_alloc_hint`.
  
-Note that regardless of the mode, if the size of the data chunk is not
+note that regardless of the mode, if the size of the data chunk is not
  reduced sufficiently it will not be used and the original
-(uncompressed) data will be stored.  For example, if the ``bluestore
+(uncompressed) data will be stored.  for example, if the ``bluestore
  compression required ratio`` is set to ``.7`` then the compressed data
  must be 70% of the size of the original (or smaller).
  
-The *compression mode*, *compression algorithm*, *compression required
+the *compression mode*, *compression algorithm*, *compression required
  ratio*, *min blob size*, and *max blob size* can be set either via a
-per-pool property or a global config option.  Pool properties can be
+per-pool property or a global config option.  pool properties can be
  set with:
  
  .. prompt:: bash $
@@ -315,36 +331,36 @@ set with:
  
  .. _bluestore-rocksdb-sharding:
  
-RocksDB Sharding
+rocksdb sharding
  ================
  
-Internally BlueStore uses multiple types of key-value data,
-stored in RocksDB.  Each data type in BlueStore is assigned a
-unique prefix. Until Pacific all key-value data was stored in
-single RocksDB column family: 'default'.  Since Pacific,
-BlueStore can divide this data into multiple RocksDB column
-families. When keys have similar access frequency, modification
-frequency and lifetime, BlueStore benefits from better caching
-and more precise compaction. This improves performance, and also
+internally bluestore uses multiple types of key-value data,
+stored in rocksdb.  each data type in bluestore is assigned a
+unique prefix. until pacific all key-value data was stored in
+single rocksdb column family: 'default'.  since pacific,
+bluestore can divide this data into multiple rocksdb column
+families. when keys have similar access frequency, modification
+frequency and lifetime, bluestore benefits from better caching
+and more precise compaction. this improves performance, and also
  requires less disk space during compaction, since each column
  family is smaller and can compact independent of others.
  
-OSDs deployed in Pacific or later use RocksDB sharding by default.
-If Ceph is upgraded to Pacific from a previous version, sharding is off.
+osds deployed in pacific or later use rocksdb sharding by default.
+if ceph is upgraded to pacific from a previous version, sharding is off.
  
-To enable sharding and apply the Pacific defaults, stop an OSD and run
+to enable sharding and apply the pacific defaults, stop an osd and run
  
      .. prompt:: bash #
  
        ceph-bluestore-tool \
          --path <data path> \
-        --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
+        --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
          reshard
  
  .. confval:: bluestore_rocksdb_cf
  .. confval:: bluestore_rocksdb_cfs
  
-Throttling
+throttling
  ==========
  
  .. confval:: bluestore_throttle_bytes
@@ -353,147 +369,147 @@ Throttling
  .. confval:: bluestore_throttle_cost_per_io_hdd
  .. confval:: bluestore_throttle_cost_per_io_ssd
  
-SPDK Usage
+spdk usage
  ==================
  
-If you want to use the SPDK driver for NVMe devices, you must prepare your system.
-Refer to `SPDK document`__ for more details.
+if you want to use the spdk driver for nvme devices, you must prepare your system.
+refer to `spdk document`__ for more details.
  
  .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
  
-SPDK offers a script to configure the device automatically. Users can run the
+spdk offers a script to configure the device automatically. users can run the
  script as root:
  
  .. prompt:: bash $
  
     sudo src/spdk/scripts/setup.sh
  
-You will need to specify the subject NVMe device's device selector with
+you will need to specify the subject nvme device's device selector with
  the "spdk:" prefix for ``bluestore_block_path``.
  
-For example, you can find the device selector of an Intel PCIe SSD with:
+for example, you can find the device selector of an intel pcie ssd with:
  
  .. prompt:: bash $
  
-   lspci -mm -n -D -d 8086:0953
+   lspci -mm -n -d -d 8086:0953
  
-The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
+the device selector always has the form of ``dddd:bb:dd.ff`` or ``dddd.bb.dd.ff``.
  
  and then set::
  
-  bluestore_block_path = "spdk:trtype:PCIe traddr:0000:01:00.0"
+  bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0"
  
-Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
+where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
  command above.
  
-You may also specify a remote NVMeoF target over the TCP transport as in the
+you may also specify a remote nvmeof target over the tcp transport as in the
  following example::
  
-  bluestore_block_path = "spdk:trtype:TCP traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
+  bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
  
-To run multiple SPDK instances per node, you must specify the
-amount of dpdk memory in MB that each instance will use, to make sure each
-instance uses its own DPDK memory.
+to run multiple spdk instances per node, you must specify the
+amount of dpdk memory in mb that each instance will use, to make sure each
+instance uses its own dpdk memory.
  
-In most cases, a single device can be used for data, DB, and WAL.  We describe
-this strategy as *colocating* these components. Be sure to enter the below
-settings to ensure that all IOs are issued through SPDK.::
+in most cases, a single device can be used for data, db, and wal.  we describe
+this strategy as *colocating* these components. be sure to enter the below
+settings to ensure that all ios are issued through spdk.::
  
    bluestore_block_db_path = ""
    bluestore_block_db_size = 0
    bluestore_block_wal_path = ""
    bluestore_block_wal_size = 0
  
-Otherwise, the current implementation will populate the SPDK map files with
-kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
+otherwise, the current implementation will populate the spdk map files with
+kernel file system symbols and will use the kernel driver to issue db/wal io.
  
-Minimum Allocation Size
+minimum allocation size
  ========================
  
-There is a configured minimum amount of storage that BlueStore will allocate on
-an OSD.  In practice, this is the least amount of capacity that a RADOS object
-can consume.  The value of :confval:`bluestore_min_alloc_size` is derived from the
+there is a configured minimum amount of storage that bluestore will allocate on
+an osd.  in practice, this is the least amount of capacity that a rados object
+can consume.  the value of :confval:`bluestore_min_alloc_size` is derived from the
  value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd`
-depending on the OSD's ``rotational`` attribute.  This means that when an OSD
-is created on an HDD, BlueStore will be initialized with the current value
-of :confval:`bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices)
+depending on the osd's ``rotational`` attribute.  this means that when an osd
+is created on an hdd, bluestore will be initialized with the current value
+of :confval:`bluestore_min_alloc_size_hdd`, and ssd osds (including nvme devices)
  with the value of :confval:`bluestore_min_alloc_size_ssd`.
  
-Through the Mimic release, the default values were 64KB and 16KB for rotational
-(HDD) and non-rotational (SSD) media respectively.  Octopus changed the default
-for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD
-(rotational) media to 4KB as well.
-
-These changes were driven by space amplification experienced by Ceph RADOS
-GateWay (RGW) deployments that host large numbers of small files
-(S3/Swift objects).
-
-For example, when an RGW client stores a 1KB S3 object, it is written to a
-single RADOS object.  With the default :confval:`min_alloc_size` value, 4KB of
-underlying drive space is allocated.  This means that roughly
-(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300%
-overhead or 25% efficiency. Similarly, a 5KB user object will be stored
-as one 4KB and one 1KB RADOS object, again stranding 4KB of device capacity,
-though in this case the overhead is a much smaller percentage.  Think of this
-in terms of the remainder from a modulus operation. The overhead *percentage*
+through the mimic release, the default values were 64kb and 16kb for rotational
+(hdd) and non-rotational (ssd) media respectively.  octopus changed the default
+for ssd (non-rotational) media to 4kb, and pacific changed the default for hdd
+(rotational) media to 4kb as well.
+
+these changes were driven by space amplification experienced by ceph rados
+gateway (rgw) deployments that host large numbers of small files
+(s3/swift objects).
+
+for example, when an rgw client stores a 1kb s3 object, it is written to a
+single rados object.  with the default :confval:`min_alloc_size` value, 4kb of
+underlying drive space is allocated.  this means that roughly
+(4kb - 1kb) == 3kb of that rados object's allocated space is never used, which corresponds to 300%
+overhead or 25% efficiency. similarly, a 5kb user object will be stored
+as one 4kb and one 1kb rados object, again stranding 4kb of device capacity,
+though in this case the overhead is a much smaller percentage.  think of this
+in terms of the remainder from a modulus operation. the overhead *percentage*
  thus decreases rapidly as user object size increases.
  
-An easily missed additional subtlety is that this
-takes place for *each* replica.  So when using the default three copies of
-data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device
-capacity.  If erasure coding (EC) is used instead of replication, the
-amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object
-will allocate (6 * 4KB) = 24KB of device capacity.
+an easily missed additional subtlety is that this
+takes place for *each* replica.  so when using the default three copies of
+data (3r), a 1kb s3 object actually consumes 12kb of storage device
+capacity, with 11kb of overhead.  if erasure coding (ec) is used instead of replication, the
+amplification may be even higher: for a ``k=4,m=2`` pool, our 1kb s3 object
+will allocate (6 * 4kb) = 24kb of device capacity.
  
-When an RGW bucket pool contains many relatively large user objects, the effect
+when an rgw bucket pool contains many relatively large user objects, the effect
  of this phenomenon is often negligible, but should be considered for deployments
  that expect a significant fraction of relatively small objects.
  
-The 4KB default value aligns well with conventional HDD and SSD devices.  Some
-new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best
+the 4kb default value aligns well with conventional hdd and ssd devices.  some
+new coarse-iu (indirection unit) qlc ssds however perform and wear best
  when :confval:`bluestore_min_alloc_size_ssd`
-is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB.
-These novel storage drives allow one to achieve read performance competitive
-with conventional TLC SSDs and write performance faster than HDDs, with
-high density and lower cost than TLC SSDs.
+is set at osd creation to match the device's iu:. 8kb, 16kb, or even 64kb.
+these novel storage drives allow one to achieve read performance competitive
+with conventional tlc ssds and write performance faster than hdds, with
+high density and lower cost than tlc ssds.
  
-Note that when creating OSDs on these devices, one must carefully apply the
-non-default value only to appropriate devices, and not to conventional SSD and
-HDD devices.  This may be done through careful ordering of OSD creation, custom
-OSD device classes, and especially by the use of central configuration _masks_.
+note that when creating osds on these devices, one must carefully apply the
+non-default value only to appropriate devices, and not to conventional ssd and
+hdd devices.  this may be done through careful ordering of osd creation, custom
+osd device classes, and especially by the use of central configuration _masks_.
  
-Quincy and later releases add
+quincy and later releases add
  the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size`
-option that enables automatic discovery of the appropriate value as each OSD is
-created.  Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``,
-``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction
-technologies may confound the determination of appropriate values. OSDs
-deployed on top of VMware storage have been reported to also
+option that enables automatic discovery of the appropriate value as each osd is
+created.  note that the use of ``bcache``, ``opencas``, ``dmcrypt``,
+``ata over ethernet``, `iscsi`, or other device layering / abstraction
+technologies may confound the determination of appropriate values. osds
+deployed on top of vmware storage have been reported to also
  sometimes report a ``rotational`` attribute that does not match the underlying
  hardware.
  
-We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that
-behavior is appropriate.  Note that this also may not work as desired with
-older kernels.  You can check for this by examining the presence and value
+we suggest inspecting such osds at startup via logs and admin sockets to ensure that
+behavior is appropriate.  note that this also may not work as desired with
+older kernels.  you can check for this by examining the presence and value
  of ``/sys/block/<drive>/queue/optimal_io_size``.
  
-You may also inspect a given OSD:
+you may also inspect a given osd:
  
  .. prompt:: bash #
  
     ceph osd metadata osd.1701 | grep rotational
  
-This space amplification may manifest as an unusually high ratio of raw to
+this space amplification may manifest as an unusually high ratio of raw to
  stored data reported by ``ceph df``.  ``ceph osd df`` may also report
-anomalously high ``%USE`` / ``VAR`` values when
-compared to other, ostensibly identical OSDs.  A pool using OSDs with
+anomalously high ``%use`` / ``var`` values when
+compared to other, ostensibly identical osds.  a pool using osds with
  mismatched ``min_alloc_size`` values may experience unexpected balancer
  behavior as well.
  
-Note that this BlueStore attribute takes effect *only* at OSD creation; if
-changed later, a given OSD's behavior will not change unless / until it is
-destroyed and redeployed with the appropriate option value(s).  Upgrading
-to a later Ceph release will *not* change the value used by OSDs deployed
+note that this bluestore attribute takes effect *only* at osd creation; if
+changed later, a given osd's behavior will not change unless / until it is
+destroyed and redeployed with the appropriate option value(s).  upgrading
+to a later ceph release will *not* change the value used by osds deployed
  under older releases or with other settings.
  
  
@@ -502,22 +518,22 @@ under older releases or with other settings.
  .. confval:: bluestore_min_alloc_size_ssd
  .. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
  
-DSA (Data Streaming Accelerator Usage)
+dsa (data streaming accelerator usage)
  ======================================
  
-If you want to use the DML library to drive DSA device for offloading
-read/write operations on Persist memory in Bluestore. You need to install
-`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU.
+if you want to use the dml library to drive dsa device for offloading
+read/write operations on persist memory in bluestore. you need to install
+`dml`_ and `idxd-config`_ library in your machine with spr (sapphire rapids) cpu.
  
-.. _DML: https://github.com/intel/DML
+.. _dml: https://github.com/intel/dml
  .. _idxd-config: https://github.com/intel/idxd-config
  
-After installing the DML software, you need to configure the shared
-work queues (WQs) with the following WQ configuration example via accel-config tool:
+after installing the dml software, you need to configure the shared
+work queues (wqs) with the following wq configuration example via accel-config tool:
  
  .. prompt:: bash $
  
-   accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
+   accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
     accel-config config-engine dsa0/engine0.1 --group-id=1
     accel-config enable-device dsa0
     accel-config enable-wq dsa0/wq0.1
author	Zac Dover <zac.dover@proton.me>
	Fri, 26 May 2023 08:59:36 +0000 (18:59 +1000)
committer	Zac Dover <zac.dover@proton.me>
	Fri, 26 May 2023 21:50:50 +0000 (07:50 +1000)