doc/rados: add prompts to bluestore-config-ref.rst

author Zac Dover <zac.dover@gmail.com>

Wed, 21 Dec 2022 07:41:04 +0000 (17:41 +1000)

committer Zac Dover <zac.dover@gmail.com>

Wed, 21 Dec 2022 22:05:37 +0000 (08:05 +1000)
author Zac Dover <zac.dover@gmail.com>
Wed, 21 Dec 2022 07:41:04 +0000 (17:41 +1000)
committer Zac Dover <zac.dover@gmail.com>
Wed, 21 Dec 2022 22:05:37 +0000 (08:05 +1000)
diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst

index cf6f63c20aedd377f05d41030f1a6e3ad972fcb7..c927066bef79f5b5580a0c4647a10808092a4eed 100644 (file)
--- a/doc/rados/configuration/bluestore-config-ref.rst
+++ b/doc/rados/configuration/bluestore-config-ref.rst
@@ -42,13 +42,17 @@ it will fit).  This means that if a DB device is specified but an explicit
  WAL device is not, the WAL will be implicitly colocated with the DB on the faster
  device.
  
-A single-device (colocated) BlueStore OSD can be provisioned with::
+A single-device (colocated) BlueStore OSD can be provisioned with:
  
-  ceph-volume lvm prepare --bluestore --data <device>
+.. prompt:: bash $
  
-To specify a WAL device and/or DB device, ::
+   ceph-volume lvm prepare --bluestore --data <device>
  
-  ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
+To specify a WAL device and/or DB device:
+   
+.. prompt:: bash $
+
+   ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
  
  .. note:: ``--data`` can be a Logical Volume using  *vg/lv* notation. Other
            devices can be existing logical volumes or GPT partitions.
@@ -66,15 +70,19 @@ the deployment strategy:
  If all devices are the same type, for example all rotational drives, and
  there are no fast devices to use for metadata, it makes sense to specify the
  block device only and to not separate ``block.db`` or ``block.wal``. The
-:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
+:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like:
+
+.. prompt:: bash $
  
-    ceph-volume lvm create --bluestore --data /dev/sda
+   ceph-volume lvm create --bluestore --data /dev/sda
  
  If logical volumes have already been created for each device, (a single LV
  using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
-``ceph-vg/block-lv`` would look like::
+``ceph-vg/block-lv`` would look like:
+
+.. prompt:: bash $
  
-    ceph-volume lvm create --bluestore --data ceph-vg/block-lv
+   ceph-volume lvm create --bluestore --data ceph-vg/block-lv
  
  .. _bluestore-mixed-device-config:
  
@@ -88,35 +96,43 @@ You must create these volume groups and logical volumes manually as
  the ``ceph-volume`` tool is currently not able to do so automatically.
  
  For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
-and one (fast) solid state drive (``sdx``). First create the volume groups::
+and one (fast) solid state drive (``sdx``). First create the volume groups:
  
-    $ vgcreate ceph-block-0 /dev/sda
-    $ vgcreate ceph-block-1 /dev/sdb
-    $ vgcreate ceph-block-2 /dev/sdc
-    $ vgcreate ceph-block-3 /dev/sdd
+.. prompt:: bash $
  
-Now create the logical volumes for ``block``::
+   vgcreate ceph-block-0 /dev/sda
+   vgcreate ceph-block-1 /dev/sdb
+   vgcreate ceph-block-2 /dev/sdc
+   vgcreate ceph-block-3 /dev/sdd
  
-    $ lvcreate -l 100%FREE -n block-0 ceph-block-0
-    $ lvcreate -l 100%FREE -n block-1 ceph-block-1
-    $ lvcreate -l 100%FREE -n block-2 ceph-block-2
-    $ lvcreate -l 100%FREE -n block-3 ceph-block-3
+Now create the logical volumes for ``block``:
+
+.. prompt:: bash $
+
+   lvcreate -l 100%FREE -n block-0 ceph-block-0
+   lvcreate -l 100%FREE -n block-1 ceph-block-1
+   lvcreate -l 100%FREE -n block-2 ceph-block-2
+   lvcreate -l 100%FREE -n block-3 ceph-block-3
  
  We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
-SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
+SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
+
+.. prompt:: bash $
  
-    $ vgcreate ceph-db-0 /dev/sdx
-    $ lvcreate -L 50GB -n db-0 ceph-db-0
-    $ lvcreate -L 50GB -n db-1 ceph-db-0
-    $ lvcreate -L 50GB -n db-2 ceph-db-0
-    $ lvcreate -L 50GB -n db-3 ceph-db-0
+   vgcreate ceph-db-0 /dev/sdx
+   lvcreate -L 50GB -n db-0 ceph-db-0
+   lvcreate -L 50GB -n db-1 ceph-db-0
+   lvcreate -L 50GB -n db-2 ceph-db-0
+   lvcreate -L 50GB -n db-3 ceph-db-0
  
-Finally, create the 4 OSDs with ``ceph-volume``::
+Finally, create the 4 OSDs with ``ceph-volume``:
  
-    $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
-    $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
-    $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
-    $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
+.. prompt:: bash $
+
+   ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
+   ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
+   ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
+   ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
  
  These operations should end up creating four OSDs, with ``block`` on the slower
  rotational drives with a 50 GB logical volume (DB) for each on the solid state
@@ -239,9 +255,11 @@ The smaller checksum values can be used by selecting `crc32c_16` or
  `crc32c_8` as the checksum algorithm.
  
  The *checksum algorithm* can be set either via a per-pool
-``csum_type`` property or the global config option.  For example, ::
+``csum_type`` property or the global config option.  For example:
+
+.. prompt:: bash $
  
-  ceph osd pool set <pool-name> csum_type <algorithm>
+   ceph osd pool set <pool-name> csum_type <algorithm>
  
  .. confval:: bluestore_csum_type
  
@@ -275,13 +293,15 @@ must be 70% of the size of the original (or smaller).
  The *compression mode*, *compression algorithm*, *compression required
  ratio*, *min blob size*, and *max blob size* can be set either via a
  per-pool property or a global config option.  Pool properties can be
-set with::
+set with:
+
+.. prompt:: bash $
  
-  ceph osd pool set <pool-name> compression_algorithm <algorithm>
-  ceph osd pool set <pool-name> compression_mode <mode>
-  ceph osd pool set <pool-name> compression_required_ratio <ratio>
-  ceph osd pool set <pool-name> compression_min_blob_size <size>
-  ceph osd pool set <pool-name> compression_max_blob_size <size>
+   ceph osd pool set <pool-name> compression_algorithm <algorithm>
+   ceph osd pool set <pool-name> compression_mode <mode>
+   ceph osd pool set <pool-name> compression_required_ratio <ratio>
+   ceph osd pool set <pool-name> compression_min_blob_size <size>
+   ceph osd pool set <pool-name> compression_max_blob_size <size>
  
  .. confval:: bluestore_compression_algorithm
  .. confval:: bluestore_compression_mode
@@ -342,16 +362,20 @@ Refer to `SPDK document`__ for more details.
  .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
  
  SPDK offers a script to configure the device automatically. Users can run the
-script as root::
+script as root:
  
-  $ sudo src/spdk/scripts/setup.sh
+.. prompt:: bash $
+
+   sudo src/spdk/scripts/setup.sh
  
  You will need to specify the subject NVMe device's device selector with
  the "spdk:" prefix for ``bluestore_block_path``.
  
-For example, you can find the device selector of an Intel PCIe SSD with::
+For example, you can find the device selector of an Intel PCIe SSD with:
+
+.. prompt:: bash $
  
-  $ lspci -mm -n -D -d 8086:0953
+   lspci -mm -n -D -d 8086:0953
  
  The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
  
@@ -377,3 +401,118 @@ settings to ensure that all IOs are issued through SPDK.::
  
  Otherwise, the current implementation will populate the SPDK map files with
  kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
+
+Minimum Allocation Size
+========================
+
+There is a configured minimum amount of storage that BlueStore will allocate on
+an OSD.  In practice, this is the least amount of capacity that a RADOS object
+can consume.  The value of :confval:`bluestore_min_alloc_size` is derived from the
+value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd`
+depending on the OSD's ``rotational`` attribute.  This means that when an OSD
+is created on an HDD, BlueStore will be initialized with the current value
+of :confval:`bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices)
+with the value of :confval:`bluestore_min_alloc_size_ssd`.
+
+Through the Mimic release, the default values were 64KB and 16KB for rotational
+(HDD) and non-rotational (SSD) media respectively.  Octopus changed the default
+for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD
+(rotational) media to 4KB as well.
+
+These changes were driven by space amplification experienced by Ceph RADOS
+GateWay (RGW) deployments that host large numbers of small files
+(S3/Swift objects).
+
+For example, when an RGW client stores a 1KB S3 object, it is written to a
+single RADOS object.  With the default :confval:`min_alloc_size` value, 4KB of
+underlying drive space is allocated.  This means that roughly
+(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300%
+overhead or 25% efficiency. Similarly, a 5KB user object will be stored
+as one 4KB and one 1KB RADOS object, again stranding 4KB of device capacity,
+though in this case the overhead is a much smaller percentage.  Think of this
+in terms of the remainder from a modulus operation. The overhead *percentage*
+thus decreases rapidly as user object size increases.
+
+An easily missed additional subtlety is that this
+takes place for *each* replica.  So when using the default three copies of
+data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device
+capacity.  If erasure coding (EC) is used instead of replication, the
+amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object
+will allocate (6 * 4KB) = 24KB of device capacity.
+
+When an RGW bucket pool contains many relatively large user objects, the effect
+of this phenomenon is often negligible, but should be considered for deployments
+that expect a significant fraction of relatively small objects.
+
+The 4KB default value aligns well with conventional HDD and SSD devices.  Some
+new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best
+when :confval:`bluestore_min_alloc_size_ssd`
+is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB.
+These novel storage drives allow one to achieve read performance competitive
+with conventional TLC SSDs and write performance faster than HDDs, with
+high density and lower cost than TLC SSDs.
+
+Note that when creating OSDs on these devices, one must carefully apply the
+non-default value only to appropriate devices, and not to conventional SSD and
+HDD devices.  This may be done through careful ordering of OSD creation, custom
+OSD device classes, and especially by the use of central configuration _masks_.
+
+Quincy and later releases add
+the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size`
+option that enables automatic discovery of the appropriate value as each OSD is
+created.  Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``,
+``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction
+technologies may confound the determination of appropriate values. OSDs
+deployed on top of VMware storage have been reported to also
+sometimes report a ``rotational`` attribute that does not match the underlying
+hardware.
+
+We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that
+behavior is appropriate.  Note that this also may not work as desired with
+older kernels.  You can check for this by examining the presence and value
+of ``/sys/block/<drive>/queue/optimal_io_size``.
+
+You may also inspect a given OSD:
+
+.. prompt:: bash #
+
+   ceph osd metadata osd.1701 | grep rotational
+
+This space amplification may manifest as an unusually high ratio of raw to
+stored data reported by ``ceph df``.  ``ceph osd df`` may also report
+anomalously high ``%USE`` / ``VAR`` values when
+compared to other, ostensibly identical OSDs.  A pool using OSDs with
+mismatched ``min_alloc_size`` values may experience unexpected balancer
+behavior as well.
+
+Note that this BlueStore attribute takes effect *only* at OSD creation; if
+changed later, a given OSD's behavior will not change unless / until it is
+destroyed and redeployed with the appropriate option value(s).  Upgrading
+to a later Ceph release will *not* change the value used by OSDs deployed
+under older releases or with other settings.
+
+
+.. confval:: bluestore_min_alloc_size
+.. confval:: bluestore_min_alloc_size_hdd
+.. confval:: bluestore_min_alloc_size_ssd
+.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
+
+DSA (Data Streaming Accelerator Usage)
+======================================
+
+If you want to use the DML library to drive DSA device for offloading
+read/write operations on Persist memory in Bluestore. You need to install
+`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU.
+
+.. _DML: https://github.com/intel/DML
+.. _idxd-config: https://github.com/intel/idxd-config
+
+After installing the DML software, you need to configure the shared
+work queues (WQs) with the following WQ configuration example via accel-config tool:
+
+.. prompt:: bash $
+
+   accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
+   accel-config config-engine dsa0/engine0.1 --group-id=1
+   accel-config enable-device dsa0
+   accel-config enable-wq dsa0/wq0.1
author	Zac Dover <zac.dover@gmail.com>
	Wed, 21 Dec 2022 07:41:04 +0000 (17:41 +1000)
committer	Zac Dover <zac.dover@gmail.com>
	Wed, 21 Dec 2022 22:05:37 +0000 (08:05 +1000)