From 2534df34b12bebbcd6ff94770c19d42f660ec086 Mon Sep 17 00:00:00 2001 From: Alfredo Deza Date: Tue, 24 Jul 2018 12:51:56 -0400 Subject: [PATCH] doc/bluestore expand on configuration with fast and slow devices Signed-off-by: Alfredo Deza --- .../configuration/bluestore-config-ref.rst | 89 ++++++++++++++++--- 1 file changed, 78 insertions(+), 11 deletions(-) diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst index 0ebe9af6bdee4..a272666aad183 100644 --- a/doc/rados/configuration/bluestore-config-ref.rst +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -20,17 +20,17 @@ and its private keyring. It is also possible to deploy BlueStore across two additional devices: -* A *WAL device* can be used for BlueStore's internal journal or - write-ahead log. It is identified by the ``block.wal`` symlink in - the data directory. It is only useful to use a WAL device if the - device is faster than the primary device (e.g., when it is on an SSD - and the primary device is an HDD). -* A *DB device* can be used for storing BlueStore's internal metadata. - BlueStore (or rather, the embedded RocksDB) will put as much - metadata as it can on the DB device to improve performance. If the - DB device fills up, metadata will spill back onto the primary device - (where it would have been otherwise). Again, it is only helpful to - provision a DB device if it is faster than the primary device. +* A *WAL device* (identified as ``block.wal`` in the data directory) can be + used for BlueStore's internal journal or write-ahead log. It is only useful + to use a WAL device if the device is faster than the primary device (e.g., + when it is on an SSD and the primary device is an HDD). +* A *DB device* (identified as ``block.db`` in the data directory) can be used + for storing BlueStore's internal metadata. BlueStore (or rather, the + embedded RocksDB) will put as much metadata as it can on the DB device to + improve performance. If the DB device fills up, metadata will spill back + onto the primary device (where it would have been otherwise). Again, it is + only helpful to provision a DB device if it is faster than the primary + device. If there is only a small amount of fast storage available (e.g., less than a gigabyte), we recommend using it as a WAL device. If there is @@ -51,6 +51,73 @@ To specify a WAL device and/or DB device, :: .. note:: --data can be a Logical Volume using the vg/lv notation. Other devices can be existing logical volumes or GPT partitions +Provisioning strategies +----------------------- +Although there are multiple ways to deploy a Bluestore OSD (unlike Filestore +which had 1) here are two common use cases that should help clarify the +initial deployment strategy: + +**block (data) only** +^^^^^^^^^^^^^^^^^^^^^ +If all the devices are the same type, for example all are spinning drives, and +there are no fast devices to combine these, it makes sense to just deploy with +block only and not try to separate ``block.db`` or ``block.wal``. The +:ref:`ceph-volume-lvm` call for a single ``/dev/sda`` device would look like:: + + ceph-volume lvm create --bluestore --data /dev/sda + +If logical volumes have already been created for each device (1 LV using 100% +of the device), then the :ref:`ceph-volume-lvm` call for an lv named +``ceph-vg/block-lv`` would look like:: + + ceph-volume lvm create --bluestore --data ceph-vg/block-lv + + +**block and block.db** +^^^^^^^^^^^^^^^^^^^^^^ +If there is a mix of fast and slow devices (spinning and solid state), +it is recommended to place ``block.db`` on the faster device while ``block`` +(data) lives on the slower (spinning drive). Sizing for ``block.db`` should be +as large as possible to avoid performance penalties otherwise. The +``ceph-volume`` tool is currently not able to create these automatically, so +the volume groups and logical volumes need to be created manually. + +For the below example, lets assume 4 spinning drives (sda, sdb, sdc, and sdd) +and 1 solid state drive (sdx). First create the volume groups:: + + $ vgcreate ceph-block-0 /dev/sda + $ vgcreate ceph-block-1 /dev/sdb + $ vgcreate ceph-block-2 /dev/sdc + $ vgcreate ceph-block-3 /dev/sdd + +Now create the logical volumes for ``block``:: + + $ lvcreate -l 100%FREE -n block-0 ceph-block-0 + $ lvcreate -l 100%FREE -n block-1 ceph-block-1 + $ lvcreate -l 100%FREE -n block-2 ceph-block-2 + $ lvcreate -l 100%FREE -n block-3 ceph-block-3 + +We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB +SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:: + + $ vgcreate ceph-db-0 /dev/sdx + $ lvcreate -L 50GB -n db-0 ceph-db-0 + $ lvcreate -L 50GB -n db-1 ceph-db-0 + $ lvcreate -L 50GB -n db-2 ceph-db-0 + $ lvcreate -L 50GB -n db-3 ceph-db-0 + +Finally, create the 4 OSDs with ``ceph-volume``:: + + $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 + $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 + $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 + $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 + +These operations should end up creating 4 OSDs, with ``block`` on the slower +spinning drives and a 50GB logical volume for each coming from the solid state +drive. + + Cache size ========== -- 2.39.5