From f2bcd0250bf2751f4f739ba788f68d1bb6cf297e Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Thu, 3 Aug 2017 09:21:18 -0400 Subject: [PATCH] doc/rados/configuration/bluestore-config-ref: devices, checksumming, cache Signed-off-by: Sage Weil --- .../configuration/bluestore-config-ref.rst | 206 +++++++++++++++++- 1 file changed, 204 insertions(+), 2 deletions(-) diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst index 254f99a133223..86c44ce00e7ed 100644 --- a/doc/rados/configuration/bluestore-config-ref.rst +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -2,11 +2,202 @@ BlueStore Config Reference ========================== +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage +device. The storage device is normally partitioned into two parts: + +#. A small partition is formatted with XFS and contains basic metadata +for the OSD. This *data directory* includes information about the OSD +(its identifier, which cluster it belongs to, and its private keyring. +#. The rest of the device is normally a large partition occupying the +rest of the device that is managed directly by BlueStore contains all +of the actual data. This *main device* is normally identifed by a +``block`` symlink in data directory. + +It is also possible to deploy BlueStore across two additional devices: + +* A *WAL device* can be used for BlueStore's internal journal or + write-ahead log. It is identified by the ``block.wal`` symlink in + the data directory. It is only useful to use a WAL device if the + device is faster than the primary device (e.g., when it is on an SSD + and the primary device is an HDD). +* A *DB device* can be used for storing BlueStore's internal metadata. + BlueStore (or rather, the embedded RocksDB) will put as much + metadata as it can on the DB device to improve performance. If the + DB device fills up, metadata will spill back onto the primary device + (where it would have been otherwise). Again, it is only helpful to + provision a DB device if it is faster than the primary device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fix). + +A single-device BlueStore OSD can be provisioned with:: + + ceph-disk prepare --bluestore + +To specify a WAL device and/or DB device, :: + + ceph-disk prepare --bluestore --block.wal --block-db + +Cache size +========== + +The amount of memory consumed by each OSD for BlueStore's cache is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD does the best it can currently +to stick to the budgeted memory. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +generally some overhead due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and +``bluestore_cache_kv_max``. The fraction of the cache devoted to data +is 1.0 minus the meta and kv ratios. The memory devoted to kv +metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max`` +since our testing indicates there are diminishing returns beyond a +certain point. + +``bluestore_cache_size`` + +:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. +:Type: Integer +:Required: Yes +:Default: ``0`` + +``bluestore_cache_size_hdd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. +:Type: Integer +:Required: Yes +:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) + +``bluestore_cache_size_ssd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. +:Type: Integer +:Required: Yes +:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) + +``bluestore_cache_meta_ratio`` + +:Description: The ratio of cache devoted to metadata. +:Type: Floating point +:Required: Yes +:Default: ``.01`` + +``bluestore_cache_kv_ratio`` + +:Description: The ratio of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``.99`` + +``bluestore_cache_kv_max`` + +:Description: The maximum amount of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``512 * 1024*1024`` (512 MB) + + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one if +four billion with a 32-bit (4 byte) checksum to one is 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example, :: + + ceph osd pool set csum_type + +``bluestore_csum_type`` + +:Description: The default checksum algorithm to use. +:Type: String +:Required: Yes +:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` +:Default: ``crc32c`` + + Inline Compression ================== -BlueStore supports inline compression using snappy, zlib, or LZ4. Please note, -the lz4 compression plugin is not distributed in the official release. +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation as a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation as an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :doc:`/api/librados/#rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with:: + + ceph osd pool set compression_algorithm + ceph osd pool set compression_mode + ceph osd pool set compression_required_ratio + ceph osd pool set compression_min_blob_size + ceph osd pool set compression_max_blob_size ``bluestore compression algorithm`` @@ -33,6 +224,17 @@ the lz4 compression plugin is not distributed in the official release. :Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` :Default: ``none`` +``bluestore compression required ratio`` + +:Description: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + +:Type: Floating point +:Required: No +:Default: .875 + ``bluestore compression min blob size`` :Description: Chunks smaller than this are never compressed. -- 2.39.5