From: Adam Kupczyk Date: Tue, 9 Jun 2026 18:16:03 +0000 (+0200) Subject: doc/rados/bluestore: Fast onode recovery X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=d7fedf1ecead7ad74579c2236f9fa175bc0c0ef4;p=ceph.git doc/rados/bluestore: Fast onode recovery Add new page of documentation about new feature: multithread onode recovery. Signed-off-by: Adam Kupczyk --- diff --git a/doc/rados/bluestore/fast-onode-scan.rst b/doc/rados/bluestore/fast-onode-scan.rst new file mode 100644 index 00000000000..f82f21aab9d --- /dev/null +++ b/doc/rados/bluestore/fast-onode-scan.rst @@ -0,0 +1,168 @@ +============================ + Multithread Onode Scanning +============================ + +.. index:: bluestore; rocksdb; allocator + +Since Pacific release BlueStore has the option to select 5-10% latency reduction +at a cost of significantly longer ``OSD`` recovery after crash. + +.. confval:: bluestore_allocation_from_file + +During crash recovery BlueStore must reconstruct allocator state by scanning all stored objects. +On systems with millions of objects this process can take several minutes. +This page presents new multithreaded onode recovery that significantly reduces recovery time. + +Allocator +========= + +In BlueStore ``Allocator`` is responsible for tracking unused allocation units on the disk. +Up to 3 allocators can be in use concurrently, one for Main, one for DB, and one for WAL. +Only Main device can contain RADOS object, and only Main's allocator is relevant here. +When BlueStore is running, allocator state is fully loaded to memory. That way +allocator can quickly respond when there is a need to allocate disk space for object, +or when object no longer exists on disk. + +RocksDB persisted allocator +--------------------------- + +Original design envisioned that allocator state updates are stored in RocksDB's ``b`` column. +When RADOS object change allocates or releases allocation unit the relevant state update information +is kept together with object-modifying RocksDB atomic transaction. The information transfer is unidirectional, +BlueStore is updating allocator state in RocksDB, but never retrieving it. +The exception is OSD bootup when BlueStore retrieves allocator state from RocksDB ``b`` column. + +File persisted allocator +------------------------ + +It was tested that keeping RocksDB busy updating ``b`` column with allocation data +has observable cost on write latency and cpu burden on compaction. +To get that extra performance back the new default mode is to update in-memory allocator state only. +Instead, on shutdown state is saved to BlueFS file named ``ALLOCATOR_NCB_DIR/ALLOCATOR_NCB_FILE``. + +Crash recovery +-------------- + +If BlueStore was not shutdown orderly there is no file to read allocator state from. +Since it is the objects that define what is in use, the recovery procedure iterates over all objects and extracts used allocations. + +Multi-thread recovery +===================== + +Iterating over all onodes in the RocksDB can take several minutes. +To help with that a multi-threaded recovery procedure is created. +It is significantly different procedure than the original one. +There is just one configurable that selects and controls recovery. + +.. confval:: bluestore_allocation_recovery_threads + + +Performance +----------- + +Example recovery timings. +OSD with NVME SSD, hosting 7.5M RBD objects, 4.6TiB used. + ++------------------------+------------+------------+ +| recovery mechanism | threads | time(s) | ++========================+============+============+ +| original | 1 | 76.0 | ++------------------------+------------+------------+ +| multithread | 1 | 55.8 | ++------------------------+------------+------------+ +| multithread | 4 | 18.1 | ++------------------------+------------+------------+ +| multithread | 8 | 10.8 | ++------------------------+------------+------------+ +| multithread | 12 | 8.0 | ++------------------------+------------+------------+ +| multithread | 16 | 6.8 | ++------------------------+------------+------------+ +| multithread | 32 | 5.3 | ++------------------------+------------+------------+ + +Testing +------- + +If long recovery time has been a deciding factor for staying with allocations in RocksDB, +there is a procedure for measuring directly recovery time. + +.. prompt:: bash # + + ceph-bluestore-tool --path recovery-compare + +.. code-block:: + + recovery-compare bluestore compare new and legacy onode recovery + Legacy recovery start + Legacy recovery result=0 took 76.033992 seconds + Legacy recovery stats= + ========================================================== + onode_count = 7500217 + shard_count = 9754950 + shared_blob_count = 0 + compressed_blob_count = 0 + spanning_blob_count = 264519 + skipped_illegal_extent = 100815206 + extent_count = 188685236 + insert_count = 0 + store store_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 1 store_statfs(0x0/0x0/0x0, data 0x90220/0x94000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 2 store_statfs(0x0/0x0/0x0, data 0x4348a70e000/0x4c7ee684000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 18446744073709551615 store_statfs(0x0/0x0/0x0, data 0x26a3f/0x150000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + ========================================================== + + bluestore_allocation_recovery_threads = 0 + No multithread recovery to compare. + recovery-compare success + +.. prompt:: bash # + + ceph-bluestore-tool --path --bluestore_allocation_recovery_threads=12 recovery-compare + +.. code-block:: + + recovery-compare bluestore compare new and legacy onode recovery + New recovery start + New recovery result=0 took 8.033309 seconds + New recovery stats= + ========================================================== + onode_count = 7500217 + shard_count = 9754950 + shared_blob_count = 0 + compressed_blob_count = 0 + spanning_blob_count = 264519 + skipped_illegal_extent = 0 + extent_count = 188685236 + insert_count = 0 + store store_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 1 store_statfs(0x0/0x0/0x0, data 0x90220/0x94000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 2 store_statfs(0x0/0x0/0x0, data 0x4348a70e000/0x4c7ee684000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 18446744073709551615 store_statfs(0x0/0x0/0x0, data 0x26a3f/0x150000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + ========================================================== + + Legacy recovery start + Legacy recovery result=0 took 62.210723 seconds + Legacy recovery stats= + ========================================================== + onode_count = 7500217 + shard_count = 9754950 + shared_blob_count = 0 + compressed_blob_count = 0 + spanning_blob_count = 264519 + skipped_illegal_extent = 100815206 + extent_count = 188685236 + insert_count = 0 + store store_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 1 store_statfs(0x0/0x0/0x0, data 0x90220/0x94000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 2 store_statfs(0x0/0x0/0x0, data 0x4348a70e000/0x4c7ee684000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + pool 18446744073709551615 store_statfs(0x0/0x0/0x0, data 0x26a3f/0x150000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0) + ========================================================== + + FSstats the same. + Allocators the same. + recovery-compare success + +One can see above discrepancy between legacy recovery times: 76.0s vs 62.2s. +This is an effect of RocksDB having the data already cached by new recovery when legacy recovery is the second one. + diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst index 01b83f95472..0367b449416 100644 --- a/doc/rados/configuration/index.rst +++ b/doc/rados/configuration/index.rst @@ -22,6 +22,7 @@ For general object store configuration, refer to the following: Storage devices BlueStore RocksDB cache <../bluestore/rocksdb-config> BlueFS Spillover Cleaner <../bluestore/bluefs-spillover-cleaner> + Fast Crash Recovery for file-stored allocations <../bluestore/fast-onode-scan> ceph-conf