doc/rados/bluestore: Fast onode recovery

author Adam Kupczyk <akupczyk@ibm.com>

Tue, 9 Jun 2026 18:16:03 +0000 (20:16 +0200)

committer Adam Kupczyk <akupczyk@ibm.com>

Fri, 12 Jun 2026 17:12:35 +0000 (19:12 +0200)
author Adam Kupczyk <akupczyk@ibm.com>
Tue, 9 Jun 2026 18:16:03 +0000 (20:16 +0200)
committer Adam Kupczyk <akupczyk@ibm.com>
Fri, 12 Jun 2026 17:12:35 +0000 (19:12 +0200)
diff --git a/doc/rados/bluestore/fast-onode-scan.rst b/doc/rados/bluestore/fast-onode-scan.rst

new file mode 100644 (file)

index 0000000..f82f21a
--- /dev/null
+++ b/doc/rados/bluestore/fast-onode-scan.rst
@@ -0,0 +1,168 @@
+============================
+ Multithread Onode Scanning
+============================
+
+.. index:: bluestore; rocksdb; allocator
+
+Since Pacific release BlueStore has the option to select 5-10% latency reduction
+at a cost of significantly longer ``OSD`` recovery after crash.
+
+.. confval:: bluestore_allocation_from_file
+
+During crash recovery BlueStore must reconstruct allocator state by scanning all stored objects.
+On systems with millions of objects this process can take several minutes.
+This page presents new multithreaded onode recovery that significantly reduces recovery time.
+
+Allocator
+=========
+
+In BlueStore ``Allocator`` is responsible for tracking unused allocation units on the disk.
+Up to 3 allocators can be in use concurrently, one for Main, one for DB, and one for WAL.
+Only Main device can contain RADOS object, and only Main's allocator is relevant here.
+When BlueStore is running, allocator state is fully loaded to memory. That way
+allocator can quickly respond when there is a need to allocate disk space for object,
+or when object no longer exists on disk.
+
+RocksDB persisted allocator
+---------------------------
+
+Original design envisioned that allocator state updates are stored in RocksDB's ``b`` column.
+When RADOS object change allocates or releases allocation unit the relevant state update information
+is kept together with object-modifying RocksDB atomic transaction. The information transfer is unidirectional,
+BlueStore is updating allocator state in RocksDB, but never retrieving it.
+The exception is OSD bootup when BlueStore retrieves allocator state from RocksDB ``b`` column.
+
+File persisted allocator
+------------------------
+
+It was tested that keeping RocksDB busy updating ``b`` column with allocation data
+has observable cost on write latency and cpu burden on compaction.
+To get that extra performance back the new default mode is to update in-memory allocator state only.
+Instead, on shutdown state is saved to BlueFS file named ``ALLOCATOR_NCB_DIR/ALLOCATOR_NCB_FILE``.
+
+Crash recovery
+--------------
+
+If BlueStore was not shutdown orderly there is no file to read allocator state from.
+Since it is the objects that define what is in use, the recovery procedure iterates over all objects and extracts used allocations.
+
+Multi-thread recovery
+=====================
+
+Iterating over all onodes in the RocksDB can take several minutes.
+To help with that a multi-threaded recovery procedure is created.
+It is significantly different procedure than the original one.
+There is just one configurable that selects and controls recovery.
+
+.. confval:: bluestore_allocation_recovery_threads
+
+
+Performance
+-----------
+
+Example recovery timings.
+OSD with NVME SSD, hosting 7.5M RBD objects, 4.6TiB used.
+
++------------------------+------------+------------+
+| recovery mechanism     | threads    | time(s)    |
++========================+============+============+
+| original               | 1          | 76.0       |
++------------------------+------------+------------+
+| multithread            | 1          | 55.8       |
++------------------------+------------+------------+
+| multithread            | 4          | 18.1       |
++------------------------+------------+------------+
+| multithread            | 8          | 10.8       |
++------------------------+------------+------------+
+| multithread            | 12         | 8.0        |
++------------------------+------------+------------+
+| multithread            | 16         | 6.8        |
++------------------------+------------+------------+
+| multithread            | 32         | 5.3        |
++------------------------+------------+------------+
+
+Testing
+-------
+
+If long recovery time has been a deciding factor for staying with allocations in RocksDB,
+there is a procedure for measuring directly recovery time.
+
+.. prompt:: bash #
+
+  ceph-bluestore-tool --path <osd-data-path> recovery-compare
+
+.. code-block::
+
+  recovery-compare bluestore compare new and legacy onode recovery
+  Legacy recovery start
+  Legacy recovery result=0 took 76.033992 seconds
+  Legacy recovery stats=
+  ==========================================================
+  onode_count             =    7500217
+  shard_count             =    9754950
+  shared_blob_count       =          0
+  compressed_blob_count   =          0
+  spanning_blob_count     =     264519
+  skipped_illegal_extent  =  100815206
+  extent_count            =  188685236
+  insert_count            =          0
+  store store_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 1 store_statfs(0x0/0x0/0x0, data 0x90220/0x94000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 2 store_statfs(0x0/0x0/0x0, data 0x4348a70e000/0x4c7ee684000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 18446744073709551615 store_statfs(0x0/0x0/0x0, data 0x26a3f/0x150000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  ==========================================================
+
+  bluestore_allocation_recovery_threads = 0
+  No multithread recovery to compare.
+  recovery-compare success
+
+.. prompt:: bash #
+
+  ceph-bluestore-tool --path <osd-data-path> --bluestore_allocation_recovery_threads=12 recovery-compare
+
+.. code-block::
+
+  recovery-compare bluestore compare new and legacy onode recovery
+  New recovery start
+  New recovery result=0 took 8.033309 seconds
+  New recovery stats=
+  ==========================================================
+  onode_count             =    7500217
+  shard_count             =    9754950
+  shared_blob_count       =          0
+  compressed_blob_count   =          0
+  spanning_blob_count     =     264519
+  skipped_illegal_extent  =          0
+  extent_count            =  188685236
+  insert_count            =          0
+  store store_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 1 store_statfs(0x0/0x0/0x0, data 0x90220/0x94000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 2 store_statfs(0x0/0x0/0x0, data 0x4348a70e000/0x4c7ee684000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 18446744073709551615 store_statfs(0x0/0x0/0x0, data 0x26a3f/0x150000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  ==========================================================
+
+  Legacy recovery start
+  Legacy recovery result=0 took 62.210723 seconds
+  Legacy recovery stats=
+  ==========================================================
+  onode_count             =    7500217
+  shard_count             =    9754950
+  shared_blob_count       =          0
+  compressed_blob_count   =          0
+  spanning_blob_count     =     264519
+  skipped_illegal_extent  =  100815206
+  extent_count            =  188685236
+  insert_count            =          0
+  store store_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 1 store_statfs(0x0/0x0/0x0, data 0x90220/0x94000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 2 store_statfs(0x0/0x0/0x0, data 0x4348a70e000/0x4c7ee684000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  pool 18446744073709551615 store_statfs(0x0/0x0/0x0, data 0x26a3f/0x150000, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
+  ==========================================================
+
+  FSstats the same.
+  Allocators the same.
+  recovery-compare success
+
+One can see above discrepancy between legacy recovery times: 76.0s vs 62.2s.
+This is an effect of RocksDB having the data already cached by new recovery when legacy recovery is the second one.
+
diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst

index 01b83f95472ac043b664298a4fb6afd6980b5607..0367b4494165d337bf11aba80a1b59469302ac52 100644 (file)
--- a/doc/rados/configuration/index.rst
+++ b/doc/rados/configuration/index.rst
@@ -22,6 +22,7 @@ For general object store configuration, refer to the following:
     Storage devices <storage-devices>
     BlueStore RocksDB cache <../bluestore/rocksdb-config>
     BlueFS Spillover Cleaner <../bluestore/bluefs-spillover-cleaner>
+   Fast Crash Recovery for file-stored allocations <../bluestore/fast-onode-scan>
     ceph-conf
author	Adam Kupczyk <akupczyk@ibm.com>
	Tue, 9 Jun 2026 18:16:03 +0000 (20:16 +0200)
committer	Adam Kupczyk <akupczyk@ibm.com>
	Fri, 12 Jun 2026 17:12:35 +0000 (19:12 +0200)
doc/rados/bluestore/fast-onode-scan.rst	[new file with mode: 0644]	patch \| blob
doc/rados/configuration/index.rst		patch \| blob \| history