doc/rados/bluestore: RockDB cache shards, perf counters

author Adam Kupczyk <akupczyk@ibm.com>

Mon, 4 May 2026 18:47:12 +0000 (20:47 +0200)

committer Adam Kupczyk <akupczyk@ibm.com>

Wed, 27 May 2026 11:51:33 +0000 (13:51 +0200)
author Adam Kupczyk <akupczyk@ibm.com>
Mon, 4 May 2026 18:47:12 +0000 (20:47 +0200)
committer Adam Kupczyk <akupczyk@ibm.com>
Wed, 27 May 2026 11:51:33 +0000 (13:51 +0200)
diff --git a/doc/rados/bluestore/rocksdb-config.rst b/doc/rados/bluestore/rocksdb-config.rst

new file mode 100644 (file)

index 0000000..9ba5372
--- /dev/null
+++ b/doc/rados/bluestore/rocksdb-config.rst
@@ -0,0 +1,140 @@
+============================
+ RocksDB Config Reference
+============================
+
+.. note:: As of the Tentacle release, two Ceph services use RocksDB: Monitors and OSDs.
+   This document focuses on OSD's RocksDB.
+
+.. index:: bluestore; rocksdb caching
+
+RocksDB caching
+===============
+
+RocksDB caching is based on preserving parts of ``.sst`` files in block cache.
+For more details, see the `Block-Cache wiki <https://github.com/facebook/rocksdb/wiki/Block-Cache>`_.
+Ceph implements its own flavor of block cache.
+See the `source code <https://github.com/ceph/ceph/tree/main/src/kv/rocksdb_cache>`_ for more details.
+This custom implementation brings together RocksDB block cache, BlueStore metadata cache,
+and BlueStore data cache to compete for available memory.
+
+Cache sharding
+--------------
+
+As the default RocksDB block cache, Ceph block cache is sharded.
+Sharding is controlled by configuration. The purpose of sharding is to
+streamline multi-threaded access.
+
+.. confval:: rocksdb_cache_shard_bits
+
+
+Perf counters
+-------------
+
+Ceph RocksDB cache operations are tracked by performance counters.
+In the default configuration BlueStore creates two block caches:
+``O`` for onodes, and ``default`` for everything else.
+The sections created in performance counters are named ``rocksdb-cache-O`` and
+``rocksdb-cache-default``.
+
+.. prompt:: bash #
+
+  ceph tell osd.0 perf dump rocksdb-cache-O
+
+.. code-block:: 
+
+  "rocksdb-cache-O": {
+    "capacity": 134217728,
+    "usage": 134182832,
+    "pinned": 0,
+    "elems": 24502,
+    "inserts": 25806978,
+    "lookups": 150436987,
+    "hits": 124629911,
+    "misses": 25807076
+  }
+
+Values ``capacity``, ``usage``, ``pinned`` and ``elems`` reflect the current state of the cache.
+Values ``inserts``, ``lookups``, ``hits`` and ``misses`` are increased on relevant event.
+
+.. index:: rocksdb; perf counters
+
+Admin commands
+--------------
+
+Performance counters show a brief summation, but each cache shard has its own stats.
+To list RocksDB onode block cache details for each shard, run an admin socket command:
+
+.. prompt:: bash #
+
+  ceph tell osd.0 rocksdb show cache O
+
+.. code-block::
+
+    shard  capacity     usage   pinned   elems  inserts  lookups     hits   misses
+        0  13631488  11076400        0    2099   136987   822679   685923   136756
+        1  13631488  11549712        0    2043   133359   571500   438383   133117
+        2  13631488  11060608        0    2232   135076   908468   773313   135155
+        3  13631488  11166896        0    2269   134006   427070   293147   133923
+        4  13631488  11117984        0    2297   133367   700242   567318   132924
+        5  13631488  11306672        0    2155   137501  1130135   991810   138325
+        6  13631488  11506512        0    2353   134515   662792   528514   134278
+        7  13631488  11093856        0    2316   135348   718971   583421   135550
+        8  13631488  11660624        0    2424   137363  1092043   954248   137795
+        9  13631488  10962000        0    2561   131982   431702   300467   131235
+       10  13631488  11379392        0    1916   134543   477118   342854   134264
+       11  13631488  11294272        0    2555   134508   512393   378337   134056
+       12  13631488  11277136        0    2079   137312  1131571   993692   137879
+       13  13631488  10887776        0    2543   134001   567073   432903   134170
+       14  13631488  10986528        0    2394   133288   584452   451018   133434
+       15  13631488  11954464        0    2456   134615   708285   573374   134911
+
+Counters that support clearing can be reset to zero by running a command:
+
+.. prompt:: bash #
+
+   ceph tell osd.0 rocksdb reset cache O
+
+.. index:: bluestore; rocksdb; admin commands
+
+Optimum shard count
+-------------------
+
+In most cases 16 shards as defined by ``rocksdb_cache_shard_bits=4`` is a good choice.
+Large OSDs can easily accommodate millions of objects and thus having millions of keys
+to encode onode metadata. While the number of keys is not directly a problem,
+it causes RocksDB to create very large index blocks.
+During leveled compaction RocksDB merges two levels. Two index blocks are
+needed at the same time.
+A problem appears if both index blocks belong to the same shard
+and their total size is more than capacity of the shard.
+Both index blocks cannot fit in the shard simultaneously and access to index block
+causes the other block to be evicted from the cache.
+The constant thrashing persists until the current RocksDB compaction step finishes.
+
+Thrashing detection
+*******************
+
+.. prompt:: bash #
+
+  ceph tell osd.0 rocksdb show cache O
+
+.. code-block::
+
+    shard  capacity     usage   pinned   elems  inserts  lookups     hits   misses
+      ...
+        2  13631488  11060608        0    2232   135076   908468   773313   135155
+        3  13631488   8166896        0       1   134006   427070   293147   133923
+        4  13631488  11117984        0    2297   133367   700242   567318   132924
+      ...
+
+It is likely that shard 3 is doing constant eviction. To verify, reset counters to zero
+and observe the relation between ``misses`` and ``hits``.
+Also, the perf counter for ``bluefs.read_bytes`` will be increasing very fast as RocksDB is
+reading the same index blocks over and over again.
+
+Mitigation
+**********
+
+More like a workaround. Reduce ``rocksdb_cache_shard_bits``.
+This will have a slight negative effect on the baseline RocksDB performance,
+as fewer shards means more opportunity for lock contention.
diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst

index 715b999d1869d970dc2ebff9986669bc919b3cc5..6c73c67ed9a0e117ad9ce162f99aa0e574f812c0 100644 (file)
--- a/doc/rados/configuration/index.rst
+++ b/doc/rados/configuration/index.rst
@@ -20,6 +20,7 @@ For general object store configuration, refer to the following:
     :maxdepth: 1
  
     Storage devices <storage-devices>
+   BlueStore RocksDB cache <../bluestore/rocksdb-config>
     ceph-conf
author	Adam Kupczyk <akupczyk@ibm.com>
	Mon, 4 May 2026 18:47:12 +0000 (20:47 +0200)
committer	Adam Kupczyk <akupczyk@ibm.com>
	Wed, 27 May 2026 11:51:33 +0000 (13:51 +0200)
doc/rados/bluestore/rocksdb-config.rst	[new file with mode: 0644]	patch \| blob
doc/rados/configuration/index.rst		patch \| blob \| history