From: Adam Kupczyk Date: Mon, 4 May 2026 18:47:12 +0000 (+0200) Subject: doc/rados/bluestore: RockDB cache shards, perf counters X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=1378e2656809ab396d4618dbe64ffb0b17cb4055;p=ceph.git doc/rados/bluestore: RockDB cache shards, perf counters Document role of RocksDB cache shards. Explain how to monitor cache status and performance. Give explanation how to detect cache flickering. Signed-off-by: Adam Kupczyk --- diff --git a/doc/rados/bluestore/rocksdb-config.rst b/doc/rados/bluestore/rocksdb-config.rst new file mode 100644 index 00000000000..9ba5372610a --- /dev/null +++ b/doc/rados/bluestore/rocksdb-config.rst @@ -0,0 +1,140 @@ +============================ + RocksDB Config Reference +============================ + +.. note:: As of the Tentacle release, two Ceph services use RocksDB: Monitors and OSDs. + This document focuses on OSD's RocksDB. + +.. index:: bluestore; rocksdb caching + +RocksDB caching +=============== + +RocksDB caching is based on preserving parts of ``.sst`` files in block cache. +For more details, see the `Block-Cache wiki `_. +Ceph implements its own flavor of block cache. +See the `source code `_ for more details. +This custom implementation brings together RocksDB block cache, BlueStore metadata cache, +and BlueStore data cache to compete for available memory. + +Cache sharding +-------------- + +As the default RocksDB block cache, Ceph block cache is sharded. +Sharding is controlled by configuration. The purpose of sharding is to +streamline multi-threaded access. + +.. confval:: rocksdb_cache_shard_bits + + +Perf counters +------------- + +Ceph RocksDB cache operations are tracked by performance counters. +In the default configuration BlueStore creates two block caches: +``O`` for onodes, and ``default`` for everything else. +The sections created in performance counters are named ``rocksdb-cache-O`` and +``rocksdb-cache-default``. + +.. prompt:: bash # + + ceph tell osd.0 perf dump rocksdb-cache-O + +.. code-block:: + + "rocksdb-cache-O": { + "capacity": 134217728, + "usage": 134182832, + "pinned": 0, + "elems": 24502, + "inserts": 25806978, + "lookups": 150436987, + "hits": 124629911, + "misses": 25807076 + } + +Values ``capacity``, ``usage``, ``pinned`` and ``elems`` reflect the current state of the cache. +Values ``inserts``, ``lookups``, ``hits`` and ``misses`` are increased on relevant event. + +.. index:: rocksdb; perf counters + +Admin commands +-------------- + +Performance counters show a brief summation, but each cache shard has its own stats. +To list RocksDB onode block cache details for each shard, run an admin socket command: + +.. prompt:: bash # + + ceph tell osd.0 rocksdb show cache O + +.. code-block:: + + shard capacity usage pinned elems inserts lookups hits misses + 0 13631488 11076400 0 2099 136987 822679 685923 136756 + 1 13631488 11549712 0 2043 133359 571500 438383 133117 + 2 13631488 11060608 0 2232 135076 908468 773313 135155 + 3 13631488 11166896 0 2269 134006 427070 293147 133923 + 4 13631488 11117984 0 2297 133367 700242 567318 132924 + 5 13631488 11306672 0 2155 137501 1130135 991810 138325 + 6 13631488 11506512 0 2353 134515 662792 528514 134278 + 7 13631488 11093856 0 2316 135348 718971 583421 135550 + 8 13631488 11660624 0 2424 137363 1092043 954248 137795 + 9 13631488 10962000 0 2561 131982 431702 300467 131235 + 10 13631488 11379392 0 1916 134543 477118 342854 134264 + 11 13631488 11294272 0 2555 134508 512393 378337 134056 + 12 13631488 11277136 0 2079 137312 1131571 993692 137879 + 13 13631488 10887776 0 2543 134001 567073 432903 134170 + 14 13631488 10986528 0 2394 133288 584452 451018 133434 + 15 13631488 11954464 0 2456 134615 708285 573374 134911 + +Counters that support clearing can be reset to zero by running a command: + +.. prompt:: bash # + + ceph tell osd.0 rocksdb reset cache O + +.. index:: bluestore; rocksdb; admin commands + +Optimum shard count +------------------- + +In most cases 16 shards as defined by ``rocksdb_cache_shard_bits=4`` is a good choice. +Large OSDs can easily accommodate millions of objects and thus having millions of keys +to encode onode metadata. While the number of keys is not directly a problem, +it causes RocksDB to create very large index blocks. +During leveled compaction RocksDB merges two levels. Two index blocks are +needed at the same time. +A problem appears if both index blocks belong to the same shard +and their total size is more than capacity of the shard. +Both index blocks cannot fit in the shard simultaneously and access to index block +causes the other block to be evicted from the cache. +The constant thrashing persists until the current RocksDB compaction step finishes. + +Thrashing detection +******************* + +.. prompt:: bash # + + ceph tell osd.0 rocksdb show cache O + +.. code-block:: + + shard capacity usage pinned elems inserts lookups hits misses + ... + 2 13631488 11060608 0 2232 135076 908468 773313 135155 + 3 13631488 8166896 0 1 134006 427070 293147 133923 + 4 13631488 11117984 0 2297 133367 700242 567318 132924 + ... + +It is likely that shard 3 is doing constant eviction. To verify, reset counters to zero +and observe the relation between ``misses`` and ``hits``. +Also, the perf counter for ``bluefs.read_bytes`` will be increasing very fast as RocksDB is +reading the same index blocks over and over again. + +Mitigation +********** + +More like a workaround. Reduce ``rocksdb_cache_shard_bits``. +This will have a slight negative effect on the baseline RocksDB performance, +as fewer shards means more opportunity for lock contention. diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst index 715b999d186..6c73c67ed9a 100644 --- a/doc/rados/configuration/index.rst +++ b/doc/rados/configuration/index.rst @@ -20,6 +20,7 @@ For general object store configuration, refer to the following: :maxdepth: 1 Storage devices + BlueStore RocksDB cache <../bluestore/rocksdb-config> ceph-conf