git.apps.os.sepia.ceph.com Git

[BlueStore]: [Remove Allocations from RocksDB]

Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.

The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)

Open Issues:

There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh

Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:

Block the OSD queues from accepting any new request
Delete all items in queue which we didn't start yet
Drain all in-flight tasks
call umount (and destage the allocation-map)
If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
create allocator from on-disk onodes and BlueFS inodes
change allocator + add stat counters + report illegal physical-extents
compare allocator after rebuild from ONodes
prevent collection from being open twice
removed FSCK repo check for null-fm
Bug-Fix: don't add BlueFS allocation to shared allocator
add configuration option to commit to No-Column-B
Only invalidate allocation file after opening rocksdb in read-write mode
fix tests not to expect failure in cases unapplicable to null-allocator
accept non-existing allocation file and don't fail the invaladtion as it could happen legally
don't commit to null-fm when db is opened in repair-mode
add a reverse mechanism from null_fm to real_fm (using RocksDB)
Using Ceph encode/decode, adding more info to header/trailer, add crc protection
Code cleanup

some changes requested by Adam (cleanup and style changes)

Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>

author	Gabriel BenHanokh <benhanokh@gmail.com>
	Thu, 14 Jan 2021 06:59:35 +0000 (08:59 +0200)
committer	Gabriel BenHanokh <benhanokh@gmail.com>
	Wed, 11 Aug 2021 13:53:09 +0000 (16:53 +0300)
commit	272160ab5e44f03982c7de10abad047d2d08a87f
tree	9c9147c02aa17bca6f2c72aa4b01184e52c3cce5	tree \| snapshot
parent	eb37e5b2e946e11ba31e0692030569cff7643b4b	commit \| diff

doc/man/8/ceph-bluestore-tool.rst		diff \| blob \| history
src/common/options/global.yaml.in		diff \| blob \| history
src/os/bluestore/BitmapFreelistManager.cc		diff \| blob \| history
src/os/bluestore/BlueFS.cc		diff \| blob \| history
src/os/bluestore/BlueFS.h		diff \| blob \| history
src/os/bluestore/BlueStore.cc		diff \| blob \| history
src/os/bluestore/BlueStore.h		diff \| blob \| history
src/os/bluestore/FreelistManager.h		diff \| blob \| history
src/os/bluestore/bluestore_tool.cc		diff \| blob \| history
src/test/objectstore/store_test.cc		diff \| blob \| history