From: Sage Weil Date: Sat, 14 May 2016 12:23:15 +0000 (-0400) Subject: doc/dev/bluestore: write path notes X-Git-Tag: v11.0.0~359^2~87 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=c71975aa4dc208a0c0fb07971ba9e2f2433e93bb;p=ceph.git doc/dev/bluestore: write path notes Signed-off-by: Sage Weil --- diff --git a/doc/dev/bluestore.rst b/doc/dev/bluestore.rst new file mode 100644 index 000000000000..756877353731 --- /dev/null +++ b/doc/dev/bluestore.rst @@ -0,0 +1,85 @@ +=================== +BlueStore Internals +=================== + + +Small write strategies +---------------------- + +* *U*: Uncompressed write of a complete, new blob. + + - write to new blob + - kv commit + +* *P*: Uncompressed partial write to unused region of an existing + blob. + + - write to unused chunk(s) of existing blob + - kv commit + +* *W*: WAL overwrite: commit intent to overwrite, then overwrite + async. Must be chunk_size = MAX(block_size, csum_block_size) + aligned. + + - kv commit + - wal overwrite (chunk-aligned) of existing blob + +* *N*: Uncompressed partial write to a new blob. Initially sparsely + utilized. Future writes will either be *P* or *W*. + + - write into a new (sparse) blob + - kv commit + +* *R+W*: Read partial chunk, then to WAL overwrite. + + - read (out to chunk boundaries) + - kv commit + - wal overwrite (chunk-aligned) of existing blob + +* *C*: Compress data, write to new blob. + + - compress and write to new blob + - kv commit + +Possible future modes +--------------------- + +* *F*: Fragment lextent space by writing small piece of data into a + piecemeal blob (that collects random, noncontiguous bits of data we + need to write). + + - write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it) + - kv commit + +* *X*: WAL read/modify/write on a single block (like legacy + bluestore). No checksum. + + - kv commit + - wal read/modify/write + +Mapping +------- + +This very roughly maps the type of write onto what we do when we +encounter a given blob. In practice it's a bit more complicated since there +might be several blobs to consider (e.g., we might be able to *W* into one or +*P* into another), but it should communicate a rough idea of strategy. + ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| | raw | raw (cached) | csum (4 KB) | csum (16 KB) | comp (128 KB) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 128+ KB (over)write | U | U | U | U | C | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 64 KB (over)write | U | U | U | U | U or C | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 4 KB overwrite | W | W | W | R+W | P|N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 100 byte overwrite | R+W | W | R+W | R+W | P|N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 100 byte append | R+W | W | R+W | R+W | P|N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 4 KB clone overwrite | P|N | P|N | P|N | P|N | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 100 byte clone overwrite | P|N | P|N | P|N | P|N | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+