--- /dev/null
+=================================
+ The Logical Address in SeaStore
+=================================
+
+The laddr (LBA btree) is the central part of the extent locating
+process. The legacy implementation of laddr allocation, however, has
+some defects, such as difficult to support bit mapping, inefficient
+cloning process, and so on. This document proposes a new design to
+address the above issues.
+
+Defects of Legacy Laddr Hint
+============================
+
+One logical extent is normally derived from an onode, so to allocate a
+laddr, we need to use the information of the corresponding onode to
+construct a laddr hint and use it to search the LBA btree to find a
+suitable laddr. (Not all logical extents are related to the onode, but
+these are only a small part of the total logical extents, so we could
+ignore them for now).
+
+The legacy laddr hint is constructed as (note the each laddr represents
+one 4KiB physical blocks on the disk):
+
+::
+
+ [shard:8][pool:8][crush:32][zero:16]
+
+Each pair of square brackets represents a property stored within the
+laddr integer. Each property contains a name and the number of bits it
+uses.
+
+When the laddr allocation conflicts, the strategy is to search the
+free space backwards from hint linearly.
+
+This laddr layout and conflict policy has some problems:
+
+#. The pool field is short; pools with id above 127 are used for
+ temporary recovery data by convention, which causes normal object
+ data to be mixed with temporary recovery data, and when the pool id
+ is larger than 255, data from different pools are mixed together.
+#. The shard field is long; we are unlikely to use more than 128
+ shards for EC.
+#. The snap info is missing; when taking a snapshot for an onode,
+ their hint will be the same, which causes the inefficient cloning
+ process.
+#. A hint could represent a 256MiB logical address space, which
+ indicates that when its snapshots' count is larger than 16 (each
+ onode reserves 16MiB space, without considering metadata extents
+ under the same onode), it will conflict with other onodes' laddr
+ hints.
+#. The most significant bits in the crush field are pg id, which makes
+ the distribution of laddr uniform at the pool level, but on each
+ OSD, the laddr's distribution is sparse. The remaining bits will
+ make conflict more frequent when pg id grows.
+
+Considering these issues, we can conclude that 64-bit length is
+insufficient if we want to maintain semantic information within laddr
+(i.e., objects under the same pg share the same laddr prefix).
+
+128-bit static layout laddr design
+==================================
+
+laddr uses a 128-bit integer to represent the address value
+internally. Besides the basic properties inherited from integers,
+such as strong ordering and arithmetic operations, certain types of
+laddr also include properties derived from user data (RADOS objects).
+The static layout design ensures these properties are deterministic
+and predictable, allowing further optimizations based on laddr.
+
+Overview
+--------
+
+The layout of laddr consists of three parts:
+
+::
+
+ [upgrade:1][object_info:76][object_content:51]
+
+When the object info for different objects does not conflict within an
+OSD, we can obtain a useful property: each RADOS object and its
+head/clone can have a unique laddr prefix within an OSD.
+
+Based on this property, we can:
+
+#. Group different snapshots of a RADOS object under the same laddr
+ prefix, which could speed up the cloning process.
+#. If the RADOS object data/metadata of a head/clone share the same
+ prefix, removing a head/clone will also be possible via range
+ deletion.
+#. Track frequently accessed objects using laddr without keeping the
+ full object name.
+
+Object Info
+-----------
+
+The definition of this property is:
+
+::
+
+ [shard:6][pool:12][reverse_hash:32][local_object_id:26]
+
+The shard, pool, and reverse hash come from the information of the
+RADOS object. The local object id is a random number to identify a
+unique object within seastore. Two different RADOS objects should
+never share the same reverse hash and local object id within a pool,
+which is handled by the allocation process.
+
+All bits set to 1 in the object info represent the invalid prefix, and
+RADOS logical extents must not use it to avoid sharing the same prefix
+with ``L_ADDR_NULL``.
+
+There are two special rules for global metadata logical extents:
+
+#. RootMetaBlock and CollectionNode: When a laddr is used to represent
+ these two types of extents, all object info bits should be zero,
+ and the RADOS object data should never use this prefix.
+#. SeastoreNodeExtent: Since this type of extent is used to store the
+ ghobject, its hint is derived from the first slot of the ghobject
+ it stores, then the local object id and local clone id (see below)
+ should be zero. The RADOS object data should not use this prefix
+ either. NOTE: It's possible that shard, pool, and hash are zero; we
+ allow them to mix with RootBlock and CollectionNode.
+
+This layout allows:
+
+- 2\ :sup:`12`\ =4096 pools per cluster
+- 2\ :sup:`6`-1=63 shards per pool for EC
+- 2\ :sup:`16`\ =65536 pgs per pool and OSD (the most significant 16
+ bits in reversed hash are represented as pg id internally)
+- 2\ :sup:`42`\ =4T objects per pg (the remaining 16 bits of hash + 26
+ bits object id)
+
+Object Content
+--------------
+
+Global metadata logical extents use these bits as block address
+directly, while the RADOS objects further divide these bits into:
+
+::
+
+ [local_clone_id:23][is_metadata:1][blocks:27]
+
+Like the local object id, each clone/snapshot of a RADOS object has a
+unique local clone id under the same object laddr prefix. When
+creating a new snapshot, take a random local clone id as the new base
+address for the snap object. As mentioned above, the local clone id
+bits must not be all 1.
+
+The indirect mapping of clone objects can only store the local clone
+id of its intermediate key; see ``pladdr_t::build_laddr()``.
+
+The remaining 28 bits are used to represent the address of concrete
+data extents. Each address represents one 4KiB block on disk.
+
+``is_metadata`` is true indicates the remaining bits represent the
+address of an Omap*Node. Take a random value as the address when
+allocating a new omap extent.
+
+It's possible for OMapInnerNode to store only the block offset field
+in each extent rather than the full 128-bit laddr to locate its
+children, which would increase the fan-out of the OMap inner node.
+
+When ``is_metadata`` is false, the remaining bits represent the
+address of ObjectDataBlock.
+
+This layout allows:
+
+#. 2\ :sup:`23`\ =8M clones per object
+#. 2\ :sup:`27`\ =128M blocks per clone of an object (128M \* 4KiB =
+ 512GiB)
+
+Conflict ratio
+--------------
+
+The allocation of local object id, local clone id, and metadata blocks
+requires random selection for now. We expect the success ratio to be
+~90% so that address allocation won't cause performance issues; a 90%
+success ratio means:
+
+- objects per pg < 400G
+- clones per object < 800K
+- metadata of object < 50GiB
+
+Upgrade
+-------
+
+This bit is reserved for layout updates.
+
+If the layout of laddr changes in the future, this bit will be used to
+transition addresses from the old layout to the new layout.
+
+TODO: Implement fsck process to support layout upgrades.
+
+Summary
+-------
+
+- For RootMetaBlock and CollectionNode:
+
+ ::
+
+ [upgrade:1][all_zero:76][offset:51]
+
+- For SeastoreNodeExtent:
+
+ ::
+
+ [upgrade:1][shard:6][pool:12][reverse_hash:32][zero:26][offset:51]
+
+- For RADOS extents (OMapInnerNode, OmapLeafNode, and ObjectDataBlock):
+
+ ::
+
+ [upgrade:1][shard:6][pool:12][reverse_hash:32][local_object_id:26][local_clone_id:23][is_metadata:1][blocks:27]
+
+ local object id is non-zero.
return ceph_le32(len);
}
-// logical addr, see LBAManager, TransactionManager
+using local_object_id_t = uint32_t;
+// LOCAL_OBJECT_ID_ZERO is reserved for SeastoreNodeExtent
+constexpr local_object_id_t LOCAL_OBJECT_ID_ZERO = 0;
+constexpr local_object_id_t LOCAL_OBJECT_ID_NULL =
+ std::numeric_limits<uint32_t>::max();
+using local_object_id_le_t = ceph_le32;
+
+using local_clone_id_t = uint32_t;
+constexpr local_clone_id_t LOCAL_CLONE_ID_NULL =
+ std::numeric_limits<uint32_t>::max();
+using local_clone_id_le_t = ceph_le32;
+
+using laddr_shard_t = int8_t;
+using laddr_pool_t = int64_t;
+// Note: this is the reversed version of the object hash
+using laddr_crush_hash_t = uint32_t;
+
+// refer to doc/dev/crimson/seastore_laddr.rst
class laddr_t {
public:
#if defined (__SIZEOF_INT128__) && !SEASTORE_LADDR_USE_BOOST_U128
static constexpr unsigned UNIT_SIZE = 1 << UNIT_SHIFT; // 4096
static constexpr unsigned UNIT_MASK = UNIT_SIZE - 1;
+ // This factory is only used in nbd driver and test cases.
+ // Cast the byte offset to valid rados object extents format.
static laddr_t from_byte_offset(loffset_t value) {
assert((value & UNIT_MASK) == 0);
- return laddr_t(value >> UNIT_SHIFT);
+ // make value block aligned.
+ value >>= UNIT_SHIFT;
+ laddr_t addr;
+ // set block offset directly.
+ addr.value = value & layout::BlockOffsetSpec::MASK;
+ // move the remaining bits to reversed hash field.
+ // 12(UNIT_SHIFT) + 27(block offset) + 32(reversed hash) = 71 > 64
+ // the rest bits won't overflow
+ addr.value |= Unsigned(value >> layout::BlockOffsetSpec::length)
+ << (layout::ObjectContentSpec::length + layout::LocalObjectIdSpec::length);
+ // don't use LOCAL_OBJECT_ID_ZERO, as it is for onode extents.
+ addr.set_local_object_id(1);
+ assert(addr.is_object_address());
+ return addr;
}
static constexpr laddr_t from_raw_uint(Unsigned v) {
return laddr_t(v);
}
+ // Return wheter this address belongs to global metadata(RootMetaBlock
+ // or CollectionNode).
+ // Always ignore the upgrade bit.
+ bool is_global_address() const {
+ return (value & layout::ObjectInfoSpec::MASK) == 0;
+ }
+
+ // Return wheter this address belongs to SeastoreNodeExtent
+ // Always ignore the upgrade bit.
+ bool is_onode_extent_address() const {
+ return get_local_object_id() == LOCAL_OBJECT_ID_ZERO;
+ }
+
+ // Return wheter this address belongs to a rados object.
+ // Always ignore the upgrade bit.
+ bool is_object_address() const {
+ return !is_global_address()
+ && !is_onode_extent_address()
+ && get_object_info() != layout::ObjectInfoSpec::MAX;
+ }
+
+ // Upgrade bit with object info bits
+ laddr_t get_object_prefix() const {
+ auto ret = *this;
+ ret.value &= layout::OBJECT_PREFIX_MASK;
+ return ret;
+ }
+
+ // Object prefix with local_clone_id
+ laddr_t get_clone_prefix() const {
+ auto ret = *this;
+ ret.value &= layout::CLONE_PREFIX_MASK;
+ return ret;
+ }
+
+ Unsigned get_object_info() const {
+ return layout::ObjectInfoSpec::get<Unsigned>(value);
+ }
+
+ laddr_shard_t get_shard() const {
+ return layout::ShardSpec::get<laddr_shard_t>(value);
+ }
+ void set_shard(laddr_shard_t shard) {
+ layout::ShardSpec::set(value, static_cast<Unsigned>(shard));
+ }
+ // Shard has similar problems as pool.
+ bool match_shard_bits(laddr_shard_t shard) const {
+ // The most significant bit of laddr_shard_t will be discarded if it is casted
+ // from boost::multiprecision::uint128_t directly, so that we cast it to
+ // uint8_t.
+ auto unsigned_shard = static_cast<uint8_t>(shard);
+ return (unsigned_shard & layout::ShardSpec::MAX)
+ == layout::ShardSpec::get<uint8_t>(value);
+ }
+
+ laddr_pool_t get_pool() const {
+ return layout::PoolSpec::get<laddr_pool_t>(value);
+ }
+ void set_pool(laddr_pool_t pool) {
+ layout::PoolSpec::set(value, static_cast<Unsigned>(pool));
+ }
+ // The pool field uses 12 bits, so we cann't figure out the real
+ // pool id is -1 or 4095. If their bits match, the pool ids match.
+ bool match_pool_bits(laddr_pool_t pool) const {
+ auto unsigned_pool = static_cast<uint64_t>(pool);
+ return (unsigned_pool & layout::PoolSpec::MAX)
+ == layout::PoolSpec::get<uint64_t>(value);
+ }
+
+ laddr_crush_hash_t get_reversed_hash() const {
+ return layout::ReversedHashSpec::get<laddr_crush_hash_t>(value);
+ }
+ void set_reversed_hash(laddr_crush_hash_t hash) {
+ return layout::ReversedHashSpec::set(value, hash);
+ }
+
+ Unsigned get_object_content() const {
+ return layout::ObjectContentSpec::get<Unsigned>(value);
+ }
+ void set_object_content(Unsigned v) {
+ layout::ObjectContentSpec::set(value, v);
+ }
+ laddr_t with_object_content(Unsigned v) const {
+ auto ret = *this;
+ ret.set_object_content(v);
+ return ret;
+ }
+
+ local_object_id_t get_local_object_id() const {
+ return layout::LocalObjectIdSpec::get<local_object_id_t>(value);
+ }
+ void set_local_object_id(local_object_id_t id) {
+ layout::LocalObjectIdSpec::set(value, id);
+ }
+ laddr_t with_local_object_id(local_object_id_t id) const {
+ auto ret = *this;
+ ret.set_local_object_id(id);
+ return ret;
+ }
+
+ bool is_metadata() const {
+ return layout::MetadataFlagSpec::get<bool>(value);
+ }
+ void set_metadata(bool md) {
+ layout::MetadataFlagSpec::set(value, md);
+ }
+ laddr_t with_metadata() const {
+ auto ret = *this;
+ ret.set_metadata(true);
+ return ret;
+ }
+ laddr_t without_metadata() const {
+ auto ret = *this;
+ ret.set_metadata(false);
+ return ret;
+ }
+
+ local_clone_id_t get_local_clone_id() const {
+ return layout::LocalCloneIdSpec::get<local_clone_id_t>(value);
+ }
+ void set_local_clone_id(local_clone_id_t id) {
+ layout::LocalCloneIdSpec::set(value, id);
+ }
+ laddr_t with_local_clone_id(local_clone_id_t id) const {
+ auto ret = *this;
+ ret.set_local_clone_id(id);
+ return ret;
+ }
+
+ // The result is related to clone prefix
+ loffset_t get_offset_bytes() const {
+ return layout::BlockOffsetSpec::get<loffset_t>(value) << UNIT_SHIFT;
+ }
+ loffset_t get_offset_blocks() const {
+ return layout::BlockOffsetSpec::get<loffset_t>(value);
+ }
+ void set_offset_by_bytes(loffset_t offset) {
+ assert(p2align(uint64_t(offset), uint64_t(UNIT_SIZE)) == offset);
+ offset >>= UNIT_SHIFT;
+ layout::BlockOffsetSpec::set(value, offset);
+ }
+ void set_offset_by_blocks(loffset_t offset) {
+ layout::BlockOffsetSpec::set(value, offset);
+ }
+ laddr_t with_offset_by_bytes(loffset_t offset) const {
+ auto ret = *this;
+ ret.set_offset_by_bytes(offset);
+ return ret;
+ }
+ laddr_t with_offset_by_blocks(loffset_t offset) const {
+ auto ret = *this;
+ ret.set_offset_by_blocks(offset);
+ return ret;
+ }
+
/// laddr_t works like primitive integer type, encode/decode it manually
void encode(::ceph::buffer::list::contiguous_appender& p) const {
auto lo = get_low64();
// Prevent direct construction of laddr_t with an integer,
// always use laddr_t::from_raw_uint instead.
constexpr explicit laddr_t(Unsigned value) : value(value) {}
+
+ template <int LENGTH, int OFFSET>
+ struct FieldSpec {
+ static constexpr int length = LENGTH;
+ static constexpr int offset = OFFSET;
+ static constexpr Unsigned MAX = (Unsigned(1) << LENGTH) - 1;
+ static constexpr Unsigned MASK = MAX << OFFSET;
+
+ template <typename ReturnType>
+ static ReturnType get(const Unsigned &laddr_value) {
+ return static_cast<ReturnType>((laddr_value & MASK) >> OFFSET);
+ }
+ static void set(Unsigned &laddr_value, Unsigned field_value) {
+ laddr_value &= ~MASK;
+ laddr_value |= (field_value << OFFSET) & MASK;
+ }
+ };
+
+ // The upgrade bit position never changes
+ using UpgradeFlagSpec = FieldSpec<1, 127>;
+
+ struct layout_v1 {
+ // object info:
+ // [shard:6][pool:12][reverse_hash:32][local_object_id:26]
+ // object content:
+ // [local_clone_id:23][is_metadata:1][block_address:27]
+
+ using ObjectInfoSpec = FieldSpec<76, 51>;
+ using ObjectContentSpec = FieldSpec<51, 0>;
+
+ using ShardSpec = FieldSpec<6, 121>;
+ using PoolSpec = FieldSpec<12, 109>;
+ using ReversedHashSpec = FieldSpec<32, 77>;
+ using LocalObjectIdSpec = FieldSpec<26, 51>;
+
+ using LocalCloneIdSpec = FieldSpec<23, 28>;
+ using MetadataFlagSpec = FieldSpec<1, 27>;
+ using BlockOffsetSpec = FieldSpec<27, 0>;
+
+ static constexpr Unsigned OBJECT_PREFIX_MASK =
+ UpgradeFlagSpec::MASK | ObjectInfoSpec::MASK;
+ static constexpr Unsigned CLONE_PREFIX_MASK =
+ OBJECT_PREFIX_MASK | LocalCloneIdSpec::MASK;
+ };
+
+ // Always alias to the latest layout implementation.
+ // All accesses, except for fsck, to the laddr fields should use this alias.
+ using layout = layout_v1;
+
Unsigned value;
};
using laddr_offset_t = laddr_t::laddr_offset_t;
};
struct __attribute__((packed)) pladdr_le_t {
- ceph_le64 pladdr = ceph_le64(PL_ADDR_NULL);
+ ceph_le64 pladdr = ceph_le64(0);
addr_type_t addr_type = addr_type_t::MAX;
pladdr_le_t() = default;