From: Zhang Song Date: Wed, 14 May 2025 08:26:26 +0000 (+0800) Subject: crimson/os/seastore: introduce static layout of laddr_t X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=271cddcdb1d24c36735918db07417f9a8dc4d5ea;p=ceph-ci.git crimson/os/seastore: introduce static layout of laddr_t Signed-off-by: Zhang Song --- diff --git a/doc/dev/crimson/seastore_laddr.rst b/doc/dev/crimson/seastore_laddr.rst new file mode 100644 index 00000000000..058344d4344 --- /dev/null +++ b/doc/dev/crimson/seastore_laddr.rst @@ -0,0 +1,215 @@ +================================= + The Logical Address in SeaStore +================================= + +The laddr (LBA btree) is the central part of the extent locating +process. The legacy implementation of laddr allocation, however, has +some defects, such as difficult to support bit mapping, inefficient +cloning process, and so on. This document proposes a new design to +address the above issues. + +Defects of Legacy Laddr Hint +============================ + +One logical extent is normally derived from an onode, so to allocate a +laddr, we need to use the information of the corresponding onode to +construct a laddr hint and use it to search the LBA btree to find a +suitable laddr. (Not all logical extents are related to the onode, but +these are only a small part of the total logical extents, so we could +ignore them for now). + +The legacy laddr hint is constructed as (note the each laddr represents +one 4KiB physical blocks on the disk): + +:: + + [shard:8][pool:8][crush:32][zero:16] + +Each pair of square brackets represents a property stored within the +laddr integer. Each property contains a name and the number of bits it +uses. + +When the laddr allocation conflicts, the strategy is to search the +free space backwards from hint linearly. + +This laddr layout and conflict policy has some problems: + +#. The pool field is short; pools with id above 127 are used for + temporary recovery data by convention, which causes normal object + data to be mixed with temporary recovery data, and when the pool id + is larger than 255, data from different pools are mixed together. +#. The shard field is long; we are unlikely to use more than 128 + shards for EC. +#. The snap info is missing; when taking a snapshot for an onode, + their hint will be the same, which causes the inefficient cloning + process. +#. A hint could represent a 256MiB logical address space, which + indicates that when its snapshots' count is larger than 16 (each + onode reserves 16MiB space, without considering metadata extents + under the same onode), it will conflict with other onodes' laddr + hints. +#. The most significant bits in the crush field are pg id, which makes + the distribution of laddr uniform at the pool level, but on each + OSD, the laddr's distribution is sparse. The remaining bits will + make conflict more frequent when pg id grows. + +Considering these issues, we can conclude that 64-bit length is +insufficient if we want to maintain semantic information within laddr +(i.e., objects under the same pg share the same laddr prefix). + +128-bit static layout laddr design +================================== + +laddr uses a 128-bit integer to represent the address value +internally. Besides the basic properties inherited from integers, +such as strong ordering and arithmetic operations, certain types of +laddr also include properties derived from user data (RADOS objects). +The static layout design ensures these properties are deterministic +and predictable, allowing further optimizations based on laddr. + +Overview +-------- + +The layout of laddr consists of three parts: + +:: + + [upgrade:1][object_info:76][object_content:51] + +When the object info for different objects does not conflict within an +OSD, we can obtain a useful property: each RADOS object and its +head/clone can have a unique laddr prefix within an OSD. + +Based on this property, we can: + +#. Group different snapshots of a RADOS object under the same laddr + prefix, which could speed up the cloning process. +#. If the RADOS object data/metadata of a head/clone share the same + prefix, removing a head/clone will also be possible via range + deletion. +#. Track frequently accessed objects using laddr without keeping the + full object name. + +Object Info +----------- + +The definition of this property is: + +:: + + [shard:6][pool:12][reverse_hash:32][local_object_id:26] + +The shard, pool, and reverse hash come from the information of the +RADOS object. The local object id is a random number to identify a +unique object within seastore. Two different RADOS objects should +never share the same reverse hash and local object id within a pool, +which is handled by the allocation process. + +All bits set to 1 in the object info represent the invalid prefix, and +RADOS logical extents must not use it to avoid sharing the same prefix +with ``L_ADDR_NULL``. + +There are two special rules for global metadata logical extents: + +#. RootMetaBlock and CollectionNode: When a laddr is used to represent + these two types of extents, all object info bits should be zero, + and the RADOS object data should never use this prefix. +#. SeastoreNodeExtent: Since this type of extent is used to store the + ghobject, its hint is derived from the first slot of the ghobject + it stores, then the local object id and local clone id (see below) + should be zero. The RADOS object data should not use this prefix + either. NOTE: It's possible that shard, pool, and hash are zero; we + allow them to mix with RootBlock and CollectionNode. + +This layout allows: + +- 2\ :sup:`12`\ =4096 pools per cluster +- 2\ :sup:`6`-1=63 shards per pool for EC +- 2\ :sup:`16`\ =65536 pgs per pool and OSD (the most significant 16 + bits in reversed hash are represented as pg id internally) +- 2\ :sup:`42`\ =4T objects per pg (the remaining 16 bits of hash + 26 + bits object id) + +Object Content +-------------- + +Global metadata logical extents use these bits as block address +directly, while the RADOS objects further divide these bits into: + +:: + + [local_clone_id:23][is_metadata:1][blocks:27] + +Like the local object id, each clone/snapshot of a RADOS object has a +unique local clone id under the same object laddr prefix. When +creating a new snapshot, take a random local clone id as the new base +address for the snap object. As mentioned above, the local clone id +bits must not be all 1. + +The indirect mapping of clone objects can only store the local clone +id of its intermediate key; see ``pladdr_t::build_laddr()``. + +The remaining 28 bits are used to represent the address of concrete +data extents. Each address represents one 4KiB block on disk. + +``is_metadata`` is true indicates the remaining bits represent the +address of an Omap*Node. Take a random value as the address when +allocating a new omap extent. + +It's possible for OMapInnerNode to store only the block offset field +in each extent rather than the full 128-bit laddr to locate its +children, which would increase the fan-out of the OMap inner node. + +When ``is_metadata`` is false, the remaining bits represent the +address of ObjectDataBlock. + +This layout allows: + +#. 2\ :sup:`23`\ =8M clones per object +#. 2\ :sup:`27`\ =128M blocks per clone of an object (128M \* 4KiB = + 512GiB) + +Conflict ratio +-------------- + +The allocation of local object id, local clone id, and metadata blocks +requires random selection for now. We expect the success ratio to be +~90% so that address allocation won't cause performance issues; a 90% +success ratio means: + +- objects per pg < 400G +- clones per object < 800K +- metadata of object < 50GiB + +Upgrade +------- + +This bit is reserved for layout updates. + +If the layout of laddr changes in the future, this bit will be used to +transition addresses from the old layout to the new layout. + +TODO: Implement fsck process to support layout upgrades. + +Summary +------- + +- For RootMetaBlock and CollectionNode: + + :: + + [upgrade:1][all_zero:76][offset:51] + +- For SeastoreNodeExtent: + + :: + + [upgrade:1][shard:6][pool:12][reverse_hash:32][zero:26][offset:51] + +- For RADOS extents (OMapInnerNode, OmapLeafNode, and ObjectDataBlock): + + :: + + [upgrade:1][shard:6][pool:12][reverse_hash:32][local_object_id:26][local_clone_id:23][is_metadata:1][blocks:27] + + local object id is non-zero. diff --git a/src/crimson/os/seastore/seastore_types.cc b/src/crimson/os/seastore/seastore_types.cc index eed901e2e1f..173c2e46938 100644 --- a/src/crimson/os/seastore/seastore_types.cc +++ b/src/crimson/os/seastore/seastore_types.cc @@ -5,6 +5,7 @@ #include +#include "common/hobject.h" #include "crimson/common/log.h" namespace { @@ -17,6 +18,10 @@ seastar::logger& journal_logger() { namespace crimson::os::seastore { +static_assert(sizeof(laddr_shard_t) == sizeof(ghobject_t().shard_id.id)); +static_assert(sizeof(laddr_pool_t) == sizeof(ghobject_t().hobj.pool)); +static_assert(sizeof(laddr_crush_hash_t) == sizeof(ghobject_t().hobj.get_bitwise_key_u32())); + bool is_aligned(uint64_t offset, uint64_t alignment) { return (offset % alignment) == 0; @@ -127,7 +132,20 @@ std::ostream &operator<<(std::ostream &out, const laddr_t &laddr) { if (laddr == L_ADDR_NULL) { return out << "L_ADDR_NULL"; } else { - return laddr_formatter_t::format(out, laddr.value); + laddr_formatter_t::format(out, laddr.value); + if (!laddr.is_global_address()) { + out << std::hex << '(' << static_cast(laddr.get_shard()) + << ',' << laddr.get_pool() + << ',' << laddr.get_reversed_hash(); + if (laddr.is_object_address()) { + out << ',' << laddr.get_local_object_id() + << ',' << laddr.get_local_clone_id() + << ',' << laddr.is_metadata() + << ',' << laddr.get_offset_bytes(); + } + out << ')' << std::dec; + } + return out; } } diff --git a/src/crimson/os/seastore/seastore_types.h b/src/crimson/os/seastore/seastore_types.h index 5776ea6150f..7da19386a4e 100644 --- a/src/crimson/os/seastore/seastore_types.h +++ b/src/crimson/os/seastore/seastore_types.h @@ -1096,7 +1096,24 @@ inline extent_len_le_t init_extent_len_le(extent_len_t len) { return ceph_le32(len); } -// logical addr, see LBAManager, TransactionManager +using local_object_id_t = uint32_t; +// LOCAL_OBJECT_ID_ZERO is reserved for SeastoreNodeExtent +constexpr local_object_id_t LOCAL_OBJECT_ID_ZERO = 0; +constexpr local_object_id_t LOCAL_OBJECT_ID_NULL = + std::numeric_limits::max(); +using local_object_id_le_t = ceph_le32; + +using local_clone_id_t = uint32_t; +constexpr local_clone_id_t LOCAL_CLONE_ID_NULL = + std::numeric_limits::max(); +using local_clone_id_le_t = ceph_le32; + +using laddr_shard_t = int8_t; +using laddr_pool_t = int64_t; +// Note: this is the reversed version of the object hash +using laddr_crush_hash_t = uint32_t; + +// refer to doc/dev/crimson/seastore_laddr.rst class laddr_t { public: #if defined (__SIZEOF_INT128__) && !SEASTORE_LADDR_USE_BOOST_U128 @@ -1125,15 +1142,185 @@ public: static constexpr unsigned UNIT_SIZE = 1 << UNIT_SHIFT; // 4096 static constexpr unsigned UNIT_MASK = UNIT_SIZE - 1; + // This factory is only used in nbd driver and test cases. + // Cast the byte offset to valid rados object extents format. static laddr_t from_byte_offset(loffset_t value) { assert((value & UNIT_MASK) == 0); - return laddr_t(value >> UNIT_SHIFT); + // make value block aligned. + value >>= UNIT_SHIFT; + laddr_t addr; + // set block offset directly. + addr.value = value & layout::BlockOffsetSpec::MASK; + // move the remaining bits to reversed hash field. + // 12(UNIT_SHIFT) + 27(block offset) + 32(reversed hash) = 71 > 64 + // the rest bits won't overflow + addr.value |= Unsigned(value >> layout::BlockOffsetSpec::length) + << (layout::ObjectContentSpec::length + layout::LocalObjectIdSpec::length); + // don't use LOCAL_OBJECT_ID_ZERO, as it is for onode extents. + addr.set_local_object_id(1); + assert(addr.is_object_address()); + return addr; } static constexpr laddr_t from_raw_uint(Unsigned v) { return laddr_t(v); } + // Return wheter this address belongs to global metadata(RootMetaBlock + // or CollectionNode). + // Always ignore the upgrade bit. + bool is_global_address() const { + return (value & layout::ObjectInfoSpec::MASK) == 0; + } + + // Return wheter this address belongs to SeastoreNodeExtent + // Always ignore the upgrade bit. + bool is_onode_extent_address() const { + return get_local_object_id() == LOCAL_OBJECT_ID_ZERO; + } + + // Return wheter this address belongs to a rados object. + // Always ignore the upgrade bit. + bool is_object_address() const { + return !is_global_address() + && !is_onode_extent_address() + && get_object_info() != layout::ObjectInfoSpec::MAX; + } + + // Upgrade bit with object info bits + laddr_t get_object_prefix() const { + auto ret = *this; + ret.value &= layout::OBJECT_PREFIX_MASK; + return ret; + } + + // Object prefix with local_clone_id + laddr_t get_clone_prefix() const { + auto ret = *this; + ret.value &= layout::CLONE_PREFIX_MASK; + return ret; + } + + Unsigned get_object_info() const { + return layout::ObjectInfoSpec::get(value); + } + + laddr_shard_t get_shard() const { + return layout::ShardSpec::get(value); + } + void set_shard(laddr_shard_t shard) { + layout::ShardSpec::set(value, static_cast(shard)); + } + // Shard has similar problems as pool. + bool match_shard_bits(laddr_shard_t shard) const { + // The most significant bit of laddr_shard_t will be discarded if it is casted + // from boost::multiprecision::uint128_t directly, so that we cast it to + // uint8_t. + auto unsigned_shard = static_cast(shard); + return (unsigned_shard & layout::ShardSpec::MAX) + == layout::ShardSpec::get(value); + } + + laddr_pool_t get_pool() const { + return layout::PoolSpec::get(value); + } + void set_pool(laddr_pool_t pool) { + layout::PoolSpec::set(value, static_cast(pool)); + } + // The pool field uses 12 bits, so we cann't figure out the real + // pool id is -1 or 4095. If their bits match, the pool ids match. + bool match_pool_bits(laddr_pool_t pool) const { + auto unsigned_pool = static_cast(pool); + return (unsigned_pool & layout::PoolSpec::MAX) + == layout::PoolSpec::get(value); + } + + laddr_crush_hash_t get_reversed_hash() const { + return layout::ReversedHashSpec::get(value); + } + void set_reversed_hash(laddr_crush_hash_t hash) { + return layout::ReversedHashSpec::set(value, hash); + } + + Unsigned get_object_content() const { + return layout::ObjectContentSpec::get(value); + } + void set_object_content(Unsigned v) { + layout::ObjectContentSpec::set(value, v); + } + laddr_t with_object_content(Unsigned v) const { + auto ret = *this; + ret.set_object_content(v); + return ret; + } + + local_object_id_t get_local_object_id() const { + return layout::LocalObjectIdSpec::get(value); + } + void set_local_object_id(local_object_id_t id) { + layout::LocalObjectIdSpec::set(value, id); + } + laddr_t with_local_object_id(local_object_id_t id) const { + auto ret = *this; + ret.set_local_object_id(id); + return ret; + } + + bool is_metadata() const { + return layout::MetadataFlagSpec::get(value); + } + void set_metadata(bool md) { + layout::MetadataFlagSpec::set(value, md); + } + laddr_t with_metadata() const { + auto ret = *this; + ret.set_metadata(true); + return ret; + } + laddr_t without_metadata() const { + auto ret = *this; + ret.set_metadata(false); + return ret; + } + + local_clone_id_t get_local_clone_id() const { + return layout::LocalCloneIdSpec::get(value); + } + void set_local_clone_id(local_clone_id_t id) { + layout::LocalCloneIdSpec::set(value, id); + } + laddr_t with_local_clone_id(local_clone_id_t id) const { + auto ret = *this; + ret.set_local_clone_id(id); + return ret; + } + + // The result is related to clone prefix + loffset_t get_offset_bytes() const { + return layout::BlockOffsetSpec::get(value) << UNIT_SHIFT; + } + loffset_t get_offset_blocks() const { + return layout::BlockOffsetSpec::get(value); + } + void set_offset_by_bytes(loffset_t offset) { + assert(p2align(uint64_t(offset), uint64_t(UNIT_SIZE)) == offset); + offset >>= UNIT_SHIFT; + layout::BlockOffsetSpec::set(value, offset); + } + void set_offset_by_blocks(loffset_t offset) { + layout::BlockOffsetSpec::set(value, offset); + } + laddr_t with_offset_by_bytes(loffset_t offset) const { + auto ret = *this; + ret.set_offset_by_bytes(offset); + return ret; + } + laddr_t with_offset_by_blocks(loffset_t offset) const { + auto ret = *this; + ret.set_offset_by_blocks(offset); + return ret; + } + /// laddr_t works like primitive integer type, encode/decode it manually void encode(::ceph::buffer::list::contiguous_appender& p) const { auto lo = get_low64(); @@ -1369,6 +1556,55 @@ private: // Prevent direct construction of laddr_t with an integer, // always use laddr_t::from_raw_uint instead. constexpr explicit laddr_t(Unsigned value) : value(value) {} + + template + struct FieldSpec { + static constexpr int length = LENGTH; + static constexpr int offset = OFFSET; + static constexpr Unsigned MAX = (Unsigned(1) << LENGTH) - 1; + static constexpr Unsigned MASK = MAX << OFFSET; + + template + static ReturnType get(const Unsigned &laddr_value) { + return static_cast((laddr_value & MASK) >> OFFSET); + } + static void set(Unsigned &laddr_value, Unsigned field_value) { + laddr_value &= ~MASK; + laddr_value |= (field_value << OFFSET) & MASK; + } + }; + + // The upgrade bit position never changes + using UpgradeFlagSpec = FieldSpec<1, 127>; + + struct layout_v1 { + // object info: + // [shard:6][pool:12][reverse_hash:32][local_object_id:26] + // object content: + // [local_clone_id:23][is_metadata:1][block_address:27] + + using ObjectInfoSpec = FieldSpec<76, 51>; + using ObjectContentSpec = FieldSpec<51, 0>; + + using ShardSpec = FieldSpec<6, 121>; + using PoolSpec = FieldSpec<12, 109>; + using ReversedHashSpec = FieldSpec<32, 77>; + using LocalObjectIdSpec = FieldSpec<26, 51>; + + using LocalCloneIdSpec = FieldSpec<23, 28>; + using MetadataFlagSpec = FieldSpec<1, 27>; + using BlockOffsetSpec = FieldSpec<27, 0>; + + static constexpr Unsigned OBJECT_PREFIX_MASK = + UpgradeFlagSpec::MASK | ObjectInfoSpec::MASK; + static constexpr Unsigned CLONE_PREFIX_MASK = + OBJECT_PREFIX_MASK | LocalCloneIdSpec::MASK; + }; + + // Always alias to the latest layout implementation. + // All accesses, except for fsck, to the laddr fields should use this alias. + using layout = layout_v1; + Unsigned value; }; using laddr_offset_t = laddr_t::laddr_offset_t; @@ -1452,7 +1688,7 @@ enum class addr_type_t : uint8_t { }; struct __attribute__((packed)) pladdr_le_t { - ceph_le64 pladdr = ceph_le64(PL_ADDR_NULL); + ceph_le64 pladdr = ceph_le64(0); addr_type_t addr_type = addr_type_t::MAX; pladdr_le_t() = default;