git.apps.os.sepia.ceph.com Git - ceph-ci.git/commit

osd: Optimised EC avoids ever reading more than K shards (if plugin supports it).

Plugins which support partial reads, should never need more than k shards
to read the data, even if some shards have failed. However, rebalancing commonly
requests k + m shards, as very frequently all shards are moved. If this occurs
and all k + m shards are online, the read will be achieved by reading ALL shards
rather than just reading k shards. This commit fixes that issue.

The problem is that we don't want to change the API to the old EC, so we cannot
update the plugin behaviour here.  Instead, the EC code itself will reduce
the number of shards it tells minimum_to_decode about.

In a comment we note that bitset_set performance could be improved using _pdep_u64.
This would require fiddly platform-specific code and would likely not show
any performance improvements for most applications. The majority of the calls to
this function will be with a bitset that has <=n set bits and will never enter this
if statement. When there are >n bits set we are going to save one or more read I/Os,
the cost of the for loop is insignificant vs this saving. I have left the comment
in as a hint to future users of this function.

Further notes were made in a review comment that are worth recording:

- If performance is limited by the drives, then less read I/Os is a clear advantage.
- If performance is limited by the network then less remote read I/Os is a clear advantage.
- If performance is limited by the CPU then the CPU cost of M unnecessary remote
  read I/Os (messenger+bluestore) is almost certainly more than the cost of doing an
  extra encode operation to calculate the coding parities.
- If performance is limited by system memory bandwidth the encode+crc generation
  has less overhead than the read+bluestore crc check+messenger overheads.

Longer term this logic should probably be pushed into the plugins, in particular
to give LRC the opportunity to optimize for locality of the shards. Reason for
not doing this now is that it would be messy because the legacy EC code cannot
support this optimization and LRC isn't yet optimizing for locality

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>

author	Alex Ainscow <aainscow@uk.ibm.com>
	Mon, 14 Jul 2025 15:55:40 +0000 (16:55 +0100)
committer	Alex Ainscow <aainscow@uk.ibm.com>
	Fri, 1 Aug 2025 08:13:12 +0000 (09:13 +0100)
commit	52915db9507053c8291f94995a662b7f5292491c
tree	99a6732a53d03653be71814bfcd44c454f501921	tree \| snapshot
parent	9e6960d1a6a2d999b08996274782abec5e50f3a6	commit \| diff

src/common/bitset_set.h		diff \| blob \| history
src/osd/ECCommon.cc		diff \| blob \| history
src/test/common/test_bitset_set.cc		diff \| blob \| history