From dc2cf00e890463e6702a15fda56ed7c699287fe9 Mon Sep 17 00:00:00 2001
From: Shai Fultheim <shai.fultheim@gmail.com>
Date: Sun, 24 May 2026 14:19:56 +0300
Subject: [PATCH] crimson/os/seastore: enforce capacity in
 RBMCleaner::try_reserve_projected_usage

RBMCleaner::try_reserve_projected_usage always returned true and just
incremented stats.projected_used_bytes. The EPM BackgroundProcess
relies on the return value to block IO when the device is full, so
this effectively disabled backpressure for the RANDOM_BLOCK_SSD
backend: concurrent transactions could each reserve unbounded amounts,
and the over-commit surfaced downstream as `unexpected enospc` asserts
in the data path (object_data_handler.cc and friends, where ENOSPC is
treated as crimson::ct_error::enospc::assert_failure because the
existing infrastructure assumes ENOSPC is impossible). The OSD aborted
under sustained random-write workloads that exceeded RBM capacity.

Compute the device's data capacity as total - journal, subtract a 5%
headroom (for metadata writes and fragmentation slack the AVL allocator
cannot pack into), and reject reservations that would push
used + projected over the line. The existing EPM blocking-IO path
(extent_placement_manager.cc:726) already queues the IO until
release_projected_usage wakes it, so no caller-side changes are needed.

This is the minimal fix to keep the OSD alive under sustained random
writes. It converts a crash into a stall: once the device fills and
the cleaner has nothing to free (RBMCleaner::clean_space is still a
TODO), new writes block indefinitely instead of crashing. Verified
against an 8-job 1MB random-write fio (--size 63g, 90GB RBM, 3GB
journal): 68 GB user-written, host WAF 1.696, OSD survives, watchdog
kills fio after slow-ops timeout. Without this patch the same workload
asserts in the data path.

The headroom is intentionally generous (5%) because there is no GC
yet; once RBMCleaner::clean_space() exists, the headroom can shrink.

Fixes: https://tracker.ceph.com/issues/75598

Signed-off-by: Shai Fultheim <shai.fultheim@gmail.com>
---
 src/crimson/os/seastore/async_cleaner.cc | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/src/crimson/os/seastore/async_cleaner.cc b/src/crimson/os/seastore/async_cleaner.cc
index f0e36f82da11..58d52b2ba05a 100644
--- a/src/crimson/os/seastore/async_cleaner.cc
+++ b/src/crimson/os/seastore/async_cleaner.cc
@@ -2031,6 +2031,27 @@ void RBMCleaner::commit_space_used(paddr_t addr, extent_len_t len)
 bool RBMCleaner::try_reserve_projected_usage(std::size_t projected_usage)
 {
   assert(background_callback->is_ready());
+
+  // Capacity check. Without this, concurrent transactions over-commit the
+  // RBM device: each reserves but the cleaner has no clean_space() yet, so
+  // a write that physically can't be served reaches the allocator and
+  // surfaces as `unexpected enospc` asserts in the data path (object_data
+  // _handler.cc et al.). Return false so the EPM BackgroundProcess blocks
+  // the IO until committed transactions release space.
+  //
+  // Headroom carves out room for metadata writes (LBA btree, backref) and
+  // for fragmentation slack the allocator can't pack into. 5% is a starting
+  // point; until RBMCleaner::clean_space() exists we cannot reclaim from
+  // fragmented free space, so headroom doubles as a fragmentation guard.
+  assert(get_total_bytes() > get_journal_bytes());
+  auto data_capacity = get_total_bytes() - get_journal_bytes();
+  auto headroom = data_capacity / 20;
+  auto committed_and_projected = stats.used_bytes
+                               + stats.projected_used_bytes
+                               + projected_usage;
+  if (committed_and_projected + headroom > data_capacity) {
+    return false;
+  }
   stats.projected_used_bytes += projected_usage;
   return true;
 }
-- 
2.47.3