From: Shai Fultheim <shai.fultheim@gmail.com>
Date: Tue, 19 May 2026 22:53:21 +0000 (+0300)
Subject: crimson/os/seastore: adaptive cleaner gc_max from observed user-burst peak
X-Git-Tag: v21.0.1~97^2~1
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=995177acf0e1437eaa1d3b4a722cd3dc2555df25;p=ceph.git

crimson/os/seastore: adaptive cleaner gc_max from observed user-burst peak

The previous commit adapts `hard_limit` to track the cleaner's observed
open-segment peak, removing the hard-coded `.10` floor and cutting WAF
~43%. With hard_limit adaptive, the remaining WAF lever is `gc_max` —
the threshold that gates when the cleaner runs in non-emergency mode
and therefore the cluster's steady-state operating fill. Lower gc_max
= higher fill = more dead bytes per reclaim cycle = fewer live bytes
copied = lower GC component of WAF.

The hard-coded default of `0.15` (cleaner triggers at 85% segment
fill) is over-provisioned for the typical cluster. On the bench
workload the empirically optimal `gc_max` is about 0.08, which at the
default 0.15 means ~7% of cluster space sits unused and ~1.5x of WAF
is paid for the privilege.

This commit makes gc_max adaptive: it decays each window from its
initial static value toward an observation-derived floor

  target_floor = hard_limit + (peak_projected_used / total)

The floor is the smallest gap the cluster needs to absorb its observed
worst-case in-flight user reservation. `peak_projected_used` is tracked
across the cluster's lifetime with a slow exponential decay applied
each adjust cycle.

Decay rate
==========

The decay multiplier is `0.995` per 30 s elapsed window. The decay is
applied lazily: each call to `maybe_adjust_thresholds()` raises 0.995
to the actual elapsed seconds / 30. This way the decay catches up
correctly even if the background process was idle and the hook went
uncalled for many cycles. A naive per-call multiplication would freeze
the decay during idle phases (the issue observed in v1 testing where
peak stayed at its high-water mark across a 45-minute idle window).

Decay timeline (fraction of original value remaining, on a system
where maybe_adjust_thresholds is called at least every 30 s during
idle — or any interval, since the decay is now elapsed-time-based):

  - half-life: log(0.5) / log(0.995) ≈ 138 windows ≈ 69 min ≈ 1 hour
  - peak retention timeline:
       5 min  → 95 %
      30 min  → 74 %
       1 hour → 55 %
       4 hours →  9 %
      12 hours →  0.2 %
      24 hours → effectively 0

So a single observed peak influences gc_max strongly for ~1 hour,
noticeably for ~4 hours, and is essentially forgotten within a day.

This is sized to be much longer than transient bench phases (peaks
remain >92% of true value within a 16 min bench, never roll out
prematurely) yet much shorter than workload-shift timescales (a
workload that genuinely eases sees gc_max shrink within hours).

Re-discovery
============

The decay lets gc_max eventually re-discover lower floors when a
workload genuinely eases, while preserving observed peaks long enough
that transient bursts inside a steady workload don't roll out
prematurely.

gc_max is bounded below by the floor at all times — so the workload's
observed needs are always satisfied without static tuning. Each
window, gc_max moves halfway toward the floor (`gc_max = max(floor,
(gc_max + floor) / 2)`). This is binary-search-style convergence:
distance to floor halves per window. When the floor rises (workload
reveals a new peak), gc_max jumps up to meet it immediately. When the
floor falls (peaks have decayed below current gc_max), gc_max halves
toward the lower value over the next several windows.

Bootstrap safety: gc_max retains the existing static initial value
(0.15), so a freshly mounted cleaner runs at the same operating point
as today's code until observations have accumulated. This avoids the
"cluster crashes before adaptive sees a workload" failure mode that
naive `gc_max = hard_limit + observed` produces.

Implementation
==============

A single double member on SegmentCleaner: `peak_projected_used_decayed`
is updated to `max(current, projected_used_bytes)` on each
`try_reserve_projected_usage()` call. `maybe_adjust_thresholds()`
applies `std::pow(0.995, elapsed_sec / 30.0)` decay on each invocation
(every ≥30 s in steady state, longer if the cleaner was idle). The
floor uses this value directly.

Bench measurements (qa/standalone/crimson randwrite, 1 MiB writes,
32 GiB per-OSD null_blk, 70% fill, 1280 GiB write target):

  Configuration                          | WAF     | Duration | Status
  ---------------------------------------|---------|----------|---------
  Static defaults (gc_max=.15, hard=.10) |   5.749 |   33 min | clean
  Manual tuned (gc_max=.08, hard=.02)    |   2.926 |   16 min | clean
  Adaptive hard_limit only               |   3.276 |   17 min | clean
  Adaptive hard_limit + gc_max (HEAD)    |   2.829 |   17 min | clean

Adaptive gc_max reduces WAF a further 14% vs hard_limit-only (3.276 ->
2.829) and slightly beats the hand-tuned manual point (2.926). The
per-OSD adaptation captures workload asymmetry that uniform static
defaults can't: on the bench's PG-imbalanced setup the lightly-loaded
osd.0 settled at gc_max=0.026 (much tighter than the manual 0.08)
while osd.1 took the full traffic and settled at gc_max=0.084. Both
extract maximum efficiency for their actual load instead of running
at worst-case-conservative values.

A separate decay-validation run (45-minute idle interlude between two
heavy phases) confirmed that the lazy decay catches up correctly even
when the background process was dormant during the idle phase.

No new workload-tuned constants are introduced. The literal numbers
in this commit are:
  - the 30 s window from the previous commit (time scale of the
    feedback loop)
  - the binary-search halving rate (control geometry, not workload-
    specific; could be 1/3 or 1/4 with similar convergence)
  - the 0.995 decay rate (per-window multiplier; gives the ~1-hour
    half-life and ~24-hour full-forget behaviour described above;
    recompile-only)

The existing `get_default()` value of `0.15` is left untouched as the
bootstrap initial — operators who disable adaptive control (future
config knob) revert to today's exact behaviour.

Signed-off-by: Shai Fultheim <shai.fultheim@gmail.com>
---

diff --git a/src/crimson/os/seastore/async_cleaner.cc b/src/crimson/os/seastore/async_cleaner.cc
index 87f63dcbcf3..e72b3d9d7d9 100644
--- a/src/crimson/os/seastore/async_cleaner.cc
+++ b/src/crimson/os/seastore/async_cleaner.cc
@@ -1,6 +1,8 @@
 // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:nil -*-
 // vim: ts=8 sw=2 sts=2 expandtab
 
+#include <cmath>
+
 #include <fmt/chrono.h>
 #include <seastar/core/metrics.hh>
 
@@ -1135,14 +1137,19 @@ void SegmentCleaner::maybe_adjust_thresholds()
       peak_open_segments_window, segments.get_num_open());
 
   // Only recompute hard_limit every 30s.
-  using namespace std::chrono_literals;
   LOG_PREFIX(SegmentCleaner::maybe_adjust_thresholds);
   auto now = seastar::lowres_clock::now();
-  if (adaptive_last_time != seastar::lowres_clock::time_point{} &&
-      now - adaptive_last_time < 30s) {
-    return;
+  double elapsed_sec = 0.0;
+  if (adaptive_last_time != seastar::lowres_clock::time_point{}) {
+    elapsed_sec = std::chrono::duration<double>(
+        now - adaptive_last_time).count();
+    if (elapsed_sec < 30.0) {
+      return;
+    }
   }
   adaptive_last_time = now;
+  double old_hard_limit = config.available_ratio_hard_limit;
+  double old_gc_max = config.available_ratio_gc_max;
 
   // Architectural floor: named writers (journal + hot/cold gens + metadata).
   auto hot = crimson::common::get_conf<uint64_t>(
@@ -1156,34 +1163,54 @@ void SegmentCleaner::maybe_adjust_thresholds()
     return;
   }
 
-  // hard_limit = (max(peak, named) + 1) * seg / total. "+1" is the minimum
+  double segment_ratio =
+      static_cast<double>(seg_size) /
+      static_cast<double>(total_bytes);
+
+  // hard_limit = (max(peak, named) + 1) * segment_ratio. "+1" is the minimum
   // safety unit: allow one more open segment than ever observed.
   std::size_t observed_peak =
       std::max<std::size_t>(peak_open_segments_window, named_writers);
   double new_hard_limit =
-      static_cast<double>(observed_peak + 1) *
-      static_cast<double>(seg_size) /
-      static_cast<double>(total_bytes);
+      static_cast<double>(observed_peak + 1) * segment_ratio;
 
   double crash_floor =
-      static_cast<double>(named_writers) *
-      static_cast<double>(seg_size) /
-      static_cast<double>(total_bytes);
+      static_cast<double>(named_writers) * segment_ratio;
   new_hard_limit = std::max(new_hard_limit, crash_floor);
 
-  // Keep gc_max strictly greater than hard_limit.
+  // Apply lazy decay covering elapsed time (allows gc_max to gradually fall
+  // when workload eases) so peaks fade even when the background process was
+  // idle and this hook went uncalled for many cycles.
+  if (elapsed_sec > 0.0) {
+    peak_projected_used_decayed *= std::pow(0.995, elapsed_sec / 30.0);
+  }
+
+  // gc_max decays halfway each window toward (hard_limit + recent peak burst).
+  double burst_floor_ratio =
+      peak_projected_used_decayed /
+      static_cast<double>(total_bytes);
+  double target_gc_max = new_hard_limit + burst_floor_ratio;
+  double decayed_gc_max =
+      (config.available_ratio_gc_max + target_gc_max) / 2.0;
+  config.available_ratio_gc_max = std::max(decayed_gc_max, target_gc_max);
   if (config.available_ratio_gc_max <= new_hard_limit) {
-    config.available_ratio_gc_max = new_hard_limit + 0.001;
+    config.available_ratio_gc_max = new_hard_limit + segment_ratio;
   }
   config.available_ratio_hard_limit = new_hard_limit;
 
-  INFO("[ADAPTIVE_GC] peak_open={} named={} hard_limit={:.4f} "
-       "gc_max={:.4f} crash_floor={:.4f}",
-       peak_open_segments_window, named_writers,
-       config.available_ratio_hard_limit,
-       config.available_ratio_gc_max, crash_floor);
+  if (old_hard_limit != new_hard_limit || old_gc_max != config.available_ratio_gc_max) {
+    INFO("[ADAPTIVE_GC] update: hard_limit {:.4f} -> {:.4f}, gc_max {:.4f} -> {:.4f} "
+         "(peak_open={} named={} peak_proj_decayed={:.0f} crash_floor={:.4f})",
+         old_hard_limit, new_hard_limit,
+         old_gc_max, config.available_ratio_gc_max,
+         peak_open_segments_window, named_writers,
+         peak_projected_used_decayed, crash_floor);
+  } else {
+    DEBUG("[ADAPTIVE_GC] no-op: hard_limit {:.4f}, gc_max {:.4f}",
+          old_hard_limit, old_gc_max);
+  }
 
-  // Reset window: record current open count as the new baseline.
+  // Reset per-window open-segment peak.
   peak_open_segments_window = segments.get_num_open();
 }
 
@@ -1883,6 +1910,11 @@ bool SegmentCleaner::try_reserve_projected_usage(std::size_t projected_usage)
 {
   assert(background_callback->is_ready());
   stats.projected_used_bytes += projected_usage;
+  // Update decayed peak; the slow decay in maybe_adjust_thresholds() lets old
+  // peaks fade so gc_max can eventually re-discover lower floors.
+  peak_projected_used_decayed = std::max(
+      peak_projected_used_decayed,
+      static_cast<double>(stats.projected_used_bytes));
   if (should_block_io_on_clean()) {
     stats.projected_used_bytes -= projected_usage;
     return false;
diff --git a/src/crimson/os/seastore/async_cleaner.h b/src/crimson/os/seastore/async_cleaner.h
index e8134a10b84..8006215c320 100644
--- a/src/crimson/os/seastore/async_cleaner.h
+++ b/src/crimson/os/seastore/async_cleaner.h
@@ -1669,6 +1669,11 @@ private:
   std::size_t peak_open_segments_window = 0;
   seastar::lowres_clock::time_point adaptive_last_time;
 
+  // Peak projected_used with slow exponential decay per adjust cycle. Decay
+  // 0.5% per 30s window = half-life ~1 hour: long enough not to forget peaks
+  // mid-workload, short enough to re-discover lower floors over time.
+  double peak_projected_used_decayed = 0.0;
+
   SegmentManagerGroupRef sm_group;
   BackrefManager &backref_manager;