From: Shai Fultheim Date: Tue, 19 May 2026 22:53:21 +0000 (+0300) Subject: crimson/os/seastore: adaptive cleaner gc_max from observed user-burst peak X-Git-Tag: v21.0.1~97^2~1 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=995177acf0e1437eaa1d3b4a722cd3dc2555df25;p=ceph.git crimson/os/seastore: adaptive cleaner gc_max from observed user-burst peak The previous commit adapts `hard_limit` to track the cleaner's observed open-segment peak, removing the hard-coded `.10` floor and cutting WAF ~43%. With hard_limit adaptive, the remaining WAF lever is `gc_max` — the threshold that gates when the cleaner runs in non-emergency mode and therefore the cluster's steady-state operating fill. Lower gc_max = higher fill = more dead bytes per reclaim cycle = fewer live bytes copied = lower GC component of WAF. The hard-coded default of `0.15` (cleaner triggers at 85% segment fill) is over-provisioned for the typical cluster. On the bench workload the empirically optimal `gc_max` is about 0.08, which at the default 0.15 means ~7% of cluster space sits unused and ~1.5x of WAF is paid for the privilege. This commit makes gc_max adaptive: it decays each window from its initial static value toward an observation-derived floor target_floor = hard_limit + (peak_projected_used / total) The floor is the smallest gap the cluster needs to absorb its observed worst-case in-flight user reservation. `peak_projected_used` is tracked across the cluster's lifetime with a slow exponential decay applied each adjust cycle. Decay rate ========== The decay multiplier is `0.995` per 30 s elapsed window. The decay is applied lazily: each call to `maybe_adjust_thresholds()` raises 0.995 to the actual elapsed seconds / 30. This way the decay catches up correctly even if the background process was idle and the hook went uncalled for many cycles. A naive per-call multiplication would freeze the decay during idle phases (the issue observed in v1 testing where peak stayed at its high-water mark across a 45-minute idle window). Decay timeline (fraction of original value remaining, on a system where maybe_adjust_thresholds is called at least every 30 s during idle — or any interval, since the decay is now elapsed-time-based): - half-life: log(0.5) / log(0.995) ≈ 138 windows ≈ 69 min ≈ 1 hour - peak retention timeline: 5 min → 95 % 30 min → 74 % 1 hour → 55 % 4 hours → 9 % 12 hours → 0.2 % 24 hours → effectively 0 So a single observed peak influences gc_max strongly for ~1 hour, noticeably for ~4 hours, and is essentially forgotten within a day. This is sized to be much longer than transient bench phases (peaks remain >92% of true value within a 16 min bench, never roll out prematurely) yet much shorter than workload-shift timescales (a workload that genuinely eases sees gc_max shrink within hours). Re-discovery ============ The decay lets gc_max eventually re-discover lower floors when a workload genuinely eases, while preserving observed peaks long enough that transient bursts inside a steady workload don't roll out prematurely. gc_max is bounded below by the floor at all times — so the workload's observed needs are always satisfied without static tuning. Each window, gc_max moves halfway toward the floor (`gc_max = max(floor, (gc_max + floor) / 2)`). This is binary-search-style convergence: distance to floor halves per window. When the floor rises (workload reveals a new peak), gc_max jumps up to meet it immediately. When the floor falls (peaks have decayed below current gc_max), gc_max halves toward the lower value over the next several windows. Bootstrap safety: gc_max retains the existing static initial value (0.15), so a freshly mounted cleaner runs at the same operating point as today's code until observations have accumulated. This avoids the "cluster crashes before adaptive sees a workload" failure mode that naive `gc_max = hard_limit + observed` produces. Implementation ============== A single double member on SegmentCleaner: `peak_projected_used_decayed` is updated to `max(current, projected_used_bytes)` on each `try_reserve_projected_usage()` call. `maybe_adjust_thresholds()` applies `std::pow(0.995, elapsed_sec / 30.0)` decay on each invocation (every ≥30 s in steady state, longer if the cleaner was idle). The floor uses this value directly. Bench measurements (qa/standalone/crimson randwrite, 1 MiB writes, 32 GiB per-OSD null_blk, 70% fill, 1280 GiB write target): Configuration | WAF | Duration | Status ---------------------------------------|---------|----------|--------- Static defaults (gc_max=.15, hard=.10) | 5.749 | 33 min | clean Manual tuned (gc_max=.08, hard=.02) | 2.926 | 16 min | clean Adaptive hard_limit only | 3.276 | 17 min | clean Adaptive hard_limit + gc_max (HEAD) | 2.829 | 17 min | clean Adaptive gc_max reduces WAF a further 14% vs hard_limit-only (3.276 -> 2.829) and slightly beats the hand-tuned manual point (2.926). The per-OSD adaptation captures workload asymmetry that uniform static defaults can't: on the bench's PG-imbalanced setup the lightly-loaded osd.0 settled at gc_max=0.026 (much tighter than the manual 0.08) while osd.1 took the full traffic and settled at gc_max=0.084. Both extract maximum efficiency for their actual load instead of running at worst-case-conservative values. A separate decay-validation run (45-minute idle interlude between two heavy phases) confirmed that the lazy decay catches up correctly even when the background process was dormant during the idle phase. No new workload-tuned constants are introduced. The literal numbers in this commit are: - the 30 s window from the previous commit (time scale of the feedback loop) - the binary-search halving rate (control geometry, not workload- specific; could be 1/3 or 1/4 with similar convergence) - the 0.995 decay rate (per-window multiplier; gives the ~1-hour half-life and ~24-hour full-forget behaviour described above; recompile-only) The existing `get_default()` value of `0.15` is left untouched as the bootstrap initial — operators who disable adaptive control (future config knob) revert to today's exact behaviour. Signed-off-by: Shai Fultheim --- diff --git a/src/crimson/os/seastore/async_cleaner.cc b/src/crimson/os/seastore/async_cleaner.cc index 87f63dcbcf3..e72b3d9d7d9 100644 --- a/src/crimson/os/seastore/async_cleaner.cc +++ b/src/crimson/os/seastore/async_cleaner.cc @@ -1,6 +1,8 @@ // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:nil -*- // vim: ts=8 sw=2 sts=2 expandtab +#include + #include #include @@ -1135,14 +1137,19 @@ void SegmentCleaner::maybe_adjust_thresholds() peak_open_segments_window, segments.get_num_open()); // Only recompute hard_limit every 30s. - using namespace std::chrono_literals; LOG_PREFIX(SegmentCleaner::maybe_adjust_thresholds); auto now = seastar::lowres_clock::now(); - if (adaptive_last_time != seastar::lowres_clock::time_point{} && - now - adaptive_last_time < 30s) { - return; + double elapsed_sec = 0.0; + if (adaptive_last_time != seastar::lowres_clock::time_point{}) { + elapsed_sec = std::chrono::duration( + now - adaptive_last_time).count(); + if (elapsed_sec < 30.0) { + return; + } } adaptive_last_time = now; + double old_hard_limit = config.available_ratio_hard_limit; + double old_gc_max = config.available_ratio_gc_max; // Architectural floor: named writers (journal + hot/cold gens + metadata). auto hot = crimson::common::get_conf( @@ -1156,34 +1163,54 @@ void SegmentCleaner::maybe_adjust_thresholds() return; } - // hard_limit = (max(peak, named) + 1) * seg / total. "+1" is the minimum + double segment_ratio = + static_cast(seg_size) / + static_cast(total_bytes); + + // hard_limit = (max(peak, named) + 1) * segment_ratio. "+1" is the minimum // safety unit: allow one more open segment than ever observed. std::size_t observed_peak = std::max(peak_open_segments_window, named_writers); double new_hard_limit = - static_cast(observed_peak + 1) * - static_cast(seg_size) / - static_cast(total_bytes); + static_cast(observed_peak + 1) * segment_ratio; double crash_floor = - static_cast(named_writers) * - static_cast(seg_size) / - static_cast(total_bytes); + static_cast(named_writers) * segment_ratio; new_hard_limit = std::max(new_hard_limit, crash_floor); - // Keep gc_max strictly greater than hard_limit. + // Apply lazy decay covering elapsed time (allows gc_max to gradually fall + // when workload eases) so peaks fade even when the background process was + // idle and this hook went uncalled for many cycles. + if (elapsed_sec > 0.0) { + peak_projected_used_decayed *= std::pow(0.995, elapsed_sec / 30.0); + } + + // gc_max decays halfway each window toward (hard_limit + recent peak burst). + double burst_floor_ratio = + peak_projected_used_decayed / + static_cast(total_bytes); + double target_gc_max = new_hard_limit + burst_floor_ratio; + double decayed_gc_max = + (config.available_ratio_gc_max + target_gc_max) / 2.0; + config.available_ratio_gc_max = std::max(decayed_gc_max, target_gc_max); if (config.available_ratio_gc_max <= new_hard_limit) { - config.available_ratio_gc_max = new_hard_limit + 0.001; + config.available_ratio_gc_max = new_hard_limit + segment_ratio; } config.available_ratio_hard_limit = new_hard_limit; - INFO("[ADAPTIVE_GC] peak_open={} named={} hard_limit={:.4f} " - "gc_max={:.4f} crash_floor={:.4f}", - peak_open_segments_window, named_writers, - config.available_ratio_hard_limit, - config.available_ratio_gc_max, crash_floor); + if (old_hard_limit != new_hard_limit || old_gc_max != config.available_ratio_gc_max) { + INFO("[ADAPTIVE_GC] update: hard_limit {:.4f} -> {:.4f}, gc_max {:.4f} -> {:.4f} " + "(peak_open={} named={} peak_proj_decayed={:.0f} crash_floor={:.4f})", + old_hard_limit, new_hard_limit, + old_gc_max, config.available_ratio_gc_max, + peak_open_segments_window, named_writers, + peak_projected_used_decayed, crash_floor); + } else { + DEBUG("[ADAPTIVE_GC] no-op: hard_limit {:.4f}, gc_max {:.4f}", + old_hard_limit, old_gc_max); + } - // Reset window: record current open count as the new baseline. + // Reset per-window open-segment peak. peak_open_segments_window = segments.get_num_open(); } @@ -1883,6 +1910,11 @@ bool SegmentCleaner::try_reserve_projected_usage(std::size_t projected_usage) { assert(background_callback->is_ready()); stats.projected_used_bytes += projected_usage; + // Update decayed peak; the slow decay in maybe_adjust_thresholds() lets old + // peaks fade so gc_max can eventually re-discover lower floors. + peak_projected_used_decayed = std::max( + peak_projected_used_decayed, + static_cast(stats.projected_used_bytes)); if (should_block_io_on_clean()) { stats.projected_used_bytes -= projected_usage; return false; diff --git a/src/crimson/os/seastore/async_cleaner.h b/src/crimson/os/seastore/async_cleaner.h index e8134a10b84..8006215c320 100644 --- a/src/crimson/os/seastore/async_cleaner.h +++ b/src/crimson/os/seastore/async_cleaner.h @@ -1669,6 +1669,11 @@ private: std::size_t peak_open_segments_window = 0; seastar::lowres_clock::time_point adaptive_last_time; + // Peak projected_used with slow exponential decay per adjust cycle. Decay + // 0.5% per 30s window = half-life ~1 hour: long enough not to forget peaks + // mid-workload, short enough to re-discover lower floors over time. + double peak_projected_used_decayed = 0.0; + SegmentManagerGroupRef sm_group; BackrefManager &backref_manager;