From: Shai Fultheim Date: Sun, 17 May 2026 09:40:44 +0000 (+0300) Subject: crimson/os/seastore: auto-tune cleaner gc segment pick under random-write X-Git-Tag: v21.0.1~130^2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=refs%2Fpull%2F68964%2Fhead;p=ceph.git crimson/os/seastore: auto-tune cleaner gc segment pick under random-write SegmentCleaner uses one of three configurable gc formulas to select the next segment to reclaim: GREEDY (lowest util wins), COST_BENEFIT ((1-u)*age/(2u)), or BENEFIT (an age-weighted quadratic in util). COST_BENEFIT is the default and the right choice for journaling / LIFO workloads, where old segments accumulate more dead bytes than young ones — age predicts deadness, so an old high-util segment is worth reclaiming because its util will keep rising as long as we wait. That assumption breaks under random-write at high cluster fill. Dead bytes spread uniformly across segments regardless of age, so age stops predicting future deadness, and (1-u)/(2u) becomes the only term that distinguishes candidates. With every segment in the 0.7-0.94 util band, (1-u)/(2u) ranges from 0.227 to 0.032 — a 7x spread the formula can easily lose to a 7x age difference. Result: a 0.94-util old segment scores higher than a 0.68-util young one, even though reclaiming the 0.68 segment would free 5x more space (32% of a 64 MB segment vs 6%). Observed in qa/standalone/crimson randwrite at ~70% full: with the unmodified formula, cleaner picks settled on 0.92-0.94 util segments freeing ~4 MB net each; net free rate collapsed to single-digit KB/s even though the cleaner was running cycles at ~30 µs each. fio's stall watchdog killed the bench after 535 GB user written (target 1280 GB). Switching gc_formula = greedy by hand let the bench complete the target. This patch detects the mis-selection at runtime and overrides the formula's pick with the greedy choice only when the difference is significant. In get_next_reclaim_segment() we already iterate all closed reclaimable segments to find the formula's max-score candidate; in the same pass we now also track the lowest-util candidate (what GREEDY would have picked). After the loop, if greedy's free-fraction (1 - greedy_util) is at least seastore_segment_cleaner_gc_autotune_ratio times the formula's pick's free-fraction (default 2.0), we swap to greedy. Since all segments share the same size, comparing free- fractions is equivalent to comparing freed bytes; the fraction form avoids an unnecessary multiplication. The full design rationale (regime-by-regime behaviour, safety guard against picked_free near zero, score-recompute on override, threshold calibration) lives in doc/dev/crimson/seastore.rst under the new "Cleaner GC autotune" section. The code references it from short inline comments. Configurable knobs: * seastore_segment_cleaner_gc_autotune (bool, default true) — operators can disable the override entirely to honor the configured formula unconditionally. Ignored when gc_formula = greedy. * seastore_segment_cleaner_gc_autotune_ratio (float, default 2.0, min 1.0) — operators can tune the override threshold. Higher is more conservative (preserves age weighting more aggressively); lower is more aggressive (behaviour converges toward pure greedy). The override predicate is factored into a static helper `SegmentCleaner::should_override_to_greedy(picked_free, greedy_free, ratio)` so the call site stays readable and the predicate is independently testable. With this change the qa/standalone/crimson randwrite bench at 70% fill completes the target run rather than stalling at the 500-600 GB mark, with the override firing reliably under high uniform alive_ ratio and not firing under low or non-uniform alive_ratio. Override behaviour can be observed with debug_seastore_cleaner=20. Signed-off-by: Shai Fultheim --- diff --git a/doc/dev/crimson/seastore.rst b/doc/dev/crimson/seastore.rst index ae429158ea3..cc54904cff7 100644 --- a/doc/dev/crimson/seastore.rst +++ b/doc/dev/crimson/seastore.rst @@ -622,6 +622,65 @@ ExtentPlacementManager is responsible for: and physical extents are updated accordingly. The SegmmentCleaner is also responisble for throttling GC work in order to avoid abrupt pauses and maintain smooth IO latenices. +.. _cleaner-gc-autotune: + +**Cleaner GC autotune**: + + ``SegmentCleaner::get_next_reclaim_segment()`` chooses the next segment to + reclaim using one of three configurable formulas selected by + ``seastore_segment_cleaner_gc_formula``: ``GREEDY`` (lowest utilization + wins), ``COST_BENEFIT`` (``(1-u) * age / (2u)``), or ``BENEFIT`` + (age-weighted quadratic). ``COST_BENEFIT`` is the default and the right + call for journaling / LIFO workloads where age predicts future + dead-byte accumulation. + + That assumption breaks under random-write at high cluster fill. Dead + bytes spread uniformly across segments regardless of age, so age stops + predicting future deadness, and ``(1-u)/(2u)`` becomes the only term that + distinguishes candidates. With every segment in the 0.7-0.94 utilization + band, ``(1-u)/(2u)`` ranges from 0.227 to 0.032 -- a 7x spread the + formula can easily lose to a 7x age difference. A 0.94-util old segment + then outscores a 0.68-util young one, even though reclaiming the 0.68 + segment would free 5x more space. + + The autotune override detects this mis-selection at runtime. In the + same pass that scores segments by the configured formula, it also + tracks the lowest-utilization candidate (what ``GREEDY`` would pick). + After the pass, if greedy's free-fraction (``1 - util``) is at least + ``seastore_segment_cleaner_gc_autotune_ratio`` times the formula's + pick's free-fraction (default 2.0), the override swaps the formula's + pick for greedy. Since all segments share the same size, comparing + free-fractions is equivalent to comparing freed bytes. + + Behaviour by regime: + + - **Low alive_ratio**: many low-util candidates exist; the formula's + age-preferred pick is typically within ~30% of greedy in + free-fraction. The override does not fire and age weighting is + preserved. + - **High alive_ratio with non-uniform utilisation** (hot/cold mix): + greedy and the formula converge on the same segment in most cases; + when they differ, the formula's choice is usually within 2x. The + override rarely fires. + - **High alive_ratio with uniform utilisation** (the failure regime + the autotune targets): greedy's pick exceeds the formula's by 3-5x + routinely. The override fires reliably; net free per reclaim jumps + from 4-6 MB to 14-22 MB. + + Configurable: + + - ``seastore_segment_cleaner_gc_autotune`` (bool, default true): + operators can disable the override unconditionally. + - ``seastore_segment_cleaner_gc_autotune_ratio`` (float, default 2.0, + min 1.0): operators can tune the threshold; higher is more + conservative (preserves age weighting more aggressively). + + A safety guard skips the override when the formula's pick has + free-fraction below ``1/1024`` of a segment, because the ratio + comparison is meaningless against a near-zero denominator. On + override the formula's score for the chosen segment is recomputed + so the value logged after selection stays consistent. + **Tiering**: .. note:: diff --git a/src/common/options/crimson.yaml.in b/src/common/options/crimson.yaml.in index c543b96b5de..850f39014d0 100644 --- a/src/common/options/crimson.yaml.in +++ b/src/common/options/crimson.yaml.in @@ -284,6 +284,39 @@ options: - greedy - cost_benefit - benefit +- name: seastore_segment_cleaner_gc_autotune + type: bool + level: advanced + desc: When the configured gc formula (cost_benefit or benefit) picks a segment + whose free-space fraction (1 - utilization) is at least + seastore_segment_cleaner_gc_autotune_ratio times smaller than the + lowest-utilization candidate, override the pick with the greedy choice. + long_desc: COST_BENEFIT and BENEFIT weight segment age, which is the right + call when age predicts dead-byte accumulation (journaling / LIFO + workloads). Under random-write at high alive_ratio dead bytes + spread uniformly across segments, age stops predicting deadness, + and the formula can pick a high-util old segment whose reclaim + frees several times less space than the lowest-util candidate. + When this option is enabled the cleaner detects the mis-selection + at runtime and overrides the formula's pick with the greedy + choice. Disable to honor the configured formula unconditionally. + Ignored when seastore_segment_cleaner_gc_formula = greedy. + default: true +- name: seastore_segment_cleaner_gc_autotune_ratio + type: float + level: advanced + desc: Override threshold for the gc auto-tune. The configured formula's + pick is overridden with the greedy candidate when greedy's free + fraction is at least this ratio times the formula's pick's free + fraction. + long_desc: Higher is more conservative (override fires less often, the + configured formula's age weighting is preserved more + aggressively). Lower is more aggressive (override fires more + often, behaviour converges toward pure greedy). The default + (2.0) captures the random-write failure regime while staying + clear of normal-operation fluctuations. + default: 2.0 + min: 1.0 - name: seastore_data_delta_based_overwrite type: size level: dev diff --git a/src/crimson/os/seastore/async_cleaner.cc b/src/crimson/os/seastore/async_cleaner.cc index 880c9f7861e..7c36897aeef 100644 --- a/src/crimson/os/seastore/async_cleaner.cc +++ b/src/crimson/os/seastore/async_cleaner.cc @@ -1759,15 +1759,52 @@ segment_id_t SegmentCleaner::get_next_reclaim_segment() const } else { bound_time = NULL_TIME; } + // Track the configured formula's best-scoring candidate alongside the + // greedy choice (lowest utilization / highest free fraction). + // See doc/dev/crimson/seastore.rst#cleaner-gc-autotune. + segment_id_t greedy_id = NULL_SEG_ID; + double greedy_min_util = 1.0; for (auto& [_id, segment_info] : segments) { if (segment_info.is_closed() && (trimmer == nullptr || !segment_info.is_in_journal(trimmer->get_journal_tail()))) { + // Track the configured formula's best-scoring reclaim candidate. double benefit_cost = calc_gc_benefit_cost(_id, now_time, bound_time); if (benefit_cost > max_benefit_cost) { id = _id; max_benefit_cost = benefit_cost; } + // Track the greedy candidate (lowest utilization / highest free fraction). + double util = calc_utilization(_id); + if (util < greedy_min_util) { + greedy_id = _id; + greedy_min_util = util; + } + } + } + // Autotune override: prefer greedy when its pick would free far more. + // See doc/dev/crimson/seastore.rst#cleaner-gc-autotune. + const bool autotune_enabled = + crimson::common::get_conf( + "seastore_segment_cleaner_gc_autotune"); + if (autotune_enabled && + gc_formula != gc_formula_t::GREEDY && + id != NULL_SEG_ID && greedy_id != NULL_SEG_ID && id != greedy_id) { + double picked_util = calc_utilization(id); + double picked_free = 1.0 - picked_util; + double greedy_free = 1.0 - greedy_min_util; + const double ratio = crimson::common::get_conf( + "seastore_segment_cleaner_gc_autotune_ratio"); + if (should_override_to_greedy(picked_free, greedy_free, ratio)) { + DEBUG("auto-tune: formula picked seg {} (util {:.3f}, free {:.3f})," + " overriding with greedy seg {} (util {:.3f}, free {:.3f})", + id, picked_util, picked_free, + greedy_id, greedy_min_util, greedy_free); + id = greedy_id; + // Recompute the formula score for the chosen segment so the + // value logged below stays semantically consistent. + max_benefit_cost = + calc_gc_benefit_cost(greedy_id, now_time, bound_time); } } if (id != NULL_SEG_ID) { diff --git a/src/crimson/os/seastore/async_cleaner.h b/src/crimson/os/seastore/async_cleaner.h index cd3ee9e1a3a..3f8ab98d19c 100644 --- a/src/crimson/os/seastore/async_cleaner.h +++ b/src/crimson/os/seastore/async_cleaner.h @@ -1446,6 +1446,18 @@ public: clean_space_ret clean_space() final; + // Predicate for the autotune override: returns true when greedy's pick frees + // significantly more space than the formula's pick. + // See doc/dev/crimson/seastore.rst#cleaner-gc-autotune. + static bool should_override_to_greedy( + double picked_free, double greedy_free, double ratio) { + // Guard against picked_free near zero (1/1024 of a segment): the ratio + // comparison is meaningless against a near-zero denominator. + constexpr double kMinPickedFreeForRatio = 1.0 / 1024.0; + return picked_free >= kMinPickedFreeForRatio && + greedy_free >= ratio * picked_free; + } + const std::set& get_device_ids() const final { return sm_group->get_device_ids(); }