Dhairya Parmar [Wed, 20 May 2026 21:18:15 +0000 (02:48 +0530)]
mds: persist session auth_name in ESession journal event
So that it can be applied to the freshly creation session which happens
while recreating session in ESession::replay when the OMAP version fell
behind the ESession cmapv and the newly creation session would be
rejected as target when a client tries to reclaim this session.
Kefu Chai [Mon, 1 Jun 2026 10:40:06 +0000 (18:40 +0800)]
qa/cephadm: query iSCSI gateway FQDN from inside the container
rbd-target-api validates that the gateway hostname supplied by gwcli
matches the container's own socket.getfqdn(). Running the same call on
the host can return a different value when the host and container resolve
names differently (e.g. on Rocky 10), causing gateway creation to fail
with HTTP 400 and all subsequent gwcli configuration to break silently.
Query the FQDN from inside the iSCSI container directly so the value is
always consistent with what rbd-target-api expects. This also removes the
"run twice" workaround, which was compensating for host-side DNS
warm-up flakiness rather than addressing the underlying mismatch.
Kefu Chai [Mon, 1 Jun 2026 05:19:04 +0000 (13:19 +0800)]
test/libcephfs: reduce SnapDiffDeletionRecreation bulk_count on Windows
this test timed out on Windows. and HugeSnapDiffLargeDelta, at half
the file count, passed in 508 seconds on the same run, suggesting this
test takes ~17 minutes on Windows -- beyond the test runner limit.
we haven't profiled the Windows client yet, but the likely culprit is
EventPoll, the Windows messenger backend, which scans the entire poll
array on every event_wait() and poll_ctl() call rather than using a
keyed data structure.
in this change, we reduce bulk_count to 1 << 12 on Windows. the unique
thing this test covers is the deletion-recreation pattern: a name that
exists as a file in snap1, gets deleted, and reappears as a directory in
snap2 -- it must show up in the diff with both snapids. 4096 produces
1024 such pairs, which is enough to exercise that logic. multi-fragment
snapdiff is already covered by HugeSnapDiffLargeDelta, which derives its
file count from mds_bal_split_size and mds_bal_fragment_fast_factor
explicitly to trigger fragmentation.
Sun Yuechi [Sat, 30 May 2026 06:15:12 +0000 (14:15 +0800)]
cmake: add WITH_SYSTEM_SPDK to link a system-installed SPDK
By default ceph builds the bundled src/spdk fork via BuildSPDK. Add a
WITH_SYSTEM_SPDK option that instead locates a distro-provided SPDK
through a new Findspdk.cmake (pkg-config based, modelled on
Finddpdk.cmake), exposing the same spdk::spdk target.
Sun Yuechi [Sat, 30 May 2026 06:11:11 +0000 (14:11 +0800)]
blk/spdk: support both old and new spdk_env_opts member names
SPDK 21.01 renamed two struct spdk_env_opts members: pci_whitelist ->
pci_allowed and master_core -> main_core. Guard the assignments in
NVMEDevice with SPDK_VERSION.
Jacques Heunis [Thu, 15 Jan 2026 12:11:11 +0000 (12:11 +0000)]
tools/rados: Remove plain text snippets from rados bench JSON output
`rados bench` emits performance stats as its output. It is very helpful
for this output to be in a machine-readable format and the CLI provides
the `--format=json` flag to achieve this.
There are some logs that do not respect the formatter flag though, as
they provide status updates as the tool is running and do not form part
of the output dataset. This prevents the contents of stdout from being
valid JSON which destroys the machine-readability of the output.
To resolve this we gate those status messages behind a check for the
formatter. If any specific formatter is provided we do not emit the
status logs. This leaves the plaintext output largely untouched while
helping the machine-readable output to be well-formed.
Fixes: https://tracker.ceph.com/issues/74370 Signed-off-by: Jacques Heunis <jheunis@bloomberg.net>
Jamie Pryde [Fri, 29 May 2026 11:44:56 +0000 (12:44 +0100)]
qa: Ignore deprecated EC plugin warning in teuthology tests
Add DEPRECATED_EC_PLUGIN to the list of health warnings to
ignore in the thrash-erasure-code-* tests that use deprecated
plugins or techniques. It is expected that this warning will
be raised.
Sun Yuechi [Fri, 29 May 2026 10:39:51 +0000 (18:39 +0800)]
rgw: move SWIFT error_handler out-of-line to fix link failure
The two error_handler overrides are defined inline in rgw_rest_swift.h
and delegate to RGWSwiftWebsiteHandler::error_handler, a non-virtual
function defined in rgw_rest_swift.cc (librgw_a.a). Because the header
is included by rgw_rest.cc, the inline bodies are emitted in
librgw_common.a, which then ODR-uses that symbol across archives.
The link line lists librgw_a.a before librgw_common.a, and GNU ld only
pulls archive members on demand: when librgw_a.a is scanned nothing yet
references RGWSwiftWebsiteHandler::error_handler, so rgw_rest_swift.cc.o
is dropped and the symbol is later unresolved. This shows up as a link
failure with gcc 16 -O2.
Move the two bodies into rgw_rest_swift.cc next to the function they
call, so the ODR-use stays within the same object and the build no
longer depends on archive scan order. No functional change.
Vallari Agrawal [Wed, 27 May 2026 12:17:55 +0000 (17:47 +0530)]
qa/suites/nvmeof: ignore "have only 1 nvmeof gateway"
Add "have only 1 nvmeof gateway" to ignorelist.
NVMEOF_SINGLE_GATEWAY is already part of ignorelist
but tests sometimes fail on "have only 1 nvmeof gateway".
Thrasher or scalability tests can trigger this but there
are enough asserts to ensure all expected gateways are
up, we can safely ignore this healthcheck warning.
Redouane Kachach [Fri, 29 May 2026 09:09:44 +0000 (11:09 +0200)]
qa/tasks: capture CommandCrashedError when running nvme list cmd
The safe_while retry loop does not catch exceptions, so a
CommandCrashedError from `nvme list` bypasses it entirely. Catch
CommandCrashedError and continue the retry loop instead.
Shai Fultheim [Tue, 19 May 2026 22:53:21 +0000 (01:53 +0300)]
crimson/os/seastore: adaptive cleaner gc_max from observed user-burst peak
The previous commit adapts `hard_limit` to track the cleaner's observed
open-segment peak, removing the hard-coded `.10` floor and cutting WAF
~43%. With hard_limit adaptive, the remaining WAF lever is `gc_max` —
the threshold that gates when the cleaner runs in non-emergency mode
and therefore the cluster's steady-state operating fill. Lower gc_max
= higher fill = more dead bytes per reclaim cycle = fewer live bytes
copied = lower GC component of WAF.
The hard-coded default of `0.15` (cleaner triggers at 85% segment
fill) is over-provisioned for the typical cluster. On the bench
workload the empirically optimal `gc_max` is about 0.08, which at the
default 0.15 means ~7% of cluster space sits unused and ~1.5x of WAF
is paid for the privilege.
This commit makes gc_max adaptive: it decays each window from its
initial static value toward an observation-derived floor
The floor is the smallest gap the cluster needs to absorb its observed
worst-case in-flight user reservation. `peak_projected_used` is tracked
across the cluster's lifetime with a slow exponential decay applied
each adjust cycle.
Decay rate
==========
The decay multiplier is `0.995` per 30 s elapsed window. The decay is
applied lazily: each call to `maybe_adjust_thresholds()` raises 0.995
to the actual elapsed seconds / 30. This way the decay catches up
correctly even if the background process was idle and the hook went
uncalled for many cycles. A naive per-call multiplication would freeze
the decay during idle phases (the issue observed in v1 testing where
peak stayed at its high-water mark across a 45-minute idle window).
Decay timeline (fraction of original value remaining, on a system
where maybe_adjust_thresholds is called at least every 30 s during
idle — or any interval, since the decay is now elapsed-time-based):
So a single observed peak influences gc_max strongly for ~1 hour,
noticeably for ~4 hours, and is essentially forgotten within a day.
This is sized to be much longer than transient bench phases (peaks
remain >92% of true value within a 16 min bench, never roll out
prematurely) yet much shorter than workload-shift timescales (a
workload that genuinely eases sees gc_max shrink within hours).
Re-discovery
============
The decay lets gc_max eventually re-discover lower floors when a
workload genuinely eases, while preserving observed peaks long enough
that transient bursts inside a steady workload don't roll out
prematurely.
gc_max is bounded below by the floor at all times — so the workload's
observed needs are always satisfied without static tuning. Each
window, gc_max moves halfway toward the floor (`gc_max = max(floor,
(gc_max + floor) / 2)`). This is binary-search-style convergence:
distance to floor halves per window. When the floor rises (workload
reveals a new peak), gc_max jumps up to meet it immediately. When the
floor falls (peaks have decayed below current gc_max), gc_max halves
toward the lower value over the next several windows.
Bootstrap safety: gc_max retains the existing static initial value
(0.15), so a freshly mounted cleaner runs at the same operating point
as today's code until observations have accumulated. This avoids the
"cluster crashes before adaptive sees a workload" failure mode that
naive `gc_max = hard_limit + observed` produces.
Implementation
==============
A single double member on SegmentCleaner: `peak_projected_used_decayed`
is updated to `max(current, projected_used_bytes)` on each
`try_reserve_projected_usage()` call. `maybe_adjust_thresholds()`
applies `std::pow(0.995, elapsed_sec / 30.0)` decay on each invocation
(every ≥30 s in steady state, longer if the cleaner was idle). The
floor uses this value directly.
Configuration | WAF | Duration | Status
---------------------------------------|---------|----------|---------
Static defaults (gc_max=.15, hard=.10) | 5.749 | 33 min | clean
Manual tuned (gc_max=.08, hard=.02) | 2.926 | 16 min | clean
Adaptive hard_limit only | 3.276 | 17 min | clean
Adaptive hard_limit + gc_max (HEAD) | 2.829 | 17 min | clean
Adaptive gc_max reduces WAF a further 14% vs hard_limit-only (3.276 ->
2.829) and slightly beats the hand-tuned manual point (2.926). The
per-OSD adaptation captures workload asymmetry that uniform static
defaults can't: on the bench's PG-imbalanced setup the lightly-loaded
osd.0 settled at gc_max=0.026 (much tighter than the manual 0.08)
while osd.1 took the full traffic and settled at gc_max=0.084. Both
extract maximum efficiency for their actual load instead of running
at worst-case-conservative values.
A separate decay-validation run (45-minute idle interlude between two
heavy phases) confirmed that the lazy decay catches up correctly even
when the background process was dormant during the idle phase.
No new workload-tuned constants are introduced. The literal numbers
in this commit are:
- the 30 s window from the previous commit (time scale of the
feedback loop)
- the binary-search halving rate (control geometry, not workload-
specific; could be 1/3 or 1/4 with similar convergence)
- the 0.995 decay rate (per-window multiplier; gives the ~1-hour
half-life and ~24-hour full-forget behaviour described above;
recompile-only)
The existing `get_default()` value of `0.15` is left untouched as the
bootstrap initial — operators who disable adaptive control (future
config knob) revert to today's exact behaviour.
Shai Fultheim [Tue, 19 May 2026 10:55:02 +0000 (13:55 +0300)]
crimson/os/seastore: adaptive cleaner hard_limit from observed open-segment peak
The cleaner's `available_ratio_hard_limit` controls when user IO blocks
(once projected_aratio < hard_limit). Setting it too high causes
unnecessary blocks during transient pressure; setting it too low risks
running out of free segments for the cleaner's own working set, which
aborts the OSD with "seastore device size setting is too small".
The current default of `0.10` was chosen empirically and does not scale
with cluster geometry. On a 32 GiB cluster with default 64 MiB segments,
`0.10` reserves ~3 GiB of always-empty space. The cleaner's actual
named-writer working set is 1 journal + `seastore_hot_tier_generations`
hot writers + `seastore_cold_tier_generations` cold writers + 1
metadata writer = (hot + cold + 2) segments. For the typical defaults
(5 hot, 3 cold) that is 10 segments = 640 MiB on a 32 GiB OSD = 2.0%.
Reserving 10% leaves ~80% of that "headroom" sitting unused, which
causes the cluster to operate at lower fill, accumulate fewer dead
bytes per segment, and pay 4-5x WAF on garbage collection cycles.
This commit makes hard_limit adaptive: track the peak open-segment
count observed during each 30 s window, then derive
where the "+ 1" segment is the minimum safety unit (one more open
segment than ever observed). The `named_writers` count is the
architectural floor below which the cleaner cannot allocate; staying
above it prevents the abort. `observed_peak` floats to track the
actual transient overhead introduced by segment transitions in the
running workload.
Implementation
==============
`AsyncCleaner::maybe_adjust_thresholds()` is added as a virtual no-op
hook; `SegmentCleaner` overrides it. The hook is invoked once per
`BackgroundProcess::run()` iteration. Each call samples the current
open-segment count into the rolling window peak. Every 30 s, the
window's peak is consumed to recompute hard_limit, and the window
resets.
`config_t config` loses its `const` qualifier; the only mutation is
this hook, which is the single writer in the cleaner's shard.
This commit only adapts `hard_limit`. `gc_max` remains at its existing
default (0.15). A follow-up commit will add adaptive `gc_max` driven
by observed user-burst and cleaner-cycle peaks; that is where the
remaining WAF reduction lives.
Bench measurements
==================
qa/standalone/crimson randwrite at 70% fill, 1 MiB writes, 32 GiB
per-OSD null_blk backing, 1280 GiB write target. Comparison against
the same workload with static `hard_limit = 0.10`:
WAF drops 43 % and end-to-end throughput nearly doubles. The mechanism
is that fewer projected_aratio dips cross the (much lower) block
threshold, so the cluster spends less time in the block-recover-block
cycle that bloats device_written without progressing user_written.
No new workload-tuned constants are introduced. The two literal
numbers in the algorithm are the 30 s recompute interval (time scale
of the feedback loop, not workload-specific) and the `+ 1 segment`
safety unit (the smallest possible buffer in units the cleaner can
allocate).
ceph-volume: retry lvs after empty result and "devices file is missing" stderr
When LVM's devices file is out of sync with the runtime device view (common
in teuthology/container namespaces with multipath), `lvs` can exit 0 with
empty stdout and only stderr warnings about missing mapper entries.
It can leave get_lvs() empty and cause Device() to fall through to lsblk on a
vg/lv path which can produce a misleading "not a block device" error.
With this fix, ceph-volume retries once with 'use_devicesfile=0' when it
detects this specific pattern.
Patrick Donnelly [Wed, 27 May 2026 03:05:22 +0000 (23:05 -0400)]
qa/tasks/cbt: construct venv just for cbt
So we no longer need to install system-wide.
Avoids errors like on Ubuntu 24.04:
2026-05-24T13:48:19.681 DEBUG:teuthology.orchestra.run.trial043:> python3 -m pip install -r /home/ubuntu/cephtest/cbt/requirements.txt
2026-05-24T13:48:19.861 INFO:teuthology.orchestra.run.trial043.stderr:error: externally-managed-environment
2026-05-24T13:48:19.861 INFO:teuthology.orchestra.run.trial043.stderr:
2026-05-24T13:48:19.861 INFO:teuthology.orchestra.run.trial043.stderr:× This environment is externally managed
2026-05-24T13:48:19.861 INFO:teuthology.orchestra.run.trial043.stderr:╰─> To install Python packages system-wide, try apt install
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: python3-xyz, where xyz is the package you are trying to
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: install.
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr:
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: If you wish to install a non-Debian-packaged Python package,
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: create a virtual environment using python3 -m venv path/to/venv.
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: sure you have python3-full installed.
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr:
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: If you wish to install a non-Debian packaged Python application,
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: it may be easiest to use pipx install xyz, which will manage a
2026-05-24T13:48:19.862 INFO:teuthology.orchestra.run.trial043.stderr: virtual environment for you. Make sure you have pipx installed.
2026-05-24T13:48:19.863 INFO:teuthology.orchestra.run.trial043.stderr:
2026-05-24T13:48:19.863 INFO:teuthology.orchestra.run.trial043.stderr: See /usr/share/doc/python3.12/README.venv for more information.
2026-05-24T13:48:19.863 INFO:teuthology.orchestra.run.trial043.stderr:
2026-05-24T13:48:19.863 INFO:teuthology.orchestra.run.trial043.stderr:note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
2026-05-24T13:48:19.863 INFO:teuthology.orchestra.run.trial043.stderr:hint: See PEP 668 for the detailed specification.
2026-05-24T13:48:19.883 DEBUG:teuthology.orchestra.run:got remote process result: 1
2026-05-24T13:48:19.883 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_3686f8793d626abcf5a0018da0a50786e41fed9d/teuthology/run_tasks.py", line 112, in run_tasks
manager.__enter__()
File "/home/teuthworker/src/git.ceph.com_teuthology_3686f8793d626abcf5a0018da0a50786e41fed9d/teuthology/task/__init__.py", line 122, in __enter__
self.setup()
File "/home/teuthworker/src/github.com_ceph_ceph-c_1bc3c25246d3a6fbc360dc78d9b4b51200743391/qa/tasks/cbt.py", line 173, in setup
self.install_dependencies()
File "/home/teuthworker/src/github.com_ceph_ceph-c_1bc3c25246d3a6fbc360dc78d9b4b51200743391/qa/tasks/cbt.py", line 112, in install_dependencies
self.first_mon.run(args=pip_install_cmd)
File "/home/teuthworker/src/git.ceph.com_teuthology_3686f8793d626abcf5a0018da0a50786e41fed9d/teuthology/orchestra/remote.py", line 596, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/teuthworker/src/git.ceph.com_teuthology_3686f8793d626abcf5a0018da0a50786e41fed9d/teuthology/orchestra/run.py", line 461, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_teuthology_3686f8793d626abcf5a0018da0a50786e41fed9d/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_teuthology_3686f8793d626abcf5a0018da0a50786e41fed9d/teuthology/orchestra/run.py", line 181, in _raise_for_status
raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on trial043 with status 1: 'python3 -m pip install -r /home/ubuntu/cephtest/cbt/requirements.txt'
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Patrick Donnelly [Wed, 27 May 2026 02:21:12 +0000 (22:21 -0400)]
qa/distros: use consistent naming
Put the release name in the yaml name so it's easy to read from the job
description. "ubuntu_latest" means different things depending on the
Ceph release.
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Adam King [Tue, 16 Sep 2025 16:07:36 +0000 (12:07 -0400)]
qa/tasks/nvme_loop: fix nvme loop task for ubuntu noble
Compared to older distros, this one complains if
you include `-q hostnqn` in the nvme connect command,
saying "Failed to write to /dev/nvme-fabrics: Invalid argument".
Removing that argument gets passed that error and
doesn't seem to have any downsides
Nitzan Mordechai [Wed, 27 May 2026 11:48:14 +0000 (11:48 +0000)]
mgr/ThreadMonitor: monitor interval running in seconds and not nanoseconds
The ctor accidently use the mgr_module_monitor_interval as nanoseconds
we need to use it as seconds.
Also, prevent high cpu loop in case read_process_statm failed during
while loop
7f739adae2 dropped the last log call from get_segment_manager(), after
which `LOG_PREFIX(SegmentManager::get_segment_manager)` and
`SET_SUBSYS(seastore_device)` had no remaining users under `HAVE_ZNS`,
generating:
Kefu Chai [Tue, 26 May 2026 14:01:41 +0000 (22:01 +0800)]
crimson/seastore: make RecordSubmitter::wait_available() idempotent
Under sustained 4K randwrite workloads that roll journal segments
frequently, crimson-osd hits
```
crimson/os/seastore/journal/record_submitter.cc:198:
FAILED ceph_assert(!is_available())
```
and, in release builds without assertions, a downstream
`boost::throw_exception<std::length_error>` from
`seastar::shared_promise::get_shared_future()` called on a
disengaged `std::optional` in the same code path.
`RecordSubmitter::roll_segment()` arms wait_available_promise on entry,
then chains `journal_allocator.roll().safe_then(...)` whose continuation
sets the promise's value and resets the optional. The background
continuation can resolve before the subsequent `wait_available()` call
is entered -- the optional gets reset, `is_available()` becomes true
again, and `wait_available()`'s `assert(!is_available())` fires. The
brittle invariant being assumed
> .safe_then's continuation will not run before its outer call returns
is not part of seastar's contract.
Honour the documented contract instead. record_submitter.h
says:
> wait for available if cannot submit, should check
> is_available() again when the future is resolved.
The postcondition is "available when resolved"; the precondition
"unavailable when called" was incidental. Make `wait_available()`
idempotent: if `is_available()` is already true on entry, return a
ready future immediately. All three external callers
- `RecordSubmitter::roll_segment`
- `CircularBoundedJournal::submit_record`
- `SegmentedOolWriter::do_write`
re-check `is_available()` on the next iteration or in the chained
continuation and dispatch correctly.
Validated by runing a 96-job fio randwrite bench to confirm
the fix in operation; pre-patch the assert fires within ~30 min
and kills the OSD.