Patrick Donnelly [Mon, 18 May 2026 14:20:08 +0000 (10:20 -0400)]
Merge PR #68937 into main
* refs/pull/68937/head:
.github/workflows/releng-audit: group events to serialize executions
.github/workflows/releng-audit: remove override on reopen
.github/workflows/releng-audit: refactor auth check to function
Kobi Ginon [Mon, 18 May 2026 13:45:32 +0000 (16:45 +0300)]
cephadm: disable UDP in samples/nfs.json for test_cephadm Ganesha
test_cephadm.sh deploys NFS through cephadm _orch deploy using
src/cephadm/samples/nfs.json. That sample is separate from the mgr
ganesha.conf.j2 template, which already sets Enable_UDP = false.
Without that setting, Ganesha on Rocky 10 (ceph-ci image) fails during
startup with "Cannot register NFS V3 on UDP", so test_cephadm.sh never
sees ganesha.nfsd listening on port 2049.
Add Protocols = 3, 4 and Enable_UDP = false to NFS_CORE_PARAM so the
sample matches the orchestrator defaults. Fixes: https://tracker.ceph.com/issues/76295 Signed-off-by: Kobi Ginon <kginon@redhat.com>
Shai Fultheim [Sun, 17 May 2026 08:27:00 +0000 (11:27 +0300)]
crimson/os/seastore: yield to user IO between cleaner cycles
After the deadlock fix in the preceding commit ("fix IO-block deadlock
when cleaner is sleeping"), the cleaner stays awake while user IO is
blocked, but a second symptom appears at high alive_ratio (~0.79): the
cleaner's segment-allocate-and-fill loop runs tightly enough that the
user-IO continuation scheduled by maybe_wake_blocked_io() never gets a
chance to retry try_reserve_io() before the cleaner consumes the
projected_avail headroom again on its next iteration. User IO wakes,
sees projected_avail still below hard_limit, re-blocks immediately.
In the qa/standalone/crimson randwrite bench this manifests as: cluster
makes 500-700 GB of progress, then user_written counter freezes for
~75 seconds (watchdog window) while the cleaner is fully busy.
In BackgroundProcess::run(), after each do_background_cycle, if user IO
is currently blocked, yield to the reactor. That gives the woken
user-IO continuation a chance to slot in and complete a reservation
before the cleaner starts its next reservation-consuming cycle.
With this change, the same bench runs 19 minutes (vs 11-16 min) and
writes 785 GB user (vs 506-692 GB) before the next cluster limit hits,
which is the inherent throughput cap at alive_ratio 0.79 where each
reclaim only frees ~21% of segment size — not a coordination bug.
Shai Fultheim [Sun, 17 May 2026 04:43:19 +0000 (07:43 +0300)]
crimson/os/seastore: fix IO-block deadlock when cleaner is sleeping
Two coordinated changes that together close a stall observed at high
alive_ratio in the qa/standalone/crimson randwrite bench (one OSD
frozen for 70+ minutes, alive_ratio ~0.79, projected_avail_ratio ~0.10,
slow_ops accumulating indefinitely).
1. SegmentCleaner::should_clean_space() used segments.get_available_ratio()
(actual ratio) while should_block_io_on_clean() used
get_projected_available_ratio() (actual minus in-flight reservations).
When the actual ratio sat just above available_ratio_hard_limit but
the projected ratio dipped below it, IO would block while the cleaner
slept. Make should_clean_space() also trip on the projected ratio.
2. BackgroundProcess::reserve_projected_usage() did not wake the
background process when an IO blocked. With the cleaner asleep and
all IO blocked, nothing called maybe_wake_blocked_io() (no
release_projected_usage runs without completing IO; no segment
release runs without the cleaner). Kick do_wake_background() at the
point of blocking, so the cleaner re-evaluates and runs.
Afreen Misbah [Mon, 18 May 2026 10:01:58 +0000 (15:31 +0530)]
mgr/dashboard: fix logs e2e tests after carbonization
Update e2e test selectors to match the new Carbon component structure.
The .card-body and .message classes were replaced with .log-viewer
and .log-entry__message after carbonizing the logs component.
Assisted-by: Claude Signed-off-by: Afreen Misbah <afreen@ibm.com>
Afreen Misbah [Sun, 17 May 2026 14:53:54 +0000 (20:23 +0530)]
mgr/dashboard: Carbonize upgrade page
- Made cluster status clickable to navigate to overview when not HEALTH_OK
- Replaced Bootstrap classes with Carbon design tokens
- Updated upgrade.component.scss to use CSS custom properties
Assisted-by: Claude Signed-off-by: Afreen Misbah <afreenmisbah@ibm.com>
Afreen Misbah [Tue, 12 May 2026 12:07:39 +0000 (17:37 +0530)]
mgr/dashboard: Fix edit and delete access for pool-manager role
Fixes https://tracker.ceph.com/issues/76561
- allows deleting pools in pool-manager role by bypassing config-opt read permissions
- allows editing in pool-manager role which failing deu to misisng rbd mirroring permissions
- fixes a bug with pool edit mode where when both compression and name are edited it fails due to an if-else logic bug
Kefu Chai [Wed, 6 May 2026 02:08:20 +0000 (10:08 +0800)]
cmake/BuildISAL: build and install library targets only
Skip building the igzip executables; Ceph only needs libisal.la.
This should speed up the build a little bit, as we don't build the
executables previous built with "make"
When the pg has been deleted while PGAdvanceMap was queued, start()
takes an early return and skips the map-advance loop. The PeeringCtx
handed to the operation may already carry a transaction and queued
peering messages, so returning without calling complete_rctx() would
drop them. Dispatch the rctx before bailing out.
Also leave the PGPeeringPipeline::process stage via handle.complete()
instead of relying on the exit_handle defer's handle.exit(). The
pg-deleted path is a graceful completion, not an op failure, so it
should match the normal completion path; handle.exit() is documented
for the failure case only.
Kefu Chai [Fri, 8 May 2026 12:43:12 +0000 (20:43 +0800)]
crimson/osd: make PGAdvanceMap idempotent
PGAdvanceMap is scheduled by two independent callers: pg creation
(do_init=true, to=current_epoch) and broadcast_map_to_pgs. They do
not coordinate, so a broadcast advance can race with an init advance
that has already pushed the pg past the broadcast's target. The op
carried a std::optional<from> that was overwritten at start-time,
guarded by ceph_assert(from <= to) which fires in this race.
The "from" parameter was never really an input. It was always
re-read inside the pipeline from pg->get_osdmap_epoch(); the value
passed in (when there was one) was discarded. Drop the member and
let the op contract be: "ensure pg has processed osdmaps up to at
least 'to'". If the pg is already past 'to', skip. This matches
OSD::advance_pg() in the classic OSD and removes the need to
serialize the two callers.
Shai Fultheim [Sat, 16 May 2026 20:17:59 +0000 (23:17 +0300)]
crimson/os/seastore: fix cleaner space leak from shadowed result list
TransactionManager::get_extents_if_live() declared an inner
std::list<CachedExtentRef> res inside the "extent is cached" branch
that shadowed the outer res returned by the coroutine. When the
queried extent was present in the cache, it was moved into the inner
list and immediately discarded, and the empty outer list was returned
to the caller.
The async cleaner uses this result to decide whether to rewrite an
extent or treat it as dead. For recently-allocated LBA tree internal
nodes (still hot in cache), the shadowed return caused the cleaner to
skip them, so mark_space_free() never paired with the earlier
mark_space_used(). Each affected reclaim leaked exactly one extent
(4 KiB for LADDR_INTERNAL), tripping the live_bytes != 0 assertion in
SegmentCleaner::clean_space() (async_cleaner.cc:1441) once a victim
segment with such a leftover was selected.
The reproducer (at ~70% full) deterministically aborted within ~3
minutes before this fix; with the fix the OSDs run cleanly past the
trigger point.
crimson/os/seastore: use configured device type to select segment manager
In get_segment_manager(), trust the user-specified device type rather
than probing the device for ZNS zones. This simplifies the
device type selection. More important: the change avoids opening a block
file which was not yet created by the mkfs (we start() seastore then
call mkfs. but starting seastore requires creating the right segment
manager. And, currently, we probe the not-yet-created block file in mkfs()
when trying to create it.)
Kefu Chai [Sat, 16 May 2026 02:53:41 +0000 (10:53 +0800)]
doc/dev: refresh vstart.sh options in dev_cluster_deployment
Bring doc/dev/dev_cluster_deployment.rst back in line with the current
src/vstart.sh:
* drop the removed -K/--kstore objectstore backend
* drop -N/--not-new, which was dropped in 8dd2e418; reusing the existing
cluster config is simply the default when -n is not given
* correct the --rgw_frontend default from civetweb to beast
* note that -b/--bluestore is the default objectstore backend
* update the example and add a note that a fresh build needs -n on the
first run, while later runs can omit it
* note that the option list is not exhaustive and point at src/vstart.sh
src/cephadm: added ceph-exporter to post-rotate signal list
As the title says this change simply adds ceph-exporter to a logrotate
list which will ensure ceph-exporter will continue writing to a new log
file even after log rotation. Currently no new log file will be written to
and you will have to manually add ceph-exporter to logrotate.d.
Kobi Ginon [Fri, 15 May 2026 16:22:30 +0000 (19:22 +0300)]
cephadm: fix mgr ports list growth; add unit tests (#76564)
Problem
-------
MgrService.prepare_create built module ports from ``mgr services`` but
only assigned them when non-empty, then always appended
service_discovery_port. After rehydration from mgr/cephadm/host.*, an
empty ``mgr services`` response left stale ports in place and appended
another 8765 each redeploy (https://tracker.ceph.com/issues/76564).
Fix
---
Always set ``daemon_spec.ports = ports + [service_discovery_port]`` so
each prepare uses a fresh list plus exactly one discovery port.
Tests
-----
Add src/pybind/mgr/cephadm/tests/services/test_mgr.py: empty vs
non-empty ``mgr services`` with carried-over duplicate ports. Cases
consolidated from parallel PR https://github.com/ceph/ceph/pull/68879 .
Authors
-------
Kobi Ginon (@kginonredhat) — fix + test integration for this PR.
Raimund Sacherer (@rsacherer) — original fix + tests in #68879;
coordinated to land a single PR (#68915).
Bill Scales [Fri, 15 May 2026 14:39:25 +0000 (15:39 +0100)]
osd: Fix bug when calculating min_peer_features
PeeringState calculates the minimum set of features for the set
of OSDs within a PG. There is a bug when the peer info has
already been cached where these peers features are not included
in the calculation. This can lead to the min feature set
including features that not all OSDs have.
Previously this just made some asserts less aggressive than they
should have been. Pull request https://github.com/ceph/ceph/pull/57740
uses min_peer_features to decide how to encode messages to other OSDs.
Midway through an upgrade this bug can cause an OSD to send
the wrong version of a message to a downlevel OSD causing
it to abort.
Fixes: https://tracker.ceph.com/issues/76600 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Casey Bodley [Fri, 15 May 2026 14:40:50 +0000 (10:40 -0400)]
rgw/beast: add ssl_ciphersuites option for tls 1.3
the existing ssl_ciphers option is passed to `SSL_CTX_set_cipher_list()`
which only applies to "TLSv1.2 and below". there's a separate
`SSL_CTX_set_ciphersuites()` for TLSv1.3
because the frontend's default configuration for `ssl_options` accepts
both 1.2 and 1.3, users may need to specify ciphers for each. that's why
`ssl_ciphersuites` is introduced as a separate option
Matthew N. Heler [Fri, 15 May 2026 11:11:35 +0000 (06:11 -0500)]
doc/rados/configuration: recommend wpq for EC clusters seeing slow ops
On large EC clusters, mClock currently routes recovery EC sub-reads
through the immediate queue, skipping throttling. When many OSDs read
from one source during recovery, that source's high-priority queue
saturates and starves client work, producing slow ops. Recommend
falling back to wpq in the mClock config reference until the
scheduler treats those reads as background.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Ronen Friedman [Wed, 13 May 2026 08:06:17 +0000 (08:06 +0000)]
qa/crimson: disable test-coredump
This test is there to aid in debugging coredumps creation and collection.
It always fails (as it intentionally leaves the coredump in place for collection).
Disabling it in the suite to avoid noise, but leaving the test in place for manual
runs when needed.
Ronen Friedman [Sun, 10 May 2026 13:16:02 +0000 (13:16 +0000)]
qa/crimson: add coredump generation test using ASOK assert
Trigger a crash on a Crimson OSD via the admin socket 'assert'
command and verify the OSD goes down and a coredump is produced.
Exercises the debug_asok_assert_abort path added in the companion
commit.
For non-ASCII object keys, raw UTF-8 bytes end up in the signed
x-amz-meta-rgwx-source-key header. Strict S3-compatible backends
normalize non-ASCII bytes when verifying SigV4, producing a signature
mismatch -> HTTP 403, surfaced in LC as -EACCES (-13).
url_encode() the value before signing. The header is write-only,
so no decode is needed.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
rgw: group lifecycle versioned deletes to reduce OLH contention
When multiple versions of the same key expire together, each delete
does a read-modify-write of the OLH on the same bucket index shard.
Buffer versions of the same key during listing and flush on key change.
Groups with multiple versions get pre-evaluated, then hard deletes go
through rgw::multi_delete::dispatch() which skips OLH updates on all
but the last delete. LCOpRule::process() is split into evaluate() and
execute() to support this two-phase pattern.
Non-versioned buckets and single-version groups are unchanged.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
mgr/dashboard: adding daemon_name as an arg to nvmeof get bundle API
When cephadm-signed are in use, we know to know exacly which nvmeof daemon is
being used so we get the correct certificates for this daemon in
particular
rgw: extract multi-delete OLH grouping for use by lifecycle
Move the OLH-aware dispatch logic out of RGWDeleteMultiObj into a
standalone rgw::multi_delete::dispatch() so lifecycle expiration
can group versioned deletes of the same key and skip redundant
OLH updates.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Patrick Donnelly [Thu, 14 May 2026 17:26:33 +0000 (13:26 -0400)]
.github/workflows/releng-audit: consolidate into single job
In order to make this a required check someday, we can't have the main
job ever be skipped. So, consolidate into a single job and skip actions
based on the router logic.
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Vallari Agrawal [Thu, 14 May 2026 11:36:00 +0000 (17:06 +0530)]
mgr/dashboard: show warning message in nvmeof cli
If return status=0 and there is error_message, then its
a warning message from gateway, add it to output string
for plain text output.
It is already there for JSON output.
Venky Shankar [Thu, 26 Feb 2026 14:43:19 +0000 (20:13 +0530)]
tools/cephfs: always execute scan_{extents,inodes,frags} and cleanup
Even when the number of objects reported from pool stats is zero.
Pool stats metrics are delayed and cannot be fully relied on for
accuracy. Trusting the number of objects (esp. when reported as
zero) could result in missed steps during data-scan execution.
So, we try to do two things now:
1. ProgressTracker::display_progress() will display progress only
when the total items exceeds to the number of progress items.
2. Refresh total object count during each iteration of processing
objects. This might be a bit too much, so we probably need to
do this periodically rather than on each iteration.
Venky Shankar [Thu, 14 May 2026 09:27:21 +0000 (14:57 +0530)]
Merge PR #64774 into main
* refs/pull/64774/head:
test_cephfs.py: delete purge_dir() helper method, use rmtree() instead
test_cephfs.py: remove rendundant call to purge_dir()
test_cephfs.py: test rmtree on root
pybind/cephfs: don't attempt to unlink root in rmtree
test_cephfs.py: test rmtree with and without should_cancel
pybind/cephfs: make should_cancel option parameter for rmtree()
mgr/volumes: clone using cptree() from cephfs python bindings
test_cephfs: add unit tests for cptree() in cephfs python bindings
test/pybind/assertions: add helper method assert_less
pybind/cephfs: use depth-first, non-recursive approach for cloning
test_cephfs: call object setup/teardown for all tests in TestWithRootUser
test_cephfs.py: add tests for utimensat()
pybind/cephfs: add python bindings for utimensat()
qa/cephfs: add tests for chownat()
pybind/cephfs: add python bindings for chownat()
test_cephfs.py: add tests for chmodat()
pybind/cephfs: add python bindings for chmodat()
test_cephfs.py: add tests for symlinkat()
pybind/cephfs: add python binding for symlinkat()
test_cephfs.py: add test for readlinkat()
pybind/cephfs: add python binding for readlinkat()
pybind/cephfs: add tests for statxat()
pybind/cephfs: add python bindings for statxat()
test_cephfs.py: add tests for mkdirat()
pybind/cephfs: add python binding for mkdirat()
Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jos Collin <jcollin@redhat.com>
Redouane Kachach [Wed, 11 Feb 2026 13:36:01 +0000 (14:36 +0100)]
mgr/cephadm: cleanup leftover certs/keys after cert_src changes
This PR improves certificate cleanup when a service switches
certificate sources (cephadm-signed <-> inline/reference). It also adds
best-effort post-remove helpers to purge stale cephadm-managed
cert/key pairs. Inline-stored (non-editable) certs/keys are removed,
while referenced/user-managed (editable) credentials are preserved.
Redouane Kachach [Wed, 11 Feb 2026 11:17:55 +0000 (12:17 +0100)]
mgr/cephadm: adding tls fields as deps for services with TLS support
This is especially important for inline certificates, so the certmgr
store is updated automatically whenever the user changes the values in
the spec and reapplies it.