mgr/cephadm: plumb force_delete_data through daemon/service removal
This PR wires the `force_delete_data` already existing flag in the
binary through cephadm’s daemon and service removal paths, so that
commands such as `ceph orch rm service` or equivalent daemon removal
can explicitly ask for data deletion instead of the default "move
under <fsid>/removed/" for daemons such as Prometheus, osd and mon.
Keep parsed command data alive while running hooks to avoid a
stack-use-after-return in Formatter::create().
Return -EAGAIN from PGCommand when the OSDMap is not ready.
Crimson OSD was missing the PG admin/tell hooks that classic OSD exposes, and it
did not accept the legacy `rados_pg_command()` / `ceph pg <pgid> <cmd>` JSON form
(e.g. `{"prefix":"pg","pgid":"1.0","cmd":"query",...}`), so `ceph pg <pgid> query`
failed.
Adds a `pg` old-form wrapper hook that exists to advertise that exists
to advertise the classic `pgid` + `cmd` + optional `arg` signature. The
runtime dispatch rewrites this to the real subcommand.
This updates parse_cmd to rewrite `prefix=pg` requests to the requested
subcommand and remap the generic `arg` field to the concrete parameter
names (`offset` for `list_unfound`, `mulcmd` for `mark_unfound_lost`)
so validation/parsing is unambiguous.
Add a standalone concept page for the OSDMap require_osd_release field,
the upgrade-gate counterpart to require_min_compat_client. Cover:
- how to set it and how to check it;
- the full set of pre-commit guards the monitor runs, rendered as a
table with each guard's error text and bypass status;
- which commands and features become available as the flag is raised,
per release;
- the OSD boot window that refuses OSDs more than two releases ahead
of the flag;
- the OSD_UPGRADE_FINISHED health warning that prompts admins to set
the flag after an upgrade;
- the initial value on new clusters and the two mon_debug_* knobs
that override it for testing.
Also cross-link the new page from the related-flags table on
require-min-compat-client.rst, and from the rados operations index.
Add a standalone concept page for the OSDMap require_min_compat_client
field, covering: how to set and check it, the non-monotonic lowering
behavior (with the features-in-use floor derived from
OSDMap::get_min_compat_client()), and the operator commands it gates.
Include tables for the floor-pinning features and the flag-gated
commands, so operators can reason about transitions without reading
OSDMonitor.cc.
Cross-reference to the CephFS per-filesystem required_client_features
mechanism, which is the MDSMap-side equivalent for client-protocol
features. Add an anchor on the existing CephFS Required Client Features
section so the cross-reference resolves.
Link the new page from the rados operations index.
doc: document ceph nvmeof CLI subcommands for target configuration
Replaces verbose podman run container commands with native ceph nvmeof
CLI subcommands. The nvmeof-cli container approach is preserved as an
alternative in a note block, with a clarification that its option names
differ from the ceph nvmeof CLI.
Direct leak of 25846 byte(s) in 28 object(s) allocated from:
#0 malloc
#1 /usr/bin/python3.10+0x16d7be
Direct leak of 20456 byte(s) in 29 object(s):
#1 PyDict_Copy /usr/bin/python3.10+0x16ae06
Direct leak of 8456 byte(s) in 14 object(s):
#1 _PyObject_GC_NewVar /usr/bin/python3.10+0x14fc57
The unittests show the same shape against /lib/.../libpython3.10.so.
These are CPython 3.10 runtime artefacts, not Ceph bugs. In CPython
3.10, Py_FinalizeEx() leaves a set of interpreter-internal allocations
(module namespace dict copies, GC-tracked variable-size containers,
type-method caches, interned strings) heap-allocated until process
exit; the OS reclaims them. PEP 683 (Immortal Objects, accepted for
Python 3.12) extends pylifecycle.c::finalize_modules() to deallocate
these remaining objects during runtime shutdown:
"During runtime shutdown, the strategy will be to first let the
runtime try to do its best effort of deallocating these instances
normally. Most of the module deallocation will now be handled by
pylifecycle.c:finalize_modules() where we clean up the remaining
modules as best as we can."
-- PEP 683, https://peps.python.org/pep-0683/
The qa/lsan.supp file has carried the empirical note "python3.12
doesn't leak anything" since the file was introduced (commit 8c099a534044bf, "asan: add qa/lsan.supp for leak sanitizer
suppressions", Mar 2023). An empirical reproduction with a minimal
Py_Initialize()/Py_FinalizeEx() program built against Python 3.13
under -fsanitize=address reports zero leaks; the same minimal test
against Python 3.10 (CI environment) reports the leaks shown above.
Add three suppressions to qa/lsan.supp's existing 3.9-3.11 block:
- leak:^PyDict_Copy and leak:^_PyObject_GC_NewVar match the two
exported CPython functions visible in the leak stacks.
- leak:python3.10 substring-matches the module path in the stack
frame, catching unsymbolised offsets in both the /usr/bin/python3.10
binary and the libpython3.10.so.1.0 shared object. Mirrors the
existing leak:libsqlite3.so pattern earlier in the file.
The whole block (including the existing PyMem_Malloc entry above) can
be removed once CI runs on Python >= 3.12.
test/osd: fix Message and Connection refcount leaks
unittest_backend_basics and unittest_ecfailover_with_peering fail under
ASan with megabytes of leaks per run, all originating in MockMessenger
and the listeners that feed into it. A representative LSan stack:
Three add_ref/consume mismatches conspire to leak the sender Message,
the decoded receiver Message, and its MockConnection:
- MockPGBackendListener::send_message receives a raw `m` with its +1
from `new T(...)` and wraps it in `MessageRef(m)` with the default
add_ref=true, bumping instead of consuming the +1. Switch to
add_ref=false.
- MockPeeringListener::send_cluster_message detaches an intrusive_ptr
into a raw pointer and then hands it to MockMessenger, which bumps
with add_ref=true; the +1 transferred via detach() is never
consumed. Pass `m.get()` so the caller's intrusive_ptr keeps
ownership and releases on scope exit.
- MockMessenger's receiver lambda lets decode_message()'s +1 (the
Ceph convention) fall out of scope, and constructs the
`new MockConnection(from_osd)` ConnectionRef with the default
add_ref=true so the +1 from `new` leaks too. Construct
ConnectionRef with add_ref=false, and wrap the decoded Message in a
`MessageRef{decoded_msg, /*add_ref=*/false}` so the smart pointer
releases the +1 on every exit path.
While at it, switch the handler API from raw `Message*` / `MsgType*`
to `MessageRef` / `boost::intrusive_ptr<MsgType>` so lifetime is
managed by the smart pointer end-to-end. Handlers that need to extend
the Message's lifetime call `m.detach()` to transfer the +1 to the
raw pointer the consumer takes ownership of: e.g. OpRequest stores
`Message*` and put()s in its destructor, so PGBackendTestFixture.cc
now passes `m.detach()` into `create_request<OpRequest, Message*>(...)`
in place of an explicit `intrusive_ptr_add_ref(m)` followed by `m`.
The peering handler in ECPeeringTestFixture.cc does not store the
Message; it just lets the typed intrusive_ptr drop on scope exit.
doc/scripts: use raw string for regex in gen_state_diagram.py
Python 3.12 emits SyntaxWarning for invalid escape sequences in ordinary
string literals. The re.search() call on line 162 was the only pattern
in the file passed as a non-raw string, causing:
doc/scripts/gen_state_diagram.py:162: SyntaxWarning: invalid escape
sequence '\s'
i = re.search("return\s+transit<\s*(\w*)\s*>()", line)
Add the r"" prefix to match the other re.search / re.finditer / re.sub
call sites in the same file. No behavior change; \s was already being
interpreted as a regex whitespace class because Python leaves unknown
escapes untouched, but this will become a SyntaxError in a future
Python release.
rgw: multisite sync data_log error handling broken in tentacle
The data_log error handling was broken in a recent refactor such that it
is not tolerant of missing data_log rados objects. This restores the
tolerant behaviour.
Fixes: https://tracker.ceph.com/issues/76228 Signed-off-by: Alexander Hussein-Kershaw <alexander.husseinkershaw@alianza.com>
Matthew N. Heler [Fri, 17 Apr 2026 18:13:52 +0000 (13:13 -0500)]
rgw/lc: add coroutine support for cloud-transition and cloud-restore
LCWorker's cloud-tiering was blocking the io_context on every HTTP
call, so rgw_lc_max_wp_worker coroutines ended up running one at a
time instead of in parallel. Same story on the restore side.
Pass the worker's optional_yield through RGWLCCloudTierCtx and use it
from the stream classes, the multipart and status-obj helpers, and
cloud_tier_restore. Transitions and restores now actually yield.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Add unit tests to cover the raw prepare help text for --osd-fsid,
assert generate_uuid is used when no osd_fsid is supplied and
assert an externally provided osd_fsid is passed through to
create_id without generating a new UUID.
ceph-volume: add --osd-fsid support to raw mode prepare
The LVM mode already supports --osd-fsid to allow external tools
(e.g., Kubernetes operators) to pre-register an OSD ID+UUID via
"ceph osd new" and then pass both to ceph-volume, ensuring the
operator retains full control of the OSD ID lifecycle and can
reliably clean up on prepare failure (no orphan OSDs).
The raw mode was missing this support: prepare() unconditionally
called system.generate_uuid(), ignoring any --osd-fsid value.
When an operator pre-registered osd.N with uuid_A and then ran
"ceph-volume raw prepare --osd-id N --dmcrypt", ceph-volume
generated uuid_B internally and called "ceph osd new uuid_B N",
which failed with EINVAL because the ID was already registered
with a different UUID.
This commit:
- Adds --osd-fsid argument to the raw mode argument parser
(devices/raw/common.py), consistent with the LVM mode.
- Changes raw.prepare() to honor an externally provided osd_fsid,
falling back to generate_uuid() only when none is given
(objectstore/raw.py), consistent with the LVM mode.
Ronen Friedman [Mon, 9 Mar 2026 17:23:18 +0000 (17:23 +0000)]
crimson/osd: use a unified super-block for devices
This commit refactors the on-hardware super-block structure
used by the seastore to a unified format that
can accommodate all three device types (HDD, ZBD, RBM).
All devices now have a 60 bytes header at address 0,
similar to the existing BlueStore layout. A 23-byte magic
string ("CRIMSON_DEVICE") is placed at the beginning of
the header, followed by 37 bytes of null padding (to
match the existing 60 bytes of the super-block), and
then the DENC-encoded device_superblock_t structure starting
at offset 60.
A unified device_config_t is now used for all device types.
The per-shard data structure is also unified, now including a union
of all relevant fields for each device type.
We are also adding a check for the super-block magic value in the
RBMDevice::read_rbm_superblock() method, similar to the existing check
in SegmentManager::read_segment_manager_superblock().
Presigned URLs using SigV2 do not contain x-amz-credential causing the
log record field Authentication Type to be incorrectly set to '-'.
This has been fixed to check for the presence of the x-amz-expires and Expires
parameters instead.
John Mulligan [Thu, 23 Apr 2026 21:37:28 +0000 (17:37 -0400)]
CODEOWNERS: add an smb group for various smb related files
Add a new smb group that covers parts of orch that manage smb as well as
the cephfs proxy. This will help automatically notify smb focused devs
on PRs.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
rgw: account presigned POST bytes_received in usage log
Presigned POST uploads reported bytes_received=0 in the usage log, while
equivalent PUT uploads reported the full object size.
The AccountingFilter starts disabled, and only the PUT path explicitly
enables it around its body read. POST bodies are consumed via
RGWPostObj_ObjStore::read_with_boundary, which did not toggle accounting
around its recv_body calls, so every byte pulled off the wire during
multipart parsing was silently dropped by the accounter.
Wrap the two recv_body calls in read_with_boundary with
ACCOUNTING_IO(s)->set_account(true/false), mirroring the PUT pattern in
RGWPutObj_ObjStore::get_data. POST uploads now account correctly; PUT
behavior is unchanged.
Signed-off-by: Lumir Sliva <61183145+lumir-sliva@users.noreply.github.com>
Tracker: https://tracker.ceph.com/issues/76157 Reported-by: Maxim Monin
rgw: read_obj_policy() consults s3:prefix when deciding between 403/404
when read_obj_policy() gets ENOENT, it only returns 404 NoSuchKey if the
requester has s3:ListBucket permission. however, policy that allows
s3:ListBucket may be conditional on the s3:prefix to restrict listings
to certain paths/object names. add the requested object name to the iam
environment as s3:prefix to match aws behavior here
* refs/pull/65656/head:
client: do not allow zero‑length reads
src/test: test zero-length async-fsync read using ceph_ll_nonblocking_readv_writev
src/test: test zero-length async-fsync read using ll_preadv_pwritev
Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Christopher Hoffman <choffman@redhat.com>
Generalize prepare_rewrite_publish_to_prior() into stage_visibility_handoff().
* introduce should_use_no_conflict_publish
* Replace is_rewrite_transaction() checks with should_use_no_conflict_publish(),
so adding new no-conflict users becomes straightforward.
* Stop committing metadata (commit_state + sync_checksum) during prepare_record()
(pre-commit). While it is correct for rewrite, doing it pre-commit doesn't buy
us anything today because readers are still blocked until the publish finishes.
Moving metadata commit to the after commit phase would also make future
non-rewrite users easier to support.
This is a prep step for expanding no-conflict publish coverage.
The test has been flaky on loaded CI, failing at different assertions
depending on where the expiration budget runs out:
[ RUN ] QuiesceDbTest.RepeatedQuiesceAwait
src/test/mds/TestQuiesceDb.cc:1112: Failure
Expected equality of these values:
ERR(115)
Which is: Operation now in progress(115)
run_request(...)
Which is: Connection timed out(110)
[ FAILED ] QuiesceDbTest.RepeatedQuiesceAwait (627 ms)
The test proves its central invariant -- "await resets the expiration
timer" -- indirectly, by having the set survive a series of
sleep_for(expiration/2) + await pairs. With expiration=0.1s that
leaves only ~20ms above the 80ms consumed by the two 40ms
release-awaits, and the margin is regularly eaten by scheduler
jitter. A similar overrun inside the quiesce loop produces the other
variant, where sleep_for(expiration/2) overshoots and the gap
between two awaits exceeds the expiration.
The per-iteration budget needs to cover ~3 CFS schedule latencies
(test wakes from sleep_for, manager wakes on notify, test wakes on
completion) plus mutex/queue processing, which is roughly 20ms
nominal on Linux with sched_latency_ns=6ms and up to 30-40ms under
heavy contention. Kernels built with a lower CONFIG_HZ coarsen
ceph::coarse_real_clock's CLOCK_REALTIME_COARSE reads, but that
affects only read precision, not the dominant scheduling-latency
term.
Fix by:
1. Raising expiration from 0.1s to 0.2s, so the sleep=expiration/2
margin is ~3x the worst-case per-iteration overhead.
2. Reducing the quiesce loop from 10 to 3 iterations, since three
iterations already span 1.5x expiration cumulatively -- enough
to prove the resets are extending the deadline.
3. Replacing the 2-iteration release-EINPROGRESS loop with a single
release+await(sec(0)) and an at_age equality assertion, so
"release does not reset the timer" is checked directly rather
than inferred from multiple awaits racing the expiration.
4. Using await = 2*expiration for the final ETIMEDOUT so the
expiration is guaranteed to fire inside the wait window rather
than on its boundary.
5. Tracking at_age across the quiesce loop and asserting it advances.
The loop's ASSERT_EQ(OK(),...) already fails if resets stop
working, but the at_age check also catches regressions where
at_age is updated to a stuck or stale value.
The test now runs in ~500ms, comparable to the original.
test/cli/radosgw-admin: align help golden with period/zone delete
Update help.t expected output to match the inline help text updated in this PR: period rm -> period delete and zone rm -> zone delete.\n\nThis keeps the CLI golden test consistent with radosgw-admin --help output and addresses make check failures for this branch.
ceph-volume: has_bluestore_label checks all bluestore label replica offsets
BlueStore replicates the block device label at fixed offsets (0 and
multiples of 1Gb up to 1000gb). has_bluestore_label() only read the
first 22 bytes, so disks with a wiped primary label but intact
replicas are missed.
with this commit, has_bluestore_label() scans each known offset with
seek/read and compares the ASCII prefix as bytes.
Adam Kupczyk [Thu, 16 Apr 2026 09:25:19 +0000 (09:25 +0000)]
extblkdev/fcm: Replace errors with health warning
Now plugin does not assert or fail to load,
but instead raises following health warnings:
EXTBLKDEV: multivolume fcm will not work properly
EXTBLKDEV: failed accessing FCM utilization log
EXTBLKDEV: bdev_enable_discard not enabled - free space will leak
Adam Kupczyk [Tue, 14 Apr 2026 17:57:42 +0000 (17:57 +0000)]
blk: Expand collect_alerts to allow specialization
Previously we had BlockDevice::collect_alerts that had fixed
implementation.
Expanded BlockDevice::collect_alerts into virtual, so KernelDevice can
override it.
This commit changes the error message emitted when the device's block
size is lesser than the minimum expected by seastore. This is done to
improve usability and provide an actionable error message.