Oguzhan Ozmen [Tue, 19 May 2026 22:12:35 +0000 (22:12 +0000)]
rgw/datalog: DataLogBackends::trim_entries: fix crash when target_gen > head_gen
When a cluster has no sync zones (single-zone), DataLogTrimCR passes
max_marker() as the trim marker, which encodes target_gen = UINT64_MAX
from gencursor(). In DataLogBackends::trim_entries, after trimming the
head (last) generation, the break condition
if (be->gen_id == target_gen)
is false (e.g. 0 != UINT64_MAX), so the loop attempts its increment
expression:
be = upper_bound(be->gen_id)->second
upper_bound(head_gen) returns end(), and dereferencing end()->second
causes crash.
Fix: also break when be->gen_id >= head_gen. Once we've trimmed the
head generation there are no further backends in the map, so the
upper_bound dereference in the loop increment will be skipped.
This is a general bug that affects any cluster using max_marker() as a
trim target (i.e. every single-zone deployment).
Oguzhan Ozmen [Tue, 19 May 2026 23:43:58 +0000 (23:43 +0000)]
test/rgw/datalog: test for trim_entries with max_marker
Verify that DataLogBackends::trim_entries does not crash when called
with max_marker() on a single-generation cluster. The bug causes
upper_bound(head_gen)->second to dereference end() (SIGSEGV)
because the only break condition checked be->gen_id == target_gen,
which is never true when target_gen is UINT64_MAX as encoded by
max_marker() and the cluster has only generation 0.
Oguzhan Ozmen [Tue, 12 May 2026 19:43:13 +0000 (19:43 +0000)]
neocls log trimming (time based): fix infinite loop on ENODATA
This is essentially the same as previous commit.
The time-based use_awaitable_t overload of trim() has the same
issue as the marker-based overload: the try-catch for ENODATA is inside
the for(;;) loop, so ENODATA is caught and swallowed, causing the loop
to retry forever.
Oguzhan Ozmen [Tue, 12 May 2026 19:38:12 +0000 (19:38 +0000)]
neocls log trimming (marker based): fix infinite loop on ENODATA
The use_awaitable_t overload of trim() has the try-catch for
ENODATA (no_message_available) inside the for(;;) loop. When
cls_log_trim returns ENODATA (i.e., nothing left to trim), the exception is
caught and silently swallowed execution falls through the catch block
back to for(;;), retrying the trim forever.
This should be a rare condition as 3 conditions should be met in a
single-cluster (once configured as multisite):
- a realm/period exists
- zone endpoints are configured
- data_log* objects exist
Since in a properly setup multisite cluster, data churn is continious so
hard to notice. In the case client reported, the multisite cluster was
reverted back to single site so data_logs have no data all the time;
hence, the issue is pronounced.
This fix adds co_return inside the catch block so ENODATA exits the loop.
Oguzhan Ozmen [Tue, 12 May 2026 19:31:04 +0000 (19:31 +0000)]
test/neocls/log trimming: reproduce log trimming can go into an infinite loop
Add two tests that calls trim() loop function directly (the
use_awaitable_t overloads) rather than the single-op wrapper used by
existing tests. Two test cases for the marker-based overload:
- trim_loop_all_entries_by_marker: writes 10 entries, trims all, verifies
the loop terminates and entries are gone.
- trim_loop_empty_log_by_marker: trims an empty log object, verifying
the loop terminates on immediate ENODATA.
Without the fix in the following commits, both tests hang indefinitely.
- start a vstart cluster
- run the test: [build] $ ./bin/ceph_test_neocls_log
- the test introduced in this commit stalls forever:
...
RUN ] neocls_log.trim_loop_all_entries_by_marker <-- stalls forever
Add a standalone concept page for the OSDMap require_osd_release field,
the upgrade-gate counterpart to require_min_compat_client. Cover:
- how to set it and how to check it;
- the full set of pre-commit guards the monitor runs, rendered as a
table with each guard's error text and bypass status;
- which commands and features become available as the flag is raised,
per release;
- the OSD boot window that refuses OSDs more than two releases ahead
of the flag;
- the OSD_UPGRADE_FINISHED health warning that prompts admins to set
the flag after an upgrade;
- the initial value on new clusters and the two mon_debug_* knobs
that override it for testing.
Also cross-link the new page from the related-flags table on
require-min-compat-client.rst, and from the rados operations index.
Add a standalone concept page for the OSDMap require_min_compat_client
field, covering: how to set and check it, the non-monotonic lowering
behavior (with the features-in-use floor derived from
OSDMap::get_min_compat_client()), and the operator commands it gates.
Include tables for the floor-pinning features and the flag-gated
commands, so operators can reason about transitions without reading
OSDMonitor.cc.
Cross-reference to the CephFS per-filesystem required_client_features
mechanism, which is the MDSMap-side equivalent for client-protocol
features. Add an anchor on the existing CephFS Required Client Features
section so the cross-reference resolves.
Link the new page from the rados operations index.
doc: document ceph nvmeof CLI subcommands for target configuration
Replaces verbose podman run container commands with native ceph nvmeof
CLI subcommands. The nvmeof-cli container approach is preserved as an
alternative in a note block, with a clarification that its option names
differ from the ceph nvmeof CLI.
doc/scripts: use raw string for regex in gen_state_diagram.py
Python 3.12 emits SyntaxWarning for invalid escape sequences in ordinary
string literals. The re.search() call on line 162 was the only pattern
in the file passed as a non-raw string, causing:
doc/scripts/gen_state_diagram.py:162: SyntaxWarning: invalid escape
sequence '\s'
i = re.search("return\s+transit<\s*(\w*)\s*>()", line)
Add the r"" prefix to match the other re.search / re.finditer / re.sub
call sites in the same file. No behavior change; \s was already being
interpreted as a regex whitespace class because Python leaves unknown
escapes untouched, but this will become a SyntaxError in a future
Python release.
Add unit tests to cover the raw prepare help text for --osd-fsid,
assert generate_uuid is used when no osd_fsid is supplied and
assert an externally provided osd_fsid is passed through to
create_id without generating a new UUID.
ceph-volume: add --osd-fsid support to raw mode prepare
The LVM mode already supports --osd-fsid to allow external tools
(e.g., Kubernetes operators) to pre-register an OSD ID+UUID via
"ceph osd new" and then pass both to ceph-volume, ensuring the
operator retains full control of the OSD ID lifecycle and can
reliably clean up on prepare failure (no orphan OSDs).
The raw mode was missing this support: prepare() unconditionally
called system.generate_uuid(), ignoring any --osd-fsid value.
When an operator pre-registered osd.N with uuid_A and then ran
"ceph-volume raw prepare --osd-id N --dmcrypt", ceph-volume
generated uuid_B internally and called "ceph osd new uuid_B N",
which failed with EINVAL because the ID was already registered
with a different UUID.
This commit:
- Adds --osd-fsid argument to the raw mode argument parser
(devices/raw/common.py), consistent with the LVM mode.
- Changes raw.prepare() to honor an externally provided osd_fsid,
falling back to generate_uuid() only when none is given
(objectstore/raw.py), consistent with the LVM mode.
Ronen Friedman [Mon, 9 Mar 2026 17:23:18 +0000 (17:23 +0000)]
crimson/osd: use a unified super-block for devices
This commit refactors the on-hardware super-block structure
used by the seastore to a unified format that
can accommodate all three device types (HDD, ZBD, RBM).
All devices now have a 60 bytes header at address 0,
similar to the existing BlueStore layout. A 23-byte magic
string ("CRIMSON_DEVICE") is placed at the beginning of
the header, followed by 37 bytes of null padding (to
match the existing 60 bytes of the super-block), and
then the DENC-encoded device_superblock_t structure starting
at offset 60.
A unified device_config_t is now used for all device types.
The per-shard data structure is also unified, now including a union
of all relevant fields for each device type.
We are also adding a check for the super-block magic value in the
RBMDevice::read_rbm_superblock() method, similar to the existing check
in SegmentManager::read_segment_manager_superblock().
John Mulligan [Thu, 23 Apr 2026 21:37:28 +0000 (17:37 -0400)]
CODEOWNERS: add an smb group for various smb related files
Add a new smb group that covers parts of orch that manage smb as well as
the cephfs proxy. This will help automatically notify smb focused devs
on PRs.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
* refs/pull/65656/head:
client: do not allow zero‑length reads
src/test: test zero-length async-fsync read using ceph_ll_nonblocking_readv_writev
src/test: test zero-length async-fsync read using ll_preadv_pwritev
Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Christopher Hoffman <choffman@redhat.com>
Generalize prepare_rewrite_publish_to_prior() into stage_visibility_handoff().
* introduce should_use_no_conflict_publish
* Replace is_rewrite_transaction() checks with should_use_no_conflict_publish(),
so adding new no-conflict users becomes straightforward.
* Stop committing metadata (commit_state + sync_checksum) during prepare_record()
(pre-commit). While it is correct for rewrite, doing it pre-commit doesn't buy
us anything today because readers are still blocked until the publish finishes.
Moving metadata commit to the after commit phase would also make future
non-rewrite users easier to support.
This is a prep step for expanding no-conflict publish coverage.
test/cli/radosgw-admin: align help golden with period/zone delete
Update help.t expected output to match the inline help text updated in this PR: period rm -> period delete and zone rm -> zone delete.\n\nThis keeps the CLI golden test consistent with radosgw-admin --help output and addresses make check failures for this branch.
ceph-volume: has_bluestore_label checks all bluestore label replica offsets
BlueStore replicates the block device label at fixed offsets (0 and
multiples of 1Gb up to 1000gb). has_bluestore_label() only read the
first 22 bytes, so disks with a wiped primary label but intact
replicas are missed.
with this commit, has_bluestore_label() scans each known offset with
seek/read and compares the ASCII prefix as bytes.
Adam Kupczyk [Thu, 16 Apr 2026 09:25:19 +0000 (09:25 +0000)]
extblkdev/fcm: Replace errors with health warning
Now plugin does not assert or fail to load,
but instead raises following health warnings:
EXTBLKDEV: multivolume fcm will not work properly
EXTBLKDEV: failed accessing FCM utilization log
EXTBLKDEV: bdev_enable_discard not enabled - free space will leak
Adam Kupczyk [Tue, 14 Apr 2026 17:57:42 +0000 (17:57 +0000)]
blk: Expand collect_alerts to allow specialization
Previously we had BlockDevice::collect_alerts that had fixed
implementation.
Expanded BlockDevice::collect_alerts into virtual, so KernelDevice can
override it.
This commit changes the error message emitted when the device's block
size is lesser than the minimum expected by seastore. This is done to
improve usability and provide an actionable error message.
mgr/dashboard: Round off y-axis value of area chart
- by default y-axos set to 1 for all
- the value round off for area chart is seperated from y-axis ticks
- also fixes a bug where all IOPS y-ticks being repeated 1,1,0,0