Redouane Kachach [Wed, 11 Feb 2026 13:36:01 +0000 (14:36 +0100)]
mgr/cephadm: cleanup leftover certs/keys after cert_src changes
This PR improves certificate cleanup when a service switches
certificate sources (cephadm-signed <-> inline/reference). It also adds
best-effort post-remove helpers to purge stale cephadm-managed
cert/key pairs. Inline-stored (non-editable) certs/keys are removed,
while referenced/user-managed (editable) credentials are preserved.
Redouane Kachach [Wed, 11 Feb 2026 11:17:55 +0000 (12:17 +0100)]
mgr/cephadm: adding tls fields as deps for services with TLS support
This is especially important for inline certificates, so the certmgr
store is updated automatically whenever the user changes the values in
the spec and reapplies it.
Patrick Donnelly [Wed, 13 May 2026 19:48:41 +0000 (15:48 -0400)]
Merge PR #68781 into main
* refs/pull/68781/head:
doc/governance: remove Sam from CSC
Reviewed-by: Joseph Mundackal <jmundackal@bloomberg.net> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Kefu Chai [Wed, 13 May 2026 11:09:37 +0000 (19:09 +0800)]
crimson/osd: fix crash in committed_osd_maps when an OSD is removed
OSDMap::is_down(osd) is defined as !is_up(osd), and is_up() gates on
exists(osd). This means is_down() returns true for OSDs that have
been *removed* from the map (EXISTS flag cleared), not just marked
down.
committed_osd_maps() iterates over epochs [first, last], and for each
epoch over all OSDs in old_map, calling get_cluster_addrs() for any
OSD that was up in old_map and is_down() in the current epoch.
get_cluster_addrs() asserts exists(osd), so when that OSD has been
removed the assertion fires.
# Two rapid OSDMap changes; the monitor batches them into one message.
ceph osd down 2
ceph osd purge 2 --yes-i-really-mean-it
# osd.0 and osd.1 call committed_osd_maps(N, N+1). Before this fix
# old_map is set once before the loop and never updated, so in
# iteration N+1 the comparison is still old_map(N-1) vs osdmap(N+1):
#
# old_map->is_up(2)=true (osd.2 was up at N-1)
# osdmap->exists(2)=false (purged in N+1)
# osdmap->is_down(2)=true (!is_up, since !exists -> true)
# -> get_cluster_addrs(2) asserts -> crash
#
# OSDMap.h: ceph_assert(exists(osd)) [in get_cluster_addrs()]
# Signal 6 (SIGABRT)
Note: 'ceph osd destroy' does NOT clear the EXISTS flag; it only sets
CEPH_OSD_DESTROYED. The EXISTS flag is cleared by 'osd rm', which
'osd purge' calls internally after 'osd destroy'.
Fix: advance old_map at the end of each iteration so the comparison
is pairwise (N-1 vs N, then N vs N+1, ...), matching classic
OSD::advance_map at src/osd/OSD.cc:8615. In the reproducer,
iteration N marks osd.2 down using osdmap(N) (where osd.2 still
exists), then sets old_map = osdmap(N). Iteration N+1 starts with
old_map(N)->is_up(2)=false (osd.2 was DOWN in N), so the condition
short-circuits and get_cluster_addrs() is never called on the new
map.
No explicit !exists branch is needed. The monitor produces a
separate epoch for each of 'osd down' / 'osd destroy' / 'osd rm', so
an OSD can only transition UP -> REMOVED through at least one
intermediate DOWN epoch in any batched MOSDMap, and the pairwise
comparison short-circuits before the assert can fire.
Jacques Heunis [Wed, 13 May 2026 09:14:05 +0000 (09:14 +0000)]
rgw: Fix ops logs sometimes having several entries per line.
Although not explicitly documented, the RGW ops log is generally
formatted with one entry per line. This makes it work well with log
shipping/ingestion services (many of which default to treating each line
as a separate entry) and in particular works well with log servers that
index based on the available JSON fields.
The current implementation separates the log call from the resulting
disk IO: Logs are written to a buffer and a separate thread flushes them
to disk (possibly in batches). The current code appends a newline only
at the end of the batch being flushed to disk and in many cases when
under load this means that several log entries are concatenated onto a
single line, which complicates attempts to process those logs.
This PR separates the addition of a new line from the flush to disk,
appending a newline after every log entry but still only flushing at the
end of the batch to avoid additional IO overhead.
Fixes: https://tracker.ceph.com/issues/76566 Signed-off-by: Jacques Heunis <jheunis@bloomberg.net>
Kefu Chai [Tue, 5 May 2026 01:36:01 +0000 (09:36 +0800)]
pybind/mgr/status: drop asserts that fight the defaultdict defaults
The 'assert metadata' checks in the status module were actually fighting
against our own defaults. Since an empty defaultdict is falsy, these
asserts would blow up the whole command if a single daemon was down
after a mgr restart.
This drops those four grumpy asserts. Now, instead of a traceback,
`ceph osd status` and `ceph fs status` will just show a blank hostname
or "unknown" version as intended.
The trigger is common in practice: any mgr restart leaves daemons
that are currently down without metadata in daemon_state, since
they never reconnect via MMgrOpen to repopulate it. After such a
restart, `ceph osd status` and `ceph fs status` blow up:
```
Error EINVAL: Traceback (most recent call last):
...
File ".../status/module.py", line 340, in handle_osd_status
assert metadata
AssertionError
```
Kefu Chai [Tue, 5 May 2026 01:35:00 +0000 (09:35 +0800)]
mgr: narrow get_metadata return type with @overload
Enable type narrowing for get_metadata() when a non-None default is
provided. Previously, the return type was always `Optional[Dict[str, str]]`,
forcing callers to use defensive `assert metadata` checks even when
a result was guaranteed.
The wrapper returns either the metadata from `_ceph_get_metadata()` or the
caller-supplied default. Providing an `@overload` allows type checkers to
prove the result is non-None, avoiding invalid assertions for falsy
defaults (like an empty defaultdict).
Niklas Hambüchen [Sat, 27 Dec 2025 13:05:19 +0000 (14:05 +0100)]
doc: Document that client_dirsize_rbytes confuses rsync
This is important to document because otherwise the immediate question
one has is "why _wouldn't_ I enable this?".
At the same time, being able to use tools like rsync is a common
motivation for using CephFS.
Unfortunately the only source to this I could find is the presentation
"CephFS: Architecture Introduction & New Features" by Greg Farnum:
https://ceph.io/assets/pdfs/events/2025/ceph-days-silicon-valley/10%20-%20Greg%20-%20CephFS.pdf
Kefu Chai [Wed, 13 May 2026 05:02:16 +0000 (13:02 +0800)]
crimson/osd: disable ofstream buffering to fix concurrent logging
seastar::logger::do_log() writes to the shared static _out pointer from
every shard's reactor thread with no lock. std::cerr is safe in this
setting because it is unbuffered: each write maps to a single write(2)
syscall, which POSIX serializes at the kernel level. A buffered
std::ofstream is not safe: multiple shards concurrently advance the
filebuf's put pointer (pptr) past the end-of-buffer marker (epptr),
causing _M_convert_to_external to compute a length longer than the
8192-byte internal buffer and write past it.
The C++ standard does not provide thread-safety guarantees for
std::ofstream. [res.on.data.races] (C++23 §16.4.6.10) specifies that
concurrent non-const access to a standard library object from multiple
threads is a data race with undefined behavior. std::basic_filebuf,
which std::ofstream owns internally, maintains mutable state (pptr,
epptr, _M_buf, and the codec state) that is updated on every write with
no synchronization. std::cerr is an explicit exception: [iostream.objects]
guarantees that concurrent writes to the standard stream objects are
safe, and cerr achieves this by being unbuffered (no pptr/epptr to race
on) and writing through a single atomic write(2) per flush.
This manifested as a heap-buffer-overflow in basic_filebuf::overflow()
via seastar::logger::do_log() during a multi-shard run, reported against
the build that included the dangling-pointer fix (6680e02d041). The
dangling-pointer bug had masked this latent race by crashing before
multiple shards came online.
In this change, we fix this by calling pubsetbuf(nullptr, 0) immediately
after opening the file. This suppresses _M_allocate_internal_buffer and
makes each operator<< call fall through to a single write(2) syscall,
matching std::cerr's thread-safety guarantee.
This fix is necessary for all production deployments, not just
pathological configurations: log_file has a daemon_default of
/var/log/ceph/$cluster-$name.log in global.yaml.in ($cluster is always
"ceph" in modern Ceph, as customized cluster names have been deprecated).
Every crimson-osd process therefore opens a log_file_stream by default,
and every multi-shard run is exposed to this race.
An alternative would be to follow ScyllaDB's approach: its service file
has no StandardOutput or StandardError directives, so systemd connects
the process to the journal, and the logger keeps _out pointing at
std::cerr. This sidesteps the buffered-ofstream problem entirely. For
Crimson to adopt that model it would need to respect log_to_file and
log_to_stderr (which it currently ignores, checking only log_file), and
a dedicated ceph-crimson-osd@.service unit would be needed so that a
StandardError=append:/var/log/ceph/ceph-osd.%i.log directive could be
added without affecting the classic OSD. That is a larger refactor;
pubsetbuf(nullptr, 0) is the minimal correct fix for now.
Kefu Chai [Tue, 12 May 2026 09:55:06 +0000 (17:55 +0800)]
crimson/osd: inline log file stream setup to fix dangling pointer
maybe_set_logger() called logger().set_ostream() with a reference to a
local ofstream before returning it by value. Correctness relied on NRVO:
without it, _out would point to a moved-from object on a dead stack frame,
causing undefined behaviour that manifests as a heap-buffer-overflow under
ASan with GCC 14 (libasan.so.8).
Note that seastar::logger::_out is a static member shared by all logger
instances, so the dangling pointer affects every logger (seastore, network,
OSD), explaining why the crash appears across subsystems.
Inline the setup so log_file_stream and reset_logger share the same scope.
set_ostream() is now called with an unambiguously live object, with no
dependence on copy elision.
Kefu Chai [Wed, 13 May 2026 04:32:33 +0000 (12:32 +0800)]
crimson/osd: drop redundant trailing co_return in pg_advance_map
check_for_splits() and split_pg() both ended with a bare co_return
that the compiler inserts implicitly for a coroutine returning
seastar::future<>. Remove the redundant statements.
Oguzhan Ozmen [Tue, 12 May 2026 19:31:04 +0000 (19:31 +0000)]
test/neocls/log trimming: reproduce log trimming can go into an infinite loop
Add two tests that calls trim() loop function directly (the
use_awaitable_t overloads) rather than the single-op wrapper used by
existing tests. Two test cases for the marker-based overload:
- trim_loop_all_entries_by_marker: writes 10 entries, trims all, verifies
the loop terminates and entries are gone.
- trim_loop_empty_log_by_marker: trims an empty log object, verifying
the loop terminates on immediate ENODATA.
Without the fix in the following commits, both tests hang indefinitely.
- start a vstart cluster
- run the test: [build] $ ./bin/ceph_test_neocls_log
- the test introduced in this commit stalls forever:
...
RUN ] neocls_log.trim_loop_all_entries_by_marker <-- stalls forever
Venky Shankar [Tue, 12 May 2026 15:26:29 +0000 (20:56 +0530)]
Merge PR #68128 into main
* refs/pull/68128/head:
qa: Fix checksum calculation on empty directories
qa: Add mirror test for snapshot with only dir
tools/cephfs_mirror: Fix sync hang
Raimund Sacherer [Tue, 12 May 2026 09:07:30 +0000 (11:07 +0200)]
python-common/drive_selection: keep existing-OSD devices past limit
assign_devices() breaks out of disk iteration on the first hit from
_limit_reached(). When that happens, any later existing-OSD device
for the current spec is silently dropped from the selection, and
ceph-volume's lvm batch loses sight of it.
Only break when the candidate is not an existing OSD for this
service_id. Existing-for-this-spec devices continue to be added past
the limit; they are already accounted for through existing_daemons.
Complements d3f1a0e1c0b ("fix limit with existing devices", 2023),
which excluded ceph devices from the limit count. That fix prevents
the break from firing in most cases; this one keeps the iteration
useful when it does fire anyway.
Needs review: interaction across spec shapes (with/without explicit
limit:, with/without existing_daemons) should be looked at.
Raimund Sacherer [Tue, 12 May 2026 09:05:53 +0000 (11:05 +0200)]
ceph-volume: tolerate <=1% short-fall on requested db/wal size
When requested_size (e.g. 1 GiB) slightly exceeds abs_size (e.g.
1023.3 MiB lost to PE alignment), get_physical_fast_allocs() called
exit(1) and aborted the whole batch.
if the short-fall is within 1%, scale down to abs_size
with an info log instead of aborting. Anything larger still hits the
existing error path.
Needs review: confirm 1% is the right threshold (maybe a lower percentage is
sufficient) and that no caller assumes abs_size == requested_size after this branch.
Raimund Sacherer [Tue, 12 May 2026 09:05:26 +0000 (11:05 +0200)]
ceph-volume: allocate db/wal slot on partial fast-device VG
On single-OSD redeploy where the fast device VG already has
DB LVs for sibling OSDs, get_physical_fast_allocs() returned an empty
list and ceph-volume fell back to a co-located OSD.
Two fixes in get_physical_fast_allocs():
- abs_size = dev_size / slots_for_vg can exceed vg_free when other
slots are still in use, so the while-loop never enters. Fall back
to abs_size = free_size / fast_slots_per_device.
- The loop counter was occupied_slots = len(dev.lvs), so on a partial
VG the loop was aborted prematurely. Count only slots
allocated in this call (new_slots) instead.
Initial issues where silent creation of OSD without DB, which
was fixed in commit 5c700ed7d64. After applying this fix we
did not get OSDs deployed at all.
Tested on RHCS 8 lab cluster (12 HDDs / 4 SSDs across 3 hosts,
db_slots: 6, encrypted)
Needs review: confirm new_slots match the original intent
of the per-batch cap when multiple OSDs are deployed in one call.
Olivier Chaze [Tue, 12 May 2026 14:32:10 +0000 (16:32 +0200)]
doc/rgw: warn about rgw_usage_max_shards consistency
Add documentation warnings explaining that all RGW daemons and
radosgw-admin commands must use the same rgw_usage_max_shards value.
Mismatched shard counts cause writes and reads/trim to target different
objects, resulting in seemingly empty usage logs or failed cleanup.
Also document the --rgw-usage-max-shards command-line parameter for
radosgw-admin as an alternative to global config.
Kefu Chai [Tue, 12 May 2026 09:17:56 +0000 (17:17 +0800)]
debian: drop explicit libprotobuf dependency from ceph-osd-crimson
The ceph-osd-crimson package already lists ${shlibs:Depends} in its
Depends field, which generates the correct libprotobuf dependency for
the target distribution at build time (e.g. libprotobuf32t64 on
Trixie/Noble). The hardcoded libprotobuf23 entry is redundant and
breaks installations on distributions where protobuf ships under a
different package name.
Afreen Misbah [Tue, 5 May 2026 21:05:11 +0000 (02:35 +0530)]
mgr/dashboard: Updates to empty state component
- added state for no storage in empty state component
- extended the icon component to take into account the scenario of button with icon
- fix unit tests
Shweta Bhosale [Mon, 11 May 2026 10:02:14 +0000 (15:32 +0530)]
mgr/nfs: reuse CephfsClient for path checks and earmark resolver
cephfs_path_is_dir defined an inner function decorated with lru_cache, so
each call got a new function object and an empty cache, CephfsClient(mgr)
ran every time. Moved caching to module-level cephfs_client_for_mgr(mgr)
and call it from cephfs_path_is_dir.
Passed that shared client into CephFSEarmarkResolver from the NFS module so
export create/apply does not construct a separate CephfsClient for
earmarks.
Kefu Chai [Mon, 11 May 2026 05:46:25 +0000 (13:46 +0800)]
crimson: consolidate the return paths of get_segment_manager()
before this change, two branches both return `BlockSegmentManager`,
which is redundant. in this change, consolidate them so that the
`HAVE_ZNS` path becomes an early return. this improves readability.
Kefu Chai [Mon, 11 May 2026 05:27:42 +0000 (13:27 +0800)]
crimson: abort on ioctl(BLKGETNRZONES) failure
previously, we did not check the return value of ioctl(BLKGETNRZONES).
we query the number of zones of the storage device to determine which
seastore backend to use. the only possible error from this ioctl is
-EFAULT (invalid user pointer), which indicates a programming error
and should never happen in practice. use ceph_assert() to catch this.
Kefu Chai [Mon, 11 May 2026 05:07:25 +0000 (13:07 +0800)]
crimson: use uint32_t when calling ioctl(BLKGETNRZONES)
before this change, we pass a pointer to a `size_t` to
ioctl(BLKGETNRZONES), but in the Linux kernel,
include/uapi/linux/blkzoned.h:
```c
#define BLKGETNRZONES _IOR(0x12, 133, __u32)
```
this API reads 32 bits of data into the pointer. on 64-bit
architectures, size_t is 64 bits. fortunately, we initialize
nr_zones with 0, so the upper 32 bits remain zero. this works
on little-endian systems, but not on big-endian systems. it is
also semantically wrong. we should pass a pointer to a 32-bit
value when calling ioctl(BLKGETNRZONES).
in this change, we change the type of nr_zones from size_t to
uint32_t to match what the Linux kernel expects.
```
[1/3] Building CXX object src/crimson/os/seastore/CMakeFiles/crimson-seastore.dir/segment_manager.cc.o
/home/kefu/dev/ceph/src/crimson/os/seastore/segment_manager.cc:45:15: warning: lambda capture 'FNAME' is not used [-Wunused-lambda-capture]
45 | ).then([FNAME,
| ^
```
but we went further by coroutinize the whole method. because the return
value of ioctl() is not checked before this change, and clang correctly
flagged this with a warning, we marker it with `[[maybe_unused]]`, we
will fix it in a separate change.
Added the following test cases:
- Test success when explicitly supplied tiebreaker
- Test success when auto-selecting tiebreaker monitor
- Test success with minimal valid configuration (1 monitor per zone)
- Test success with auto-selection and minimal config (1 monitor per zone)
- Test success when strategy is automatically changed to CONNECTIVITY
- Test failure when auto-selecting and tiebreaker is in a data zone
- Test failure when explicitly specifying tiebreaker in a data zone
- Test failure when multiple potential tiebreakers exist
- Test failure when one data zone has 0 monitors
- Test failure when tiebreaker monitor doesn't exist
1. enable_stretch_mode no longer require to supply tiebreaker mon
2. enable_stretch_mode will automatically set monitor election strategy
to Connectivity if not already set.
3. Move away from "sites" and use "zones" instead throughout the doc
Ceph will try to select a tiebreaker mon that resides in
the crush <dividing_bucket> type but doesn't belong
to any of the data sites which the OSDs resides in.
Also created a helper function
`MonmapMonitor::validate_and_enable_stretch_mode`
inside `MonmapMonitor::try_enable_stretch_mode`
making the logic unittestable
Moreover, ceph mon enable_stretch_mode will
automatically set monitor election strategy to Connectivity.
We now also enforce that at least 1 monitor exists for each data zone.