Kefu Chai [Wed, 13 May 2026 11:09:37 +0000 (19:09 +0800)]
crimson/osd: fix crash in committed_osd_maps when an OSD is removed
OSDMap::is_down(osd) is defined as !is_up(osd), and is_up() gates on
exists(osd). This means is_down() returns true for OSDs that have
been *removed* from the map (EXISTS flag cleared), not just marked
down.
committed_osd_maps() iterates over epochs [first, last], and for each
epoch over all OSDs in old_map, calling get_cluster_addrs() for any
OSD that was up in old_map and is_down() in the current epoch.
get_cluster_addrs() asserts exists(osd), so when that OSD has been
removed the assertion fires.
# Two rapid OSDMap changes; the monitor batches them into one message.
ceph osd down 2
ceph osd purge 2 --yes-i-really-mean-it
# osd.0 and osd.1 call committed_osd_maps(N, N+1). Before this fix
# old_map is set once before the loop and never updated, so in
# iteration N+1 the comparison is still old_map(N-1) vs osdmap(N+1):
#
# old_map->is_up(2)=true (osd.2 was up at N-1)
# osdmap->exists(2)=false (purged in N+1)
# osdmap->is_down(2)=true (!is_up, since !exists -> true)
# -> get_cluster_addrs(2) asserts -> crash
#
# OSDMap.h: ceph_assert(exists(osd)) [in get_cluster_addrs()]
# Signal 6 (SIGABRT)
Note: 'ceph osd destroy' does NOT clear the EXISTS flag; it only sets
CEPH_OSD_DESTROYED. The EXISTS flag is cleared by 'osd rm', which
'osd purge' calls internally after 'osd destroy'.
Fix: advance old_map at the end of each iteration so the comparison
is pairwise (N-1 vs N, then N vs N+1, ...), matching classic
OSD::advance_map at src/osd/OSD.cc:8615. In the reproducer,
iteration N marks osd.2 down using osdmap(N) (where osd.2 still
exists), then sets old_map = osdmap(N). Iteration N+1 starts with
old_map(N)->is_up(2)=false (osd.2 was DOWN in N), so the condition
short-circuits and get_cluster_addrs() is never called on the new
map.
No explicit !exists branch is needed. The monitor produces a
separate epoch for each of 'osd down' / 'osd destroy' / 'osd rm', so
an OSD can only transition UP -> REMOVED through at least one
intermediate DOWN epoch in any batched MOSDMap, and the pairwise
comparison short-circuits before the assert can fire.
Venky Shankar [Tue, 12 May 2026 15:26:29 +0000 (20:56 +0530)]
Merge PR #68128 into main
* refs/pull/68128/head:
qa: Fix checksum calculation on empty directories
qa: Add mirror test for snapshot with only dir
tools/cephfs_mirror: Fix sync hang
Kefu Chai [Tue, 12 May 2026 09:17:56 +0000 (17:17 +0800)]
debian: drop explicit libprotobuf dependency from ceph-osd-crimson
The ceph-osd-crimson package already lists ${shlibs:Depends} in its
Depends field, which generates the correct libprotobuf dependency for
the target distribution at build time (e.g. libprotobuf32t64 on
Trixie/Noble). The hardcoded libprotobuf23 entry is redundant and
breaks installations on distributions where protobuf ships under a
different package name.
Afreen Misbah [Tue, 5 May 2026 21:05:11 +0000 (02:35 +0530)]
mgr/dashboard: Updates to empty state component
- added state for no storage in empty state component
- extended the icon component to take into account the scenario of button with icon
- fix unit tests
Kefu Chai [Mon, 11 May 2026 05:46:25 +0000 (13:46 +0800)]
crimson: consolidate the return paths of get_segment_manager()
before this change, two branches both return `BlockSegmentManager`,
which is redundant. in this change, consolidate them so that the
`HAVE_ZNS` path becomes an early return. this improves readability.
Kefu Chai [Mon, 11 May 2026 05:27:42 +0000 (13:27 +0800)]
crimson: abort on ioctl(BLKGETNRZONES) failure
previously, we did not check the return value of ioctl(BLKGETNRZONES).
we query the number of zones of the storage device to determine which
seastore backend to use. the only possible error from this ioctl is
-EFAULT (invalid user pointer), which indicates a programming error
and should never happen in practice. use ceph_assert() to catch this.
Kefu Chai [Mon, 11 May 2026 05:07:25 +0000 (13:07 +0800)]
crimson: use uint32_t when calling ioctl(BLKGETNRZONES)
before this change, we pass a pointer to a `size_t` to
ioctl(BLKGETNRZONES), but in the Linux kernel,
include/uapi/linux/blkzoned.h:
```c
#define BLKGETNRZONES _IOR(0x12, 133, __u32)
```
this API reads 32 bits of data into the pointer. on 64-bit
architectures, size_t is 64 bits. fortunately, we initialize
nr_zones with 0, so the upper 32 bits remain zero. this works
on little-endian systems, but not on big-endian systems. it is
also semantically wrong. we should pass a pointer to a 32-bit
value when calling ioctl(BLKGETNRZONES).
in this change, we change the type of nr_zones from size_t to
uint32_t to match what the Linux kernel expects.
```
[1/3] Building CXX object src/crimson/os/seastore/CMakeFiles/crimson-seastore.dir/segment_manager.cc.o
/home/kefu/dev/ceph/src/crimson/os/seastore/segment_manager.cc:45:15: warning: lambda capture 'FNAME' is not used [-Wunused-lambda-capture]
45 | ).then([FNAME,
| ^
```
but we went further by coroutinize the whole method. because the return
value of ioctl() is not checked before this change, and clang correctly
flagged this with a warning, we marker it with `[[maybe_unused]]`, we
will fix it in a separate change.
Kefu Chai [Sat, 9 May 2026 06:39:17 +0000 (14:39 +0800)]
rgw/d4n: fix deprecated async_run overload in RedisPool
The async_run overload taking a logger argument is deprecated since
Boost 1.89. Use the 2-arg async_run(config, token) overload when
building with Boost >= 1.89, and fall back to the 3-arg overload
for Boost 1.87-1.88.
See https://www.boost.org/doc/libs/1_89_0/libs/redis/doc/html/redis/reference/boost/redis/basic_connection/async_run-04.html
A PGAdvanceMap queued by broadcast_map_to_pgs can sit behind in-flight
DeleteSome events on the peering pipeline holding a Ref<PG>. When it
finally runs, the collection has already been removed in seastore and
PGAdvanceMap drives handle_advance_map / check_for_splits on a stale
PG thereby issuing ops on a collection that no longer exists, crashing the OSD.
Following Classic OSD, set peering_state.set_delete_complete() in PG::do_delete_work's
final batch and bail out of PGAdvanceMap::start when pg->is_deleted() is true.
Jon Bailey [Thu, 7 May 2026 12:28:01 +0000 (13:28 +0100)]
doc: Clarification of text in ec stretch cluster design
Information regarding min_size in the EC Cluster Design doc was unclear in regards to the intention of what we want to develop. This commit is to clarify this so it is clear to readers.
Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Kefu Chai [Sun, 29 Mar 2026 05:47:47 +0000 (13:47 +0800)]
crimson/osd: acquire throttle when scanning replica/primary for backfill
The backfill state machine called budget_available() before deciding to
scan, but request_primary_scan() and request_replica_scan() never
actually acquired the throttle slot. This meant scans could proceed
without any resource reservation, defeating the QoS intent of the
throttler introduced in 791772f1c0.
In this change, we fix this by acquiring the throttle before initiating
each scan.
John Mulligan [Mon, 20 Apr 2026 20:07:19 +0000 (16:07 -0400)]
mgr/smb: add --wildcard and --recursive to smb cluster rm
Add new --wildcard and --recursive flags to the smb cluster rm
subcommands. These allow deleting clusters in bulk. The --wildcard
option works like the same option for share rm in that it allows the use
of globbing for the cluster IDs, this includes '*' to delete all
clusters. The --recursive option tells the command to also delete all
child resources (shares) when deleting a cluster.
This was previously doable by streaming the output of `ceph smb show
...` through (sed or) jq and flipping the intent to removed and piping
that to `ceph smb apply` - but this is clearly not obvious nor easy to
document versus these new options.
Signed-off-by: John Mulligan <jmulligan@redhat.com>