Kotresh HR [Mon, 25 May 2026 18:22:29 +0000 (23:52 +0530)]
doc: Update the mirroring doc with new metrics fields
Update the mirroring documentation and also the
release notes with new metrics introduced and it's
availability via 'fs mirror peer status' asok
interface.
Kotresh HR [Fri, 5 Jun 2026 14:23:14 +0000 (19:53 +0530)]
tools/cephfs_mirror: Nest peer_status metrics by dir path and peer uuid
Restructure peer_status output so mirrored directory paths can be
shared by multiple peers without key collisions. Metrics are grouped
as metrics/<dir_path>/peer/<peer_uuid>/ instead of flat dir keys.
Kotresh HR [Sat, 28 Mar 2026 11:23:33 +0000 (16:53 +0530)]
tools/cephfs_mirror: Add eta metrics
Add estimate time of completion for the current
syncing snapshot. The calculation takes into
account the average read/write throughput from
the start of snapshot sync and not the current
read/write throughput. So the ETA is affected
accordingly.
Kotresh HR [Sat, 28 Mar 2026 10:57:02 +0000 (16:27 +0530)]
tools/cephfs_mirror: Add read/write throughput
The read throughput added measures the bytes
read per second from the source ceph filesystem.
Similarly, the write throughput added measures
the bytes written per second to the remote ceph
filesystem. It's derived from the time spent
in preadv and pwritev calls.
sync-mode:
---------
The 'sync-mode: full/delta' is added to peer status.
The 'delta' means, blockdiff along with snapdiff is
being used to sync the files where as 'full' means
full directory is crawled and each file is synced
entirely.
crawl:
-----
The state can be in-progress/completed. This
identifies whether the crawler thread is done
queuing the files for data sync threads.
The time taken for the duration is also shown.
If the crawl is in-progress, the duration
would show the time taken till then from the
start of the crawl. If the crawl state is
completed, then duration indicates total
time taken for the crawl.
The crawl duration is shown in "d h m s" format.
The existing 'sync_duration' in last_synced_snap
is also formatted
The values are as below. When crawl state is
completed, the 'total_files' metric doesn't
grow anymore.
crawl_duration:
--------------
The crawl_duration of last snapshot is saved in last_synced_snap
section as well.
Kotresh HR [Mon, 16 Feb 2026 10:59:31 +0000 (16:29 +0530)]
tools/cephfs_mirror: Add inprogress bytes and files metric
Add following mirroring progress metrics to current_syncing_snap
as below
bytes:
sync_bytes - bytes synced till now
total_bytes - total bytes to be synced
sync_percent - Percentage of bytes synced till now
files:
total_files - Total files to be synced
sync_files - files synced till now
sync_percent - Percentage of files synced till now
sync_files and sync_bytes are also stored in last_synced_snap section
after the snapshot is synced.
Patrick Donnelly [Mon, 18 May 2026 14:20:08 +0000 (10:20 -0400)]
Merge PR #68937 into main
* refs/pull/68937/head:
.github/workflows/releng-audit: group events to serialize executions
.github/workflows/releng-audit: remove override on reopen
.github/workflows/releng-audit: refactor auth check to function
Afreen Misbah [Mon, 18 May 2026 10:01:58 +0000 (15:31 +0530)]
mgr/dashboard: fix logs e2e tests after carbonization
Update e2e test selectors to match the new Carbon component structure.
The .card-body and .message classes were replaced with .log-viewer
and .log-entry__message after carbonizing the logs component.
Assisted-by: Claude Signed-off-by: Afreen Misbah <afreen@ibm.com>
Afreen Misbah [Sun, 17 May 2026 14:53:54 +0000 (20:23 +0530)]
mgr/dashboard: Carbonize upgrade page
- Made cluster status clickable to navigate to overview when not HEALTH_OK
- Replaced Bootstrap classes with Carbon design tokens
- Updated upgrade.component.scss to use CSS custom properties
Assisted-by: Claude Signed-off-by: Afreen Misbah <afreenmisbah@ibm.com>
Afreen Misbah [Tue, 12 May 2026 12:07:39 +0000 (17:37 +0530)]
mgr/dashboard: Fix edit and delete access for pool-manager role
Fixes https://tracker.ceph.com/issues/76561
- allows deleting pools in pool-manager role by bypassing config-opt read permissions
- allows editing in pool-manager role which failing deu to misisng rbd mirroring permissions
- fixes a bug with pool edit mode where when both compression and name are edited it fails due to an if-else logic bug
Kefu Chai [Wed, 6 May 2026 02:08:20 +0000 (10:08 +0800)]
cmake/BuildISAL: build and install library targets only
Skip building the igzip executables; Ceph only needs libisal.la.
This should speed up the build a little bit, as we don't build the
executables previous built with "make"
Shai Fultheim [Sat, 16 May 2026 20:17:59 +0000 (23:17 +0300)]
crimson/os/seastore: fix cleaner space leak from shadowed result list
TransactionManager::get_extents_if_live() declared an inner
std::list<CachedExtentRef> res inside the "extent is cached" branch
that shadowed the outer res returned by the coroutine. When the
queried extent was present in the cache, it was moved into the inner
list and immediately discarded, and the empty outer list was returned
to the caller.
The async cleaner uses this result to decide whether to rewrite an
extent or treat it as dead. For recently-allocated LBA tree internal
nodes (still hot in cache), the shadowed return caused the cleaner to
skip them, so mark_space_free() never paired with the earlier
mark_space_used(). Each affected reclaim leaked exactly one extent
(4 KiB for LADDR_INTERNAL), tripping the live_bytes != 0 assertion in
SegmentCleaner::clean_space() (async_cleaner.cc:1441) once a victim
segment with such a leftover was selected.
The reproducer (at ~70% full) deterministically aborted within ~3
minutes before this fix; with the fix the OSDs run cleanly past the
trigger point.
Kefu Chai [Sat, 16 May 2026 02:53:41 +0000 (10:53 +0800)]
doc/dev: refresh vstart.sh options in dev_cluster_deployment
Bring doc/dev/dev_cluster_deployment.rst back in line with the current
src/vstart.sh:
* drop the removed -K/--kstore objectstore backend
* drop -N/--not-new, which was dropped in 8dd2e418; reusing the existing
cluster config is simply the default when -n is not given
* correct the --rgw_frontend default from civetweb to beast
* note that -b/--bluestore is the default objectstore backend
* update the example and add a note that a fresh build needs -n on the
first run, while later runs can omit it
* note that the option list is not exhaustive and point at src/vstart.sh
Matthew N. Heler [Fri, 15 May 2026 11:11:35 +0000 (06:11 -0500)]
doc/rados/configuration: recommend wpq for EC clusters seeing slow ops
On large EC clusters, mClock currently routes recovery EC sub-reads
through the immediate queue, skipping throttling. When many OSDs read
from one source during recovery, that source's high-priority queue
saturates and starves client work, producing slow ops. Recommend
falling back to wpq in the mClock config reference until the
scheduler treats those reads as background.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
mgr/dashboard: adding daemon_name as an arg to nvmeof get bundle API
When cephadm-signed are in use, we know to know exacly which nvmeof daemon is
being used so we get the correct certificates for this daemon in
particular
Patrick Donnelly [Thu, 14 May 2026 17:26:33 +0000 (13:26 -0400)]
.github/workflows/releng-audit: consolidate into single job
In order to make this a required check someday, we can't have the main
job ever be skipped. So, consolidate into a single job and skip actions
based on the router logic.
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Venky Shankar [Thu, 14 May 2026 09:27:21 +0000 (14:57 +0530)]
Merge PR #64774 into main
* refs/pull/64774/head:
test_cephfs.py: delete purge_dir() helper method, use rmtree() instead
test_cephfs.py: remove rendundant call to purge_dir()
test_cephfs.py: test rmtree on root
pybind/cephfs: don't attempt to unlink root in rmtree
test_cephfs.py: test rmtree with and without should_cancel
pybind/cephfs: make should_cancel option parameter for rmtree()
mgr/volumes: clone using cptree() from cephfs python bindings
test_cephfs: add unit tests for cptree() in cephfs python bindings
test/pybind/assertions: add helper method assert_less
pybind/cephfs: use depth-first, non-recursive approach for cloning
test_cephfs: call object setup/teardown for all tests in TestWithRootUser
test_cephfs.py: add tests for utimensat()
pybind/cephfs: add python bindings for utimensat()
qa/cephfs: add tests for chownat()
pybind/cephfs: add python bindings for chownat()
test_cephfs.py: add tests for chmodat()
pybind/cephfs: add python bindings for chmodat()
test_cephfs.py: add tests for symlinkat()
pybind/cephfs: add python binding for symlinkat()
test_cephfs.py: add test for readlinkat()
pybind/cephfs: add python binding for readlinkat()
pybind/cephfs: add tests for statxat()
pybind/cephfs: add python bindings for statxat()
test_cephfs.py: add tests for mkdirat()
pybind/cephfs: add python binding for mkdirat()
Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jos Collin <jcollin@redhat.com>
Patrick Donnelly [Wed, 13 May 2026 19:48:41 +0000 (15:48 -0400)]
Merge PR #68781 into main
* refs/pull/68781/head:
doc/governance: remove Sam from CSC
Reviewed-by: Joseph Mundackal <jmundackal@bloomberg.net> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Kefu Chai [Tue, 5 May 2026 01:36:01 +0000 (09:36 +0800)]
pybind/mgr/status: drop asserts that fight the defaultdict defaults
The 'assert metadata' checks in the status module were actually fighting
against our own defaults. Since an empty defaultdict is falsy, these
asserts would blow up the whole command if a single daemon was down
after a mgr restart.
This drops those four grumpy asserts. Now, instead of a traceback,
`ceph osd status` and `ceph fs status` will just show a blank hostname
or "unknown" version as intended.
The trigger is common in practice: any mgr restart leaves daemons
that are currently down without metadata in daemon_state, since
they never reconnect via MMgrOpen to repopulate it. After such a
restart, `ceph osd status` and `ceph fs status` blow up:
```
Error EINVAL: Traceback (most recent call last):
...
File ".../status/module.py", line 340, in handle_osd_status
assert metadata
AssertionError
```
Kefu Chai [Tue, 5 May 2026 01:35:00 +0000 (09:35 +0800)]
mgr: narrow get_metadata return type with @overload
Enable type narrowing for get_metadata() when a non-None default is
provided. Previously, the return type was always `Optional[Dict[str, str]]`,
forcing callers to use defensive `assert metadata` checks even when
a result was guaranteed.
The wrapper returns either the metadata from `_ceph_get_metadata()` or the
caller-supplied default. Providing an `@overload` allows type checkers to
prove the result is non-None, avoiding invalid assertions for falsy
defaults (like an empty defaultdict).