Patrick Donnelly [Mon, 23 Feb 2026 15:29:13 +0000 (10:29 -0500)]
Merge PR #67135 into main
* refs/pull/67135/head:
pybind: remove compile_time_env parameter from setup.py files
pybind/rados,rgw: replace Tempita errno checks with C preprocessor
pybind/cephfs: replace deprecated IF with C preprocessor macro
Reviewed-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
cls/rgw_gc/cls_rgw_gc: read config via cls_get_config
Commit https://github.com/ceph/ceph/commit/3877c1e37f2fa4e1574b57f05132288f210835a7
added new way to let CLS gain access to global configuration (`g_ceph_context`).
`cls_rgw_gc_queue_init` method is not using the new CLS call of `cls_get_config`
but instead directly uses `g_ceph_context`.
Crimson OSD implementation does **not** support `g_ceph_context` which results in a (SIGSEGV)
crash due to null access. Switching to `cls_get_config`, similarly to `cls_rgw.cc`, would allow
both OSD implementations to access the conf safely.
The above approach is well-defined due to the two orthogonal implementations of objclass.cc.
Classical OSD uses `src/osd/objclass.cc` While Crimson OSD uses `src/crimson/osd/objclass.cc`.
mgr/DaemonServer: Re-order OSDs in crush bucket to maximize OSDs for upgrade
DaemonServer::_maximize_ok_to_upgrade_set() attempts to find which OSDs
from the initial set found as part of _populate_crush_bucket_osds() can be
upgraded as part of the initial phase. If the initial set results in failure,
the convergence logic trims the 'to_upgrade' vector from the end until a safe
set is found.
Therefore, it would be advantageous to sort the OSDs by the ascending number
of PGs hosted by the OSDs. By placing OSDs with smallest (or no PGs) at the
beginning of the vector, the trim logic along with _check_offlines_pgs() will
have the best chance of finding OSDs to upgrade as it approaches a grouping
of OSDs that have the smallest or no PGs.
To achieve the above, a temporary vector of struct pgs_per_osd is created and
sorted for a given crush bucket. The sorted OSDs are pushed to the main
crush_bucket_osds that is eventually used to run the _check_offlines_pgs()
logic to find a safe set of OSDs to upgrade.
pgmap is passed to _populate_crush_bucket_osds() to utilize get_num_pg_by_osd()
for the above logic to work.
Implement a new Mgr command called 'ok-to-upgrade' that returns a set of OSDs
within the provided CRUSH bucket that are safe to upgrade without reducing
immediate data availability.
The command accepts the following as input:
- CRUSH bucket name (required)
- The CRUSH bucket type is limited to 'rack', 'chassis', 'host' and 'osd'.
This is to prevent users from specifying a bucket type higher up the tree
which could result in performance issues if the number of OSDs in the
bucket is very high.
- The new Ceph version to check against. The format accepted is the short
form of the Ceph version, for e.g. 20.3.0-3803-g63ca1ffb5a2. (required)
- The maximum number of OSDs to consider if specified. (optional)
Implementation Details:
After sanity checks on the provided parameters, the following steps are
performed:
1. The set of OSDs within the CRUSH bucket is first determined.
2. From the main set of OSDs, a filtered set of OSDs not yet running the new
Ceph version is created.
- For this purpose, the OSD's 'ceph_version_short' string is read from
the metadata. For this purpose a new method called
DaemonServer::get_osd_metadata() is used. The information is determined
from the DaemonStatePtr maintained within the DaemonServer.
3. If all OSDs are already running the new Ceph version, a success report is
generated and returned.
4. If OSDs are not running the new Ceph version, a new set (to_upgrade) is
created.
5. If the current version cannot be determined, an error is logged and the
output report with 'bad_no_version' field populated with the OSD in question
is generated.
6. On the new set (to_upgrade), the existing logic in _check_offline_pgs() is
executed to see if stopping any or all OSDs in the set as part of the upgrade
can reduce immediate data availability.
- If data availability is impacted, then the number of OSDs in the filtered
set is reduced by a factor defined by a new config option called
'mgr_osd_upgrade_check_convergence_factor' which is set to 0.8 by default.
- The logic in _check_offline_pgs() is repeated for the new set.
- The above is repeated until a safe subset of OSDs that can be stopped for
upgrade is found. Each iteration reduces the number of OSDs to check by
the convergence factor mentioned above.
7. It must be noted that the default value of
'mgr_osd_upgrade_check_convergence_factor' is on the higher side in order to
help determine an optimal set of OSDs to upgrade. In other words, a higher
convergence factor would help maximize the number of OSDs to upgrade. In this
case, the number of iterations and therefore the time taken to determine the
OSDs to upgrade is proportional to the number of OSDs in the CRUSH bucket.
The converse is true if a lower convergence factor is used.
8. If the number of OSDs determined is lower than the 'max' specified, then an
additional loop is executed to determine if other children of the CRUSH
bucket can be added to the existing set.
9. Once a viable set is determined, an output report similar to the following is
generated:
A standalone test is introduced that exercises the logic for both replicated
and erasure-coded pools by manipulating the min_size for a pool and check for
upgradability. The tests also performs other basic sanity checks and error
conditions.
The output shown below is for a cluster running on a single node with 10 OSDs
and with replicated pool configuration:
mgr/DaemonServer: Modify offline_pg_report to handle set or vector types
The offline_pg_report structure to be used by both the 'ok-to-stop' and
'ok-to-upgrade' commands is modified to handle either std::set or std::vector
type containers. This is necessitated due to the differences in the way
both commands work. For the 'ok-to-upgrade' command logic to work optimally,
the items in the specified crush bucket including items found in the subtree
must be strictly ordered. The earlier std::set container re-orders the items
upon insertion by sorting the items which results in the offline pg check to
report sub-optimal results.
Therefore, the offline_pg_report struct is modified to use
std::variant<std::vector<int>, std::set<int>> as a ContainerType and handled
accordingly in dump() using std::visit(). This ensures backward compatibility
with the existing 'ok-to-stop' command while catering to the requirements of
the new 'ok-to-upgrade' command.
Kotresh HR [Sun, 15 Feb 2026 18:41:51 +0000 (00:11 +0530)]
tools/cephfs_mirror: Fix lock order issue
Lock order 1:
InstanceWatcher::m_lock ----> FSMirror::m_lock
Lock order 2:
FSMirror::m_lock -----> InstanceWatcher::m_lock
The Lock order 1 is where it's aborted and it happens
during blocklisting. The InstanceWatcher::handle_rewatch_complete()
acquires InstanceWatcher::m_lock and calls
m_elistener.set_blocklisted_ts() which tries to acquire
FSMirror::m_lock
The Lock order 2 exists in mirror peer status command.
The FSMirror::mirror_status(Formatter *f) takes FSMirro::m_lock
and calls is_blocklisted which takes InstanceWatcher::m_lock
Fix:
FSMirror::m_blocklisted_ts and FSMirror::m_failed_ts is converted
to std::<atomic> and also fixed the scope of m_lock in
InstanceWatcher::handle_rewatch_complete() and
MirrorWatcher::handle_rewatch_complete()
Look at the tracker for traceback and further details.
Kotresh HR [Sat, 21 Feb 2026 15:55:30 +0000 (21:25 +0530)]
tools/cephfs_mirror: Do remote fs sync once instead of fsync on each fd
Do remote fs sync once just before taking snapshot
as it's faster than doing fsync on each fd after
file copy.
Moreover, all the datasync threads use the same sinlge libceph
onnection and doing ceph_fsync concurrently on different fds on
a single libcephfs connection could cause hang as observed in
testing as below. This issue is tracked at
https://tracker.ceph.com/issues/75070
-----
Thread 2 (Thread 0xffff644cc400 (LWP 74020) "d_replayer-0"):
0 0x0000ffff8e82656c in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
1 0x0000ffff8e828ff0 [PAC] in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libc.so.6
2 0x0000ffff8fc90fd4 [PAC] in ceph::condition_variable_debug::wait ...
3 0x0000ffff9080fc9c in ceph::condition_variable_debug::wait<Client::wait_on_context_list ...
4 Client::wait_on_context_list ... at /lsandbox/upstream/ceph/src/client/Client.cc:4540
5 0x0000ffff9083fae8 in Client::_fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13299
6 0x0000ffff90840278 in Client::_fsync ...
7 0x0000ffff90840514 in Client::fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13042
8 0x0000ffff907f06e0 in ceph_fsync ... at /lsandbox/upstream/ceph/src/libcephfs.cc:316
9 0x0000aaaaad5b2f88 in cephfs::mirror::PeerReplayer::copy_to_remote ...
----
Kotresh HR [Sun, 15 Feb 2026 09:37:09 +0000 (15:07 +0530)]
tools/cephfs_mirror: Don't use blockdiff on smaller files
Introduce a new configuration option,
'cephfs_mirror_blockdiff_min_file_size', to control the minimum file
size above which block-level diff is used during CephFS mirroring.
Files smaller than the configured threshold are synchronized using
full file copy, while larger files attempt block-level delta sync.
This provides better flexibility across environments with varying
file size distributions and performance constraints.
The default value is set to 16_M (16 MiB). The value is read once
at beginning of every snapshot sync.
Kotresh HR [Sat, 21 Feb 2026 15:40:08 +0000 (21:10 +0530)]
tools/cephfs_mirror: Handle shutdown/blocklist/cancel at syncm dataq wait
1. Add is_stopping() predicate at sdq_cv wait
2. Use the existing should_backoff() routine to validate
shutdown/blocklsit/cancel errors and set corresponding errors.
3. Handle notify logic at the end
4. In shutdown(), notify all syncm's sdq_cv wait
Kotresh HR [Sun, 22 Feb 2026 18:10:32 +0000 (23:40 +0530)]
tools/cephfs_mirror: Handle shutdown/blocklist at syncm_q wait
1. Convert smq_cv.wait to timed wait as blocklist doesn't have
predicate to evaluate. Evaluate is_shutdown() as predicate.
When either of the two is true, set corresponding error and
backoff flag in all the syncm objects. The last thread data
sync thread would wake up all the crawler threads. This is
necessary to wake up the crawler threads whose data queue
is not picked by any datasync threads.
2. In shutdown(), change the order of join, join datasync threads
first. The idea is kill datasync threads first before crawler
threads as datasync threads are extension of crawler threads
and othewise might cause issues. Also wake up smq_cv wait for
shutdown.
Kotresh HR [Sat, 21 Feb 2026 14:06:39 +0000 (19:36 +0530)]
tools/cephfs_mirror: Monitor num of active datasync threads
Introduce an atomic counter in PeerReplayer to track the number of
active SnapshotDataSyncThread instances.
The counter is incremented when a datasync thread enters its entry()
function and decremented automatically on exit via a small RAII guard
(DataSyncThreadGuard). This ensures accurate accounting even in the
presence of early returns or future refactoring.
This change helps in handling of shutdown and blocklist scenarios.
At the time of shutdown or blocklisting, datasync threads may still
be processing multiple jobs across different SyncMechanism instances.
It is therefore essential that only the final exiting datasync thread
performs the notifications for all relevant waiters, including the
syncm data queue, syncm queue, and m_cond.
This approach ensures orderly teardown by keeping crawler threads
active until all datasync threads have completed execution.
Terminating crawler threads prematurely—before datasync threads have
exited—can lead to inconsistencies, as crawler threads deregister the
mirroring directory while datasync threads may still be accessing it.
Kotresh HR [Sat, 21 Feb 2026 14:03:39 +0000 (19:33 +0530)]
tools/cephfs_mirror: Store a reference of PeerReplayer object in SyncMechanism
Store a reference of PeerReplayer object in SyncMechanism.
This allows SyncMechansim object to call functions of PeerReplayer.
This is required in multiple places like handling
shutdown/blocklist/cancel where should_backoff() needs to be
called by syncm object while poppig dataq by data sync threads.
Kotresh HR [Sun, 15 Feb 2026 03:09:54 +0000 (08:39 +0530)]
tools/cephfs_mirror: Make PeerReplayer::m_stopping atomic
Make PeerReplayer::m_stopping as std::<atomic> and make it
independant of m_lock. This helps 'm_stopping' to be used
as predicate in any conditional wait which doesn't use
m_lock.
Kotresh HR [Sat, 21 Feb 2026 13:51:02 +0000 (19:21 +0530)]
tools/cephfs_mirror: Fix assert while opening handles
Issue:
When the crawler or a datasync thread encountered an error,
it's possible that the crawler gets notified by a datasync
thread and bails out resulting in the unregister of the
particular dir_root. The other datasync threads might
still hold the same syncm object and tries to open the
handles during which the following assert is hit.
ceph_assert(it != m_registered.end());
Cause:
This happens because the in_flight counter in syncm object
was tracking if it's processing the actual job from the data
queue.
Fix:
Make in_flight counter in syncm object to track the active
syncm object i.e, inrement as soon as the datasync thread
get a reference to it and decrement when it goes out of
reference.
Kotresh HR [Sat, 21 Feb 2026 10:36:31 +0000 (16:06 +0530)]
tools/cephfs_mirror: Fix dequeue of syncm on error
On error encountered in crawler thread or datasync
thread while processing a syncm object, it's possible
that multiple datasync threads attempts the dequeue of
syncm object. Though it's safe, add a condition to avoid
it.
Kotresh HR [Sat, 21 Feb 2026 10:27:42 +0000 (15:57 +0530)]
tools/cephfs_mirror: Handle errors in crawler thread
Any error encountered in crawler threads should be
communicated to the data sync threads by marking the
crawl error in the corresponding syncm object. The
data sync threads would finish pending jobs, dequeue
the syncm object and notify crawler to bail out.
Kotresh HR [Sat, 21 Feb 2026 10:18:56 +0000 (15:48 +0530)]
tools/cephfs_mirror: Handle error in datasync thread
On any error encountered in datasync threads while syncing
a particular syncm dataq, mark the datasync error and
communicate the error to the corresponding syncm's crawler
which is waiting to take a snaphsot. The crawler will log
the error and bail out.
There is global queue of SyncMechanism objects(syncm). Each syncm
object represents a single snapshot being synced and each syncm
object owns m_sync_dataq representing list of files in the snapshot
to be synced.
The data sync threads should consume the next syncm job
if the present syncm has no pending work. This can evidently
happen if the last file being synced in the present syncm
job is a large file from it's syncm_dataq. In this case, one
data sync thread is busy syncing the large file, the rest of
data sync threads just wait for it to finish to avoid busy loop.
Instead, the idle data sync threads could start consuming the next
syncm job.
This brings in a change to data structure.
- syncm_q has to be std::deque instead of std::queue as syncm in the
middle can finish syncing first and that needs to be removed before
the front
Afreen Misbah [Mon, 16 Feb 2026 13:57:24 +0000 (19:27 +0530)]
mgr/dashboard: Add health check panel
Fixes https://tracker.ceph.com/issues/74958
- adds helath check panel in overview dashboard
- updates tests
- refactors component as per modern Angular convention
- using onPush CDS in Overview component
- using view model pattern to aggregate data for rendering
Kotresh HR [Sat, 21 Feb 2026 08:28:47 +0000 (13:58 +0530)]
tools/cephfs_mirror: Synchronize taking snapshot
The crawler/entry creation thread needs to wait until
all the data is synced by datasync threads to take
the snapshot. This patch adds the necessary conditions
for the same.
It is important for the conditional flag to be part
of SyncMechanism and not part of PeerReplayer class.
The following bug would be hit if it were part of
PeerReplayer class.
When multiple directories are confiugred for mirroring as below
/d0 /d1 /d2
Crawler1 Crawler2 Crawler3
DoneEntryOps DoneEntryOps DoneEntryOps
WaitForSafeSnap WaitForSafeSnap WaitForSafeSnap
When all crawler threads are waiting at above, the data sync threads
which is done processing /d1, would notify, waking up all the crawlers
causing spurious/unwanted wake up and half baked snapshots.
Kotresh HR [Sat, 21 Feb 2026 08:18:15 +0000 (13:48 +0530)]
tools/cephfs_mirror: Fix data sync threads completion logic
We need to exactly know when all data threads completes
the processing of a syncm. If a few threads finishes the
job, they all need to wait for the in processing threads
of that syncm to complete. Otherwise the finished threads
would be busy loop until in processing threads finishes.
And only after all threads finishes processing, the crawler
thread can be notified to take the snapshot.
Kotresh HR [Tue, 9 Dec 2025 10:05:08 +0000 (15:35 +0530)]
tools/cephfs_mirror: Mark crawl finished
After entry operations are synced and stack is empty,
mark the crawl as finished so the data sync threads'
wait logic works correctly and doesn't indefinitely wait.
Kotresh HR [Wed, 14 Jan 2026 08:47:07 +0000 (14:17 +0530)]
tools/cephfs_mirror: Add SyncMechanism Queue
Add a queue of shared_ptr of type SyncMechanism.
Since it's shared_ptr, the queue can hold both
shared_ptr to both RemoteSync and SnapDiffSync objects.
Each SyncMechanism holds the queue for the SyncEntry
items to be synced using the data sync threads.
The SyncMechanism queue needs to be shared_ptr because
all the data sync threads needs to access the object
of SyncMechanism to process the SyncEntry Queue.
This patch sets up the building blocks for the same.
Kotresh HR [Wed, 14 Jan 2026 08:27:34 +0000 (13:57 +0530)]
tools/cephfs_mirror: Use the existing m_lock and m_cond
The entire snapshot is synced outside the lock.
The m_lock and m_cond pair is used for data sync
threads along with crawler threads to work well
with all terminal conditions like shutdown and
existing data structures.
Ramana Raja [Wed, 24 Dec 2025 10:24:50 +0000 (05:24 -0500)]
mgr/rbd_support: Fix "start-time" arg behavior
The "start-time" argument, optionally passed when adding or removing an
mirror image snapshot schedule or a trash purge schedule, does not
behave as intended. It is meant to schedule an initial operation at a
specific time of day in a given time zone. Instead, it offsets the
schedule’s anchor time. By default, the scheduler uses the UNIX epoch as
the anchor to calculate recurring schedule times, and "start-time"
simply shifts this anchor away from UTC, which can confuse users. For
example:
```
$ # current time
$ date --universal
Wed Dec 10 05:55:21 PM UTC 2025
$ rbd mirror snapshot schedule add -p data --image img1 1h 19:00Z
$ rbd mirror snapshot schedule ls -p data --image img1
every 15m starting at 19:00:00+00:00
```
A user might assume that the scheduler will run the first snapshot each
day at 19:00 UTC and then run snapshots every 15 minutes. Instead, the
scheduler runs the first snapshot at 18:00 UTC and then continues at the
configured interval:
```
$ rbd mirror snapshot schedule status -p data --image img1
SCHEDULE TIME IMAGE
2025-12-10 18:00:00 data/img1
```
Additionally, the "start-time" argument accepts a full ISO 8601
timestamp but silently ignores everything except hour, minute, and time
zone. Even time zone handling is incorrect: specifying "23:00-01:00"
with an interval of "1d" results in a snapshot taken once per day at
22:00 UTC rather than 00:00 UTC, because only utcoffset.seconds is used
while utcoffset.days is ignored.
Fix:
Similar to the handling of the "start" argument in the FS snap-schedule
manager module, require "start-time" to use an ISO 8601 date-time format
with a mandatory date component. Time and time zone are optional and
default to 00:00 and UTC respectively.
The "start-time" now defines the anchor time used to compute recurring
schedule times. The default anchor remains the UNIX epoch. Existing
on-disk schedules with legacy-format "start-time" values are updated to
include the date Jan 1, 1970.
The `snap schedule ls` output now displays "start-time" with date and
time in the format "%Y-%m-%d %H:%M:00". The display time is in UTC.
Fixes: https://tracker.ceph.com/issues/74192 Signed-off-by: Ramana Raja <rraja@redhat.com>
common/options: change osd_target_transaction_size from int to uint
Change osd_target_transaction_size from signed int to unsigned int to
match the return type of Transaction::get_num_opts() (ceph_le64).
This change:
- Eliminates compiler warnings when comparing signed/unsigned values
- Enables automatic size conversion (e.g., "4_K" → 4096) via y2c.py
for improved administrator usability
- Maintains type consistency throughout the codebase
Kefu Chai [Tue, 17 Feb 2026 11:41:32 +0000 (19:41 +0800)]
librbd/pwl: fix memory leaks in discard operations
Fix memory leak in librbd persistent write log (PWL) cache discard
operations by properly completing request objects.
ASan reported the following leaks in unittest_librbd:
Direct leak of 240 byte(s) in 1 object(s) allocated from:
#0 operator new(unsigned long)
#1 librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::discard(...)
/ceph/src/librbd/cache/pwl/AbstractWriteLog.cc:935:5
#2 TestMockCacheReplicatedWriteLog_discard_Test::TestBody()
/ceph/src/test/librbd/cache/pwl/test_mock_ReplicatedWriteLog.cc:534:7
Plus multiple indirect leaks totaling 2,076 bytes through the
shared_ptr reference chain.
Root cause:
C_DiscardRequest objects were never deleted because their complete()
method was never called. The on_write_persist callback released the
BlockGuard cell but didn't call complete() to trigger self-deletion.
Write requests use WriteLogOperationSet which takes the request as
its on_finish callback, ensuring complete() is eventually called.
Discard requests don't use WriteLogOperationSet and must explicitly
call complete() in their on_write_persist callback.
Solution:
Call discard_req->complete(r) in the on_write_persist callback and
move cell release into finish_req() -- mirroring how C_WriteRequest
handles it. The complete() -> finish() -> finish_req() chain ensures
the cell is released after the user request is completed, preserving
the same ordering as write requests.
Test results:
- Before: 2,316 bytes leaked in 15 allocations
- After: 0 bytes leaked
- unittest_librbd discard tests pass successfully with ASan
os/Transaction: change get_num_ops() return type to uint64_t
Change Transaction::get_num_ops() to return uint64_t instead of int
to match the underlying data.ops type (ceph_le<__u64>) and eliminate
compiler warnings about signed/unsigned comparison.
Fixes warning in ECTransaction.cc:
```
/home/kefu/dev/ceph/src/osd/ECTransaction.cc: In constructor ‘ECTransaction::Generate::Generate(PGTransaction&, ceph::ErasureCodeInterfaceRef&, pg_t&, const ECUtil::stripe_info_t&, const std::map<hobject_t, ECUtil::shard_extent_map_t>&, std::map<hobject_t, ECUtil::shard_extent_map_t>*, shard_id_map<ceph::os::Transaction>&, const OSDMapRef&, const hobject_t&, PGTransaction::ObjectOperation&, ECTransaction::WritePlanObj&, DoutPrefixProvider*, pg_log_entry_t*)’:
/home/kefu/dev/ceph/src/osd/ECTransaction.cc:589:25: warning: comparison of integer expressions of different signedness: ‘int’ and ‘__gnu_cxx::__alloc_traits<std::allocator<unsigned int>, unsigned int>::value_type’ {aka ‘unsigned int’} [-Wsign-compare]
589 | if (t.get_num_ops() > old_transaction_counts[int(shard)] &&
```
Imran Imtiaz [Fri, 20 Feb 2026 10:57:15 +0000 (10:57 +0000)]
mgr/dashboard: add schedule_level to image API for pool/cluster snapshot schedule
Add optional schedule_level param (image|pool|cluster) to
PUT /api/block/image/{image_spec}. Removes more-specific schedules
before setting at the chosen level. Backward compatible when omitted.
Fixes: https://tracker.ceph.com/issues/75043 Assisted-by: Cursor AI Signed-off-by: Imran Imtiaz <imran.imtiaz@uk.ibm.com>