Kotresh HR [Sat, 21 Feb 2026 13:51:02 +0000 (19:21 +0530)]
tools/cephfs_mirror: Fix assert while opening handles
Issue:
When the crawler or a datasync thread encountered an error,
it's possible that the crawler gets notified by a datasync
thread and bails out resulting in the unregister of the
particular dir_root. The other datasync threads might
still hold the same syncm object and tries to open the
handles during which the following assert is hit.
ceph_assert(it != m_registered.end());
Cause:
This happens because the in_flight counter in syncm object
was tracking if it's processing the actual job from the data
queue.
Fix:
Make in_flight counter in syncm object to track the active
syncm object i.e, inrement as soon as the datasync thread
get a reference to it and decrement when it goes out of
reference.
Kotresh HR [Sat, 21 Feb 2026 10:36:31 +0000 (16:06 +0530)]
tools/cephfs_mirror: Fix dequeue of syncm on error
On error encountered in crawler thread or datasync
thread while processing a syncm object, it's possible
that multiple datasync threads attempts the dequeue of
syncm object. Though it's safe, add a condition to avoid
it.
Kotresh HR [Sat, 21 Feb 2026 10:27:42 +0000 (15:57 +0530)]
tools/cephfs_mirror: Handle errors in crawler thread
Any error encountered in crawler threads should be
communicated to the data sync threads by marking the
crawl error in the corresponding syncm object. The
data sync threads would finish pending jobs, dequeue
the syncm object and notify crawler to bail out.
Kotresh HR [Sat, 21 Feb 2026 10:18:56 +0000 (15:48 +0530)]
tools/cephfs_mirror: Handle error in datasync thread
On any error encountered in datasync threads while syncing
a particular syncm dataq, mark the datasync error and
communicate the error to the corresponding syncm's crawler
which is waiting to take a snaphsot. The crawler will log
the error and bail out.
There is global queue of SyncMechanism objects(syncm). Each syncm
object represents a single snapshot being synced and each syncm
object owns m_sync_dataq representing list of files in the snapshot
to be synced.
The data sync threads should consume the next syncm job
if the present syncm has no pending work. This can evidently
happen if the last file being synced in the present syncm
job is a large file from it's syncm_dataq. In this case, one
data sync thread is busy syncing the large file, the rest of
data sync threads just wait for it to finish to avoid busy loop.
Instead, the idle data sync threads could start consuming the next
syncm job.
This brings in a change to data structure.
- syncm_q has to be std::deque instead of std::queue as syncm in the
middle can finish syncing first and that needs to be removed before
the front
Kotresh HR [Sat, 21 Feb 2026 08:28:47 +0000 (13:58 +0530)]
tools/cephfs_mirror: Synchronize taking snapshot
The crawler/entry creation thread needs to wait until
all the data is synced by datasync threads to take
the snapshot. This patch adds the necessary conditions
for the same.
It is important for the conditional flag to be part
of SyncMechanism and not part of PeerReplayer class.
The following bug would be hit if it were part of
PeerReplayer class.
When multiple directories are confiugred for mirroring as below
/d0 /d1 /d2
Crawler1 Crawler2 Crawler3
DoneEntryOps DoneEntryOps DoneEntryOps
WaitForSafeSnap WaitForSafeSnap WaitForSafeSnap
When all crawler threads are waiting at above, the data sync threads
which is done processing /d1, would notify, waking up all the crawlers
causing spurious/unwanted wake up and half baked snapshots.
Kotresh HR [Sat, 21 Feb 2026 08:18:15 +0000 (13:48 +0530)]
tools/cephfs_mirror: Fix data sync threads completion logic
We need to exactly know when all data threads completes
the processing of a syncm. If a few threads finishes the
job, they all need to wait for the in processing threads
of that syncm to complete. Otherwise the finished threads
would be busy loop until in processing threads finishes.
And only after all threads finishes processing, the crawler
thread can be notified to take the snapshot.
Kotresh HR [Tue, 9 Dec 2025 10:05:08 +0000 (15:35 +0530)]
tools/cephfs_mirror: Mark crawl finished
After entry operations are synced and stack is empty,
mark the crawl as finished so the data sync threads'
wait logic works correctly and doesn't indefinitely wait.
Kotresh HR [Wed, 14 Jan 2026 08:47:07 +0000 (14:17 +0530)]
tools/cephfs_mirror: Add SyncMechanism Queue
Add a queue of shared_ptr of type SyncMechanism.
Since it's shared_ptr, the queue can hold both
shared_ptr to both RemoteSync and SnapDiffSync objects.
Each SyncMechanism holds the queue for the SyncEntry
items to be synced using the data sync threads.
The SyncMechanism queue needs to be shared_ptr because
all the data sync threads needs to access the object
of SyncMechanism to process the SyncEntry Queue.
This patch sets up the building blocks for the same.
Kotresh HR [Wed, 14 Jan 2026 08:27:34 +0000 (13:57 +0530)]
tools/cephfs_mirror: Use the existing m_lock and m_cond
The entire snapshot is synced outside the lock.
The m_lock and m_cond pair is used for data sync
threads along with crawler threads to work well
with all terminal conditions like shutdown and
existing data structures.
Ramana Raja [Wed, 24 Dec 2025 10:24:50 +0000 (05:24 -0500)]
mgr/rbd_support: Fix "start-time" arg behavior
The "start-time" argument, optionally passed when adding or removing an
mirror image snapshot schedule or a trash purge schedule, does not
behave as intended. It is meant to schedule an initial operation at a
specific time of day in a given time zone. Instead, it offsets the
schedule’s anchor time. By default, the scheduler uses the UNIX epoch as
the anchor to calculate recurring schedule times, and "start-time"
simply shifts this anchor away from UTC, which can confuse users. For
example:
```
$ # current time
$ date --universal
Wed Dec 10 05:55:21 PM UTC 2025
$ rbd mirror snapshot schedule add -p data --image img1 1h 19:00Z
$ rbd mirror snapshot schedule ls -p data --image img1
every 15m starting at 19:00:00+00:00
```
A user might assume that the scheduler will run the first snapshot each
day at 19:00 UTC and then run snapshots every 15 minutes. Instead, the
scheduler runs the first snapshot at 18:00 UTC and then continues at the
configured interval:
```
$ rbd mirror snapshot schedule status -p data --image img1
SCHEDULE TIME IMAGE
2025-12-10 18:00:00 data/img1
```
Additionally, the "start-time" argument accepts a full ISO 8601
timestamp but silently ignores everything except hour, minute, and time
zone. Even time zone handling is incorrect: specifying "23:00-01:00"
with an interval of "1d" results in a snapshot taken once per day at
22:00 UTC rather than 00:00 UTC, because only utcoffset.seconds is used
while utcoffset.days is ignored.
Fix:
Similar to the handling of the "start" argument in the FS snap-schedule
manager module, require "start-time" to use an ISO 8601 date-time format
with a mandatory date component. Time and time zone are optional and
default to 00:00 and UTC respectively.
The "start-time" now defines the anchor time used to compute recurring
schedule times. The default anchor remains the UNIX epoch. Existing
on-disk schedules with legacy-format "start-time" values are updated to
include the date Jan 1, 1970.
The `snap schedule ls` output now displays "start-time" with date and
time in the format "%Y-%m-%d %H:%M:00". The display time is in UTC.
Fixes: https://tracker.ceph.com/issues/74192 Signed-off-by: Ramana Raja <rraja@redhat.com>
Github allows to add a instructions file to each repo
(.github/copilot-instructions.md) to improve the behavior
of Copilot Reviews and Agent.
These instructions can also be customized per path, filetype, etc.:
https://docs.github.com/en/copilot/how-tos/configure-custom-instructions/add-repository-instructions
This commit was authored through a Github Agent session: https://github.com/ceph/ceph/tasks/edeca07b-eabd-477c-917a-a18e72a0e2c2
This commit makes it log the http error with the code and the reason
in sessionservice_discover() and log the error code along with the
body in query() for 5xx responses.
node-proxy: encapsulate send logic in dedicated method
Move the "send data to mgr when inventory changed" logic from main()
into a dedicated method _try_send_update().
This flattens the reporter loop and keeps main() to a single call under
the lock.
- use warning for bad request in the API,
when thread is not alive and for retry failure,
- use error for OOB load failure,
- use info for backoff interval,
- use debug in send attempts and for member fetch
this commit fixes mypy errors by adding explicit types for get_path
and get_* getters methods, extending SystemBackend with
start/shutdown and declaring _ca_temp_file on NodeProxyManager
node-proxy: split out config, bootstrap and redfish logic
refactor config, bootstrap, redfish layer, and monitoring:
this:
- adds a config module (CephadmCofnig, load_cephadm_config and
get_node_proxy_config) and protocols for api/reporter.
- extracts redfish logic to redfish.py
- adds a vendor registry with entrypoints.
- simplifies main() and NodeProxyManager().
This commit renames CONFIG to DEFAULTS and add load_config() with
deep merge, refactor Config to use path + defaults and makes
node-proxy config path configurable via bootstrap JSON or env.
node-proxy: introduce component spec registry and overrides for updates
This change introduces a single COMPONENT_SPECS dict and get_update_spec(component)
as the single source of truth for RedFish component update config (collection, path,
fields, attribute). To support hardware that uses different paths or attributes,
get_component_spec_overrides() allows overriding only those fields (via dataclasses.replace())
without duplicating the rest of the spec.
All _update_network, _update_power, etc. now call _run_update(component).
For instance, AtollonSystem uses this to set the power path to 'PowerSubsystem'.
mgr/cephadm: safe status/health access in node-proxy agent and inventory
This adds helpers in NodeProxyEndpoint and NodeProxyCache to safely
read status.health and status.state.
In NodeProxyEndpoint, methods _get_health_value() and _get_state_value()
are used in get_nok_members() to avoid KeyError on malformed data.
In NodeProxyCache, _get_health_value(), _has_health_value(),
_is_error_status(), and _is_unknown_status() are used in fullreport()
and when filtering 'non ok' members instead of accessing
status['status']['health'] inline.
node-proxy: narrow build_data exception handling and re-raise
With this commit, it catches only KeyError, TypeError, and
AttributeError in build_data() instead of Exception, and
re-raise after logging so callers get the actual error.
node-proxy: refactor Endpoint/EndpointMgr and fix chassis paths
This commit refactors EndpointMgr and Endpoint to use explicit dicts
instead of dynamic attributes. It also fixes member path filtering
so chassis endpoints use Chassis paths.
node-proxy: reduce log verbosity for missing optional fields
Change missing field logging from warning to debug level in
RedfishDellSystem, as missing optional fields can be expected behavior
and and doesn't require warning level logging.
Shraddha Agrawal [Wed, 11 Feb 2026 14:23:39 +0000 (19:53 +0530)]
cephadm: add tests for seastore support
This commits adds the following tests:
1. cephadm: JSON roundtrip of a spec with objecstore=seastore.
2. cephadm: validation checks for objecstore values.
3. cephadm to ceph-volume: cmd checks if objecstore=seastore is set.