Casey Bodley [Tue, 13 May 2025 13:42:32 +0000 (09:42 -0400)]
rgw/zone: remove duplicated startup logic in RGWSI_Zone
SiteConfig had already loaded the correct configuration without all of
the crazy search_realm_with_zone() stuff which is now confused by
defaults. remove all of this duplicated logic and rely on SiteConfig
removes functions search_realm_with_zone() create_default_zg() and
init_default_zone() which are no longer used
- regression from https://github.com/ceph/ceph/pull/64477/files
- removing frontend valdations as this values are volatiel and require changes every release. Nvmeof is seeting these and validating as well.
Redouane Kachach [Fri, 14 Nov 2025 12:06:59 +0000 (13:06 +0100)]
mgr/cephadm: don't remove TLS certs if svc still has daemons on host
This change fixes an issue in cephadm where cephadm-signed (and
inline) TLS certificates could be removed while a service was still
running on the same host. During a rolling transition from HTTP to
HTTPS (e.g. RGW moving from port 80 to 443 with ssl: true), the
previous post_remove() logic deletes the service’s cephadm-signed
cert/key as soon as any daemon is removed, even if a new HTTPS daemon
for the same service is being deployed on that host. In practice this
leads to the certificate being created for the new daemon and then
immediately deleted from the certmgr store.
The new behavior makes post_remove() more conservative: before
removing a cephadm-signed or inline certificate, it checks whether
there are any remaining daemons for that service on the same host. If
there are, the cert/key is left in place because it may still be in
use (for example during an HTTP->HTTPS rollout). Certificates are only
cleaned up once the last daemon for that service disappears from the
host (and, when the service no longer uses SSL). This preserves
correct TLS behavior during service transitions while still
ensuring certificates are eventually garbage-collected when no longer
needed.
Aashish Sharma [Wed, 12 Nov 2025 10:57:58 +0000 (16:27 +0530)]
mgr/cephadm: Fix RGW zone endpoint auto-update log
ic in _update_rgw_endpoints method
Issue: The existing implementation does not re-attempt endpoint updates when no RGW daemons were found for a service or the daemon deployment is still in progress. The zone is being modified with an empty endpoint array in this case.
Fix: Added conditional checks to retry the update if no daemons are found.
We need to hide multi-cluster context-switcher from the downstream 9.0
clusters in case a cluster with an existing multi-cluster setup upgrades
to this version.
Redouane Kachach [Thu, 23 Oct 2025 11:10:49 +0000 (13:10 +0200)]
mgr/cephadm: add tombstones to persist certs info after mgr failover
Runtime-added TLS objects names were lost across mgr restarts/failovers
since they existed only in memory. We now write a tombstone to the KV
store whenever a new certificate is registered (empty map for
service/host scope; minimal JSON for global), so the object name is
restored during load().
Redouane Kachach [Thu, 23 Oct 2025 11:06:08 +0000 (13:06 +0200)]
mgr/cephadm: fix objects_by_names initialization + some improvements
This commit include the following changes:
1. Fix objects_by_names initialization as we was using a dict for all
the case including global scoped objects which is not correct. For
those cases an instance of an empty TLSObject must be used.
2. Add sanity checks to the load() method to avoid loading incorrect
and malformed entries.
3. Add some helper functions to avoid code repetition
While creating the service without providing the allowlist domain, the
UI fails with an error which is logged in the mgr log
```
Nov 05 04:11:56 ceph-node-00 ceph-mgr[1587]: [dashboard ERROR frontend.error] (https://192.168.100.100:8443/#/services/(modal:create)): Cannot read properties of null (reading 'split')
TypeError: Cannot read properties of null (reading 'split')
at ServiceFormComponent.onSubmit (https://192.168.100.100:8443/src_bootstrap_ts.js:31997:74)
at ServiceFormComponent_Template_cd_form_button_panel_submitActionEvent_60_listener (https://192.168.100.100:8443/src_bootstrap_ts.js:34168:83)
at executeListenerWithErrorHandling (https://192.168.100.100:8443/node_modules_angular_core_fesm2022_core_mjs.js:26276:12)
at Object.wrapListenerIn_markDirtyAndPreventDefault [as next] (https://192.168.100.100:8443/node_modules_angular_core_fesm2022_core_mjs.js:26308:18)
at SafeSubscriber.__tryOrUnsub (https://192.168.100.100:8443/default-node_modules_rxjs__esm2015_internal_AsyncSubject_js-node_modules_rxjs__esm2015_intern-7c6e1a.js:960:10)
at SafeSubscriber.next (https://192.168.100.100:8443/default-node_modules_rxjs__esm2015_internal_AsyncSubject_js-node_modules_rxjs__esm2015_intern-7c6e1a.js:900:14)
at Subscriber._next (https://192.168.100.100:8443/default-node_modules_rxjs__esm2015_internal_AsyncSubject_js-node_modules_rxjs__esm2015_intern-7c6e1a.js:847:22)
at Subscriber.next (https://192.168.100.100:8443/default-node_modules_rxjs__esm2015_internal_AsyncSubject_js-node_modules_rxjs__esm2015_intern-7c6e1a.js:824:12)
at EventEmitter_.next (https://192.168.100.100:8443/default-node_modules_rxjs__esm2015_internal_AsyncSubject_js-node_modules_rxjs__esm2015_intern-7c6e1a.js:604:17)
at EventEmitter_.emit (https://192.168.100.100:8443/node_modules_angular_core_fesm2022_core_mjs.js:7069:13)
```
Suyash Dongre [Wed, 20 Aug 2025 17:52:41 +0000 (23:22 +0530)]
Check if `HTTP_X_AMZ_COPY_SOURCE` header is empty
The issue was that the `HTTP_X_AMZ_COPY_SOURCE` header could be present but empty (i.e., an empty string rather than NULL). The code only checked if the pointer was not NULL, but didn't verify that the string had content. When an empty string was passed to RGWCopyObj::parse_copy_location(), it would eventually try to access name_str[0] on an empty string, causing a crash.
Changes Includes:
Added styles in rh_overrides for btn-tertiary to fix the styles on multisite page and also added class `btn-group` class to make the buttons look like before
Regression introduced by the previous PR https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/1323
Adam King [Wed, 29 Oct 2025 19:27:09 +0000 (15:27 -0400)]
cephadm: mount nvmeof conf under /src/
The current downstream nvmeof container builds for
9.0 seem to be using /src/ as the home directory for
the container rather than /remote-source/ceph-nvmeof/app/
This is effectively the reverse issue as
seen in https://bugzilla.redhat.com/show_bug.cgi?id=2240588
Shweta Bhosale [Fri, 24 Oct 2025 11:00:16 +0000 (16:30 +0530)]
mgr/cephadm: For updating NFS backends in HAProxy, send a SIGHUP signal to reload the configuration instead of restart Fixes: https://tracker.ceph.com/issues/73633 Signed-off-by: Shweta Bhosale <Shweta.Bhosale1@ibm.com>
Resolves: rhbz#2401776
rgw/dedup: fixes an assertion failure from __snprintf_chk in fortified mode when handling dedup cluster shard token OIDs.
The issue stems from buffer size validation in string operations.
- remove 'Z' from rbd APIs which are returning now `aware` timestamp
- `datetime.utcfromtimestamp` is deprectated so using `datetime.fromtimestamp(timestamp, tz=tz=timezone.utc)` thereby returning only `aware` timestamp and removing 'Z'.
- similarly `datetime.utcnow()` is deprecated , migrated to `datetime.now(timezone.utc)`
This commit refactors setup_metadata_devices into smaller helper methods.
It keeps the distinction between existing logical volumes and raw devices
explicit, centralizes tag handling and path assignment to make the
control flow obvious and separates responsibilities for checking, creating,
and tagging devices.
Adam King [Fri, 10 Oct 2025 20:46:03 +0000 (16:46 -0400)]
python-common/cryptotools: add funcs for call_home_agent crypto activities
So that cephadm and the call_home_agent modules aren't both attempting
to import cryptography libraries that cause https://tracker.ceph.com/issues/64213
Introduce a termination_grace_period field in service spec to define how long the
orchestrator should wait for a service to shut down gracefully before forcefully terminating it.
The value is plumbed mgr -> cephadm and written into 'unit.stop' as 'podman stop -t <N>'
mgr/cephadm: add the VIP to the internal mgmt-gateway cert SAN list
Include the VIP as part of the mgmt-gateway internal server
certificate SAN list when operating in HA mode. Otherwise
the communication between internal services might fail.
Yuval Lifshitz [Sun, 12 Oct 2025 14:14:36 +0000 (14:14 +0000)]
rgw/logging: fix race condition when name update returns ECANCELED
* when we get ECANCELED indication from the name set operation we should
bail out and not continue with the rollover
* this fix revealed a hidden bug where we do not check the existing temp
name when we do conf change cleanup (rollover)
Adam King [Fri, 10 Oct 2025 14:48:35 +0000 (10:48 -0400)]
mgr/orchestrator: stop passing "default_flow_style" flag to yaml dump
This seems to not be compatible with pyyaml 6.0
```
File "/lib/python3.12/site-packages/ceph/deployment/service_spec.py", line 1350, in __repr__
y = yaml.dump(cast(dict, self), default_flow_style=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib64/python3.12/site-packages/yaml/__init__.py", line 253, in dump
return dump_all([data], stream, Dumper=Dumper, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib64/python3.12/site-packages/yaml/__init__.py", line 241, in dump_all
dumper.represent(data)
File "/lib64/python3.12/site-packages/yaml/representer.py", line 28, in represent
self.serialize(node)
File "/lib64/python3.12/site-packages/yaml/serializer.py", line 54, in serialize
self.serialize_node(node, None, None)
File "/lib64/python3.12/site-packages/yaml/serializer.py", line 104, in serialize_node
self.emit(MappingStartEvent(alias, node.tag, implicit,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Prepared.__init__() got an unexpected keyword argument 'flow_style'
```
and didn't seem to cause any issues with making our specs look
readable in the logs or being able to round-trip specs
when using `ceph orch ls --export` (minus the known bug
around doing so with multi-line certs)
rgw/lc: At least wait for |rgw_lc_lock_max_time| while trying to fetch the lc-shard lock to get or update the bucket status.
Currently each lc worker would try 1 second to get the lock on lc_shard to decide on which bucket to process and again 1 second to update the bucket status once bucket is lc processed. However when there are multiple rgws running lc, often shard is locked by the other lc worker or if there are issues when the rados is slow the lock is not processed within 1 second and worker either skips processing the bucket or skips updating the bucket, resulting in miss of LC or miss in updating the bucket status.
So in worst case when other lc worker is already processing a shard, wait for rgw_lc_lock_max_time to get the lock, as any given worker can max hold onto rgw_lc_lock_max_time a given shard.
Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
(cherry picked from commit 937ac626afd3bf443edf96aa177854e8eb291af5) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
rgw/lc: if the buckets last lc processing time is less than start time of current LC session, then continue processing bucket for lC even if the status is not in initalized state.
Currently the logic inside expired_session() would consider an LC session valid for almost 2-3 days, so for some bucket where the lc processing POST status update fails, the next lc session would skip the bucket because the expired_session() would return false as it multiplies the num_seconds_day *2. Instead of hardcoding the logic to 2 days, store the start time for each lc session and then compare the bucket update time with lc_start time, if bucket process time is less then current lc start time, then bucket can be processed as previous session is already expired.
client: adjust `Fb` cap ref count check during synchronous fsync()
cephfs client holds a ref on Fb caps when handing out a write delegation[0].
As fsync from (Ganesha) client holding write delegation will block indefinitely[1]
waiting for cap ref for Fb to drop to 0, which will never happen until the
delegation is returned/recalled.
If an inode has been write delegated, adjust for cap reference count
check in fsync().
Note: This only workls for synchronous fsync() since `client_lock` is
held for the entire duration of the call (at least till the patch leading
upto the reference count check). Asynchronous fsync() needs to be fixed
separately (as that can drop `client_lock`).
Nizamudeen A [Thu, 11 Sep 2025 05:29:47 +0000 (10:59 +0530)]
mgr/dashboard: improve search and pagination behavior
add a throttle to the pagination cycle so that if you repeatedly try to
cycle through the page, it increases the delay. Doing this because
unlike search the button click to change page is deliberate and the
first click to the button should respond immediately.
another thing is that the search with a keyword stores every keystroke i
do in the search field and then after the debouncce interval it sends
all those request one by one.
for eg: if i type 222 it waits 1s for the
debounce timer and then sends a request to find osd with id 2 first then
again 2 and then again 2. Instead it should only send 222 at the end.
Nizamudeen A [Thu, 11 Sep 2025 04:13:13 +0000 (09:43 +0530)]
mgr/dashboard: fix missing schedule interval in rbd API
Fetching the rbd image schedule interval through the rbd_support module
schedule list command
GET /api/rbd will have the following field per image
```
"schedule_info": {
"image": "rbd/rbd_1",
"schedule_time": "2025-09-11 03:00:00",
"schedule_interval": [
{
"interval": "5d",
"start_time": null
},
{
"interval": "3h",
"start_time": null
}
]
},
```
Also fixes the UI where schedule interval was missing in the form and
also disable editing the schedule_interval.
Extended the same thing to the `GET /api/pool` endpoint.
Commit includes changes:
1) Renaming Topic to Notification destination
2) Renaming Tiering to Storage class
3) Renaming Users to User Management
4) fix storage class table refresh after delete
5) Also made changes to internal routing for topic and storage class
rgw/dedup: Grant dedup process full RGW permissions.
This is necessary to allow for the creation of intermediate SLAB objects on systems configured with Ceph authentication.
Fixes: https://tracker.ceph.com/issues/72894 Signed-off-by: Mark Kogan <mkogan@ibm.com>
Update PendingReleaseNotes
Co-authored-by: Yuval Lifshitz <yuvalif@yahoo.com> Signed-off-by: Mark Kogan <31659604+mkogan1@users.noreply.github.com>
Update PendingReleaseNotes
Co-authored-by: Yuval Lifshitz <yuvalif@yahoo.com> Signed-off-by: Mark Kogan <31659604+mkogan1@users.noreply.github.com>
(cherry picked from commit dae572d50080609c77d7131cfc99b1fb3f16d31b) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Resolves: rhbz#2393790
Rishabh Dave [Thu, 8 May 2025 15:05:39 +0000 (20:35 +0530)]
mgr/vol: keep clone source info even after cloning is finished
Instead of removing the information regarding source of a cloned
subvolume from the .meta file after the cloning has finished, keep it as
it is as the user may find it useful.
Justin Caratzas [Mon, 6 Oct 2025 23:25:44 +0000 (19:25 -0400)]
mgr/dashboard: add an option to control the dashboard crypto caller
Add a mgr config option `crypto_caller` that lets a ceph user override
the default behavior of using the remote crypto caller. Supported
values are `internal` and `remote`.
Justin Caratzas [Mon, 6 Oct 2025 23:25:44 +0000 (19:25 -0400)]
mgr/cephadm: always use the internal cryptocaller
The cephadm modules needs to use python cryptography module for ssh (via
asyncssh) and thus there's no need to use the remote crypto caller in
cephadm. Configure cephadm to always use the internal cryptocaller.
Justin Caratzas [Mon, 6 Oct 2025 23:25:44 +0000 (19:25 -0400)]
python-common/cryptotools: catch all failures to read cert
Previously, the internal crypto caller would catch (and convert) some
errors when reading the cert but not all cases. Move the logic to catch
the errors to a common location and do it once consistently.
Justin Caratzas [Mon, 6 Oct 2025 23:25:44 +0000 (19:25 -0400)]
python-common/cryptotools: unify and organize all endpoint functions
Lightly reorganize and make the "endpoint" functions in cryptotools.py more
consistent and uniform. Use small functions for input and output
handling so that the handling is done the same way throughout. Pass a
pre-constructed crypto caller via the args to then endpoint functions.
Make generating the private key it's own named function rather than
one single (and only) function with overloaded behavior controlled by
a cli switch.
Justin Caratzas [Mon, 6 Oct 2025 23:25:44 +0000 (19:25 -0400)]
pybind/mgr: fix test case in test_tls.py
Why violate the typing in a test? mypy never noticed this because tests
are not type checked but there seems to be no need to turn a str into
bytes to pass to a function that is typed only as taking str!