Matthew N. Heler [Thu, 26 Feb 2026 01:03:56 +0000 (19:03 -0600)]
rgw: add RestoreStatus support to object listings
S3 clients can request restore status in listing responses through the
x-amz-optional-object-attributes header, but we had no support for it.
This stores the restore state in the bucket index so listings can
include <RestoreStatus> without having to read each object's attrs
individually.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Add script to test for CRUSH retry exhaustion in stretch mode with
2 datacenters. Tests unbiased stretch rules by running multiple
iterations of PG mappings and checking for collisions that exceed
the 50-try limit.
Also add --show-retry-exhaustion flag to crushtool to detect and
report when CRUSH mapping hits the maximum retry limit.
mgr/cephadm: replace md5_hash with FIPS-safe config_hash
Replace md5_hash() usages in cephadm dependency hashing with an
algorithm-agnostic config_hash() helper. config_hash() is backed by
SHA-256, making dependency hash generation unconditionally FIPS-safe
while preserving change-detection behavior.
Ville Ojamo [Wed, 22 Apr 2026 06:51:34 +0000 (13:51 +0700)]
doc/rados: improve troubleshooting-mon.rst
Don't ceph tell mon_status and then claim it passes the help command.
Improve language and link to cephadm doc on asok usage. Add label and
note about accessing asok from the host in troubleshooting.rst.
Capitalize and use double backticks consistently.
Add some missing articles and other minor word changes.
Fix indentation.
Use ref and link definitions consistently, use automatic bold.
Use privileged prompts for CLI commands where necessary.
Remove spaces at end of lines and change tabs to four spaces.
Signed-off-by: Ville Ojamo <git2233+ceph@ojamo.eu>
Afreen Misbah [Fri, 27 Mar 2026 16:06:38 +0000 (21:36 +0530)]
mgr/dashboard: Add gray10 theme base color to all pages
- applies #f4f4f4 - $background to all pages as base page
- earlier the base color of page was white
- also updates tabs/navs/tables css to adapt
- some fixes of spacings in alerts tabs, nvmeof
Afreen Misbah [Thu, 26 Mar 2026 13:25:18 +0000 (18:55 +0530)]
mgr/dashboard: Remove tooltip and popover defaults
Fixes https://tracker.ceph.com/issues/75410
These defaults are not required as carbon adds blackish color to tooltips and moving forward we want to align to CDS.
If anything breaks then add / fix in the used component
The objectstore tool tests restart the OSDs without allowing enough
time for GC to run, which can lead to no-OOL-segments conditions on restart. This
adds a gc_before_restart option to the test config, which when set
to true will run crimson-objectstore-tool --op gc on each OSD
before restarting them.
crimson/tools/objectstore: add GC operation to crimson-objectstore-tool
This adds a GC operation to the crimson-objectstore-tool, allowing
us to trigger GC cycles on demand during testing. This will
help reduce segment pressure and avoid 'no-segments' conditions.
mgr/cephadm: Skip RDMA device check for NFS during upgrade
During image upgrade, prepare_create run on the asyncio event-loop
thread while an outer wait_async is active. Calling wait_async again for
cephadm list-rdma on that thread blocks the loop and can hang or time out.
Matthew Heler [Sun, 26 Apr 2026 21:00:44 +0000 (16:00 -0500)]
rgw/lc: drop per-bucket LC counters to PRIO_DEBUGONLY
mgr's perf-schema bridge silently drops labeled counters on the way
out, so shipping the per-bucket LC counters up through MgrReport just
costs ingest memory for data mgr can't expose anyway. ceph-exporter
already handles labeled counters via the daemon admin socket, so make
that the only path.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Joshua Blanch [Sat, 24 Jan 2026 16:53:14 +0000 (16:53 +0000)]
mgr/cephadm: remove SSH error logs from health detail when host is unreachable
HostConnectionError exception includes verbose logs from asyncssh which
creates noise when looking at ceph health detail. This moves the SSH logs
to log.exception() and remove it from appearing under `health detail`.
mgr/dashboard: use cephadm root CA for RGW SSL and improve error handling
Problem: Dashboard fails to access object pages when RGW is deployed with SSL using cephadm-signed certificates.
Root cause: RGW REST API connection fails with SSL certificate verification error because the cephadm root CA certificate that signed the RGW SSL certificates is not in the dashboard's trust store.
Code Fixes:
1. rgw_client.py:
Added _get_ssl_ca_bundle() which fetches the cephadm root CA certificate from the cert store and writes it atomically (via a temp file and os.replace) to a fixed path (/tmp/ceph-dashboard-ca/rgw-cephadm-root-ca.pem), returning the file path for SSL verification.
Notes:
- The file is written once per mgr process lifetime and reused by all RgwClient instances. On mgr restart it is refetched and overwritten.
- A dedicated subdirectory (/tmp/ceph-dashboard-ca/) is used because /tmp has the sticky bit set, which prevents os.replace from overwriting files owned by a different user.
2. rest_client.py
Fixed secondary that handle_connection_error crash - when the initial SSL error occurred, the error handler itself crashed trying to process the exception, because it assumed reason.args[0] was always a string, but for SSL errors it's an SSLError object.
cephadm: replace call_throws with call in command_inspect_image
Problem:
During the upgrade, when inspecting the new ceph image for the first time, an error is printed to the ceph-mgr log instead of displaying a user-friendly message.
Root cause: During an upgrade, inspect-image is called on each node to check if the target image exists locally before pulling it. This flow, where inspect-image always precedes the pull, occurs on nodes other than the first.
Code Fixes:
1. src/cephadm/cephadm.py:
Replace call_throws with call in command_inspect_image. call_throws raises a RuntimeError on any non-zero exit code, producing a full traceback in the logs. call returns the exit code instead of raising, so the function exits cleanly with errno.ENOENT when the image is not found.
cephadm: convert lists back to tuples when loading last_client_files
Problem: ceph mgr fail or active ceph mgr restart causes unnecessary client files recreation on _admin hosts. Files such as /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring are rewritten even when their content has not changed.
Root cause:
update_client_file() stores client file metadata as a Python tuple (digest, mode, uid, gid).
When save_host() persists this to the mon store via json.dumps(), the tuple is serialized as a JSON array since JSON has no tuple type.
On mgr failover or restart, cache.load() deserializes the data with json.loads(), which returns a Python list instead of a tuple.
The comparison in _write_client_files(): match = old_files[path] == (digest, mode, uid, gid) then compares a list (from JSON) against a tuple (freshly built), which always evaluates to False.
This causes every client file to be rewritten on every mgr failover or restart.
Code Fixes:
1. src/pybind/mgr/cephadm/inventory.py:
convert the deserialized lists back to tuples when loading last_client_files
Shweta Bhosale [Wed, 18 Feb 2026 14:29:58 +0000 (19:59 +0530)]
mgr/nfs: 1. Removed the option to enable and disable cluster wide qos, it will be enabled by default
2. Removed the cluster_enable_qos field from the cluster-level block as it was causing confusion for the user.
3. Instead of using cluster use global while showing cluster level qos values in export qos get
Shweta Bhosale [Thu, 6 Nov 2025 13:04:19 +0000 (18:34 +0530)]
mgr/cephadm: support nfs cluster level qos
Added below CEPH_NODES_LIST block in ganesha.conf and enable_cluster_qos in cluster level QoS block
CEPH_NODES_LIST {
Ceph_Nodes = 192.168.100.100, 192.168.100.101, 192.168.100.102;
}
Fixes: https://tracker.ceph.com/issues/69861 Signed-off-by: Shweta Bhosale <Shweta.Bhosale1@ibm.com>
mgr/cephadm: Changes to add NFS cluster qos inter node communication port in spec
mgr/nfs: Addressed review comments for cluster level qos support
mgr/nfs: add enable_cluster_qos = true while enabling qos
Shweta Bhosale [Wed, 19 Mar 2025 11:16:10 +0000 (16:46 +0530)]
mgr/nfs: When cluster level qos is disabled and export still has qos parameters, then allow nfs export apply command if file has same qos block which is already set
mgr/cephadm: plumb force_delete_data through daemon/service removal
This PR wires the `force_delete_data` already existing flag in the
binary through cephadm’s daemon and service removal paths, so that
commands such as `ceph orch rm service` or equivalent daemon removal
can explicitly ask for data deletion instead of the default "move
under <fsid>/removed/" for daemons such as Prometheus, osd and mon.
Keep parsed command data alive while running hooks to avoid a
stack-use-after-return in Formatter::create().
Return -EAGAIN from PGCommand when the OSDMap is not ready.
Crimson OSD was missing the PG admin/tell hooks that classic OSD exposes, and it
did not accept the legacy `rados_pg_command()` / `ceph pg <pgid> <cmd>` JSON form
(e.g. `{"prefix":"pg","pgid":"1.0","cmd":"query",...}`), so `ceph pg <pgid> query`
failed.
Adds a `pg` old-form wrapper hook that exists to advertise that exists
to advertise the classic `pgid` + `cmd` + optional `arg` signature. The
runtime dispatch rewrites this to the real subcommand.
This updates parse_cmd to rewrite `prefix=pg` requests to the requested
subcommand and remap the generic `arg` field to the concrete parameter
names (`offset` for `list_unfound`, `mulcmd` for `mark_unfound_lost`)
so validation/parsing is unambiguous.
Add a standalone concept page for the OSDMap require_osd_release field,
the upgrade-gate counterpart to require_min_compat_client. Cover:
- how to set it and how to check it;
- the full set of pre-commit guards the monitor runs, rendered as a
table with each guard's error text and bypass status;
- which commands and features become available as the flag is raised,
per release;
- the OSD boot window that refuses OSDs more than two releases ahead
of the flag;
- the OSD_UPGRADE_FINISHED health warning that prompts admins to set
the flag after an upgrade;
- the initial value on new clusters and the two mon_debug_* knobs
that override it for testing.
Also cross-link the new page from the related-flags table on
require-min-compat-client.rst, and from the rados operations index.
Add a standalone concept page for the OSDMap require_min_compat_client
field, covering: how to set and check it, the non-monotonic lowering
behavior (with the features-in-use floor derived from
OSDMap::get_min_compat_client()), and the operator commands it gates.
Include tables for the floor-pinning features and the flag-gated
commands, so operators can reason about transitions without reading
OSDMonitor.cc.
Cross-reference to the CephFS per-filesystem required_client_features
mechanism, which is the MDSMap-side equivalent for client-protocol
features. Add an anchor on the existing CephFS Required Client Features
section so the cross-reference resolves.
Link the new page from the rados operations index.
doc: document ceph nvmeof CLI subcommands for target configuration
Replaces verbose podman run container commands with native ceph nvmeof
CLI subcommands. The nvmeof-cli container approach is preserved as an
alternative in a note block, with a clarification that its option names
differ from the ceph nvmeof CLI.