rgw: add GCM hardware acceleration support via CryptoAccel
Extend the CryptoAccel plugin system to support AES-256-GCM encryption,
following the same pattern established for CBC.
The CryptoAccel base class now includes GCM constants (12-byte nonce,
16-byte tag) and pure virtual methods for gcm_encrypt, gcm_decrypt,
and their batch variants. All derived classes must implement these
methods, maintaining consistency with how CBC is handled.
OpenSSL serves as the fallback when ISA-L is unavailable, using the
EVP API with proper AAD handling. QAT stubs return false since GCM
requires different session setup than CBC; a note has been added to
the QAT acceleration documentation clarifying this limitation.
The RGW integration follows the CBC pattern closely. The previous
gcm_encrypt_chunk and gcm_decrypt_chunk functions have been unified
into gcm_transform() with two overloads: one for EVP-only operation
and one that uses the accelerator exclusively when available, falling
back to EVP only when no accelerator can be loaded. Static assertions
ensure the nonce and tag sizes stay consistent between the acceleration
layer and RGW.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Wed, 28 Jan 2026 04:06:17 +0000 (22:06 -0600)]
rgw: add AES-256-GCM (AEAD) support for server-side encryption
This adds GCM as an alternative to the existing CBC cipher for SSE-C,
SSE-KMS, SSE-S3, and RGW-AUTO. GCM provides authenticated encryption,
meaning it detects tampering during decryption rather than silently
returning corrupted data.
The new rgw_crypt_sse_algorithm config option controls which cipher is
used for new uploads. The default remains aes-256-cbc for backward
compatibility with older RGW versions in mixed clusters. Once all nodes
are upgraded, administrators can enable aes-256-gcm for new objects.
Existing CBC-encrypted objects continue to decrypt correctly regardless
of this setting.
GCM encrypts in 4KB chunks, each producing 4112 bytes of ciphertext
(4096 plaintext + 16-byte authentication tag). This means encrypted
objects are larger than their plaintext. To preserve correct behavior:
- RGW_ATTR_CRYPT_ORIGINAL_SIZE stores the plaintext size
- Content-Length and bucket listings report the plaintext size
- Range requests translate plaintext offsets to storage offsets
Each object gets a random 12-byte nonce stored in RGW_ATTR_CRYPT_NONCE.
This nonce serves two purposes: it's combined with chunk indices to
derive unique IVs for each encrypted block, and for SSE-C it's included
in the key derivation to bind ciphertext to object identity. Moving
encrypted data at the RADOS level causes decryption to fail rather than
silently producing garbage.
Multipart uploads derive per-part keys and use the S3 part number in
IV derivation to guarantee unique IVs across parts. The actual part
numbers are stored in RGW_ATTR_CRYPT_PART_NUMS during CompleteMultipart
to handle non-contiguous uploads (e.g., parts 1, 3, 5).
The implementation uses generic AEAD abstractions (is_aead_mode(),
aead_plaintext_to_encrypted_size(), etc.) so that adding other
authenticated ciphers like ChaCha20-Poly1305 in the future requires
only implementing the cipher itself—the size handling, range request
translation, and multipart machinery will work unchanged.
Originally-by: Kyle Bader <kbader@ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Expose the existing radosgw-admin dedup commands (stats, estimate, exec,
abort, pause, resume, throttle) as HTTP Admin OPS endpoints under
/{admin}/dedup, following the same pattern used by ratelimit, usage, and
other admin REST APIs.
New files:
- rgw_rest_dedup.h: RGWHandler_Dedup and RGWRESTMgr_Dedup
- rgw_rest_dedup.cc: REST op classes calling the same cluster:: backend
functions as radosgw-admin
API summary:
- GET /dedup?op=stats - collect and display dedup statistics
- GET /dedup?op=throttle - display throttle settings
- POST /dedup?op=estimate - start dedup estimate session
- POST /dedup?op=exec - start full dedup (requires yes-i-really-mean-it)
- POST /dedup?op=abort - abort active dedup session
- POST /dedup?op=pause - pause active dedup session
- POST /dedup?op=resume - resume paused dedup session
- POST /dedup?op=throttle - set throttle limits
Documentation added to doc/radosgw/adminops.rst with cross-reference
from doc/radosgw/s3_objects_dedup.rst.
i would prefer to run the s3control test coverage in rgw/verify, but it
depends on rgw_dns_name configuration and support for wildcard dns which
breaks most of the other rgw/verify test cases
the vhost-style transformations ran in RGWREST::preprocess() before we
even route the request, so applied to every REST API in radosgw
vhost-style requests are specific to the S3 API, so they should only
apply after being routed to RGWRESTMgr_S3
extract the vhost logic from RGWREST::proprocess() into
rgw_rest_transform_s3_vhost_style(), and call that only from
RGWRESTMgr_S3::get_resource_mgr_as_default()
url-decoding of request_uri into decoded_uri is now duplicated in
preprocess() to apply to all requests, then again after vhost-style
transforms the request_uri
avoid allocating a list of strings to parse the comma-separated
rgw_enable_apis configuration
the range returned by ceph::split() has no size() function, so change
the calculation to not require it - `size() - distance(begin(), pos)`
is the same thing as `distance(pos, end())`
Casey Bodley [Mon, 30 Jun 2025 22:06:08 +0000 (18:06 -0400)]
rgw: add helper for bucket + account PublicAccessBlock config
get_public_access_conf() takes an optional account, and checks
RGW_ATTR_PUBLIC_ACCESS on that in addition to the bucket. if both attrs
are found, return the union of their configurations
Oguzhan Ozmen [Tue, 19 May 2026 22:12:35 +0000 (22:12 +0000)]
rgw/datalog: DataLogBackends::trim_entries: fix crash when target_gen > head_gen
When a cluster has no sync zones (single-zone), DataLogTrimCR passes
max_marker() as the trim marker, which encodes target_gen = UINT64_MAX
from gencursor(). In DataLogBackends::trim_entries, after trimming the
head (last) generation, the break condition
if (be->gen_id == target_gen)
is false (e.g. 0 != UINT64_MAX), so the loop attempts its increment
expression:
be = upper_bound(be->gen_id)->second
upper_bound(head_gen) returns end(), and dereferencing end()->second
causes crash.
Fix: also break when be->gen_id >= head_gen. Once we've trimmed the
head generation there are no further backends in the map, so the
upper_bound dereference in the loop increment will be skipped.
This is a general bug that affects any cluster using max_marker() as a
trim target (i.e. every single-zone deployment).
Oguzhan Ozmen [Tue, 19 May 2026 23:43:58 +0000 (23:43 +0000)]
test/rgw/datalog: test for trim_entries with max_marker
Verify that DataLogBackends::trim_entries does not crash when called
with max_marker() on a single-generation cluster. The bug causes
upper_bound(head_gen)->second to dereference end() (SIGSEGV)
because the only break condition checked be->gen_id == target_gen,
which is never true when target_gen is UINT64_MAX as encoded by
max_marker() and the cluster has only generation 0.
Kefu Chai [Wed, 20 May 2026 07:55:58 +0000 (15:55 +0800)]
crimson/osd: use store-specific max_object_size for the OSD-layer write check
is_offset_and_length_valid() checked write sizes against
osd_max_object_size (128 MiB), but SeaStore caps per-onode laddr space
at seastore_default_max_object_size (16 MiB). Writes between the two
limits pass the OSD check, reach SeaStore, and trip
prepare_data_reservation()'s ceph_assert(), crashing the OSD and its
replicas.
Add FuturizedStore::Shard::get_max_object_size() (returns
osd_max_object_size by default) and override it in SeaStore::Shard to
return min(osd_max_object_size, max_object_size). Convert
is_offset_and_length_valid() from a static function to a PGBackend
member that queries the store, so EFBIG reaches the client before the
write ever hits the store.
Kefu Chai [Wed, 20 May 2026 04:03:12 +0000 (12:03 +0800)]
crimson/osd: only unblock wait_for_active_blocker on replica when ACTIVE
ReplicaActive::react(ActivateCommitted) sets ACTIVE or PEERED before
calling on_activate_committed(). Without a guard, an unconditional
unblock() on the PEERED path resets the promise, causing ops that
arrive afterward to park indefinitely (until the next on_change()).
The primary already has this guard in on_activate_complete(); mirror it
on the replica side.
Kefu Chai [Wed, 20 May 2026 02:22:24 +0000 (10:22 +0800)]
crimson/osd: wake pgs_creating waiters in PGMap::pg_loaded()
wait_for_pg() parks callers on pgs_creating[pgid] when the PG isn't in
pgs yet. pg_created() wakes those waiters; pg_loaded() didn't. An op
that races ahead of a PG load hangs indefinitely at CreateOrWaitPG.
Add the symmetric wake-up in pg_loaded() with a conditional find --
unlike pg_created(), a loaded PG may have no waiters at all.
Kefu Chai [Wed, 20 May 2026 07:36:51 +0000 (15:36 +0800)]
crimson/seastore: reject oversized writes and zeros instead of aborting
prepare_data_reservation() ceph_assert()s the request fits within
seastore_default_max_object_size (16 MiB), but the OSD validates writes
against osd_max_object_size (128 MiB). Anything between the two limits
passes OSD validation then trips the assert, crashing the OSD and its
replicas.
_zero() already returned EIO for this case; mirror that in _write() and
fix _zero()'s off-by-one (>= should be >, matching the <= in the assert).
Jamie Pryde [Wed, 20 May 2026 10:31:53 +0000 (11:31 +0100)]
mon: Add health checker for deprecated EC plugins and techniques
We want to reduce the number of EC plugins and techniques we support
in order to focus dev and test effort on the ones that are most
useful.
We are deprecating the following plugins and techniques in Umbrella,
and dropping support for them in the V release:
* shec
* clay
* all non-reed_sol_van jerasure techniques
This commit adds a health checker to print a warning message if the cluster
is using any of the deprecated plugins/techniques and instructs the user
to migrate objects to a different pool.
Matan Breizman [Wed, 20 May 2026 08:22:22 +0000 (11:22 +0300)]
crimson/osd: call pubsetbuf() before open()
Move rdbuf()->pubsetbuf(nullptr, 0) before ofstream::open() since libstdc++
may ignore setbuf() once the filebuf is associated with a file.
```
setbuf() may only be called when the std::basic_filebuf is not associated with a file (has no effect otherwise)
```
https://en.cppreference.com/cpp/io/basic_filebuf/setbuf
Ville Ojamo [Wed, 20 May 2026 06:26:08 +0000 (13:26 +0700)]
doc/rados: move label to right place in pools.rst
The label is named setting and not unsetting, so move it from the
unsetting section to the setting section.
The label is used only once and the context in which it is used is also
more fitting for the setting section.
Signed-off-by: Ville Ojamo <git2233+ceph@ojamo.eu>
Ronen Friedman [Sat, 16 May 2026 14:39:51 +0000 (14:39 +0000)]
crimson/osd: decouple snap trim initiation from scrub completion
Add SnapTrimInitiate operation so kick_snap_trim() no longer calls
on_active_actmap() inline during scrub completion, which nested
conflicting with_interruption contexts and hit an assertion.
Kefu Chai [Wed, 20 May 2026 00:55:55 +0000 (08:55 +0800)]
crimson/seastore: clamp block_size to laddr_t::UNIT_SIZE on small-LBA devices
Seastar's file::disk_write_dma_alignment() faithfully reports what the
kernel exposes for the underlying device. On block devices in 512-byte
LBA mode (the factory default for many NVMe SSDs), it correctly returns
512.
SeaStore's internal addressing, however, operates at a 4 KiB page
granularity defined by laddr_t::UNIT_SIZE, and SeaStore::_mount() asserts
that block_size >= UNIT_SIZE. As a result, ceph-osd-crimson --mkfs
--osd-objectstore seastore aborts on any device shipped in 512-LBA
mode:
seastore.cc:344 ceph_assert(block_size >= laddr_t::UNIT_SIZE)
seastore requires a device block size of at least 4096 bytes,
but the primary device at '/var/lib/ceph/osd/ceph-N/block'
reports block_size=512
The reported alignment is not wrong; it is the minimum Linux enforces
for O_DIRECT on that device, so users with optimization in mind can
override it through io_properties.yaml. But SeaStore can run correctly
on 512-LBA devices as long as it issues only 4 KiB-aligned I/O (which
is also 512-aligned, so the device is happy). Clamp the captured
block_size to laddr_t::UNIT_SIZE so SeaStore can host an OSD on
512-LBA storage without operator intervention, while still honoring
larger device-reported alignments when present.
Oguzhan Ozmen [Tue, 12 May 2026 19:43:13 +0000 (19:43 +0000)]
neocls log trimming (time based): fix infinite loop on ENODATA
This is essentially the same as previous commit.
The time-based use_awaitable_t overload of trim() has the same
issue as the marker-based overload: the try-catch for ENODATA is inside
the for(;;) loop, so ENODATA is caught and swallowed, causing the loop
to retry forever.
Oguzhan Ozmen [Tue, 12 May 2026 19:38:12 +0000 (19:38 +0000)]
neocls log trimming (marker based): fix infinite loop on ENODATA
The use_awaitable_t overload of trim() has the try-catch for
ENODATA (no_message_available) inside the for(;;) loop. When
cls_log_trim returns ENODATA (i.e., nothing left to trim), the exception is
caught and silently swallowed execution falls through the catch block
back to for(;;), retrying the trim forever.
This should be a rare condition as 3 conditions should be met in a
single-cluster (once configured as multisite):
- a realm/period exists
- zone endpoints are configured
- data_log* objects exist
Since in a properly setup multisite cluster, data churn is continious so
hard to notice. In the case client reported, the multisite cluster was
reverted back to single site so data_logs have no data all the time;
hence, the issue is pronounced.
This fix adds co_return inside the catch block so ENODATA exits the loop.
Jaya Prakash [Tue, 19 May 2026 17:02:22 +0000 (17:02 +0000)]
qa: fix TEST_mon_features persistent feature checks in mon/misc.sh
The test reused the "tentacle" jq filter while validating
"nvmeof_beacon_diff", causing the comparison to fail. Also fix
the umbrella feature validation and update the expected
persistent feature count.
mgr/DaemonServer: auto-tune stats period when message queue gets backed up
The mgr can get overwhelmed when there's a lot of cluster activity and
daemons are sending stats reports faster than we can process them.
This commit adds logic to monitor the messenger queue depth and bump
up mgr_stats_period when things get congested. This reduces the
frequency of daemon stat reports, allowing the mgr to process existing
reports without being overwhelmed by new ones. The period automatically
scales back down when the queue clears up.
Added mgr_stats_period_autotune (on by default) and a queue threshold
setting. Recovery happens automatically when the queue clears up.
Max period is capped at 60 seconds to prevent excessive stat delays.
Patrick Donnelly [Tue, 19 May 2026 14:10:33 +0000 (10:10 -0400)]
.github/workflows/releng-audit: update workflows
To avoid this warning:
> Warning: Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/checkout@v3, actions/setup-python@v4. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Kefu Chai [Tue, 19 May 2026 12:58:10 +0000 (20:58 +0800)]
debian/rules: strip ceph-osd-classic and ceph-osd-crimson
override_dh_strip enumerates each binary package explicitly. It was not
updated when ceph-osd was split into the ceph-osd-classic and
ceph-osd-crimson implementation packages, so the OSD binaries in those
two packages are shipped unstripped (ceph-osd-crimson installs at ~4.6
GiB) and their -dbg packages are left empty.
Add the missing dh_strip invocations so the OSD binaries are stripped
and their debug symbols land in the corresponding -dbg packages, as is
already done for every other binary package.
Afreen Misbah [Mon, 18 May 2026 20:06:35 +0000 (01:36 +0530)]
mgr/dashboard: fix remaining FA icon references and test failures
- Fix icon size mismatches and HTML lint errors
- Fix remaining FA icon references in tests
- Replace FA icons with Carbon in upgrade component:
use cds-inline-loading for spinners, cd-icon for status icons
- Update test selectors for Carbon icon queries
Fixes: https://tracker.ceph.com/issues/76631 Signed-off-by: Afreen Misbah <afreen23@gmail.com> Assisted-by: Claude
Afreen Misbah [Sun, 17 May 2026 16:43:59 +0000 (22:13 +0530)]
mgr/dashboard: fix filter icon alignment in table toolbar
Replace Bootstrap inline styles with proper CSS class for filter
icon and select dropdowns alignment. Created filter-wrapper class
to properly align filter icon with select elements using flexbox.
Signed-off-by: Afreen Misbah <afreen@ibm.com> Assisted-by: Claude Fixes: https://tracker.ceph.com/issues/76631
Afreen Misbah [Sun, 17 May 2026 15:07:45 +0000 (20:37 +0530)]
mgr/dashboard: fix missing loader and zone group icon
- Add state="active" to cds-inline-loading in card-row component
to properly show loading spinner for table row actions
- Replace parentChild icon with clusterIcon (web-services--cluster)
for zone group representation in RGW multisite
- Remove parentChild from Icons enum and replace with
WebServicesCluster in components.module.ts
- Import ComponentsModule in rgw.module.ts for cd-icon support
Signed-off-by: Afreen Misbah <afreen@ibm.com> Assisted-by: Claude Fixes: https://tracker.ceph.com/issues/76631
Added LoadingModule and InlineLoadingModule imports to:
- block.module.ts
- cephfs.module.ts
- cluster.module.ts
(rgw.module.ts and components.module.ts already had them)
Signed-off-by: Afreen Misbah <afreen@ibm.com> Assisted-by: Claude Fixes: https://tracker.ceph.com/issues/76631
Afreen Misbah [Sun, 17 May 2026 00:14:41 +0000 (05:44 +0530)]
mgr/dashboard: remove font awesome references
- Remove .fa and .fa-* class styles from component SCSS files
- Remove FA icon spacing rules from global styles
- Clean up .fa-stack styles (FA stacking feature)
- Remove FA-specific color styles
- Remove FA icons
Signed-off-by: Afreen Misbah <afreen@ibm.com> Assisted-by: Claude Fixes: https://tracker.ceph.com/issues/76631
Bill Scales [Tue, 19 May 2026 06:05:13 +0000 (07:05 +0100)]
doc/dev/internals: Improve Ceph Internals TOC
The Ceph internals section of the docs is a bit of a mess
as far as the table of contents is concerned. This commit
tries to add a bit more structure grouping topics by
area and trying to arrange them in a more logical order.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
rgw/dedup: add --allow/deny-bucket-list and --allow/deny-storage-class-list to dedup commands
Resolves: bz#2413730 Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
Casey Bodley [Tue, 12 May 2026 18:58:16 +0000 (14:58 -0400)]
librados/asio: clear cancellation slot in associated executor
the librados callback function `AsyncOp::aio_dispatch()` runs on
Objecter's finisher strand executor, and dispatches the completion
handler to its associated executor
asio cancellation is not thread-safe, so should be synchronized on that
associated executor. move the call to `slot.clear()` from that librados
callback into the AsyncHandler wrapper so it doesn't run until we've
switched to the correct executor
because our `op_cancellation` handler depends on an `AioCompletion`
pointer, we have to clear the cancellation slot before that
`AioCompletion` lifetime ends