This commit introduces performance counters for individual Ceph mgr modules.
These counters allow monitoring module behavior, debugging latency issues,
and identifying performance bottlenecks, all without modifying the modules themselves.
The following counters are now exposed under:
> ceph daemon mgr.<id> perf dump
Example structure:
"mgr_module_<module_name>": {
"notify_avg_usec": { <- Average time spent handling notify events
"avgcount": 0,
"sum": 0
},
"cmd_avg_usec": { <- Average time spent processing CLI/admin commands
"avgcount": 0,
"sum": 0
},
"serve_avg_usec": { <- Average time spent in module serve loop (if applicable)
"avgcount": 0,
"sum": 0
},
"alive": 1 <- Module is alive (1 = running, 0 = exited)
"cpu_usage": 0, <- CPU usage in percent
"mem_rss_change": 0, <- Memory RSS change in bytes
"mem_rss_current": 490737664 <- Memory RSS current in bytes
}
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
Conflicts:
src/mgr/ActivePyModules.cc - finisher.queue changed by 63859, adding py_module to the parameter list
src/mgr/PyModuleRegistry.cc - check_all_modules_started added by 63859
Sun Yuechi [Mon, 25 May 2026 07:07:36 +0000 (15:07 +0800)]
isa-l: enable on RISC-V
ISA-L v2.32.0 added RISC-V support. Enable the ISA-L erasure code
plugin and the zlib compressor on RISC-V when RVV is available.
RVV is detected via the existing ceph_arch_riscv_probe() path added
in 01dc12ad565, so the same Linux 6.5+ requirement applies; on older
kernels the RVV path stays disabled.
ceph-volume: OSD mapper lifecycle (LVM + raw) for activate
This adds small helpers so activate can consistently bring the OSD device
stack online (LVM lvchange, optional mapper open) and tear it down again,
with refresh in between. Same idea for the raw path. Crypto is handled
inside that flow when the OSD is encrypted.
Kefu Chai [Sun, 24 May 2026 08:25:46 +0000 (16:25 +0800)]
rgw: bump Apache Arrow submodule from 17.0.0 to 19.0.1
When WITH_SYSTEM_ARROW is false, Ceph builds Arrow from the bundled
src/apache submodule. Our CI uses ubuntu:jammy as the base image, which
does not package libarrow-dev, so the bundled path is always taken there.
Arrow 17.0.0 vendors a copy of Thrift whose download URLs are no longer
reachable, breaking CI builds that try to fetch them at configure time.
Bump arrow submodule to 19.0.1, the latest Arrow release that:
- builds successfully on ubuntu:jammy, and
- requires only CMake 3.22 (the version shipped by ubuntu:jammy)
See also
CMake version shipped by ubuntu:jammy
- https://packages.ubuntu.com/jammy/cmake
Kefu Chai [Fri, 22 May 2026 11:01:17 +0000 (19:01 +0800)]
crimson/scrub: fix assert in PGScrubber::release_range() on interval change
when an interval change occurs while ScrubReserveRange is still
waiting to acquire background_process_lock, ChunkState::exit()
calls release_range() but blocked is not yet set. this triggers
ceph_assert(blocked) in release_range().
fix by checking if blocked is set before asserting. if blocked is
not set, the range was never reserved, so release_range() is a
no-op. ScrubReserveRange's finally block handles lock cleanup in
this case.
Jan Radon [Fri, 15 May 2026 13:42:08 +0000 (15:42 +0200)]
feat(rgw/kafka): add mTLS client certificate authentication for Kafka notifications
Add support for mutual TLS (mTLS) client certificate authentication
when publishing bucket notifications to Kafka brokers. RGW can now
present a client certificate and private key to authenticate with
brokers that require ssl.client.auth=required.
Changes:
- Add ssl-certificate-location, ssl-key-location, and ssl-key-password
topic attributes for configuring client certificates
- Validate that ssl_certificate and ssl_key are provided together
- Include ssl_key_password in connection identity (hash/equality)
- Add kafka-security.sh script for generating broker and client TLS certs
- Add mTLS test (test_notification_kafka_security_ssl_mtls) using
use_mtls=True flag on the existing SSL security path
- Update RGW notifications documentation with mTLS parameters
Fixes: http://tracker.ceph.com/issues/67427 Signed-off-by: Jan Radon <jan.fabian.radon@sap.com>
Adam Kupczyk [Thu, 21 May 2026 12:27:39 +0000 (12:27 +0000)]
os/bluestore/BlueFS: Simplify flush functions
Removed 'offset'~'length' parameters from flush-related functions;
now single 'end' marks limit of requested flush.
It simplifies logic as we cannot flush random file ranges anyway.
Adam Kupczyk [Thu, 27 Nov 2025 07:56:02 +0000 (07:56 +0000)]
os/bluestore/bluefs: Simplify flush procedure
Refactor FileWriter:
1) add get_flush_offset(), has_unflushed_data()
2) renamed flush_buffer()->get_flush_buffer()
3) refactored logic of flush:
- removed bufferlist tail
- use buffer to store data that has to be reused
Kefu Chai [Sat, 9 May 2026 05:01:04 +0000 (13:01 +0800)]
cls: remove unused variable
to silence following warning:
```
/home/kefu/dev/ceph/src/cls/rgw/cls_rgw_types.cc: In static member function ‘static std::__cxx11::list<rgw_bucket_dir> rgw_bucket_dir::generate_test_instances()’:
/home/kefu/dev/ceph/src/cls/rgw/cls_rgw_types.cc:736:11: warning: variable ‘i’ set but not used [-Wunused-but-set-variable=]
736 | uint8_t i = 0;
| ^
```
The new implementation retire an absent extent by constructing a real
empty extent and add it to the transaction's retired_set, instead of
creating a retired placeholder
osd/scrub: limit scrubbing under snap-trimming overload
When the snap-trim queues are long, scrubbing is likely to
make things worse. This change adds a new scrubbing restriction
for that case, and prevents periodic scrubs from starting when
the total snap-trim queue length across all PGs exceeds a
configurable threshold.
Jamie Pryde [Wed, 20 May 2026 14:53:42 +0000 (15:53 +0100)]
doc: deprecate EC plugins and techniques
We want to reduce the number of EC plugins and techniques we support
in order to focus dev and test effort on the ones that are most
useful.
We are deprecating the following plugins and techniques in Umbrella,
and dropping support for them in the V release:
* shec
* clay
* all non-reed_sol_van jerasure techniques
This commit updates the documentation to reflect these changes.
Casey Bodley [Thu, 21 May 2026 13:54:16 +0000 (09:54 -0400)]
qa/rgw: ignore 'keytool: command for found' errors
this 'keytool' invocation was moved from qa/tasks/s3tests_java.py to
qa/tasks/rgw.py so that it would also cover the java checksum tests
but that means it runs for any rgw job with https enabled, even if it
doesn't install or use any java stuff. the 'keytool' command itself
comes from jdk packages which aren't installed by default
ignore errors from this command so that subsuites can use https without
installing java
Krunal Chheda [Wed, 20 May 2026 18:14:22 +0000 (14:14 -0400)]
rgw/notification: fix zero eventTime in bucket notifications on concurrent PUT race
When concurrent PUTs target the same object key, RADOS may return
-ECANCELED to the losing writers. In that path *meta.mtime was never
populated from meta.set_mtime, leaving mtime at epoch (zero), which
propagated into bucket notification eventTime as
"1970-01-01T00:00:00.000Z".
Fix: set *meta.mtime from meta.set_mtime before returning 0 in the
ECANCELED/ENOENT/EEXIST early-return block, matching the behaviour of
the successful write path.
Also add a regression test that fires 20 concurrent threads writing
the same key and asserts no event in the persistent queue carries a
zero eventTime.
Maodi Ma [Wed, 5 Nov 2025 02:35:46 +0000 (02:35 +0000)]
common: enable AVX512+VPCLMULQDQ for crc32c performance on x86
- Add crc32_iscsi_by16_10 in src/isa-l into candidates for ceph_crc32c
- Add hardware capability check for AVX512 instr before register
- Add NASM feature check to ensure compatibility and to enable
AS_FEATURE_LEVEL in crc32_iscsi_by16_10.asm
Venky Shankar [Thu, 7 May 2026 09:47:38 +0000 (15:17 +0530)]
mds: prevent CDir omap commit with empty updates/removals/header
Empty `stales` and `to_remove` causes `size` to be initialized with
sizeof(fnode_t). If the encoded inode size (plus fnode_t size) exceeds
max_dir_commit_size, commit_one() is called as a non-header update
with empty `_set` and `_rm` sets causing the MDS to assert.
While this patch prevents the assert, it is unknown at this point as
to why the encoded inode size is so large. We have seen it before
once, but there is lack of debug information to dig into. This fix
will prevent the assert, however, the MDS would go read-only due to
the large rados operation size, but at least we will have a live
system to debug at that point.
ShreeJejurikar [Wed, 13 May 2026 13:05:39 +0000 (18:35 +0530)]
rgw/logging: use assumed-role ARN as Requester for STS requests
When a request is made with STS temporary credentials, the bucket logging
Requester field was being set to the underlying user ID instead of the
assumed-role ARN. Per the AWS S3 server-access-log spec, the Requester
field should contain the assumed-role ARN (e.g.
arn:aws:sts::<account>:assumed-role/<role>/<session>) for STS-credentialed
requests.
Detect TYPE_ROLE identities via s->auth.identity->get_identity_type() and
use the ARN returned by Identity::get_caller_identity() (already
implemented by RoleApplier in the expected AWS format) instead of falling
straight through to s->user->get_id(). Existing behavior for account- and
user-scoped requests is unchanged.
an.groshev [Tue, 17 Feb 2026 08:26:46 +0000 (11:26 +0300)]
logrotate: send SIGHUP to ceph-exporter on log rotation
ceph-exporter registers a SIGHUP handler that reopens its log files,
but it was missing from the postrotate killall/pkill list. Without the
signal, the daemon keeps an open fd to the already-rotated file and
continues writing there, causing /var/log/ceph to fill up.
Matthew N. Heler [Mon, 30 Mar 2026 23:44:36 +0000 (18:44 -0500)]
rgw: write RGW_ATTR_CRYPT_PREFETCH_ALIGN for AEAD ciphers
Store the plaintext and encrypted block sizes at upload time so
future cls prefetch ops can compute on-disk read ranges from
xattrs without instantiating a cipher.
Only written for size-expanding ciphers (GCM). CBC objects have
no attr — plaintext and ciphertext sizes are identical.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Sun, 29 Mar 2026 02:27:57 +0000 (21:27 -0500)]
rgw: add range projection helpers for encrypted and compressed objects
Add stateless helpers that project plaintext byte ranges to on-disk
byte ranges for compressed and encrypted objects. fixup_range()
delegates to these for range computation.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Sun, 29 Mar 2026 18:48:01 +0000 (13:48 -0500)]
rgw: use stored plaintext size for AEAD segment validation
The SLO/DLO size check was converting encrypted size to plaintext
via rgw_get_aead_decrypted_size(), which overestimates for multipart
objects without CRYPT_PARTS. Use the stored CRYPT_ORIGINAL_SIZE
attr instead, it's exact and already in the attrs.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Thu, 19 Mar 2026 01:46:26 +0000 (20:46 -0500)]
rgw: replace GCM nonce with salt-based key derivation
Move randomness from the GCM IV into key derivation. Each object
now gets a 32-byte random salt stored in RGW_ATTR_CRYPT_SALT, fed
into HMAC-SHA256 alongside bucket_id and object name to produce a
unique per-object key. The GCM IV is a deterministic counter from
the chunk position, which is safe because the key never repeats.
All GCM modes (SSE-C, SSE-KMS, SSE-S3, RGW-AUTO) now go through
derive_object_key() before any encrypt or decrypt operation.
Rename AES_GCM_NONCE_SIZE to AES_GCM_IV_SIZE across CryptoAccel
backends (isa-l, openssl, qat) to reflect what it actually is.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Wed, 18 Mar 2026 23:51:49 +0000 (18:51 -0500)]
rgw: use bucket_id instead of bucket name in GCM key derivation
The bucket name isn't globally unique ie different tenants can
have the same bucket name. Using bucket_id (which is globally
unique and includes tenant context) prevent cross-tenant key
collisions in the HMAC-SHA256 derivation.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Sat, 21 Feb 2026 15:27:14 +0000 (09:27 -0600)]
rgw: optimize GCM encrypt/decrypt hot path
Reduce per-chunk overhead by hoisting accelerator resolution and
EVP context creation out of the chunk loop, replacing ct_memeq with
memcmp, linearizing input before the chunk loop, and eliminating
unnecessary tag copies in the ISA-L path. Also rewrites IV derivation
to use cached native arithmetic instead of a per-chunk byte-at-a-time
loop, and aligns the output buffer to 64 bytes for optimal SIMD stores.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
qa/rgw: test GCM encryption in existing crypt and multisite suites
Add an aes facet to the rgw/crypt and rgw/multisite suites so
teuthology runs them with both the default cipher (CBC) and with
rgw_crypt_sse_algorithm set to aes-256-gcm.
Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
rgw: add GCM hardware acceleration support via CryptoAccel
Extend the CryptoAccel plugin system to support AES-256-GCM encryption,
following the same pattern established for CBC.
The CryptoAccel base class now includes GCM constants (12-byte nonce,
16-byte tag) and pure virtual methods for gcm_encrypt, gcm_decrypt,
and their batch variants. All derived classes must implement these
methods, maintaining consistency with how CBC is handled.
OpenSSL serves as the fallback when ISA-L is unavailable, using the
EVP API with proper AAD handling. QAT stubs return false since GCM
requires different session setup than CBC; a note has been added to
the QAT acceleration documentation clarifying this limitation.
The RGW integration follows the CBC pattern closely. The previous
gcm_encrypt_chunk and gcm_decrypt_chunk functions have been unified
into gcm_transform() with two overloads: one for EVP-only operation
and one that uses the accelerator exclusively when available, falling
back to EVP only when no accelerator can be loaded. Static assertions
ensure the nonce and tag sizes stay consistent between the acceleration
layer and RGW.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Matthew N. Heler [Wed, 28 Jan 2026 04:06:17 +0000 (22:06 -0600)]
rgw: add AES-256-GCM (AEAD) support for server-side encryption
This adds GCM as an alternative to the existing CBC cipher for SSE-C,
SSE-KMS, SSE-S3, and RGW-AUTO. GCM provides authenticated encryption,
meaning it detects tampering during decryption rather than silently
returning corrupted data.
The new rgw_crypt_sse_algorithm config option controls which cipher is
used for new uploads. The default remains aes-256-cbc for backward
compatibility with older RGW versions in mixed clusters. Once all nodes
are upgraded, administrators can enable aes-256-gcm for new objects.
Existing CBC-encrypted objects continue to decrypt correctly regardless
of this setting.
GCM encrypts in 4KB chunks, each producing 4112 bytes of ciphertext
(4096 plaintext + 16-byte authentication tag). This means encrypted
objects are larger than their plaintext. To preserve correct behavior:
- RGW_ATTR_CRYPT_ORIGINAL_SIZE stores the plaintext size
- Content-Length and bucket listings report the plaintext size
- Range requests translate plaintext offsets to storage offsets
Each object gets a random 12-byte nonce stored in RGW_ATTR_CRYPT_NONCE.
This nonce serves two purposes: it's combined with chunk indices to
derive unique IVs for each encrypted block, and for SSE-C it's included
in the key derivation to bind ciphertext to object identity. Moving
encrypted data at the RADOS level causes decryption to fail rather than
silently producing garbage.
Multipart uploads derive per-part keys and use the S3 part number in
IV derivation to guarantee unique IVs across parts. The actual part
numbers are stored in RGW_ATTR_CRYPT_PART_NUMS during CompleteMultipart
to handle non-contiguous uploads (e.g., parts 1, 3, 5).
The implementation uses generic AEAD abstractions (is_aead_mode(),
aead_plaintext_to_encrypted_size(), etc.) so that adding other
authenticated ciphers like ChaCha20-Poly1305 in the future requires
only implementing the cipher itself—the size handling, range request
translation, and multipart machinery will work unchanged.
Originally-by: Kyle Bader <kbader@ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Matthew N. Heler <matthew.heler@hotmail.com>
Expose the existing radosgw-admin dedup commands (stats, estimate, exec,
abort, pause, resume, throttle) as HTTP Admin OPS endpoints under
/{admin}/dedup, following the same pattern used by ratelimit, usage, and
other admin REST APIs.
New files:
- rgw_rest_dedup.h: RGWHandler_Dedup and RGWRESTMgr_Dedup
- rgw_rest_dedup.cc: REST op classes calling the same cluster:: backend
functions as radosgw-admin
API summary:
- GET /dedup?op=stats - collect and display dedup statistics
- GET /dedup?op=throttle - display throttle settings
- POST /dedup?op=estimate - start dedup estimate session
- POST /dedup?op=exec - start full dedup (requires yes-i-really-mean-it)
- POST /dedup?op=abort - abort active dedup session
- POST /dedup?op=pause - pause active dedup session
- POST /dedup?op=resume - resume paused dedup session
- POST /dedup?op=throttle - set throttle limits
Documentation added to doc/radosgw/adminops.rst with cross-reference
from doc/radosgw/s3_objects_dedup.rst.
i would prefer to run the s3control test coverage in rgw/verify, but it
depends on rgw_dns_name configuration and support for wildcard dns which
breaks most of the other rgw/verify test cases
the vhost-style transformations ran in RGWREST::preprocess() before we
even route the request, so applied to every REST API in radosgw
vhost-style requests are specific to the S3 API, so they should only
apply after being routed to RGWRESTMgr_S3
extract the vhost logic from RGWREST::proprocess() into
rgw_rest_transform_s3_vhost_style(), and call that only from
RGWRESTMgr_S3::get_resource_mgr_as_default()
url-decoding of request_uri into decoded_uri is now duplicated in
preprocess() to apply to all requests, then again after vhost-style
transforms the request_uri
avoid allocating a list of strings to parse the comma-separated
rgw_enable_apis configuration
the range returned by ceph::split() has no size() function, so change
the calculation to not require it - `size() - distance(begin(), pos)`
is the same thing as `distance(pos, end())`
Casey Bodley [Mon, 30 Jun 2025 22:06:08 +0000 (18:06 -0400)]
rgw: add helper for bucket + account PublicAccessBlock config
get_public_access_conf() takes an optional account, and checks
RGW_ATTR_PUBLIC_ACCESS on that in addition to the bucket. if both attrs
are found, return the union of their configurations