Oguzhan Ozmen [Tue, 24 Mar 2026 17:54:22 +0000 (17:54 +0000)]
doc: add PendingReleaseNotes entry for rgw multisite DNS endpoint resolution
Documents the new rgw_rest_conn_connect_to_resolved_ips feature that
enables RGW to resolve HTTP endpoints for RGW services such as multisite,
into all IP addresses and distribute requests across them using
round-robin with per-IP health tracking, supporting DNS service
discovery deployments without external load balancers.
Oguzhan Ozmen [Wed, 6 May 2026 19:17:55 +0000 (19:17 +0000)]
rgw/multisite: fix endpoint unreachable detection in RGWRESTConn sync paths
The checks that decide whether to call set_endpoint_unconnectable()
were comparing against -EIO, but the actual error codes returned on
connection failure changed after commit 37352a9074 ("rgw: change
rgw_http_error_to_errno default to -ERR_INTERNAL_ERROR").
- complete_request() -> wait() returns req_data->ret which is set to
rgw_http_error_to_errno(0) = -ERR_INTERNAL_ERROR when http_status is 0
(TCP connect failed). Fix the six call sites to check -ERR_INTERNAL_ERROR.
- forward_request() returns tl::unexpected(-ERR_SERVICE_UNAVAILABLE)
when http_status == 0 (no HTTP response received at all). Fix the two
forward/forward_iam conditionals to check -ERR_SERVICE_UNAVAILABLE.
Without this fix, connection failures are never detected in the sync
paths, so set_endpoint_unconnectable() is never called and the
IP failover / retry logic is effectively dead.
The .h coroutine paths were already fixed by dbb409e21b9 ("rgw: fix
endpoint detection in RGWRESTConn") but that commit missed all .cc
sync paths.
Replace raw string manipulation in RGWEndpoint with boost::urls::url.
URL path/query/host changes now use set_path(), set_query(), set_host()
instead of string concatenation.
Rename original_url to endpoint_url_lookup_id to clarify its role as a
health-tracking key for ResolvedEndpoint lookup in
set_endpoint_unconnectable().
Oguzhan Ozmen [Tue, 24 Mar 2026 15:52:38 +0000 (15:52 +0000)]
doc/radosgw: expose rgw_rest_conn_connect_to_resolved_ips and rgw_rest_conn_ip_fail_timeout_secs
Expose the confval information for the confval "rgw_rest_conn_connect_to_resolved_ips"
and "rgw_rest_conn_ip_fail_timeout_secs" so that they can be seen in the
"Ceph Object Gateway Config Reference" as these are meant to be client-facing configs.
Oguzhan Ozmen [Tue, 3 Mar 2026 20:46:38 +0000 (20:46 +0000)]
rgw: rename round-robin counters for brevity
Rename endpoint_round_robin_counter to endpoint_rr_index and
endpoint_ips_round_robin_counter to ip_rr_index for shorter,
cleaner variable names while maintaining clarity.
Oguzhan Ozmen [Tue, 3 Mar 2026 16:49:21 +0000 (16:49 +0000)]
rgw/zone: increase visibility into zone connections via admin socket
Adds a new admin socket command to dump zone connection details
including endpoints, resolved IPs, and health status. Useful for
debugging multisite connectivity issues.
Usage: ceph daemon <radosgw.asok> zone connections
Oguzhan Ozmen [Tue, 3 Mar 2026 00:45:59 +0000 (00:45 +0000)]
rgw/rest: track connection failures per-IP instead of per-endpoint
Previously, when a connection to a zone endpoint failed, the entire
endpoint was marked as unavailable for a timeout period. Since we now
resolve endpoints to all their IP addresses (via DNS A/AAAA records),
we can be more granular: track failures at the individual IP level.
Introduce ResolvedIP struct that pairs each IP's connect_to string
with its own failure timestamp. When selecting an IP for a request,
round-robin skips IPs that have recently failed, allowing traffic to
continue flowing to healthy nodes even when some are down.
An endpoint-level last_failure_time is maintained as a fast-path
optimization to avoid scanning all IPs when none have failed recently.
Oguzhan Ozmen [Mon, 2 Mar 2026 18:11:27 +0000 (18:11 +0000)]
rgw/rest: consolidate endpoint_urls and resolved_endpoints into single vector
Previously RGWRESTConn stored endpoints in two data structures:
- endpoint_urls: vector<string> for ordered round-robin iteration
- resolved_endpoints: unordered_map<string, ResolvedEndpoint> for lookup
This was redundant since the URL was stored in both places.
Oguzhan Ozmen [Sun, 8 Feb 2026 00:30:13 +0000 (00:30 +0000)]
rgw: add operator<< for RGWEndpoint and simplify logging
Add ostream operator<< to RGWEndpoint struct for convenient logging of
endpoint details (url, original_url when different, and connect_to).
Update log statements across rgw_http_client.cc, rgw_rest_client.cc,
and rgw_rest_conn.cc to use the new operator for cleaner, more
consistent output.
Add unittest_rgw_http_client to test RGWEndpoint functionality.
Oguzhan Ozmen [Sat, 7 Feb 2026 17:33:42 +0000 (17:33 +0000)]
rgw: track original URL within RGWEndpoint instead of separate member (refactor)
Refactor endpoint tracking by adding original_url to RGWEndpoint struct
instead of maintaining a separate endpoint_orig member in RGWHTTPClient.
This simplifies the code by having each endpoint self-track its original
URL, which is needed for connection status lookups after URL modifications.
Oguzhan Ozmen [Sat, 7 Feb 2026 01:45:20 +0000 (01:45 +0000)]
rgw/rest: consolidate endpoint status tracking into ResolvedEndpoint
Refactor RGWRESTConn to eliminate the separate endpoints_status map by
moving the connection status (std::atomic<ceph::real_time>) directly
into the ResolvedEndpoint struct. This reduces redundancy and simplifies
endpoint state management.
Dhairya Parmar [Wed, 20 May 2026 21:18:15 +0000 (02:48 +0530)]
mds: persist session auth_name in ESession journal event
So that it can be applied to the freshly creation session which happens
while recreating session in ESession::replay when the OMAP version fell
behind the ESession cmapv and the newly creation session would be
rejected as target when a client tries to reclaim this session.
Kefu Chai [Mon, 1 Jun 2026 10:40:06 +0000 (18:40 +0800)]
qa/cephadm: query iSCSI gateway FQDN from inside the container
rbd-target-api validates that the gateway hostname supplied by gwcli
matches the container's own socket.getfqdn(). Running the same call on
the host can return a different value when the host and container resolve
names differently (e.g. on Rocky 10), causing gateway creation to fail
with HTTP 400 and all subsequent gwcli configuration to break silently.
Query the FQDN from inside the iSCSI container directly so the value is
always consistent with what rbd-target-api expects. This also removes the
"run twice" workaround, which was compensating for host-side DNS
warm-up flakiness rather than addressing the underlying mismatch.
Kefu Chai [Mon, 1 Jun 2026 05:19:04 +0000 (13:19 +0800)]
test/libcephfs: reduce SnapDiffDeletionRecreation bulk_count on Windows
this test timed out on Windows. and HugeSnapDiffLargeDelta, at half
the file count, passed in 508 seconds on the same run, suggesting this
test takes ~17 minutes on Windows -- beyond the test runner limit.
we haven't profiled the Windows client yet, but the likely culprit is
EventPoll, the Windows messenger backend, which scans the entire poll
array on every event_wait() and poll_ctl() call rather than using a
keyed data structure.
in this change, we reduce bulk_count to 1 << 12 on Windows. the unique
thing this test covers is the deletion-recreation pattern: a name that
exists as a file in snap1, gets deleted, and reappears as a directory in
snap2 -- it must show up in the diff with both snapids. 4096 produces
1024 such pairs, which is enough to exercise that logic. multi-fragment
snapdiff is already covered by HugeSnapDiffLargeDelta, which derives its
file count from mds_bal_split_size and mds_bal_fragment_fast_factor
explicitly to trigger fragmentation.
Sun Yuechi [Sat, 30 May 2026 06:15:12 +0000 (14:15 +0800)]
cmake: add WITH_SYSTEM_SPDK to link a system-installed SPDK
By default ceph builds the bundled src/spdk fork via BuildSPDK. Add a
WITH_SYSTEM_SPDK option that instead locates a distro-provided SPDK
through a new Findspdk.cmake (pkg-config based, modelled on
Finddpdk.cmake), exposing the same spdk::spdk target.
Sun Yuechi [Sat, 30 May 2026 06:11:11 +0000 (14:11 +0800)]
blk/spdk: support both old and new spdk_env_opts member names
SPDK 21.01 renamed two struct spdk_env_opts members: pci_whitelist ->
pci_allowed and master_core -> main_core. Guard the assignments in
NVMEDevice with SPDK_VERSION.
The POSIXBucket copy constructor incorrectly calls .get() on a
on a temporary unique_ptr returned by clone(), causing immediate
deletion of the Directory object. This leaves a dangling pointer
that triggers a segfault during destruction.
Matt Benjamin [Tue, 3 Feb 2026 22:12:22 +0000 (17:12 -0500)]
posixdriver: fix cksum_type, flags propagation
Posixdriver doesn't serialize POSIXMultipartUpload, but rather a
member mp_obj of type POSIXMPObj--so to avoid losing the latter's
inherited cksum_type and cksum_flags members (which are already
copied in), copy them out in POSIXMultiPartUpload::get_info() which
we need to call to copy out dest_placement anyway.
(oops, chksum_type was copied in, but not cksum_flags)
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Sun, 15 Feb 2026 20:56:03 +0000 (15:56 -0500)]
posixdriver: fix cache fill of versioned buckets
This change completes the original intent (hypothesized) to
conditionally set the FLAG_CURRENT bit on just the current
entries during bucket listing cache fill.
This avoids interning 2 copies of the current version of each
object in the listing cache, and also correctly sets the
FLAG_CURRENT bit as required--so the current versions are correctly
reported in versioned listings.
Janky logic to find the current version by explicitly chasing
the symlink target and saving it outside the enumeration scope
has been replaced with proper call to stat() provided by Dang.
Symlink::fill_cache() is no longer used, so removed.
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Sun, 15 Feb 2026 15:21:28 +0000 (10:21 -0500)]
posixdriver: add bde.flags to in bucket cache serde cycle
The upstream logic (mostly?) correctly uses bde.flags when filling
the cache for versioned objects, but cache ser(de)ialization has
been discarding that member.
This change suppresses the visible result where RGW incorrectly produces
multiple versions in non-versioned listing because none uniquely sets
FLAG_CURRENT:
Cached listings for versions are still incorrect in containing an
an extra entry for the "current" version in with empty instance
(from the Symlink)--the visible effect being that list-object-versions
output is incorrect (no entry is sent with IsLatest, after the
empty instance version has been filtered out).
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Jacques Heunis [Thu, 15 Jan 2026 12:11:11 +0000 (12:11 +0000)]
tools/rados: Remove plain text snippets from rados bench JSON output
`rados bench` emits performance stats as its output. It is very helpful
for this output to be in a machine-readable format and the CLI provides
the `--format=json` flag to achieve this.
There are some logs that do not respect the formatter flag though, as
they provide status updates as the tool is running and do not form part
of the output dataset. This prevents the contents of stdout from being
valid JSON which destroys the machine-readability of the output.
To resolve this we gate those status messages behind a check for the
formatter. If any specific formatter is provided we do not emit the
status logs. This leaves the plaintext output largely untouched while
helping the machine-readable output to be well-formed.
Fixes: https://tracker.ceph.com/issues/74370 Signed-off-by: Jacques Heunis <jheunis@bloomberg.net>
Jamie Pryde [Fri, 29 May 2026 11:44:56 +0000 (12:44 +0100)]
qa: Ignore deprecated EC plugin warning in teuthology tests
Add DEPRECATED_EC_PLUGIN to the list of health warnings to
ignore in the thrash-erasure-code-* tests that use deprecated
plugins or techniques. It is expected that this warning will
be raised.
Sun Yuechi [Fri, 29 May 2026 10:39:51 +0000 (18:39 +0800)]
rgw: move SWIFT error_handler out-of-line to fix link failure
The two error_handler overrides are defined inline in rgw_rest_swift.h
and delegate to RGWSwiftWebsiteHandler::error_handler, a non-virtual
function defined in rgw_rest_swift.cc (librgw_a.a). Because the header
is included by rgw_rest.cc, the inline bodies are emitted in
librgw_common.a, which then ODR-uses that symbol across archives.
The link line lists librgw_a.a before librgw_common.a, and GNU ld only
pulls archive members on demand: when librgw_a.a is scanned nothing yet
references RGWSwiftWebsiteHandler::error_handler, so rgw_rest_swift.cc.o
is dropped and the symbol is later unresolved. This shows up as a link
failure with gcc 16 -O2.
Move the two bodies into rgw_rest_swift.cc next to the function they
call, so the ODR-use stays within the same object and the build no
longer depends on archive scan order. No functional change.
Vallari Agrawal [Wed, 27 May 2026 12:17:55 +0000 (17:47 +0530)]
qa/suites/nvmeof: ignore "have only 1 nvmeof gateway"
Add "have only 1 nvmeof gateway" to ignorelist.
NVMEOF_SINGLE_GATEWAY is already part of ignorelist
but tests sometimes fail on "have only 1 nvmeof gateway".
Thrasher or scalability tests can trigger this but there
are enough asserts to ensure all expected gateways are
up, we can safely ignore this healthcheck warning.
Redouane Kachach [Fri, 29 May 2026 09:09:44 +0000 (11:09 +0200)]
qa/tasks: capture CommandCrashedError when running nvme list cmd
The safe_while retry loop does not catch exceptions, so a
CommandCrashedError from `nvme list` bypasses it entirely. Catch
CommandCrashedError and continue the retry loop instead.