matt benjamin [Fri, 16 May 2025 16:02:20 +0000 (12:02 -0400)]
rgw: defensive fix for crash attemping part-copy of '%' versioned obj
The proximate cause of the issue actually appears to be in recognizing
the key.name of the object, only failing in rgw_rados due to an assert
on key.name being non-empty.
Signed-off-by: matt benjamin <mbenjamin@redhat.com>
(cherry picked from commit 5111b625a174aa2eaeb4be943dec9fe4b9d948af) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Mon, 30 Jun 2025 14:26:25 +0000 (10:26 -0400)]
rgwlc: fix removal of delete markers (SAL)
S3 delete markers do not have head objects, and SAL's Object::load_obj_state()
returns -ENOENT in this case. Handle this case in LC's remove_expired_obj().
Matt Benjamin [Sun, 10 Aug 2025 18:05:43 +0000 (14:05 -0400)]
rgw:chksum: pull up aws-sdk-java-v2 and fix S3Builder invocation
This commit pulls up aws-sdk-java-v2 to 2.32.2, which has trailing header
formatting previously seen with golang v2 sdk--for which the upstream
*Reef* logic is not present (see prior commit by Yixin Jin).
And it fixes the construction of S3Client to accept endpoint self-signed
certificates--logic which is present in the main function example code
in jcksum.java, but somehow not in putobjects.java (anymore?).
Yixin Jin [Sun, 10 Aug 2025 15:59:18 +0000 (11:59 -0400)]
rgw:cksum: fix two checksum-trailer related signing issues
1. return error code on signature mismatch (should be 400,
XAmzContentSHA256Mismatch
2. reorder final chunk extraction and signing to better address
what we were handling as a special case of a few trailing bytes--
this is arising because the implementer was working against Reef,
which I guess doesn't have the extra extraction logic (c.f.,
ceph/main and its upstream backport)
(A change to catch rgw::io::Exception at rgw_process_authenticated
has been removed, as it is already handled in the only applicable
path.)
Matt Benjamin [Sun, 18 May 2025 01:02:34 +0000 (21:02 -0400)]
rgw: aws-chunked need not supply any content-length
The updated logic for aws chunked handling (2024) appears sufficient
to handle the cases produced by aws-sdk-go-v2.
Note that https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html
states that "For all requests, you must include the
x-amz-decoded-content-length header, specifying the size of the object in
bytes." (accessed 5/17/2025) (but now we do not enforce it).
Matt Benjamin [Sat, 17 May 2025 19:52:20 +0000 (15:52 -0400)]
rgw: recognize checksum from x-amz-checksum-{type} alone
Some SDKs may send x-amz-checksum-algorithm or
x-amz-sdk-checksum-algorithm regardless as well, but those are
only required if the checksum header is in the trailer section.
Max Kellermann [Thu, 24 Apr 2025 11:22:55 +0000 (13:22 +0200)]
rgw/rgw_cksum: work around -Wsometimes-uninitialized
clang complains that `cck3` might not be initialized:
```
/home/jenkins-build/build/workspace/ceph-api/src/rgw/rgw_cksum.cc:74:2: error: variable 'cck3' is used uninitialized whenever switch default is taken [-Werror,-Wsometimes-uninitialized]
74 | default:
| ^~~~~~~
/home/jenkins-build/build/workspace/ceph-api/src/rgw/rgw_cksum.cc:78:31: note: uninitialized use occurs here
78 | cck3 = rgw::digest::byteswap(cck3);
| ^~~~
/home/jenkins-build/build/workspace/ceph-api/src/rgw/rgw_cksum.cc:61:15: note: initialize the variable 'cck3' to silence this warning
61 | uint32_t cck3;
| ^
| = 0
```
The `default:` case however is not reachable because `ck1.type` has
already been checked. Adding initializers to `cck3` would only hide
potential future bugs, therefore I suggest just bailing out of the
function for this unreachable piece of code. With C++23, we could use
`std::unreachable()` instead.
Sachin Punadikar [Thu, 21 Aug 2025 10:09:17 +0000 (06:09 -0400)]
NFS CONF: Disable dentry caching in Ganesha
Disbale dentry caching in Ganesha. This caching leads to inconsistent
directory listing to connected NFS clients.
Fixes - https://tracker.ceph.com/issues/72797
rgw/multisite: handle secondary zone's response appropriately
depending on primary zone's version.
decode primary's response only when generate-key is true.
rgw/multisite: forward create_key request to master, fetch the newly created key
and store it on secondary. also, include 'create_date' in the user info response to
help identify timestamp of each key.
Mark Kogan [Mon, 27 May 2024 17:01:01 +0000 (17:01 +0000)]
rgw: qat: if necesary initialize the `qat` supplemental group
when RGW is started as an entry point of a container the shell
does not have the opportunity to initialize the supplemental groups
hence the `sudo usermod -a -G qat <USER>` has not taken effect,
a call to `man 3 initgroups` is necessary
John Mulligan [Tue, 22 Jul 2025 23:24:11 +0000 (19:24 -0400)]
mgr/smb: add new cephfs parameter for getting fscrypt keys
Add a new field to the cephfs configuration section for shares. This
section selects the keybridge scope and key name to use when acquiring
the key to use for fscrypt.
John Mulligan [Tue, 22 Jul 2025 23:22:15 +0000 (19:22 -0400)]
mgr/smb: add keybridge configuration to cluster resource
Add keybridge service configuration classes and parameters to the
resources module. This supports enabling the keybridge, setting up
scopes for the keybridge and it's access control.
A helper class is added that parses and helps validate the scope names.
John Mulligan [Wed, 16 Jul 2025 21:55:44 +0000 (17:55 -0400)]
mgr/smb: add enums that will be used for configuring keybridge
Add a pair of enum types that will be used for configuring the
keybridge. The scope type identifies what kind of scope is being
used. The peer policy can be used to allow a dev or other user
more access to the keybridge api for development purposes.
John Mulligan [Fri, 18 Jul 2025 14:23:31 +0000 (10:23 -0400)]
mgr/smb: fix a resource error unpacking str instead of list
Add special handling for the case where a string is passed instead of a
list. Without this fix a string will be converted into a list of single
letter items, something pretty much no one ever wants. Raise an
exception instead.
John Mulligan [Fri, 18 Jul 2025 16:20:17 +0000 (12:20 -0400)]
cephadm: add keybridge sidecar to smb daemon module
The keybridge uses the sambacc configuration but can also be passed
CLI options. Since cephadm writes the cert files, cephadm must also
pass the file names to use to the container args.
John Mulligan [Wed, 16 Jul 2025 21:08:49 +0000 (17:08 -0400)]
python-common/deployment: add keybridge feature to smb service spec
The keybridge sidecar is enabled by the keybridge feature flag.
This sidecar will be used to help fetch keys over various protocols
for the ceph module to use to set up fs encryption.
Adam Kupczyk [Mon, 31 Mar 2025 11:38:08 +0000 (13:38 +0200)]
os/bluestore: Add ability to ignore BlueFS zombie files
Under normal circumstances BlueFS _replay() procedure
does not allow for zombie files to be present.
Zombie files are files that are declared but are not attached to any name+dir.
One exception if special BlueFS Log file (ino 1).
This change introduces configurable 'bluefs_log_replay_remove_zombie_files'.
When set to 'true', instead of refusing to mount, BlueFS logs the error and ignores the file.
This is equivalent of removing it, with the distinction being that until BlueFS log is
compacted, one cannot revert the option back, or the problem will reemerge.
Jamie Pryde [Wed, 13 Aug 2025 10:57:40 +0000 (11:57 +0100)]
erasure-code: use cauchy if K and M values are not supported by ISA-L reed_sol_van
ISA-L supports a limited set of K and M values when using a vandermonde matrix. There are no such limitations when using a cauchy matrix. If the user specifies reed_sol_van (or does not specify a technique and relies on the default reed_sol_van setting) and an unsupported K/M combination, then we will automatically switch the technique for the new EC profile to cauchy. Benchmarking has not shown any noticeable performance differences between ISA-L in reed_sol_van vs cauchy modes.
osd: stop scrub_purged_snaps() from ignoring osd_beacon_report_interval
OSD beacons could be burdersome to the enitre cluster, as they lead
to generation of new `OSDMap` epochs. Therefore their frequency is
restricted through `osd_beacon_report_interval` to 5 mins by default.
Unfortunately, the `OSD::send_purged_snaps()` is unaware about this
policy with the net result being storm of OSDMaps. This patch unifies
its behavior with `OSD::tick_without_osd_lock()`.
mgr/DaemonState: Minimise time we hold the DaemonStateIndex lock
Calling back into python functions whilst holding the lock can result in
this thread being queued for the GIL and resulting in extended delays
for threads waiting to acquire the lock.
Jon Bailey [Wed, 20 Aug 2025 10:11:09 +0000 (11:11 +0100)]
osd: Reduce the amount of status invalidations when rolling shards forwards during peering
Currently stats invalidations happen during peering when rolling forward shards.
We can reduce this so we only invalidate the stats when we don't have any other shards at the version we want to roll the stats forwards to.
In the cases where we have a shard with the stats at the correct version, we use those stats instead of invalidating.
If we do not have any shards with the correct version of stats, we do the invalidate as before.
* Current primary shard has been absent so has missed the latest few writes
* All the recent writes are partial writes that have not updated shard X
* All the recent writes have completed
The authorative shard is chosen from the set of primary-capable shards
that have the highest last epoch started, these have all got log entries
for the recent writes.
The get log shard is chosen from the set of shards that have the highest
last epoch started, this chooses shard X because its furthest behind
The primary shard last update is not less than get log shard last
update so this if statement decides that it has a good enough log:
We then proceed through peering using the primary log and the
log from shard X. Neither have details about the recent writes
which are then incorrectly rolled back.
The if statement should be looking at last_update for the
authorative shard rather than the get_log_shard, the code
would then realize that it needs to get the log from the
authorative shard first and then have a second pass
where it gets the log from the get log shard.
Peering would then have information about the partial writes
(obtained from the authorative shards log) and could correctly
roll these writes forward by deducing that the get_log_shard
didn't have these log entries because they were partial writes.
Alex Ainscow [Fri, 8 Aug 2025 09:25:53 +0000 (10:25 +0100)]
osd: Fix segfault in EC debug string
The old debug_string implementation was potentially reading up to 3
bytes off the end of an array. It was also doing lots of unnecessary
bufferlist reconstructs. This refactor of this function fixes both
issues.
Bill Scales [Fri, 8 Aug 2025 08:58:14 +0000 (09:58 +0100)]
osd: Optimized EC backfill interval has wrong versions
Bug in the optimized EC code creating the backfill
interval on the primary. It is creating a map with
the object version for each backfilling shard. When
there are multiple backfill targets the code was
overwriting oi.version with the version
for a shard that has had partial writes which
can result in the object not being backfilled.
Can manifest as a data integirty issue, scrub
error or snapshot corruption.
Bill Scales [Mon, 4 Aug 2025 15:24:41 +0000 (16:24 +0100)]
osd: Optimized EC choose_acting needs to use best primary shard
There have been a couple of corner case bugs with choose_acting
with optimized EC pools in the scenario where a new primary
with no existing log is choosen and find_best_info selects
a non-primary shard as the authorative shard.
Non-primary shards don't have a full log so in this scenario
we need to get the log from a shard that does have a complete
log first (so our log is ahead or eqivalent to authorative shard)
and then repeat the get log for the authorative shard.
Problems arise if we make different decisions about the acting
set and backfill/recovery based on these two different shards.
In one bug we osicillated between two different primaries
because one primary used one shard to making peering decisions
and the other primary used the other shard, resulting in
looping flip/flop changes to the acting_set.
In another bug we used one shard to decide that we could do
async recovery but then tried to get the log from another
shard and asserted because we didn't have enough history in
the log to do recovery and should have choosen to do a backfill.
This change makes optimized EC pools always choose the
best !non_primary shard when making decisions about peering
(irrespective of whether the primary has a full log or not).
The best overall shard is now only used for get log when
deciding how far to rollback the log.
It also sets repeat_getlog to false if peering fails because
the PG is incomplete to avoid looping forever trying to get
the log.
Alex Ainscow [Fri, 1 Aug 2025 14:09:58 +0000 (15:09 +0100)]
osd: Do not sent PDWs if read count > k
The main point of PDW (as currently implemented) is to reduce the amount
of reading performed by the primary when preparing for a read-modify-write (RMW).
It was making the assumption that if any recovery was required by a
conventional RMW, then a PDW is always better. This was an incorrect assumption
as a conventional RMW performs at most K reads for any plugin which
supports PDW. As such, we tweak this logic to perform a conventional RMW
if the PDW is going to read k or more shards.
This should improve performance in some minor areas.
Alex Ainscow [Wed, 18 Jun 2025 19:46:49 +0000 (20:46 +0100)]
osd: Fix decode for some extent cache reads.
The extent cache in EC can cause the backend to perform some surprising reads. Some
of the patterns were discovered in test that caused the decode to attempt to
decode more data than was anticipated during the read planning, leading to an
assert. This simple fix reduces the scope of the decode to the minimum.
Bill Scales [Fri, 1 Aug 2025 10:48:18 +0000 (11:48 +0100)]
osd: Optimized EC calculate_maxles_and_minlua needs to use ...
exclude_nonprimary_shards
When an optimized EC pool is searching for the best shard that
isn't a non-primary shard then the calculation for maxles and
minlua needs to exclude nonprimary-shards
This bug was seen in a test run where activating a PG was
interrupted by a new epoch and only a couple of non-primary
shards became active and updated les. In the next epoch
a new primary (without log) failed to find a shard that
wasn't non-primary with the latest les. The les of
non-primary shards should be ignored when looking for
an appropriate shard to get the full log from.
This is safe because an epoch cannot start I/O without
at least K shards that have updated les, and there
are always K-1 non-primary shards. If I/O has started
then we will find the latest les even if we skip
non-primary shards. If I/O has not started then the
latest les ignoring non-primary shards is the
last epoch in which I/O was started and has a good
enough log+missing list.
Bill Scales [Fri, 1 Aug 2025 09:39:16 +0000 (10:39 +0100)]
osd: Optimized EC choose_async_recovery_ec must use auth_shard
Optimized EC pools modify how GetLog and choose_acting work,
if the auth_shard is a non-primary shard and the (new) primary
is behind the auth_shard then we cannot just get the log from
the non-primary shard because it will be missing entries for
partial writes. Instead we need to get the log from a shard
that has the full log first and then repeat GetLog to get
the log from the auth_shard.
choose_acting was modifying auth_shard in the case where
we need to get the log from another shard first. This is
wrong - the remainder of the logic in choose_acting and
in particular choose_async_recovery_ec needs to use the
auth_shard to calculate what the acting set will be.
Using a different shard occasional can cause a
different acting set to be selected (because of
thresholds about the number of log entries behind
a shard needs to be to perform async recovery) and
this can lead to two shards flip/flopping with
different opinions about what the acting set should be.
Fix is to separate out which shard will be returned
to GetLog from the auth_shard which will be used
for acting set calculations.
Bill Scales [Fri, 1 Aug 2025 09:22:47 +0000 (10:22 +0100)]
osd: Optimized EC don't try to trim past crt
If there is an exceptionally long sequence of partial writes
that did not update a shard that is followed by a full write
then it is possible that the log trim point is ahead of the
previous write to the shard (and hence crt). We cannot trim
beyond crt. In this scenario its fine to limit the trim to crt
because the shard doesn't have any of the log entries for the
partial writes so there is nothing more to trim.
Bill Scales [Fri, 1 Aug 2025 08:56:23 +0000 (09:56 +0100)]
osd: Optimized EC missing call to apply_pwlc after updating pwlc
update_peer_info was updating pwlc with a newer version received
from another shard, but failed to update the peer_info's to
reflect the new pwlc by calling apply_pwlc.
Scenario was primary receiving an update from shard X which had
newer information about shard Y. The code was calling apply_pwlc
for shard X but not for shard Y.
The fix simplifies the logic in update_peer_info - if we are
the primary update all peer_info's that have pwlc. If we
are a non-primary and there is pwlc then update info.
Bill Scales [Wed, 30 Jul 2025 11:44:10 +0000 (12:44 +0100)]
osd: Optimized EC don't apply pwlc for divergent writes
Split pwlc epoch into a separate variable so that we
can use epoch and version number when comparing if
last_update is within a pwlc range. This ensures that
pwlc is not applied to a shard that has a divergent
write, but still tracks the most recent update of pwlc.
Bill Scales [Wed, 30 Jul 2025 11:41:34 +0000 (12:41 +0100)]
osd: Optimized EC present_shards no longer needed
present_shards is no longer needed in the PG log entry, this has been
replaced with code in proc_master_log that calculates which shards were
in the last epoch started and are still present.
Bill Scales [Mon, 28 Jul 2025 08:26:36 +0000 (09:26 +0100)]
osd: Optimized EC proc_master_log fix roll-forward logic when shard is absent
Fix bug in optimized EC code where proc_master_log incorrectly did not
roll forward a write if one of the written shards is missing in the current
epoch and there is a stray version of that shard that did not receive the
write.
As long as the currently present shards that participated in les and were
updated by a write have the update then the write should be rolled-forward.
Bill Scales [Mon, 28 Jul 2025 08:21:54 +0000 (09:21 +0100)]
osd: Refactor find_best_info and choose_acting
Refactor find_best_info to have separate function to calculate
maxles and minlua. The refactor makes history_les_bound
optional, tidy up the choose_acting interface removing this
where it is not used.
Bill Scales [Thu, 17 Jul 2025 18:17:27 +0000 (19:17 +0100)]
osd: EC Optimizations proc_master_log boundary case bug fixes
Fix a couple of bugs in proc_master_log for optimized EC
pools dealing with boundary conditions such as an empty
log and merging two logs that diverge from the very first
entry.
Refactor the code to handle the boundary conditions and
neaten up the code.
Predicate the code block with if (pool.info.allows_ecoptimizations())
to make it clear this code path is only for optimized EC pools.
Jon Bailey [Fri, 25 Jul 2025 13:16:35 +0000 (14:16 +0100)]
osd: Invalidate stats during peering if we are rolling a shard forwards.
This change will mean we always recalculate stats upon rolling stats forwards. This prevent the situation where we end up with incorrect statistics due to where we always take the stats of the oldest shard during peering; causing outdated pg stats being applied for cases where the oldest shards are shards that don't see partial writes where num_bytes has changed on other places after that point on that shard.