Naman Munet [Mon, 7 Jul 2025 09:26:49 +0000 (14:56 +0530)]
mgr/dashboard: differentiate account users from rgw users in bucket form
fixes: https://tracker.ceph.com/issues/71523
commit includes:
1) Added checkbox to select account user and another dropdown to show account users
2) Also fixed bucket replication as it was throwing error for 'invalidBucketARN'
osd/scrub: allow auto-repair on operator-initiated scrubs
Previously, operator-initiated scrubs would not auto-repair, regardless
of the value of the 'osd_scrub_auto_repair' config option. This was
less confusing to the operator than it could have been, as most
operator commands would in fact cause a regular periodic scrub
to be initiated. However, that quirk is now fixed: operator commands
now trigger 'op-initiated' scrubs. Thus the need for this patch.
The original bug was fixed in https://github.com/ceph/ceph/pull/54615,
but was unfortunately re-introduced later on. Fixes: https://tracker.ceph.com/issues/72178 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 97de817ad1c253ee1c7c9c9302981ad2435301b9)
Patrick Donnelly [Fri, 27 Jun 2025 18:46:07 +0000 (14:46 -0400)]
mds: nudge log for unstable locks after early_reply
A getattr/lookup can cause a wrlock or xlock to become unstable after a request
(like rename) acquires it but before early reply. The MDS will not nudge the
log in this situation and the getattr/lookup will need to wait for the eventual
journal flush before the lock is released.
Patrick Donnelly [Fri, 27 Jun 2025 18:38:17 +0000 (14:38 -0400)]
mds: allow disabling batch ops
To address a bug and future ones where batching lookup/getattr does not help
"kick" the MDS in switching state more quickly (e.g. flushing the MDS journal).
Zac Dover [Wed, 11 Jun 2025 12:44:32 +0000 (22:44 +1000)]
doc/rados/ops: edit cache-tiering.rst
Add material to doc/rados/operations/cache-tiering.rst, as suggested by
Anthony D'Atri in
https://github.com/ceph/ceph/pull/63745#discussion_r2127887785.
Ville Ojamo [Wed, 30 Apr 2025 18:17:14 +0000 (01:17 +0700)]
doc/radosgw: Improve rgw-cache.rst
Try to improve the language by completely rewriting some sentences.
Attempt to format the document more like the rest of the docs.
Fix several errors in punctuation, capitalization, spaces etc.
Use blocks with bash prompts for CLI commands instead of hardcoded
prompts.
Fix section hierarchy and section title underline lengths.
Use admonition.
Soumya Koduri [Fri, 23 May 2025 21:39:50 +0000 (03:09 +0530)]
rgw/restore: Use strtoull to read size till 2^64
Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit b3c867a121a7315b5a9e2d30d0af44c08676f8ca)
Soumya Koduri [Fri, 23 May 2025 20:25:30 +0000 (01:55 +0530)]
rgw/cloud-restore: Handle failure with adding restore entry
In case adding restore entry to FIFO fails, reset the `restore_status`
of that object as "RestoreFailed" so that restore process can be
retried from the end S3 user.
Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit 9974f51eb61603b8117d7b50e6b0b4614fcce721)
rgw/cloud-restore: Support restoration of objects transitioned to Glacier/Tape endpoint
Restoration of objects from certain cloud services (like Glacier/Tape) could
take significant amount of time (even days). Hence store the state of such restore requests
and periodically process them.
Brief summary of changes
* Refactored existing restore code to consolidate and move all restore processing into rgw_restore* file/class
* RGWRestore class is defined to manage the restoration of objects.
* Lastly, for SAL_RADOS, FIFO is used to store and read restore entries.
Currently, this PR handles storing state of restore requests sent to cloud-glacier tier-type which need async processing.
The changes are tested with AWS Glacier Flexible Retrieval with tier_type Expedited and Standard.
Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Reviewed-by: Daniel Gryniewicz <dang@redhat.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit ef96bb0d6137bacf45b9ee2f99ad5bcd8b3b6add)
qa/standalone/scrub: fix "scrubbed in 0ms" in osd-scrub-test.sh
The specific test looks for a 'last scrub duration' higher than
0 as a sign that the scrub actually ran. Previous code fixes
guaranteed that even a scrub duration as low as 1ms would be
reported as "1" (1s). However, none of the 15 objects created
in this test were designated for the tested PG, which remained
empty. As a result, the scrub duration was reported as "0".
The fix is to create a large enough number of objects so that
at least one of them is mapped to the tested PG.
Alex Ainscow [Fri, 6 Jun 2025 11:09:04 +0000 (12:09 +0100)]
osd: Optimised EC should avoid decodes off the end of objects.
This was a particular edge case whereby the need to do an encode and a decode as part of recovery
was causing EC to attempt to do a decode off the end of the shard, despite this being
unnecessary.
Alex Ainscow [Fri, 23 May 2025 08:59:31 +0000 (09:59 +0100)]
osd: During recovery, pass "for_recovery" when attempting re-reads
get_all_remaining_reads() is used for two purposes:
1. If a read unexpectedly fails, recover from other shards.
2. If a shard is missing, but is allowed to be missing (typically
due to unequal shard sizes), we rely on this function to return
no new reads, without an error.
The test failure we saw case (2), but I think case (1) is important.
Most of the time we probably would not notice, but if insufficient redundancy
exists without the for_recovery being set, then this will result in
recovery failing.
Alex Ainscow [Fri, 2 May 2025 09:11:45 +0000 (10:11 +0100)]
osd: Improve backfill in new EC.
In old EC, the full stripe was always read and written. In new EC, we only attempt
to recover the shards that were missing. If an old OSD is available, the read can
be directed there.
Alex Ainscow [Thu, 8 May 2025 15:14:03 +0000 (16:14 +0100)]
osd: Cope with empty reads from an OSD without panic.
If a ReadOp from EC contains two objects where one object only reads from a single shard, but
other onjects require other shards, then this bug can be hit. The fix should make it clear
what the issue is
Alex Ainscow [Thu, 8 May 2025 13:22:36 +0000 (14:22 +0100)]
osd: Recover non-primary shards with the correct version.
Scrub revealed a bug whereby the non-primary shards were being given a
version number in the OI which did not match the expected version in
the authoritative OI.
A secondary issue is that all attributes were being pushed to the non
primary shards, whereas only OI is actually needed.
Alex Ainscow [Wed, 7 May 2025 09:33:24 +0000 (10:33 +0100)]
osd: Do not do a read-modify-write if op.delete_first is set
Some client OPs are able to generate transactions which delete an
object and then write it again. This is used by the copy-from ops.
If such a write is not 4k aligned, then the new EC code was incorrectly doing
a read-modify write on the misaligned 4k. This causes some
garbage to be written to the backend OSD, off the end of the
object. This is only a problem if the object is later extended
without the end being written.
Problematic sequence is:
1. Create two objects (A and B) of size X and Y where:
X > Y, (Y % 4096) != 0
2. copy_from OP B -> A
3. Extend B without writing offset Y+1
This will result in a corrupt data buffer at Y+1 without this fix.
Alex Ainscow [Thu, 1 May 2025 09:09:15 +0000 (10:09 +0100)]
osd: Fix off-by-one error in shard_extent_map.
Inserting the first parity buffer was causing the ro-range within the SEM to be incorrectly calculated.
Simple fix and I have added some unit tests to defend this error in the future.
Alex Ainscow [Fri, 25 Apr 2025 18:15:34 +0000 (19:15 +0100)]
osd: Minor performance improvement in ECUtil.cc
Code changes to prevent create then erase of empty
shard. Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit c2d6414f659b123fa7060442bff7a90a7ceeb7c0)
Alex Ainscow [Thu, 24 Apr 2025 13:02:41 +0000 (14:02 +0100)]
osd: clone + delete ops should invalidate source new EC extent cache.
The op.is_delete() function only returns true if the op is not ALSO
doing something else (e.g. a clone). This causes issues with clearing
the new EC extent cache.
Alex Ainscow [Wed, 23 Apr 2025 14:41:11 +0000 (15:41 +0100)]
osd: Fix parity updates in truncates.
Previously in optimised EC, when truncating to a partial
stripe, the parity was not being updated. This fix reads
the non-truncated data from the final stripe and calculates
parity updates, which are written to the parity shards.
Alex Ainscow [Tue, 22 Apr 2025 12:41:19 +0000 (13:41 +0100)]
osd: Fix EC cache invalidation bug
With optimised EC, there were two bugs with cache invalidation:
1. If two invalidates were in the queue, its possible the second
invalidate might be cleared by the first.
2. Reads were being requested if size was being reduced.
Also, added a few debug improvements and some new asserts.
Alex Ainscow [Wed, 9 Apr 2025 12:49:49 +0000 (13:49 +0100)]
osd: Make EC alignment independent of page size.
Code which manipulates full pages is often faster. To exploit this
optimised EC was written to deal with 4k alignment wherever possible.
When inputs are not aligned, they are quickly aligned to 4k.
Not all architectures use 4k page sizes. Some power architectures for
example have a 64k page size. In such situations, it is unlikely that
using 64k page alignment will provide any performance boost, indeed it
is likely to hurt performance significantly. As such, EC has been
moved to maintain its internal alignment (4k), whcih can be configured.
This has the added advantage, that we can can potentially tweak this
value in the future.
Alex Ainscow [Wed, 16 Apr 2025 09:41:48 +0000 (10:41 +0100)]
osd: Fix Truncates in Optimised EC
The previous truncate code attempted to perform a non-aligned truncate by
creating a zero buffer at the end of the object, which was written.
The new code initially truncates to the exact size of the user object before
growing the object to the required 4k alignment. This simpler arrangement
also simplifies the rollback.
Bill Scales [Fri, 6 Jun 2025 12:28:14 +0000 (13:28 +0100)]
osd: EC optimizations fix bug when recovering only partial write objects
PGLog::reset_complete_to is not handling the scenario where all the
missing objects have a partial write that excludes updating the shard being
recovered as their most recent update. In this scenario the oldest need
is newer than newest log entry. Setting last_compelte to the head of the
log confuses code and makes it think that recovery has completed.
The fix is to hold last_complete one entry behind the head of the log
until all missing objects have been recovered.
PGLog::recover_got already does this when an object is recovered and the
remaining objects to recover match this scenario, so this fix just makes
reset_complete_to behave the same way as recover_got.