Alex Ainscow [Wed, 18 Jun 2025 19:46:49 +0000 (20:46 +0100)]
osd: Multiple Decode fixes.
Fix 1:
These are multiple fixes that affected the same code. To simplify review
and understanding of the changes, they have been merged into a single
commit.
What happened in defect is (k,m = 6,4)
1. State is: fast_reads = true, shards 0,4,5,6,7,8 are available. Shard 1 is missing this object.
2. Shard 5 only needs zeros, so read is dropped. Other sub read message sent.
3. Object on shard 1 completes recovery (so becomes not-missing)
4. Read completes, complete notices that it only has 5 reads, so calculates what it needs to re-read.
5. Calculates it needs 0,1,4,5,6,7 - and so wants to read shard 1.
6. Code assumes that enough reads should have been performed, so refused to do another reads and instead generates an EIO.
The problem here is some "lazy" code in step (4). What is should be doing is working out that it
can use the zero buffers and not calling get_remaining_reads(). Instead, what it attempts to do is
call get_remaining_reads() and if there is no work to do, then it assumes it has everything
already and completes the read with success. This assumption mostly works - but in this
combination of fast_reads picking less than k shards to read from AND an object completing
recovery in parallel causes issue.
The solution is to wait for all reads to complete and then assume that any remaining zero buffers
count as completed reads. This should then cause the plugin to declare "success"
Fix 2:
There are decodes in new EC which can occur when less than k
shards have been read. These reads in the last stripe, where
for decoding purposes, the data past the end of the shard can
be considered zeros. EC does not read these, but instead relies
on the decode function inventing the zero buffers.
This was working correctly when fast reads were turned off, but
if such an IO was encountered with fast reads turned on the
logic was disabled and the IO returns an EIO.
This commit fixes that logic, so that if all reads have complete
and send_all_remaining_reads conveys that no new reads were
requested, then decode will still be possible.
FIX 3:
If reading the end of an object with unequally sized objects,
we pad out the end of the decode with zeros, to provide
the correct data to the plugin.
Previously, the code decided not to add the zeros to "needed"
shards. This caused a problem where for some parity-only
decodes, an incomplete set of zeros was generated, fooling the
_decode function into thinking that the entire shard was zeros.
In the fix, we need to cope with the case where the only data
needed from the shard is the padding itself.
The comments around the new code describe the logic behind
the change.
This makes the encode-specific use case of padding out the
to-be-decoded shards unnecessary, as this is performed by the
pad_on_shards function below.
Also fixing some logic in calculating the need_set being passed
to the decode function did not contain the extra shards needed
for the decode. This need_set is actually ignored by all the
plugins as far as I know, but being wrong does not seem
helpful if its needed in the future.
Fix 4: Extend reads when recovering parity
Here is an example use case which was going wrong:
1. Start with 3+2 EC, shards 0,3,4 are 8k shard 1,2 is 4k
2. Perform a recovery, where we recover 2 and 4. 2 is missing, 4 can be copied from another OSD.
3. Recovery works out that it can do the whole recovery with shards 0,1,3. (note not 4)
4. So the "need" set is 0,1,3, the "want" set is 2,4 and the "have" set is 0,1,3,4,5
5. The logic in get_all_avail_shards then tries to work out the extents it needs - it only. looks at 2, because we "have" 4
6. Result is that we end up reading 4k on 0,1,3, then attempt to recover 8k on shard 4 from this... which clearly does not work.
Fix 5: Round up padding to 4k alignment in EC
The pad_on_shards was not aligning to 4k. However, the decode/encode functions were. This meant that
we assigned a new buffer, then added another after - this should be faster.
Fix 6: Do not invent encode buffers before doing decode.
In this bug, during recovery, we could potentially be creating
unwanted encode buffers and using them to decode data buffers.
This fix simply removes the bad code, as there is new code above
which is already doing the correct action.
Fix 7: Fix miscompare with missing decodes.
In this case, two OSDs failed at once. One was replaced and the other was not.
This caused us to attempt to encode a missing shard while another shard was missing, which
caused a miscompare because the recovery failed to do the decode properly before doing an encode.
Bill Scales [Wed, 18 Jun 2025 11:11:51 +0000 (12:11 +0100)]
osd: Optimized EC pools bug fix when repeating GetLog
When the primary shard of an optimized EC pool does not have
a copy of the log it may need to repeat the GetLog peering
step twice, the first time to get a full copy of a log from
a shard that sees all log entries and then a second time
to get the "best" log from a nonprimary shard which may
have a partial log due to partial writes.
A side effect of repeating GetLog is that the missing
list is collected for both the "best" shard and the
shard that provides a full copy of the log. This later
missing list confuses later steps in the peering
process and may cause this shard to complete writes
and end up diverging from the primary. Discarding
this missing list causes Peering to behave the same as if
the GetLog step did not need to be repeated.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Alex Ainscow [Wed, 11 Jun 2025 15:30:40 +0000 (16:30 +0100)]
osd: Fix attribute recover in rare recovery scenario
When recovering attributes, we read them from the first potential primary, then
if that read failures, attempt to read from another potential primary.
The problem is that the code which calculates which shards to read for a recovery
only takes into account *data* and not where the attributes are. As such, if the
second read only required a non-primary, then the attribute read fails and the
OSD panics.
The fix is to detect this scenario and perform an empty read to that shard, which
the attribute-read code can use for attribute reads.
Code was incorrectly interpreting a failed attribute read on recovery as
meaning a "fast_read". Also, no attribute recovery would occur in this case.
Bill Scales [Wed, 11 Jun 2025 14:53:48 +0000 (15:53 +0100)]
osd: EC Optimizations fix bugs in applying pwlc to update info and log
1. Refactor the code that applies pwlc to update info and log so that there
is one function rather than multiple copies of the code.
2. pwlc information is used to track shards that have not been updated by
partial writes. It is used to advance last_complete (and last_update and
the log head) to account for log entries that the shard missed. It was
only being applied if last_complete matched the range of partial writes
recorded in pwlc. When a shard has missing objects last_complete is
deliberately held before the oldest need, this stops pwlc being applied.
This is wrong - pwlc can still try and update last update and the log
head even if it cannot update last_complete.
3. When the primary receives info (and pwlc) information from OSD x(y)
it uses the pwlc information to update x(y)'s info. During backfill
there may be other shards z(y) which should also be updated using the
pwlc information.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
In the EnableRequest state machine, clean up the handling of the async
request to fetch the mirror image, particularly when a non-primary image
is being created by the rbd-mirror daemon.
Edit the section "Data Pool Damage" in doc/cephfs/disaster-recovery.rst.
This commit is part of the project of improving the data-recovery parts
of the CephFS documentation, as requested in the Ceph Power Users
Feedback Summary in mid-2025.
auth: remove unused AuthTicket::renew_after member variable
The AuthTicket::renew_after field is only set in init_timestamps() and
read by dump() for debugging purposes. It has no functional use cases
and causes encoding/decoding inconsistencies.
During decoding, this field remains unchanged, creating discrepancies
between original and decoded values. This issue was masked because
check-generated.sh and readable.sh reused struct instances, preserving
stale field values across encode/decode cycles.
An upcoming change will allocate fresh instances for each decode
operation, which would expose these inconsistent values.
Remove the unused field to eliminate the encoding inconsistency and
simplify the codebase.
Alex Ainscow [Mon, 21 Jul 2025 07:17:57 +0000 (08:17 +0100)]
osd: Replace deprecated std::align_storage_t with alignas
C++23 has been enabled, causing deprecated warnings. Following the
"possible implementation" in the C++ docs, I have replaced the last
remaining aligned_storage_t.
* refs/pull/63214/head:
release note: add a note that "subvolume info" cmd output can also...
doc/cephfs: update docs since "subvolume info" cmd output can also...
qa/cephfs: add test to check clone source info's present in...
mgr/vol: show clone source info in "subvolume info" cmd output
mgr/vol: keep clone source info even after cloning is finished
* refs/pull/58564/head:
client: clamp sizes to INT_MAX in sync i/o code paths
client: restrict bufferlist to total write size
src/test: test sync/async i/o code paths with huge (4GiB) buffers
- for arm64 hitting (Use `node --trace-warnings ...` to show where the warning was created)
NX Cannot find module @rollup/rollup-linux-arm64-gnu. npm has a bug related to optional dependencies (https://github.com/npm/cli/issues/4828). Please try `npm i` again after removing both package-lock.json and node_modules directory.
Pass --verbose to see the stacktrace.
- due this this make check arm64 failing
- added the fix as per https://github.com/vitejs/vite/discussions/15532#discussioncomment-13369584
- its failing then NX Falling back to ts-node for local typescript execution. This may be a little slower.
NX Cannot find module '@rspack/binding-linux-arm64-gnu'
- the above fix failed asking more deps from rollup, so added whole rollup package
this suite hasn't provided much benefit since it was added, and is
becoming more of a maintenance burden recently:
* https://tracker.ceph.com/issues/71584
* https://tracker.ceph.com/issues/72179
osd/scrub: allow auto-repair on operator-initiated scrubs
Previously, operator-initiated scrubs would not auto-repair, regardless
of the value of the 'osd_scrub_auto_repair' config option. This was
less confusing to the operator than it could have been, as most
operator commands would in fact cause a regular periodic scrub
to be initiated. However, that quirk is now fixed: operator commands
now trigger 'op-initiated' scrubs. Thus the need for this patch.
The original bug was fixed in https://github.com/ceph/ceph/pull/54615,
but was unfortunately re-introduced later on. Fixes: https://tracker.ceph.com/issues/72178 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Kushal Deb [Wed, 21 May 2025 10:01:06 +0000 (15:31 +0530)]
improve error handling and add --resume flag to 'ceph rgw realm bootstrap' for partial recovery
This patch enhances the `ceph rgw realm bootstrap` command by improving error
messaging and introducing a `--resume` flag to support recovery from partial
bootstrap failures.
librbd: images aren't closed in group_snap_*_by_record() on error
Fixes memory leak and handles resource leak scenario when at leat one IoCtx is not
created successfully. This is done by returning error before opening any image.
Changes are made in group_snap_remove_by_record and group_snap_rollback_by_record
John Mulligan [Wed, 16 Jul 2025 18:07:33 +0000 (14:07 -0400)]
build-integration-branch: allow setting git trailer on final commit
After the last commit is made, provide a simple mechanism for adding
git trailers to the commit message. The git trailers [1] are metadata
that tools may make use of. In particular, we add a few of the
trailers documented by ceph-build here [2] as well as allowing
for arbitrary trailers for future changes (before this code can
be updated), advanced trailer, or other unrelated purposes.
John Mulligan [Wed, 16 Jul 2025 17:39:27 +0000 (13:39 -0400)]
build-integration-branch: convert to argparse
Convert build-integration-branch to use the stdlib argparse module.
Argparse is:
* Part of the python standard library and available since 3.2
* Well documented as a stdlib component
* Widely used
* Fairly simple and direct
docopt is:
* Clever
* Not documented as a dependency of this script (so I bet most users
are relying on the fallback behavior)
* Of questionable maintenance status with:
- No releases since 2014
- Only four PRs merged since 2019
- Last merged PR was merged, recommending an alternate repo, and then
disappeared from the commit history of the master branch, indicating
a possible maintainership/status discrepancy
- Only a couple of commits merged since 2018 (visible on github)
* In my opinion: not particularly ergonomic esp. wrt dictionary based
key access
I feel pretty comfortable making this conversion as I think it will
make the script easier to maintain and extend.
Signed-off-by: John Mulligan <jmulligan@redhat.com>