Jon Bailey [Fri, 25 Jul 2025 13:16:35 +0000 (14:16 +0100)]
osd: Invalidate stats during peering if we are rolling a shard forwards.
This change will mean we always recalculate stats upon rolling stats forwards. This prevent the situation where we end up with incorrect statistics due to where we always take the stats of the oldest shard during peering; causing outdated pg stats being applied for cases where the oldest shards are shards that don't see partial writes where num_bytes has changed on other places after that point on that shard.
mgr/dashboard: Enable rgw module automatically in the primary and secondary cluster if not enabled during multi-site automation
1. Enable rgw module automatically in the primary and secondary cluster if not enabled during multi-site automation
2. Improve progress bar descriptions and add sub-descriptions for steps
libcephfs_proxy: fix userperm pointer decoding for older protocols
The random data used to decode pointers coming from the old protocol was
taken from the client instead of using the global_random data, which is
the correct one.
libcephfs_proxy: remove unnecessary protocol references in daemon
With the new protocol structure definitions, it's not necessary to
explicitly access each field inside its version substructure (v0, for
example). Now all fields of the latest version are declared inside an
anonymous substructure that can be accessed without a prefix.
libcephfs_proxy: remove unnecessary protocol references in client
With the new protocol structure definitions, it's not necessary to
explicitly access each field inside its version substructure (v0, for
example). Now all fields of the latest version are declared inside an
anonymous substructure that can be accessed without a prefix.
libcephfs_proxy: fix protocol structures for backward compatibility
The structures used for transferring data between the proxy client and
the proxy daemon had been reworked in a recent change to be able to
expand the protocol. This caused an inconsistency in the size of the
data transferred when communication with a peer using the older version.
The result was that the peer receiving the data with an unexpected size
was closing the connection, causing unexpected errors.
The discrepancy in size is the result of how compilers pad structures
combined with the change in the structure layout introduced when
extending the protocol. With these changes, the computation of the size
of each version of the structures was not done correctly.
This change makes the layout equal to the older version, so that
computing the size of the structures becomes easier and doesn't depend
on unexpected paddings.
This refactores redundant device setup calls in LvmBlueStore class:
Calling the same function twice with different arguments for WAL
and DB devices was inefficient and unnecessary.
The new implementation simplifies the logic by directly accessing
`self.args`, it removes the need for passing arguments manually.
ftbfs:
```
/home/jenkins-build/build/workspace/ceph-pull-requests/src/neorados/cls/fifo/detail/fifo.h:630:14: error: no member named 'parse' in namespace 'ceph'; did you mean 'pause'?
630 | auto n = ceph::parse<decltype(m.num)>(num);
| ^~~~~~~~~~~
```
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
(cherry picked from commit 3df1133c17b174c27c250cf7ac018199cc40b15b) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Adam C. Emerson [Mon, 30 Jun 2025 20:54:46 +0000 (16:54 -0400)]
rgw/datalog: Manage and shutdown tasks properly
This is slightly ugly but good enough for now. Make sure we can block
when shutting down background tasks.
Remove a few `driver` parameters that are unused. This lets us
simplify the IAM Policy and Lua tests and not construct stores we
never use. (Which is good since we aren't running them under a cluster.)
Adam C. Emerson [Fri, 11 Jul 2025 18:57:02 +0000 (14:57 -0400)]
neorados/fifo: Rewrite as proper I/O object
Split nominal handle object and reference-counted
implementation. While we're at it, add lazy-open functionality.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 3097297dd39432d172d69454419fa83a908075f6) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Adam C. Emerson [Thu, 26 Jun 2025 17:58:57 +0000 (13:58 -0400)]
{neorados,osdc}: Support subsystem cancellation
Tag operations with a subsystem so we can cancel them all in one go.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 2526eb573b789b33b7d9ebf1169491f13e2318bb) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Conflicts:
src/rgw/driver/rados/rgw_service.cc
src/rgw/rgw_sal.cc
- `#ifdef`s for standalone Rados
src/rgw/driver/rados/rgw_datalog.cc
- Periodic re-run of recovery removed in main and pending backport
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Adam C. Emerson [Fri, 30 May 2025 20:54:45 +0000 (16:54 -0400)]
neorados: Hold reference to implementation across operations
Asynchrony combined with cancellations keeps leading to occasional
lifetime issues, so follow the best-practices of Asio I/O objects by
having completions keep a reference live.
The original NeoRados backing implements Asio's two-phase shutdown
properly.
The RadosClient backing does not, because it shares an Objecter with
completions that do not belong to it. In practice I don't think this
will matter since librados and neorados get shut down around the same
time.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 57c9723928b4d2b2148ca0dd4d505acdc071f8eb) Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
John Mulligan [Mon, 8 Sep 2025 18:13:59 +0000 (14:13 -0400)]
tentacle: update formatting to match across tentacle branches
I managed to create a bit of a mess with formatting changes after
a fix was cherry picked to `tentacle-release`. This change makes
the formatting on `tentacle-release` match that of `tentacle`.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Laura Flores [Fri, 5 Sep 2025 21:46:20 +0000 (16:46 -0500)]
doc/rados/operations: add kernel client procedure to read balancer documentation
As of now, the kernel client does not support `pg-upmap-primary`. I have
added some troubleshooting steps to help users who are unable to
mount images and filesystems with the kernel client while using `pg-upmap-primary`.
Once the feature is supported by the kernel client, users will be able
to perform mounts along with `pg-upmap-primary`.
N Balachandran [Thu, 28 Aug 2025 06:22:23 +0000 (11:52 +0530)]
rgw/logging: fixes data loss during rollover
Multiple threads attempting to roll over the same log object can result
in the creation of numerous orphan tail objects, each with a single record.
This occurs when a NULL RGWObjVersionTracker is used during the creation of
a new logging object. These records are inaccessible, leading to data loss,
which is particularly critical in Journal mode.
Furthermore, valid log tail objects may be added to the Garbage Collection (GC)
list, exacerbating data loss.
Fixes: https://tracker.ceph.com/issues/72740 Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
(cherry picked from commit eea6525c031ae93f4ae846b06d55831e658faa2c)
Bill Scales [Wed, 16 Jul 2025 14:55:40 +0000 (15:55 +0100)]
osd: Optimized EC invalid pwlc for shards doing backfill/async
Shards performing backfill or async recovery receive log entries
(but not transactions) for updates to missing/yet to be backfilled
objects. These log entries get applied and completed immediately
because there is nothing that can be rolled back. This causes
pwlc to advance too early and causes problems if other shards
do not complete the update and end up rolling it backwards.
This fix sets pwlc to be invalid when such a log entry is
applied and completed and it then remains invalid until the
next interval when peering runs again. Other shards will
continue to update pwlc and any complete subset of shards
in a future interval will include at least one shard that
has continued to update pwlc
Bill Scales [Wed, 16 Jul 2025 14:05:16 +0000 (15:05 +0100)]
osd: Optimized EC add_log_entry should not skip partial writes
Undo a previous attempt at a fix that made add_log_entry skip adding partial
writes to the log if the write did not update this shard. The only case where
this code path executed is when a partial write was to an object that needs
backfilling or async recovery. For async recovery we need to keep the
log entry because it is needed to update the missing list. For backfill it
doesn't harm to keep the log entry.
Bill Scales [Wed, 16 Jul 2025 13:55:49 +0000 (14:55 +0100)]
osd: Optimized EC apply_pwlc needs to be careful about advancing last_complete
Fix bug in apply_pwlc where the primary was advancing last_complete for a
shard doing async recorvery so that last_complete became equal to last_update
and it then thought that recovery had completed. It is only valid to advance
last_complete if it is equal to last_update.
Tidy up the logging in this function as consecutive calls to this function
often logged that it could advance on the 1st call and then that it could not
on the 2nd call. We only want one log message.
Alex Ainscow [Mon, 14 Jul 2025 22:57:49 +0000 (23:57 +0100)]
osd: Always send EC messages to all shards following error.
Explanation of bug which is being fixed:
Log entry 204'784 is an error - "_rollback_to deleting head on smithi17019940-573 because
got ENOENT|whiteout on find_object_context" so this log entry is generated outside of EC by
PrimaryLogPG. It should be applied to all shards, however osd 13(2) was a little slow and
the update got interrupted by a new epoch so it didn't apply it. All the other shards
marked it as applied and completed (there isn't the usual interlock that EC has of making
sure all shards apply the update before any complete it).
We then processed 4 partial writes applying and completing them (they didn't update osd
13(2)), then we have a new epoch and go through peering. Peering says osd 13(2) didn't see
update 204'784 (it didn't) and therefore the error log entry and the 4 partial writes need
to be rolled back. The other shards had completed those 4 partial writes so we end up with
4 missing objects on all the shards which become unfound objects.
I think the underlying bug means that log entry 204'784 isn't really complete and may
"disappear" from the log in a subsequent peering cycle. Trying to forcefully rollback a
logged error doesn't generate a missing object or a miscompare, so the consequences of the
bug are hidden. It is however tripping up the new EC code where proc_master_log is being
much stricter about what a completed write means.
Fix:
After generating a logged error we could force the next write to EC to update metadata on
all shards even if its a partial write. This means this write won't complete unless all
shards see the logged error. This will make new EC behave the same as old EC. There is
already an interlock with EC (call_write_ordered) which is called just before generating
the log error that ensures that any in-flight writes complete before submitting the log
error. We could set a boolean flag here (at the point call_write_ordered is called is fine,
don't need to wait for the callback) to say the next write has to be to all shards. The
flag can be cleared if we generate the transactions for the next write, or we get an
on_change notification (peering will then clear up the mess)
Alex Ainscow [Mon, 14 Jul 2025 15:40:22 +0000 (16:40 +0100)]
osd: Attribute re-reads in optimised EC
There were some bugs in attribute reads during recovery in optimised
EC where the attribute read failed. There were two scenarios:
1. It was not necessary to do any further reads to recover the data. This
can happen during recovery of many shards.
2. The re-read could be honoured from non-primary shards. There are
sometimes multiple copies of the shard whcih can be used, so a failed read
on one OSD can be replaced by a read from another.
Bill Scales [Fri, 11 Jul 2025 11:59:40 +0000 (12:59 +0100)]
osd: EC optimizations keep log entries on all shards
When a shard is backfilling it gets given log entries
for partial writes even if they do not apply to the
shard. The code was updating the missing list but
discarding the log entry. This is wrong because the
update can be rolled backwards and the log entry is
required to revert the update to the missing list.
Keeping the log entry has a small but insignificant
performance impact.
Bill Scales [Mon, 7 Jul 2025 20:13:59 +0000 (21:13 +0100)]
mon: Optimized EC pools preprocess_pgtemp incorrectly rejecting pgtemp as nop
Optimized EC pools store pgtemp with primary shards first, this was not
being taken into account by OSDMonitor::preprocess_pgtemp which meant
that the change of pgtemp from [None,2,4] to [None,4,2] for a 2+1 pool
was being rejected as a nop because the primary first encoded version
of [None,2,4] is [None,4,2].
Bill Scales [Fri, 4 Jul 2025 10:51:05 +0000 (11:51 +0100)]
osd: EC Optimizations proc_master_log bug fixes
1. proc_master_log can roll forward full-writes that have
been applied to all shards but not yet completed. Add a
new function consider_adjusting_pwlc to roll-forward
pwlc. Later partial_write can be called to process the
same writes again. This can result in pwlc being rolled
backwards. Modify partial_write so it does not undo pwlc.
2. At the end of proc_master_log we want the new
authorative view of pwlc to persist - this may be
better or worse than the stale view of pwlc held by
other shards. consider_rollback_pwlc sometimes
updated the epoch in the toversion (second value of the
range fromversion-toverison). We now always do this.
Updating toversion.epoch causes problems because this
version sometimes gets copied to last_update and
last_complete - using the wrong epoch here messes
everything up in later peering cycles. Instead we
now update fromversion.epoch. This requires changes
to apply_pwlc and an assert in Stray::react(const MInfoRec&)
3. Calling apply_pwlc at the end of proc_master_log is
too early - updating last_update and last_complete here
breaks GetMissing. We need to do this later when activating
(change to search_missing and activate)
4. proc_master_log is calling partial_write with the
wrong previous version - this causes problems after a
split when the log is sparsely populated.
5. merging PGs is not setting up pwlc correctly which
can cause issues in future peering cycles. The
pwlc can simply be reset, we need to update the epoch
to make sure this view of pwlc persists vs stale
pwlc from other shards.
Alex Ainscow [Tue, 1 Jul 2025 14:51:58 +0000 (15:51 +0100)]
osd: Relax assertion that all recoveries require a read.
If multiple object are being read as part of the same recovery (this happens
when recovering some snapshots) and a read fails, then some reads from other
shards will be necessary. However, some objects may not need to read. In this
case it is only important that at least one read message is sent, rather than
one read message per object is sent.
Alex Ainscow [Sun, 29 Jun 2025 21:54:51 +0000 (22:54 +0100)]
osd: Truncate coding shards to minimal size
Scrub detected a bug where if an object was truncated to a size where the first
shard is smaller than the chunk size (only possible for >4k chunk sizes), then
the coding shards were being aligned to the chunk size, rather than to 4k.
This fixes changes how the write plan is calculated to write the correct size.
Fix some unusual scenarios where peering was incorrectly
declaring that objects were missing on stray shards. When
proc_master_log rolls forward partial writes it need to
update pwlc exactly the same way as if the write had been
completed. This ensures that stray shards that were not
updated because of partial writes do not cause objects
to be incorrectly marked as missing.
The fix also means some code in GetMissing which was trying
to do a similar thing for shards that were acting,
recovering or backfilling (but not stray) can be deleted.
Bill Scales [Wed, 25 Jun 2025 09:26:17 +0000 (10:26 +0100)]
osdc: Optimized EC pools routing bug
Fix bug with routing to an acting set like [None,Y,X,X]p(X)
for a 3+1 optimzed pool where osd X is representing more
than one shard. For an optimized EC pool we want it to
choose shard 3 because shard 2 is a non-primary. If we
just search the acting set for the first OSD that matches
X this will pick shard 2, so we have to convert the order
to primary's first, then find the matching OSD and then
convert this back to the normal ordering to get shard 3.
Bill Scales [Mon, 23 Jun 2025 10:36:37 +0000 (11:36 +0100)]
mon: Optimized EC clean_temps needs to permit primary change
Optimized EC pools were blocking clean_temps from clearing pg_temp
when up == acting but up_primary != acting_primary because optimized
pools sometimes use pg_temp to force a change of primary shard.
However this can block merges which require the two PGs being
merged to have the same primary. Relax clean_temps to permit
pg_temp to be cleared so long as the new primary is not a
non-primary shard.
Bill Scales [Mon, 23 Jun 2025 09:24:17 +0000 (10:24 +0100)]
osd: Optimized EC pools - fix overaggressive assert in read_log_and_missing
Non-primary shards may not be updated because of partial writes. This means
that the OI verison for an object on these shards may be stale. An assert
in read_log_and_missing was checking that the OI version matched the have
version in a missing entry. The missing entry calculates the have version
using the prior_version from a log entry, this does not take into account
partial writes so can be ahead of the stale OI version.
Relax the assert for optimized pools to require have >= oi.version
Bill Scales [Mon, 23 Jun 2025 09:12:10 +0000 (10:12 +0100)]
osd: rewind_divergent_log needs to dirty log if crt changes or ...
rollback_info_trimmed_to changes
PGLog::rewind_divergent_log was only causing the log to be marked
dirty and checkpointed if there were divergent entries. However
after a PG split it is possible that the log can be rewound
modifying crt and/or rollback_info_trimmed_to without creating
divergent entries because the entries being rolled back were
all split into the other PG.
Failing to checkpoint the log generates a window where if the OSD
is reset you can end up with crt (and rollback_info_trimmed_to) > head.
One consequence of this is asserts like
ceph_assert(rollback_info_trimmed_to == head); firing.
Fixes: https://tracker.ceph.com/issues/55141 Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d8f78adf85f8cb11deeae3683a28db92046779b5)
Alex Ainscow [Fri, 20 Jun 2025 20:47:32 +0000 (21:47 +0100)]
osd: Correct truncate logic for new EC
The clone logic in the truncate was only cloning from the truncate
to the end of the pre-truncate object. If the next shard was being
truncated to a shorter length (which is common), then this shard
has a larger clone.
The rollback, however, can only be given a single range, so it was
given a range which covers all clones. The problem is that if shard
0 is rolled back, then some empty space from the clone was copied
to shard 0.
Fix is easy - calculate the full clone length and apply to all shards, so it matches the rollback.
Bill Scales [Thu, 19 Jun 2025 13:26:04 +0000 (14:26 +0100)]
osd: Do not apply log entry if shard not written
This was a failed test, where the primary concluded that all objects were present
despite one missing object on the non primary shard.
The problem was caused because the log entries are sent to the unwritten shards if that
shard is missing in order to update the version number in the missing object. However,
the log entry should not actually be added to the log.
Further testing showed there are other scenarios where log entries are sent to
unwritten shards (for example a clone + partial_write in the same transaction),
these scenarios do not want to add the log entry either.
Bill Scales [Thu, 19 Jun 2025 12:41:17 +0000 (13:41 +0100)]
osd: EC Optimizations proc_replica_log needs to apply pwlc
PeeringState::proc_replica_log needs to apply pwlc before
calling PGLog so that any partial writes that have occurred
are taken into account when working out where a replica/stray
has diverged from the primary.