git.apps.os.sepia.ceph.com Git

]> git.apps.os.sepia.ceph.com Git - ceph-ci.git/log

Christopher Hoffman [Mon, 11 Aug 2025 14:30:05 +0000 (14:30 +0000)]

client: move mref_reader check in statfs out of helper

Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 9ab1209a68d450eca9bf915a605200ab92e53926)

commit | commitdiff | tree

Christopher Hoffman [Wed, 6 Aug 2025 15:47:46 +0000 (15:47 +0000)]

test: Add test for libcephfs statfs

Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 61c13a9d097353a924436f9e6c2b95984d12485d)

commit | commitdiff | tree

Christopher Hoffman [Tue, 5 Aug 2025 19:39:29 +0000 (19:39 +0000)]

client: get quota root based off of provided inode in statfs

In statfs, get quota_root for inode provided. Check if a quota
is directly applied to inode. If not, reverse tree walk up and
maybe find a quota set higher up the tree.

Fixes: https://tracker.ceph.com/issues/72355
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 85f9fc29d202e1b050e50ad8bae13d7751ef28db)

commit | commitdiff | tree

Christopher Hoffman [Tue, 5 Aug 2025 19:34:45 +0000 (19:34 +0000)]

client: use path supplied in statfs

If path provided, use in statfs. Replumb internal statfs
for internal only to allow for use in ll_statfs and statfs

Fixes: https://tracker.ceph.com/issues/72355
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 0c6a8add81c61960b733ce3ec38f7bbb3b5e93e9)

commit | commitdiff | tree

Enrico Bocchi [Sun, 15 Jun 2025 20:55:22 +0000 (22:55 +0200)]

mdstypes: Dump export_ephemeral_random_pin as double

Fixes: https://tracker.ceph.com/issues/71944
Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit 12b9d735006b5ee8b47e49d9aa6b693f7ce7b952)

commit | commitdiff | tree

Rishabh Dave [Wed, 4 Jun 2025 06:32:54 +0000 (12:02 +0530)]

release note: add a note that "subvolume info" cmd output can also...

contain "source field" in it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 37244a7972b9842e3cb8f8ddbd4a1d9ca875bbe3)

commit | commitdiff | tree

Rishabh Dave [Wed, 4 Jun 2025 06:28:11 +0000 (11:58 +0530)]

doc/cephfs: update docs since "subvolume info" cmd output can also...

contain source field in it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 0a029149df824ef0f026af912925d809b31862ef)

commit | commitdiff | tree

Rishabh Dave [Fri, 9 May 2025 16:02:27 +0000 (21:32 +0530)]

qa/cephfs: add test to check clone source info's present in...

.meta file of a cloned subvolume after cloning is finished and in the
output of "ceph fs subvolume info" command.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 44cacaf1e46b46f96ac3641cad02f27c4c753532)

commit | commitdiff | tree

Rishabh Dave [Fri, 9 May 2025 16:40:47 +0000 (22:10 +0530)]

mgr/vol: show clone source info in "subvolume info" cmd output

Include clone source information in output of "ceph fs subvolume info"
command so that users can access this information conveniently.

Fixes: https://tracker.ceph.com/issues/71266
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 0ef6da69d993ca58270010e0b458bad0dff29034)

commit | commitdiff | tree

Rishabh Dave [Thu, 8 May 2025 15:05:39 +0000 (20:35 +0530)]

mgr/vol: keep clone source info even after cloning is finished

Instead of removing the information regarding source of a cloned
subvolume from the .meta file after the cloning has finished, keep it as
it is as the user may find it useful.

Fixes: https://tracker.ceph.com/issues/71266
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit bbacfdae1a2e7eb60f91b75852bcd1096b6e3c84)

commit | commitdiff | tree

Kotresh HR [Thu, 24 Jul 2025 17:31:12 +0000 (17:31 +0000)]

qa: Add test for subvolume_ls on osd full

Fixes: https://tracker.ceph.com/issues/72260
Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 8547e57ebc4022ca6750149f49b68599a8af712e)

commit | commitdiff | tree

Kotresh HR [Thu, 24 Jul 2025 09:33:06 +0000 (09:33 +0000)]

mds: Fix readdir when osd is full.

Problem:
The readdir wouldn't list all the entries in the directory
when the osd is full with rstats enabled.

Cause:
The issue happens only in multi-mds cephfs cluster. If rstats
is enabled, the readdir would request 'Fa' cap on every dentry,
basically to fetch the size of the directories. Note that 'Fa' is
CEPH_CAP_GWREXTEND which maps to CEPH_CAP_FILE_WREXTEND and is
used by CEPH_STAT_RSTAT.

The request for the cap is a getattr call and it need not go to
the auth mds. If rstats is enabled, the getattr would go with
the mask CEPH_STAT_RSTAT which mandates the requirement for
auth-mds in 'handle_client_getattr', so that the request gets
forwarded to auth mds if it's not the auth. But if the osd is full,
the indode is fetched in the 'dispatch_client_request' before
calling the handler function of respective op, to check the
FULL cap access for certain metadata write operations. If the inode
doesn't exist, ESTALE is returned. This is wrong for the operations
like getattr, where the inode might not be in memory on the non-auth
mds and returning ESTALE is confusing and client wouldn't retry. This
is introduced by the commit 6db81d8479b539d which fixes subvolume
deletion when osd is full.

Fix:
Fetch the inode required for the FULL cap access check for the
relevant operations in osd full scenario. This makes sense because
all the operations would mostly be preceded with lookup and load
the inode in memory or they would handle ESTALE gracefully.

Fixes: https://tracker.ceph.com/issues/72260
Introduced-by: 6db81d8479b539d3ca6b98dc244c525e71a36437
Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 1ca8f334f944ff78ba12894f385ffb8c1932901c)

commit | commitdiff | tree

Venky Shankar [Thu, 11 Sep 2025 03:35:19 +0000 (03:35 +0000)]

qa/cephfs: fix test_subvolume_group_charmap_inheritance test

Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit fe3d6417bfaef571a9bb4093b5a8dfdb3cc3e59d)

commit | commitdiff | tree

Xavi Hernandez [Thu, 3 Jul 2025 08:34:37 +0000 (10:34 +0200)]

doc: add name mangling documentation for subvolume group creation

Signed-off-by: Xavi Hernandez <xhernandez@gmail.com>
(cherry picked from commit b47bbf8afdfcb81ee8aed7ef9c27b45dd8d5a589)

commit | commitdiff | tree

Xavi Hernandez [Thu, 3 Jul 2025 08:33:49 +0000 (10:33 +0200)]

qa: add tests for name mangling in subvolume group creation

Signed-off-by: Xavi Hernandez <xhernandez@gmail.com>
(cherry picked from commit ea2d8e9fc04f249d576e4799c3bdc44302cf1226)

commit | commitdiff | tree

Xavi Hernandez [Thu, 3 Jul 2025 08:27:10 +0000 (10:27 +0200)]

pybind/mgr: add name mangling options to subvolume group creation

Signed-off-by: Xavi Hernandez <xhernandez@gmail.com>
(cherry picked from commit f98990ac1bbdf4ca0f05ea0336289cb32001159f)

commit | commitdiff | tree

Enrico Bocchi [Tue, 5 Nov 2024 08:26:04 +0000 (09:26 +0100)]

mgr/volumes: Fix json.loads for test on mon caps

Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit b008ef9eb690618608f902c67f8df1fb8a587e33)

commit | commitdiff | tree

Enrico Bocchi [Wed, 16 Oct 2024 09:40:26 +0000 (11:40 +0200)]

mgr/volumes: Add test for mon caps if auth key has remaining mds/osd caps

Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit 403d5411364e2fddd70d98a6f120b26e416c1d99)

commit | commitdiff | tree

Enrico Bocchi [Mon, 26 Aug 2024 11:30:02 +0000 (13:30 +0200)]

mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps

Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit 0882bbe8a4470f82993d87b7c02b19aa7fe7fbcc)

commit | commitdiff | tree

Nitzan Mordechai [Tue, 15 Jul 2025 10:58:40 +0000 (10:58 +0000)]

workunits/rados: remove cache tier test

Fixes: https://tracker.ceph.com/issues/71930
Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
(cherry picked from commit 2b57f435a1de1a99dd7bcb47478938965587713b)

commit | commitdiff | tree

Naveen Naidu [Mon, 21 Jul 2025 10:55:01 +0000 (16:25 +0530)]

qa/suites/rados/thrash-old-clients: Add OSD warnings to ignore list

Add these to ignorelist:
- OSD_HOST_DOWN
- OSD_ROOT_DOWN

These warnings cause Teuthology tests to fail during OSD setup. They
are unrelated to RADOS testing and occur due to the nature of
thrashing tests, where OSDs temporarily mark themselves as down.

If an OSD is the only one in the cluster at the time, the cluster
may incorrectly detect the host as down, even though other OSDs
are still starting up.

Fixes: https://tracker.ceph.com/issues/70972
Signed-off-by: Naveen Naidu <naveen.naidu@ibm.com>
(cherry picked from commit 36b96ad56f0d364930b8841ebe36335756e489a9)

commit | commitdiff | tree

Igor Fedotov [Fri, 4 Jul 2025 12:15:26 +0000 (15:15 +0300)]

blk/kernel: improve DiscardThread life cycle.

This will eliminate a potential race between thread startup and its
removal.

Relates-to: https://tracker.ceph.com/issues/71800
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 69369b151b96ca74bffb9d72f4c249f48fde2845)

commit | commitdiff | tree

Nizamudeen A [Thu, 11 Sep 2025 04:13:13 +0000 (09:43 +0530)]

mgr/dashboard: fix missing schedule interval in rbd API

Fetching the rbd image schedule interval through the rbd_support module
schedule list command

GET /api/rbd will have the following field per image
```
"schedule_info": {
                    "image": "rbd/rbd_1",
                    "schedule_time": "2025-09-11 03:00:00",
                    "schedule_interval": [
                        {
                            "interval": "5d",
                            "start_time": null
                        },
                        {
                            "interval": "3h",
                            "start_time": null
                        }
                    ]
                },
```

Also fixes the UI where schedule interval was missing in the form and
also disable editing the schedule_interval.

Extended the same thing to the `GET /api/pool` endpoint.

Fixes: https://tracker.ceph.com/issues/72977
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 72cebf0126bd07f7d42b0ae7b68646c527044942)

commit | commitdiff | tree

Shraddha Agrawal [Tue, 16 Sep 2025 13:52:27 +0000 (19:22 +0530)]

options/mon: disable availability tracking by default

Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit ef7effaa33bd6b936d7433e668d36f80ed7bee65)

commit | commitdiff | tree

Bill Scales [Wed, 27 Aug 2025 13:44:08 +0000 (14:44 +0100)]

osd: Optimized EC incorrectly rolled backwards write

A bug in choose_acting in this scenario:

* Current primary shard has been absent so has missed the latest few writes
* All the recent writes are partial writes that have not updated shard X
* All the recent writes have completed

The authorative shard is chosen from the set of primary-capable shards
that have the highest last epoch started, these have all got log entries
for the recent writes.

The get log shard is chosen from the set of shards that have the highest
last epoch started, this chooses shard X because its furthest behind

The primary shard last update is not less than get log shard last
update so this if statement decides that it has a good enough log:

if ((repeat_getlog != nullptr) &&
    get_log_shard != all_info.end() &&
    (info.last_update < get_log_shard->second.last_update) &&
    pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {

We then proceed through peering using the primary log and the
log from shard X. Neither have details about the recent writes
which are then incorrectly rolled back.

The if statement should be looking at last_update for the
authorative shard rather than the get_log_shard, the code
would then realize that it needs to get the log from the
authorative shard first and then have a second pass
where it gets the log from the get log shard.

Peering would then have information about the partial writes
(obtained from the authorative shards log) and could correctly
roll these writes forward by deducing that the get_log_shard
didn't have these log entries because they were partial writes.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit ac4e0926bbac4ee4d8e33110b8a434495d730770)

commit | commitdiff | tree

Alex Ainscow [Tue, 12 Aug 2025 16:12:45 +0000 (17:12 +0100)]

osd: Clear zero_for_decode for shards where read failed on recovery

Not clearing this can lead to a failed decode, which panics, rather than
a recovery or IO failure.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 6365803275b1b6a142200cc2db9735d48c86ae03)

commit | commitdiff | tree

Alex Ainscow [Fri, 8 Aug 2025 15:20:32 +0000 (16:20 +0100)]

osd: Reduce buffer-printing debug strings to debug level 30

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
# Conflicts:
# src/osd/ECBackend.cc
(cherry picked from commit b4ab3b1dcef59a19c67bb3b9e3f90dfa09c4f30b)

commit | commitdiff | tree

Alex Ainscow [Fri, 8 Aug 2025 09:25:53 +0000 (10:25 +0100)]

osd: Fix segfault in EC debug string

The old debug_string implementation was potentially reading up to 3
bytes off the end of an array. It was also doing lots of unnecessary
bufferlist reconstructs. This refactor of this function fixes both
issues.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit da3ccdf4d03e40b747f8876449199102e53e00ce)

commit | commitdiff | tree

Bill Scales [Fri, 8 Aug 2025 08:58:14 +0000 (09:58 +0100)]

osd: Optimized EC backfill interval has wrong versions

Bug in the optimized EC code creating the backfill
interval on the primary. It is creating a map with
the object version for each backfilling shard. When
there are multiple backfill targets the code was
overwriting oi.version with the version
for a shard that has had partial writes which
can result in the object not being backfilled.

Can manifest as a data integirty issue, scrub
error or snapshot corruption.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit acca514f9a3d0995b7329f4577f6881ba093a429)

commit | commitdiff | tree

Bill Scales [Mon, 4 Aug 2025 15:24:41 +0000 (16:24 +0100)]

osd: Optimized EC choose_acting needs to use best primary shard

There have been a couple of corner case bugs with choose_acting
with optimized EC pools in the scenario where a new primary
with no existing log is choosen and find_best_info selects
a non-primary shard as the authorative shard.

Non-primary shards don't have a full log so in this scenario
we need to get the log from a shard that does have a complete
log first (so our log is ahead or eqivalent to authorative shard)
and then repeat the get log for the authorative shard.

Problems arise if we make different decisions about the acting
set and backfill/recovery based on these two different shards.
In one bug we osicillated between two different primaries
because one primary used one shard to making peering decisions
and the other primary used the other shard, resulting in
looping flip/flop changes to the acting_set.

In another bug we used one shard to decide that we could do
async recovery but then tried to get the log from another
shard and asserted because we didn't have enough history in
the log to do recovery and should have choosen to do a backfill.

This change makes optimized EC pools always choose the
best !non_primary shard when making decisions about peering
(irrespective of whether the primary has a full log or not).
The best overall shard is now only used for get log when
deciding how far to rollback the log.

It also sets repeat_getlog to false if peering fails because
the PG is incomplete to avoid looping forever trying to get
the log.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit f3f45c2ef3e3dd7c7f556b286be21bd5a7620ef7)

commit | commitdiff | tree

Alex Ainscow [Fri, 1 Aug 2025 14:09:58 +0000 (15:09 +0100)]

osd: Do not sent PDWs if read count > k

The main point of PDW (as currently implemented) is to reduce the amount
of reading performed by the primary when preparing for a read-modify-write (RMW).

It was making the assumption that if any recovery was required by a
conventional RMW, then a PDW is always better. This was an incorrect assumption
as a conventional RMW performs at most K reads for any plugin which
supports PDW. As such, we tweak this logic to perform a conventional RMW
if the PDW is going to read k or more shards.

This should improve performance in some minor areas.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit cffd10f3cc82e0aef29209e6e823b92bdb0291ce)

commit | commitdiff | tree

Alex Ainscow [Wed, 18 Jun 2025 19:46:49 +0000 (20:46 +0100)]

osd: Fix decode for some extent cache reads.

The extent cache in EC can cause the backend to perform some surprising reads. Some
of the patterns were discovered in test that caused the decode to attempt to
decode more data than was anticipated during the read planning, leading to an
assert. This simple fix reduces the scope of the decode to the minimum.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 2ab45a22397112916bbcdb82adb85f99599e03c0)

commit | commitdiff | tree

Bill Scales [Fri, 1 Aug 2025 10:48:18 +0000 (11:48 +0100)]

osd: Optimized EC calculate_maxles_and_minlua needs to use ...
exclude_nonprimary_shards

When an optimized EC pool is searching for the best shard that
isn't a non-primary shard then the calculation for maxles and
minlua needs to exclude nonprimary-shards

This bug was seen in a test run where activating a PG was
interrupted by a new epoch and only a couple of non-primary
shards became active and updated les. In the next epoch
a new primary (without log) failed to find a shard that
wasn't non-primary with the latest les. The les of
non-primary shards should be ignored when looking for
an appropriate shard to get the full log from.

This is safe because an epoch cannot start I/O without
at least K shards that have updated les, and there
are always K-1 non-primary shards. If I/O has started
then we will find the latest les even if we skip
non-primary shards. If I/O has not started then the
latest les ignoring non-primary shards is the
last epoch in which I/O was started and has a good
enough log+missing list.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 72d55eec85afa4c00fac8dc18a1fb49751e61985)

commit | commitdiff | tree

Bill Scales [Fri, 1 Aug 2025 09:39:16 +0000 (10:39 +0100)]

osd: Optimized EC choose_async_recovery_ec must use auth_shard

Optimized EC pools modify how GetLog and choose_acting work,
if the auth_shard is a non-primary shard and the (new) primary
is behind the auth_shard then we cannot just get the log from
the non-primary shard because it will be missing entries for
partial writes. Instead we need to get the log from a shard
that has the full log first and then repeat GetLog to get
the log from the auth_shard.

choose_acting was modifying auth_shard in the case where
we need to get the log from another shard first. This is
wrong - the remainder of the logic in choose_acting and
in particular choose_async_recovery_ec needs to use the
auth_shard to calculate what the acting set will be.
Using a different shard occasional can cause a
different acting set to be selected (because of
thresholds about the number of log entries behind
a shard needs to be to perform async recovery) and
this can lead to two shards flip/flopping with
different opinions about what the acting set should be.

Fix is to separate out which shard will be returned
to GetLog from the auth_shard which will be used
for acting set calculations.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 3c2161ee7350a05e0d81a23ce24cd0712dfef5fb)

commit | commitdiff | tree

Bill Scales [Fri, 1 Aug 2025 09:22:47 +0000 (10:22 +0100)]

osd: Optimized EC don't try to trim past crt

If there is an exceptionally long sequence of partial writes
that did not update a shard that is followed by a full write
then it is possible that the log trim point is ahead of the
previous write to the shard (and hence crt). We cannot trim
beyond crt. In this scenario its fine to limit the trim to crt
because the shard doesn't have any of the log entries for the
partial writes so there is nothing more to trim.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 645cdf9f61e79764eca019f58a4d9c6b51768c81)

commit | commitdiff | tree

Bill Scales [Fri, 1 Aug 2025 08:56:23 +0000 (09:56 +0100)]

osd: Optimized EC missing call to apply_pwlc after updating pwlc

update_peer_info was updating pwlc with a newer version received
from another shard, but failed to update the peer_info's to
reflect the new pwlc by calling apply_pwlc.

Scenario was primary receiving an update from shard X which had
newer information about shard Y. The code was calling apply_pwlc
for shard X but not for shard Y.

The fix simplifies the logic in update_peer_info - if we are
the primary update all peer_info's that have pwlc. If we
are a non-primary and there is pwlc then update info.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d19f3a3bcbb848e530e4d31cbfe195973fa9a144)

commit | commitdiff | tree

Bill Scales [Wed, 30 Jul 2025 11:44:10 +0000 (12:44 +0100)]

osd: Optimized EC don't apply pwlc for divergent writes

Split pwlc epoch into a separate variable so that we
can use epoch and version number when comparing if
last_update is within a pwlc range. This ensures that
pwlc is not applied to a shard that has a divergent
write, but still tracks the most recent update of pwlc.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d634f824f229677aa6df7dded57352f7a59f3597)

commit | commitdiff | tree

Bill Scales [Wed, 30 Jul 2025 11:41:34 +0000 (12:41 +0100)]

osd: Optimized EC present_shards no longer needed

present_shards is no longer needed in the PG log entry, this has been
replaced with code in proc_master_log that calculates which shards were
in the last epoch started and are still present.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 880a17e39626d99a0b6cc8259523daa83c72802c)

commit | commitdiff | tree

Bill Scales [Mon, 28 Jul 2025 08:26:36 +0000 (09:26 +0100)]

osd: Optimized EC proc_master_log fix roll-forward logic when shard is absent

Fix bug in optimized EC code where proc_master_log incorrectly did not
roll forward a write if one of the written shards is missing in the current
epoch and there is a stray version of that shard that did not receive the
write.

As long as the currently present shards that participated in les and were
updated by a write have the update then the write should be rolled-forward.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit e0e8117769a8b30b2856f940ab9fc00ad1e04f63)

commit | commitdiff | tree

Bill Scales [Mon, 28 Jul 2025 08:21:54 +0000 (09:21 +0100)]

osd: Refactor find_best_info and choose_acting

Refactor find_best_info to have separate function to calculate
maxles and minlua. The refactor makes history_les_bound
optional, tidy up the choose_acting interface removing this
where it is not used.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit f1826fdbf136dc7c96756f0fb8a047c9d9dda82a)

commit | commitdiff | tree

Bill Scales [Thu, 17 Jul 2025 18:17:27 +0000 (19:17 +0100)]

osd: EC Optimizations proc_master_log boundary case bug fixes

Fix a couple of bugs in proc_master_log for optimized EC
pools dealing with boundary conditions such as an empty
log and merging two logs that diverge from the very first
entry.

Refactor the code to handle the boundary conditions and
neaten up the code.

Predicate the code block with if (pool.info.allows_ecoptimizations())
to make it clear this code path is only for optimized EC pools.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 1b44fd9991f5f46b969911440363563ddfad94ad)

commit | commitdiff | tree

Jon Bailey [Fri, 25 Jul 2025 13:16:35 +0000 (14:16 +0100)]

osd: Invalidate stats during peering if we are rolling a shard forwards.

This change will mean we always recalculate stats upon rolling stats forwards. This prevent the situation where we end up with incorrect statistics due to where we always take the stats of the oldest shard during peering; causing outdated pg stats being applied for cases where the oldest shards are shards that don't see partial writes where num_bytes has changed on other places after that point on that shard.

Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
(cherry picked from commit b178ce476f4a5b2bb0743e36d78f3a6e23ad5506)

commit | commitdiff | tree

Radoslaw Zarzynski [Wed, 21 May 2025 16:33:15 +0000 (16:33 +0000)]

osd: ECTransaction.h includes OSDMap.h

Needed for crimson.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 6dd393e37f6afb9063c4bed3e573557bd0efb6bd)

commit | commitdiff | tree

Radoslaw Zarzynski [Mon, 21 Apr 2025 08:49:55 +0000 (08:49 +0000)]

osd: bypass messenger for local EC reads

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit b07d1f67625c8b621b2ebf5a7f744c588cae99d3)

commit | commitdiff | tree

Radoslaw Zarzynski [Fri, 18 Jul 2025 10:35:09 +0000 (10:35 +0000)]

osd: fix buildability after get_write_plan() shuffling

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 7f4cb19251345849736e83bd0c7cc15ccdcdf48b)

commit | commitdiff | tree

Radoslaw Zarzynski [Sun, 11 May 2025 10:40:55 +0000 (10:40 +0000)]

osd: just shuffle get_write_plan() from ECBackend to ECCommon

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 9d5bf623537b8ee29e000504d752ace1c05964d7)

commit | commitdiff | tree

Radoslaw Zarzynski [Sun, 11 May 2025 09:20:29 +0000 (09:20 +0000)]

osd: prepare get_write_plan() for moving from ECBackend to ECCommon

For the sake of sharing with crimson.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit dc5b0910a500363b62cfda8be44b4bed634f9cd6)

commit | commitdiff | tree

Radoslaw Zarzynski [Sun, 11 May 2025 06:51:23 +0000 (06:51 +0000)]

osd: separate producing EC's WritePlan out into a dedicated method

For the sake of sharing with crimson in next commits.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit e06c0c6dd08fd6d2418a189532171553d63a9deb)

commit | commitdiff | tree

Radoslaw Zarzynski [Wed, 23 Apr 2025 11:42:00 +0000 (11:42 +0000)]

osd: fix unused variable warning in ClientReadCompleter

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit eb3a3bb3a70e6674f6e23a88dd1b2b86551efda2)

commit | commitdiff | tree

Radoslaw Zarzynski [Thu, 9 May 2024 21:00:05 +0000 (21:00 +0000)]

osd: shuffle ECCommon::RecoveryBackend from ECBackend.cc to ECCommon.cc

It's just code movement; there is no changes apart that.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit ef644c9d29b8adaef228a20fc96830724d1fc3f5)

commit | commitdiff | tree

Radoslaw Zarzynski [Thu, 9 May 2024 20:32:32 +0000 (20:32 +0000)]

osd: drop junky `#if 1` in recovery backend

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit d43bded4a02532c4612d53fc4418db8e4e829c3f)

commit | commitdiff | tree

Radoslaw Zarzynski [Thu, 9 May 2024 19:11:14 +0000 (19:11 +0000)]

osd: move ECCommon::RecoveryBackend from ECBackend.cc to ECCommon.cc

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit debd035a650768ead64e0707028bb862f4767bef)

commit | commitdiff | tree

Radoslaw Zarzynski [Thu, 9 May 2024 19:09:50 +0000 (19:09 +0000)]

osd: replace get_obc() with maybe_load_obc() in EC recovery

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 266773625f997ff6a1fda82b201e023948a5c081)

commit | commitdiff | tree

Radoslaw Zarzynski [Thu, 9 May 2024 19:07:32 +0000 (19:07 +0000)]

osd: abstract sending MOSDPGPush during EC recovery

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 1d54eaff41ec8d880bcf9149e4c71114e0ffdc09)

commit | commitdiff | tree

Radoslaw Zarzynski [Tue, 26 Mar 2024 14:28:16 +0000 (14:28 +0000)]

osd: prepare ECCommon::RecoveryBackend for shuffling to ECCommon.cc

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit e3ade5167d3671524eb372a028157f2a46e7a219)

commit | commitdiff | tree

Radoslaw Zarzynski [Tue, 26 Mar 2024 14:20:56 +0000 (14:20 +0000)]

osd: squeeze RecoveryHandle out of ECCommon::RecoveryBackend

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 1e0feb73a4b91bd8b7b3ecc164d28fe005b97ed1)

commit | commitdiff | tree

Radosław Zarzyński [Wed, 27 Sep 2023 12:17:06 +0000 (14:17 +0200)]

osd: just shuffle RecoveryMessages to ECCommon.h

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit bc28c16a9a83b0f12d3d6463eaeacbab40b0890b)

commit | commitdiff | tree

Radoslaw Zarzynski [Tue, 26 Mar 2024 11:59:42 +0000 (11:59 +0000)]

osd: prepare RecoveryMessages for shuffling to ECCommon.h

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 0581926035113b1a9cb38f76233242d6b32a7dc6)

commit | commitdiff | tree

Radoslaw Zarzynski [Mon, 25 Mar 2024 13:02:07 +0000 (13:02 +0000)]

osd: ECCommon::RecoveryBackend doesn't depend on ECBackend anymore

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 6ead960b23a95211847250d90e3d2945c6254345)

commit | commitdiff | tree

Radoslaw Zarzynski [Fri, 18 Apr 2025 08:42:18 +0000 (08:42 +0000)]

osd: fix buildability after the RecoveryBackend shuffling

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit c9d18cf3024e5ba681bed5dc315f70527e99b3f1)

commit | commitdiff | tree

Radoslaw Zarzynski [Mon, 25 Mar 2024 11:08:23 +0000 (11:08 +0000)]

osd: just shuffle RecoveryBackend from ECBackend.h to ECCommon.h

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 98480d2f75a7b99aa72562a6a6daa5f39db3d425)

commit | commitdiff | tree

Shraddha Agrawal [Thu, 21 Aug 2025 12:38:18 +0000 (18:08 +0530)]

doc: add config option and usage docs

This commit adds docs for the new config option introduced as
well as updates the feature docs on how to use this config
option.

Fixes: https://tracker.ceph.com/issues/72619
Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit 1cbe41bde12eb1d0437b746164edb689393cc5ad)

commit | commitdiff | tree

Shraddha Agrawal [Thu, 21 Aug 2025 12:27:52 +0000 (17:57 +0530)]

mon/MgrStatMonitor: update availability score after configured interval

This commit ensures that data availability status is only updated
after the configured interval has elapsed. By default this interval
is setup to be 1s.

Fixes: https://tracker.ceph.com/issues/72619
Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit e9e3d90210922f336950d02bedca2f09d4463dfe)

commit | commitdiff | tree

Shraddha Agrawal [Thu, 21 Aug 2025 11:51:49 +0000 (17:21 +0530)]

mon/MgrStatMonitor: add pool_availability_update_interval config option

This commit adds the dynamic config option to change the
interval at which the data availablity score is updated.

Fixes: https://tracker.ceph.com/issues/72619
Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit 37173c8be118795af6218ffad1e67a95a935a394)

commit | commitdiff | tree

Casey Bodley [Thu, 10 Jul 2025 13:47:04 +0000 (09:47 -0400)]

rgw/s3: remove 'aws-chunked' from Content-Encoding response

PutObject stores some of the generic http request headers in object
attrs so they can be returned as response headers in Get/HeadObject

S3 has its own `aws-chunked` value for the `Content-Encoding` header,
which it says does _not_ get stored with the object or returned with
Get/HeadObject

we've been storing this header with objects forever, so omitting the
value on PutObject doesn't fix the issue for existing objects. instead,
add the necessary filtering to Get/HeadObject

Fixes: https://tracker.ceph.com/issues/21128
Signed-off-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 4a802b864a6f9d45506197dfb1bc23cf852e51f8)

commit | commitdiff | tree

Leonid Chernin [Tue, 24 Jun 2025 13:00:49 +0000 (16:00 +0300)]

nvmeofgw: fixing GW delete issues
1.fixing the issue when gw is deleted based on invalid subsystem info
2. in function track_deleting_gws: break from loop only if
  delete was really done
        3. fix published rebalance index - publish ana-group instead of
  index
        4. do not dump gw-id string after gw was removed

Fixes: https://tracker.ceph.com/issues/71896
Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 77a11a7206748fa4be383da9f00a5df50e437e4a)

commit | commitdiff | tree

Leonid Chernin [Tue, 5 Aug 2025 10:19:59 +0000 (13:19 +0300)]

nvmeofgw: cleanup pending map upon monitor restart
fixes https://tracker.ceph.com/issues/72434

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 924acd1f2c11784790abb2b9c5ff5dacd32934e1)

commit | commitdiff | tree

Avan Thakkar [Mon, 18 Aug 2025 09:12:12 +0000 (14:42 +0530)]

mgr/prometheus: fix enabled_modules check for smb metadata

For get_smb_metadata check for enabled modules from mgr_map instead of available modules
to get list of mgr modules which are enabled on active mgr.

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit ea255486133c1057194db1ea56b117e498f4a34b)

commit | commitdiff | tree

Avan Thakkar [Thu, 24 Jul 2025 13:25:10 +0000 (18:55 +0530)]

mgr/prometheus: add share name as label to SMB_METADATA metric

Fixes: https://tracker.ceph.com/issues/72068
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit fcce415a351c12826bc94faeae02c9000975a245)

commit | commitdiff | tree

Avan Thakkar [Thu, 10 Jul 2025 12:07:59 +0000 (17:37 +0530)]

mgr/prometheus: add smb_metadata metric

Exposed SMB metadata metric including labels:
- CephFS volume, subvolume group and subvolume
- SMB cluster ID (as netbiosname)
- SMB version

Fixes: https://tracker.ceph.com/issues/72068
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit e214717fea850ae2121fa566f60b696ec8ddd7a2)

commit | commitdiff | tree

Adam King [Wed, 30 Jul 2025 19:51:11 +0000 (15:51 -0400)]

mgr/cephadm: don't use list_servers to get active mgr host for prometheus SD config

Having a lot of calls into list_servers causes issues with
the core ceph mgr on large clusters. Additionally, we were
using it purely to get the active mgr's host here, which
cephadm should be able to do without needing a mgr api call

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 726bb5a95de7857c220953a1ed26ed3263213c6f)

commit | commitdiff | tree

Adam King [Wed, 30 Jul 2025 19:49:20 +0000 (15:49 -0400)]

mgr/cephadm: add interval control for stray daemon checks

Primarily to avoid running list_servers (which we kind of
need to do stray daemon checks since the whole point is
to check against a source that isn't cephadm). It was
found on larger clusters calling into list_servers
often can cause issues with the core ceph mgr

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit ee0364761e1ee29e6ad527dddd0eafc01c1f1aaa)

commit | commitdiff | tree

Bill Scales [Wed, 16 Jul 2025 14:55:40 +0000 (15:55 +0100)]

osd: Optimized EC invalid pwlc for shards doing backfill/async

Shards performing backfill or async recovery receive log entries
(but not transactions) for updates to missing/yet to be backfilled
objects. These log entries get applied and completed immediately
because there is nothing that can be rolled back. This causes
pwlc to advance too early and causes problems if other shards
do not complete the update and end up rolling it backwards.

This fix sets pwlc to be invalid when such a log entry is
applied and completed and it then remains invalid until the
next interval when peering runs again. Other shards will
continue to update pwlc and any complete subset of shards
in a future interval will include at least one shard that
has continued to update pwlc

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 534fc76d40a86a49bfabab247d3a703cbb575e27)

commit | commitdiff | tree

Bill Scales [Wed, 16 Jul 2025 14:05:16 +0000 (15:05 +0100)]

osd: Optimized EC add_log_entry should not skip partial writes

Undo a previous attempt at a fix that made add_log_entry skip adding partial
writes to the log if the write did not update this shard. The only case where
this code path executed is when a partial write was to an object that needs
backfilling or async recovery. For async recovery we need to keep the
log entry because it is needed to update the missing list. For backfill it
doesn't harm to keep the log entry.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 9f0e883b710a06e3371bc7e0681e034727447f27)

commit | commitdiff | tree

Bill Scales [Wed, 16 Jul 2025 13:55:49 +0000 (14:55 +0100)]

osd: Optimized EC apply_pwlc needs to be careful about advancing last_complete

Fix bug in apply_pwlc where the primary was advancing last_complete for a
shard doing async recorvery so that last_complete became equal to last_update
and it then thought that recovery had completed. It is only valid to advance
last_complete if it is equal to last_update.

Tidy up the logging in this function as consecutive calls to this function
often logged that it could advance on the 1st call and then that it could not
on the 2nd call. We only want one log message.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 0b19ed49ff76d7470bfcbd7f26ea0c7e5a2bc358)

commit | commitdiff | tree

Alex Ainscow [Tue, 15 Jul 2025 10:50:20 +0000 (11:50 +0100)]

osd: Use std::cmp_greater to avoid signedness warnings.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 846879e6c2ec4ab5a65040981a617f4b603c379a)

commit | commitdiff | tree

Alex Ainscow [Mon, 14 Jul 2025 22:57:49 +0000 (23:57 +0100)]

osd: Always send EC messages to all shards following error.

Explanation of bug which is being fixed:

Log entry 204'784 is an error - "_rollback_to deleting head on smithi17019940-573 because
got ENOENT|whiteout on find_object_context" so this log entry is generated outside of EC by
PrimaryLogPG. It should be applied to all shards, however osd 13(2) was a little slow and
the update got interrupted by a new epoch so it didn't apply it. All the other shards
marked it as applied and completed (there isn't the usual interlock that EC has of making
sure all shards apply the update before any complete it).

We then processed 4 partial writes applying and completing them (they didn't update osd
13(2)), then we have a new epoch and go through peering. Peering says osd 13(2) didn't see
update 204'784 (it didn't) and therefore the error log entry and the 4 partial writes need
to be rolled back. The other shards had completed those 4 partial writes so we end up with
4 missing objects on all the shards which become unfound objects.

I think the underlying bug means that log entry 204'784 isn't really complete and may
"disappear" from the log in a subsequent peering cycle. Trying to forcefully rollback a
logged error doesn't generate a missing object or a miscompare, so the consequences of the
bug are hidden. It is however tripping up the new EC code where proc_master_log is being
much stricter about what a completed write means.

Fix:
After generating a logged error we could force the next write to EC to update metadata on
all shards even if its a partial write. This means this write won't complete unless all
shards see the logged error. This will make new EC behave the same as old EC. There is
already an interlock with EC (call_write_ordered) which is called just before generating
the log error that ensures that any in-flight writes complete before submitting the log
error. We could set a boolean flag here (at the point call_write_ordered is called is fine,
don't need to wait for the callback) to say the next write has to be to all shards. The
flag can be cleared if we generate the transactions for the next write, or we get an
on_change notification (peering will then clear up the mess)

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 4948f74331c13cd93086b057e0f25a59573e3167)

commit | commitdiff | tree

Alex Ainscow [Mon, 14 Jul 2025 15:40:22 +0000 (16:40 +0100)]

osd: Attribute re-reads in optimised EC

There were some bugs in attribute reads during recovery in optimised
EC where the attribute read failed. There were two scenarios:

1. It was not necessary to do any further reads to recover the data. This
can happen during recovery of many shards.
2. The re-read could be honoured from non-primary shards. There are
sometimes multiple copies of the shard whcih can be used, so a failed read
on one OSD can be replaced by a read from another.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 417fb71c9b5628726d3217909ba1b6d3e7bf251a)

commit | commitdiff | tree

Bill Scales [Fri, 11 Jul 2025 11:59:40 +0000 (12:59 +0100)]

osd: EC optimizations keep log entries on all shards

When a shard is backfilling it gets given log entries
for partial writes even if they do not apply to the
shard. The code was updating the missing list but
discarding the log entry. This is wrong because the
update can be rolled backwards and the log entry is
required to revert the update to the missing list.
Keeping the log entry has a small but insignificant
performance impact.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 1fa5302092cbbb37357142d01ca008cae29d4f5e)

commit | commitdiff | tree

Alex Ainscow [Fri, 11 Jul 2025 09:58:32 +0000 (09:58 +0000)]

osd: Remove some extraneous references to hinfo.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 1c2476560f1c4b2eec1c074a6eead5520b5474eb)

commit | commitdiff | tree

Bill Scales [Mon, 7 Jul 2025 20:13:59 +0000 (21:13 +0100)]

mon: Optimized EC pools preprocess_pgtemp incorrectly rejecting pgtemp as nop

Optimized EC pools store pgtemp with primary shards first, this was not
being taken into account by OSDMonitor::preprocess_pgtemp which meant
that the change of pgtemp from [None,2,4] to [None,4,2] for a 2+1 pool
was being rejected as a nop because the primary first encoded version
of [None,2,4] is [None,4,2].

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 00aa1933d3457c377d9483072e663442a4ff8ffd)

commit | commitdiff | tree

Bill Scales [Fri, 4 Jul 2025 10:51:05 +0000 (11:51 +0100)]

osd: EC Optimizations proc_master_log bug fixes

1. proc_master_log can roll forward full-writes that have
been applied to all shards but not yet completed. Add a
new function consider_adjusting_pwlc to roll-forward
pwlc. Later partial_write can be called to process the
same writes again. This can result in pwlc being rolled
backwards. Modify partial_write so it does not undo pwlc.

2. At the end of proc_master_log we want the new
authorative view of pwlc to persist - this may be
better or worse than the stale view of pwlc held by
other shards. consider_rollback_pwlc sometimes
updated the epoch in the toversion (second value of the
range fromversion-toverison). We now always do this.
Updating toversion.epoch causes problems because this
version sometimes gets copied to last_update and
last_complete - using the wrong epoch here messes
everything up in later peering cycles. Instead we
now update fromversion.epoch. This requires changes
to apply_pwlc and an assert in Stray::react(const MInfoRec&)

3. Calling apply_pwlc at the end of proc_master_log is
too early - updating last_update and last_complete here
breaks GetMissing. We need to do this later when activating
(change to search_missing and activate)

4. proc_master_log is calling partial_write with the
wrong previous version - this causes problems after a
split when the log is sparsely populated.

5. merging PGs is not setting up pwlc correctly which
can cause issues in future peering cycles. The
pwlc can simply be reset, we need to update the epoch
to make sure this view of pwlc persists vs stale
pwlc from other shards.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 0b8593a0112e31705acb581ac388a4ef1df31b4b)

commit | commitdiff | tree

Jon Bailey [Thu, 3 Jul 2025 13:24:41 +0000 (14:24 +0100)]

osd: Fix issue where not all shards are receiving setattr when it's sent to an object with the whiteout flag set.

Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
(cherry picked from commit 89fef784aa46e74dd05ef8f1bff16f357016dfc3)

commit | commitdiff | tree

Alex Ainscow [Tue, 1 Jul 2025 14:51:58 +0000 (15:51 +0100)]

osd: Relax assertion that all recoveries require a read.

If multiple object are being read as part of the same recovery (this happens
when recovering some snapshots) and a read fails, then some reads from other
shards will be necessary. However, some objects may not need to read. In this
case it is only important that at least one read message is sent, rather than
one read message per object is sent.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 9f9ea6ddd38ebf6ae7159855267e61858bb2b7fc)

commit | commitdiff | tree

Alex Ainscow [Tue, 1 Jul 2025 14:49:20 +0000 (15:49 +0100)]

osd: Recovery of zero length reads when we add a new OSD without an interval.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 3493d13d733454bb75616c628e25b2fa94dcb400)

commit | commitdiff | tree

Alex Ainscow [Mon, 30 Jun 2025 13:31:21 +0000 (14:31 +0100)]

osd: Relax PGLog assert when ec optimisations are enabled on a pool.

The versions on partial shards are permitted to be behind, so we need
to relax several asserts, this is another example.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 0c89e7ef2ab48199ee3f7296cf1cb44c9aeec667)

commit | commitdiff | tree

Alex Ainscow [Sun, 29 Jun 2025 21:54:51 +0000 (22:54 +0100)]

osd: Truncate coding shards to minimal size

Scrub detected a bug where if an object was truncated to a size where the first
shard is smaller than the chunk size (only possible for >4k chunk sizes), then
the coding shards were being aligned to the chunk size, rather than to 4k.

This fixes changes how the write plan is calculated to write the correct size.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit a39a309631482b0caa071d586f192cd19a7ae470)

commit | commitdiff | tree

Bill Scales [Fri, 27 Jun 2025 12:35:58 +0000 (13:35 +0100)]

osd: EC Optimizations fix peering bug causing unfound objects

Fix some unusual scenarios where peering was incorrectly
declaring that objects were missing on stray shards. When
proc_master_log rolls forward partial writes it need to
update pwlc exactly the same way as if the write had been
completed. This ensures that stray shards that were not
updated because of partial writes do not cause objects
to be incorrectly marked as missing.

The fix also means some code in GetMissing which was trying
to do a similar thing for shards that were acting,
recovering or backfilling (but not stray) can be deleted.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 83a9c0a9f8e9ed4f514adb32f1ae2df1602c3f88)

commit | commitdiff | tree

Bill Scales [Wed, 25 Jun 2025 09:26:17 +0000 (10:26 +0100)]

osdc: Optimized EC pools routing bug

Fix bug with routing to an acting set like [None,Y,X,X]p(X)
for a 3+1 optimzed pool where osd X is representing more
than one shard. For an optimized EC pool we want it to
choose shard 3 because shard 2 is a non-primary. If we
just search the acting set for the first OSD that matches
X this will pick shard 2, so we have to convert the order
to primary's first, then find the matching OSD and then
convert this back to the normal ordering to get shard 3.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 3310f97859109090706b84824cac2f8a6cfe6928)

commit | commitdiff | tree

Bill Scales [Mon, 23 Jun 2025 10:36:37 +0000 (11:36 +0100)]

mon: Optimized EC clean_temps needs to permit primary change

Optimized EC pools were blocking clean_temps from clearing pg_temp
when up == acting but up_primary != acting_primary because optimized
pools sometimes use pg_temp to force a change of primary shard.

However this can block merges which require the two PGs being
merged to have the same primary. Relax clean_temps to permit
pg_temp to be cleared so long as the new primary is not a
non-primary shard.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit ce53276191e60375486f75d93508690f780bee21)

commit | commitdiff | tree

Bill Scales [Mon, 23 Jun 2025 09:24:17 +0000 (10:24 +0100)]

osd: Optimized EC pools - fix overaggressive assert in read_log_and_missing

Non-primary shards may not be updated because of partial writes. This means
that the OI verison for an object on these shards may be stale. An assert
in read_log_and_missing was checking that the OI version matched the have
version in a missing entry. The missing entry calculates the have version
using the prior_version from a log entry, this does not take into account
partial writes so can be ahead of the stale OI version.

Relax the assert for optimized pools to require have >= oi.version

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 74e138a7c1f8b7e375568c6811a60f6bdad181b3)

commit | commitdiff | tree

Bill Scales [Mon, 23 Jun 2025 09:12:10 +0000 (10:12 +0100)]

osd: rewind_divergent_log needs to dirty log if crt changes or ...
rollback_info_trimmed_to changes

PGLog::rewind_divergent_log was only causing the log to be marked
dirty and checkpointed if there were divergent entries. However
after a PG split it is possible that the log can be rewound
modifying crt and/or rollback_info_trimmed_to without creating
divergent entries because the entries being rolled back were
all split into the other PG.

Failing to checkpoint the log generates a window where if the OSD
is reset you can end up with crt (and rollback_info_trimmed_to) > head.
One consequence of this is asserts like
ceph_assert(rollback_info_trimmed_to == head); firing.

Fixes: https://tracker.ceph.com/issues/55141
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d8f78adf85f8cb11deeae3683a28db92046779b5)

commit | commitdiff | tree

Alex Ainscow [Fri, 20 Jun 2025 20:47:32 +0000 (21:47 +0100)]

osd: Correct truncate logic for new EC

The clone logic in the truncate was only cloning from the truncate
to the end of the pre-truncate object. If the next shard was being
truncated to a shorter length (which is common), then this shard
has a larger clone.

The rollback, however, can only be given a single range, so it was
given a range which covers all clones. The problem is that if shard
0 is rolled back, then some empty space from the clone was copied
to shard 0.

Fix is easy - calculate the full clone length and apply to all shards, so it matches the rollback.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 5d7588c051b31098c9970877ab6a784967ff94c8)

commit | commitdiff | tree

Alex Ainscow [Fri, 20 Jun 2025 10:48:59 +0000 (11:48 +0100)]

osd: Fix incorrect invalidate_crc during slice iterate

The CRCs were being invalidate at the wrong point, so the last CRC was
not being invalidated.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 564b53c446201ed33b5345c936c3c4b5d32bdaab)

commit | commitdiff | tree

Bill Scales [Thu, 19 Jun 2025 13:26:04 +0000 (14:26 +0100)]

osd: Do not apply log entry if shard not written

This was a failed test, where the primary concluded that all objects were present
despite one missing object on the non primary shard.

The problem was caused because the log entries are sent to the unwritten shards if that
shard is missing in order to update the version number in the missing object. However,
the log entry should not actually be added to the log.

Further testing showed there are other scenarios where log entries are sent to
unwritten shards (for example a clone + partial_write in the same transaction),
these scenarios do not want to add the log entry either.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 24cd772f2099aa5f7dfeb7609522f770d0ae1115)

commit | commitdiff | tree

Bill Scales [Thu, 19 Jun 2025 12:41:17 +0000 (13:41 +0100)]

osd: EC Optimizations proc_replica_log needs to apply pwlc

PeeringState::proc_replica_log needs to apply pwlc before
calling PGLog so that any partial writes that have occurred
are taken into account when working out where a replica/stray
has diverged from the primary.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 6c3c0a88b68e2548df670dbe9797d54f89259398)

commit | commitdiff | tree

Alex Ainscow [Wed, 18 Jun 2025 19:46:49 +0000 (20:46 +0100)]

osd: Multiple Decode fixes.

Fix 1:

These are multiple fixes that affected the same code. To simplify review
and understanding of the changes, they have been merged into a single
commit.

What happened in defect is (k,m = 6,4)

1. State is: fast_reads = true, shards 0,4,5,6,7,8 are available. Shard 1 is missing this object.
2. Shard 5 only needs zeros, so read is dropped. Other sub read message sent.
3. Object on shard 1 completes recovery (so becomes not-missing)
4. Read completes, complete notices that it only has 5 reads, so calculates what it needs to re-read.
5. Calculates it needs 0,1,4,5,6,7 - and so wants to read shard 1.
6. Code assumes that enough reads should have been performed, so refused to do another reads and instead generates an EIO.

The problem here is some "lazy" code in step (4).  What is should be doing is working out that it
can use the zero buffers and not calling get_remaining_reads().  Instead, what it attempts to do is
call get_remaining_reads() and if there is no work to do, then it assumes it has everything
already and completes the read with success.  This assumption mostly works - but in this
combination of fast_reads picking less than k shards to read from AND an object completing
recovery in parallel causes issue.

The solution is to wait for all reads to complete and then assume that any remaining zero buffers
count as completed reads.  This should then cause the plugin to declare "success"

Fix 2:

There are decodes in new EC which can occur when less than k
shards have been read.  These reads in the last stripe, where
for decoding purposes, the data past the end of the shard can
be considered zeros. EC does not read these, but instead relies
on the decode function inventing the zero buffers.

This was working correctly when fast reads were turned off, but
if such an IO was encountered with fast reads turned on the
logic was disabled and the IO returns an EIO.

This commit fixes that logic, so that if all reads have complete
and send_all_remaining_reads conveys that no new reads were
requested, then decode will still be possible.

FIX 3:

If reading the end of an object with unequally sized objects,
we pad out the end of the decode with zeros, to provide
the correct data to the plugin.

Previously, the code decided not to add the zeros to "needed"
shards.  This caused a problem where for some parity-only
decodes, an incomplete set of zeros was generated, fooling the
_decode function into thinking that the entire shard was zeros.

In the fix, we need to cope with the case where the only data
needed from the shard is the padding itself.

The comments around the new code describe the logic behind
the change.

This makes the encode-specific use case of padding out the
to-be-decoded shards unnecessary, as this is performed by the
pad_on_shards function below.

Also fixing some logic in calculating the need_set being passed
to the decode function did not contain the extra shards needed
for the decode. This need_set is actually ignored by all the
plugins as far as I know, but being wrong does not seem
helpful if its needed in the future.

Fix 4: Extend reads when recovering parity

Here is an example use case which was going wrong:
1. Start with 3+2 EC, shards 0,3,4 are 8k shard 1,2 is 4k
2. Perform a recovery, where we recover 2 and 4.  2 is missing, 4 can be copied from another OSD.
3. Recovery works out that it can do the whole recovery with shards 0,1,3. (note not 4)
4. So the "need" set is 0,1,3, the "want" set is 2,4 and the "have" set is 0,1,3,4,5
5. The logic in get_all_avail_shards then tries to work out the extents it needs - it only. looks at 2, because we "have" 4
6. Result is that we end up reading 4k on 0,1,3, then attempt to recover 8k on shard 4 from this... which clearly does not work.

Fix 5: Round up padding to 4k alignment in EC

The pad_on_shards was not aligning to 4k.  However, the decode/encode functions were. This meant that
we assigned a new buffer, then added another after - this should be faster.

Fix 6: Do not invent encode buffers before doing decode.

In this bug, during recovery, we could potentially be creating
unwanted encode buffers and using them to decode data buffers.

This fix simply removes the bad code, as there is new code above
which is already doing the correct action.

Fix 7: Fix miscompare with missing decodes.

In this case, two OSDs failed at once. One was replaced and the other was not.

This caused us to attempt to encode a missing shard while another shard was missing, which
caused a miscompare because the recovery failed to do the decode properly before doing an encode.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit c116b8615d68a3926dc78a4965cc0a28ff85d4f2)

commit | commitdiff | tree

Bill Scales [Wed, 18 Jun 2025 11:11:51 +0000 (12:11 +0100)]

osd: Optimized EC pools bug fix when repeating GetLog

When the primary shard of an optimized EC pool does not have
a copy of the log it may need to repeat the GetLog peering
step twice, the first time to get a full copy of a log from
a shard that sees all log entries and then a second time
to get the "best" log from a nonprimary shard which may
have a partial log due to partial writes.

A side effect of repeating GetLog is that the missing
list is collected for both the "best" shard and the
shard that provides a full copy of the log. This later
missing list confuses later steps in the peering
process and may cause this shard to complete writes
and end up diverging from the primary. Discarding
this missing list causes Peering to behave the same as if
the GetLog step did not need to be repeated.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d2ba0f932c61746b51d0d427056a53e24db6ea0f)

commit | commitdiff | tree

Alex Ainscow [Wed, 11 Jun 2025 15:30:40 +0000 (16:30 +0100)]

osd: Fix attribute recover in rare recovery scenario

When recovering attributes, we read them from the first potential primary, then
if that read failures, attempt to read from another potential primary.

The problem is that the code which calculates which shards to read for a recovery
only takes into account *data* and not where the attributes are. As such, if the
second read only required a non-primary, then the attribute read fails and the
OSD panics.

The fix is to detect this scenario and perform an empty read to that shard, which
the attribute-read code can use for attribute reads.

Code was incorrectly interpreting a failed attribute read on recovery as
meaning a "fast_read". Also, no attribute recovery would occur in this case.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 98eae78f7629295800cb7dbb252cac7d0feff680)

commit | commitdiff | tree

Alex Ainscow [Wed, 11 Jun 2025 15:23:08 +0000 (16:23 +0100)]

osd: code clean up and debug in optimised EC

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 1352868cec8d644e5fff68df7050c52bb4ed7e65)

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom