osd/scrub: a single counters selection mechanism - step 1
Following the preceeding PR, the Scrubber now employs
two methods for selecting the specific subset of performance
counters to update (the replicated pool set or the EC one).
The first method is using labeled counters, with 4 optional labels
(Primary/Replica X Replicated/EC Pool). The second method
is by naming the specific OSD counters to use in ScrubIoCounterSet
objects, then selecting the appropriate set based on the pool type.
This commit is the first step on the path to unifying the two
methods - discarding the use of labeled counters, and only
naming OSD counters.
osd/scrub: perf-counters for I/O performed by the scrubber
Define two sets of performance counters to track I/O performed
by the scrubber - one set to be used when scrubbing a PG
in a replicated pool, and one - for EC PGs.
The ceph_ll_io_info structure has recently been extended to support
zerocopy operations. The proxy was initializing just the known members,
so, after the zerocopy support, it was passing garbage in some fields,
causing failures.
This patch completely clears the whole structure to be sure that
everything is initialized to its default value.
osd/scrub: additional configuration params to trigger scrub reschedule
Adding the following parameters to the (small) set of configuration
options that, if changed, trigger re-computation of the next scrub
schedule:
- osd_scrub_interval_randomize_ratio,
- osd_deep_scrub_interval_cv, and
- osd_deep_scrub_interval (which was missing in the list of
parameters watched by the OSD).
Bill Scales [Mon, 31 Mar 2025 08:17:35 +0000 (09:17 +0100)]
test: Add unittests for pgtemp_primaryfirst/pgtemp_undo_primaryfirst
Add unittests for pgtemp_primaryfirst and pgtemp_undo_primaryfirst
to prove the later is a reverse transform and that neither has any
effect until an optimized EC pool configures non-primary shards.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Bill Scales [Thu, 6 Mar 2025 12:20:52 +0000 (12:20 +0000)]
osd: Restrict choice of primary shard for ec_optimizations pools
Pools with ec_optimizations enabled have restrictions on which
shards are permitted to become the primary because not all shards
are updated for every I/O.
To preserve backwards compatibility with downlevel clients
pg_temp is used as the method to override the selection of
primary by OSDMap. Directly changing the logic in OSDMap
would have meant that all clients need to be upgraded to
tentacle before using optimized EC pools, so was discounted.
Using primary_temp to set the primary for an EC pool is
not reliable because under error conditions an OSD can store
multiple shards for the same PG and primary_temp cannot
define which of these shards will be choosen.
For optimized EC pools pg_temp is shuffled so that the
non-primary shards are listed last. This means that the
existing logic in OSDMap that picks the first available
shard as the primary will avoid selecting a non-primary
shard. OSDMonitor applies the shuffle when pg_temp is set,
this is then reverted in PeeringState when initializing the
acting set after OSDMap has selected the primary.
PeeringState::choose_acting is modified to set pg_temp if
OSDMap has selected a non-primary shard, this will cause
a new OSDMAP to be published which will persuade
OSDMap to select a primary shard instead.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
cls/rbd: drop overzealous CLS_ERR message in mirror_remote_namespace_get()
Currently it unnecessarily floods the log of the OSD which hosts
rbd_mirroring object with "No such file or directory" errors. Just
drop it as read_key() already logs all errors except ENOENT.
Credit to N Balachandran <nibalach@redhat.com> for spotting this.
It looks like at some point the centos9 image started shipping with
curl-minimal, which conflicts with the regular curl package. Asking dnf to find
the binary avoids this, since both packages provide it. Since we were already
doing this with rpmbuild, we can go ahead and loop wget into that in case
something similar happens there.
Zack Cerza [Fri, 7 Mar 2025 20:53:23 +0000 (13:53 -0700)]
make-debs.sh: Optionally take debian version
Our existing CI builds have names like:
ceph-base_20.0.0-194-g6efaea33-1jammy_amd64.deb
Before this change, they are like:
ceph-base_20.0.0-158-gb0de3a42-1_amd64.deb
This way we can pass e.g. "jammy" to end up with names compatible with our CI
builds.
Ville Ojamo [Thu, 10 Apr 2025 08:09:11 +0000 (15:09 +0700)]
doc/ceph-volume: Promptify commands and fix formatting
Use the more modern prompt block for CLI
commands, fix missing newline and messed up
line breaks.
Also change existing prompts to all indent with
same amount of spaces.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Adam Kupczyk [Tue, 1 Apr 2025 14:01:23 +0000 (14:01 +0000)]
os/bluestore/bluefs: Fix race condition between truncate() and unlink()
It was possible for unlink() to interrupt ongoing truncate().
As the result, unlink() finishes properly, but truncate() is not aware
of it and does:
1) updates file that is already removed
2) releases same allocations again
Now fixed by checking if file is deleted under FILE lock.
Casey Bodley [Thu, 7 Nov 2024 20:36:45 +0000 (15:36 -0500)]
rgw/rados: add concurrent io algorithms for sharded data
cls/rgw provides the base class CLSRGWConcurrentIO as a swiss army knife
for bucket index operations that visit every shard object. while it uses
asynchronous librados requests to perform the io, it blocks on a
condition variable when waiting for the AioCompletions.
for use in coroutines, we need a version of this that suspends instead
of blocking. and to support both stackful and stackless coroutines, we
want a fully-generic async inferface templated on CompletionToken.
while the CLSRGWConcurrentIO algorithm works for all current uses
(reads and writes, with/without retries, with/without cleanup), i chose
to break this into 3 algorithms with well-defined semantics:
1. reads: to produce a successful result, all shard operations must
succeed. so any shard's failure causes the rest to be cancelled or
skipped. supports retries for ListBucket (RGWBIAdvanceAndRetryError).
2. writes: even if some shards fail, we still want to visit every shard
before returning the error. supports retries for log trimming
operations (repeat until ENODATA).
3. revertible writes: similar to reads, requires all shard operations to
succeed. on any failure, the side effects of any successful writes
must be reverted before returning. only used by IndexInit (any created
shards are removed on failure).
each algorithm provides a pure virtual base class that must be
implemented for each type of operation, similar to how existing
operations inherit from CLSRGWConcurrentIO.
mgr/dashboard: Fix empty ceph version in GET api/hosts
Fixes https://tracker.ceph.com/issues/70821
Due to the pagination the host list is being fetched from orchestrator which caused a regression as via orchestrator list ceph version is always marked empty.
Caused by https://github.com/ceph/ceph/pull/52154
Also fixed tests , as the new version addition causing whole json object mock to fail in tests
test/librbd/test_notify.py: drop RBD_DISABLE_UPDATE_FEATURES
This was put in place in commit 9c0b239d70cd ("qa/upgrade:
conditionally disable update_features tests") to paper over a backwards
compatibility issue that arose from commit 01ff1530544c ("librbd: make
all maintenance op notifications async"). It's not needed in squid or
later because upgrades from octopus are tested only until reef.
test/librbd/test_notify.py: force line-buffered output
"master" and "slave" invocations are intended to run in parallel and
coordinate between themselves. Ensure that their respective output is
properly timestamped and ordered in teuthology.log file.