src/mon/PGMap.cc: check unfound obejcts in `get_unavailable_pg_in_pool_map`
If a pool has any PG with unfound objects, we should consider
it unavailable for the availability score. If a PG has unfound
objects, it will be recorded in PGMap.
In `get_unavailable_pg_in_map`, if a PG has unfound obejcts,
we add it to `pool_pg_unavailable_map`.
When we update the `pg_stat` we don't
check whether the pg state is in `stale`.
Therefore, the attribute `last_unstale`
will always get updated even if the pg
state actually contains `stale`.
Solution:
Place a condition to only update
the attribute `last_unstale` when
we the pg truly doesn't have `stale`
in its state.
Kamoltat [Thu, 26 Oct 2023 19:08:37 +0000 (19:08 +0000)]
src/mon/PGMap.cc: init pool_availability
Added PoolAvailability Struct
Modified PGMap.cc to include a k,v map:
`pool_availability`.
The key being the `poolid` and value
is `PoolAvailability`
Init the function:
`PGMap::get_unavailable_pg_in_pool_map()`
to identify and aggregate all the PGs we
mark as `unavailable` as well as the pool
that associates with the unavailable PG.
Also, included `pool_availability`
to `PGMapDigest::dump()`.
Adam Kupczyk [Mon, 10 Jun 2024 16:03:24 +0000 (16:03 +0000)]
os/bluestore: Add admin socket commands to inspect onode metadata
Add admin socket commands:
1) bluestore collections
Lists collections.
2) bluestore list <coll> [start object] [max count]
Lists collection coll starting from object (optional). Default 100 entries. 0 = unlimited.
3) bluestore onode metadata <object>
Prints onode metadata as seen by BlueStore.
It might happen (usually in tests) that 2 BlueStore instances are created at the same time.
Since admin commands are unique, it fails to register.
Use first register to detect whether we can register at all.
Adam Kupczyk [Tue, 8 Apr 2025 08:36:21 +0000 (08:36 +0000)]
os/bluestore: Add do_write_v2_compressed()
Modify do_write_v2() to branch into do_write_v2_compressed().
Segmented and regular cases are recognized and handled properly.
New do_write_v2_compressed() oversees compression / recompression.
Make one Estimator per Collection.
It makes possible for estimator to learn in collection specific compressibility.
In write_v2_compressed use compressor already selected in choose_write_options.
Make Collection create Estimator on first use.
Adam Kupczyk [Tue, 8 Apr 2025 11:03:22 +0000 (11:03 +0000)]
os/bluestore/compression: Main part of recompression feature
Add feature of recompression scanner that looks around write region to see how much
would be gained, if we read some more around and wrote more.
Added Compression.h / Compression.cc.
Added debug_bluestore_compression dout.
Created Scanner class.
Provides write_lookaround() for scanning loaded extents.
Adam Kupczyk [Wed, 9 Apr 2025 16:03:52 +0000 (16:03 +0000)]
os/bluestore/compression: Estimator class
Add CMake rules to compile.
Add bluestore_compression dout subsys.
Created Estimator class.
It is used by Scanner to decide if specific extent is to be recompressed.
Prepare for future machine learning / adaptive algorithm for estimation.
So far logic of Estimator is relatively simple.
It learns expected recompression values and uses them in next iterations to predict.
Max Kellermann [Thu, 24 Apr 2025 05:17:48 +0000 (07:17 +0200)]
mds/Locker: use ceph_abort_msg() instead of ceph_assert()
This ceph_assert() always fails, but depending on the configuration
value `ceph_assert_supresssions`, execution may continue, but the
`dir` variable is left uninitialized. This leads to a compiler
warning:
/home/jenkins-build/build/workspace/ceph-api/src/mds/Locker.cc:451:22: error: variable 'dir' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
clang then suggests to nullptr-initialize the variable:
/home/jenkins-build/build/workspace/ceph-api/src/mds/Locker.cc:447:11: note: initialize the variable 'dir' to silence this warning
447 | CDir *dir;
| ^
| = nullptr
This, however, is a very bad idea because all this does is suppress
the warning; it still crashes the process.
Since there's no recovery from this problem, let's switch to
ceph_abort_msg() which is [[noreturn]] and the compiler can deduce
that `dir` is always initialized when it's used.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Merge pull request #62693 from ronen-fr/wip-rf-iocnt
osd/scrub: performance counters for I/O performed by the scrubber
Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com> Reviewed-by: Bill Scales <bill_scales@uk.ibm.com> Reviewed-by: Samuel Just <sjust@redhat.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
N Balachandran [Mon, 21 Apr 2025 11:34:08 +0000 (17:04 +0530)]
rbd: display correct mirror state when creating
The mirror image state is set to MIRROR_IMAGE_STATE_CREATING
when the image is first created on the secondary, but was displayed
as "unknown" by the rbd info command. This has been fixed.
Fixes: https://tracker.ceph.com/issues/70963 Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
HealthMonitor: Add topology-aware netsplit detection and warning
Problem:
Currently, Ceph cannot detect and report network partitions (netsplits)
between monitors in different topology locations in a consolidated way.
While stretch mode can handle partitions through monitor elections,
users lack visibility into the topology-level view of network
disconnections, making troubleshooting difficult.
Solution:
This implementation adds a hierarchical netsplit detection mechanism that:
- Uses DirectedGraph structure for netsplit detection
- Maps monitor disconnections to relevant CRUSH topology levels
- Aggregates individual disconnections into location-level reports when appropriate
- Detects complete location-level netsplits when ALL monitors between locations
cannot communicate
- Reports specific topology locations experiencing complete communication failures
- Falls back to individual monitor-level reporting for partial disconnections
- Handles monitors with missing location data gracefully
- Leverages HealthMonitor::check_for_mon_down to receive a set of down monitors,
efficiently avoiding false netsplit reports for monitors already known to be down
- Implements smart filtering that correctly excludes down monitors from location-based
analysis, ensuring accurate netsplit reporting at both individual and topology levels
The implementation produces user-friendly health warnings:
1. For complete location netsplits: "Netsplit detected between dc1 and dc2"
2. For individual monitor disconnections: "Netsplit detected between mon.a and mon.d"
Performance considerations:
- Time complexity: O(m²) where m is the number of monitors
- Space complexity: O(m²) for connection tracking
- Practical impact is minimal as monitor count is typically small (3-7)
librbd: disallow "rbd trash mv" if image is in a group
Removing an image that is a member of a group has always been
disallowed. However, moving an image that is a member of a group to
trash is currently allowed and this is deceptive -- the only reason for
a user to move an image to trash should be the intent to remove it.
More importantly, group APIs operate in terms of image names -- there
are no corresponding variants that would operate in terms of image IDs.
For example, even though internally GroupImageSpec struct stores an
image ID, the public rbd_group_image_info_t struct insists on an image
name. When rbd_group_image_list() encounters a trashed member image
(i.e. one that doesn't have a name), it just fails with ENOENT and no
listing gets produced at all until the offending image is restored from
trash. Something like this can be very hard to debug for an average
user, so let's make rbd_trash_move() fail with EMLINK the same way as
rbd_remove() does in this scenario.
The one case where moving a member image to trash makes sense is live
migration where the source image gets trashed to be almost immediately
replaced by the destination image as part of preparing migration.
EMLINK is returned by rbd_remove() if the image is a member of a group.
Add a dedicated exception similar to ImageBusy or ImageHasSnapshots and
a test for it.
This came up during: https://tracker.ceph.com/issues/69406#note-25
Where an "assert_all" was called but didn't cause an abort.
Added "ignore_assert_all" to showcase this scenario along with any
other case where we are expected to abort.
The tests could be used to verify errorator's aborting behavior.
We should also invoke the errfunc (which aborts) when the
return type is no_touch_error_marker.
Added comments explaining:
* why it's forbidden to return void
* why std::is_same_v<return_t, no_touch_error_marker> is checked
Ville Ojamo [Tue, 22 Apr 2025 12:09:23 +0000 (19:09 +0700)]
doc/radosgw: Fix indentation in admin.rst
Indent the CLI command continuation lines correctly to start at the same
position as the other such commands, add one space on 2 lines.
Introduced in #62877.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Alex Ainscow [Mon, 7 Apr 2025 08:20:44 +0000 (09:20 +0100)]
osd: Install stub extent cache in OSD.
The extent cache in new EC is a per OSD-shard cache will caches
reads used by read-modify-write to improve performance of sequential
IO. We want to provide a single PR with all of EC in it, so this
PR provides a non-functional stub to allow all the non-EC code to
be installed.
Alex Ainscow [Thu, 27 Mar 2025 15:37:57 +0000 (15:37 +0000)]
osd: New options for configuring new EC
Adding three new configuration options which will apply once new EC
is in place:
osd_pool_default_flag_ec_optimizations
This allows EC optimizations to be turned on by default.
ec_extent_cache_size
This allows the user to specify the size of the per-shard extent cache if
they feel that the default 10MiB is too large or too small.
The default value may well change following more extensive testing.
ec_pd_write_mode
This is a development flag for testing the parity delta write RMW mechanism
within the EC code. Setting to anything other than 0 will cause performance
problems. It is provided as a test mechanism for performance and
teuthology. Performance may wish too turn off all PDW writes for a particular
IO pattern. This will allow us to determine if the automatic mode should be
using conventional RMW writes. The force-on mode allows testing on more
unusual scenarios and on smaller configurations.
Finally, we tweak the way optimisations are enabled, so as to be common between
enabling and default-enabled.
Alex Ainscow [Thu, 27 Mar 2025 14:38:44 +0000 (14:38 +0000)]
crimson/osd: Add scrub stubs for crimson and classic, ready for new EC
The new optimised EC code is not backward compatible withold EC Code.
Before this commit there is some stub code which assumes that an hinfo
xattr will exist and can be used for scrub. This is no longer the case in new EC.
We plan to first make the scrubbing changes for new EC in classic and will
subsequently port to crimson. It will not look like the code here, so there is
little point in keeping it.
Additionally, add some stubs for scrub in classic optimized EC.
There will be a later PR specifically for dealing with scrubbing in
new EC which fix all the fix mes in class,
The crimson code will be fixed up at a later date and will only
support optimised EC.
Alex Ainscow [Fri, 28 Mar 2025 13:46:30 +0000 (13:46 +0000)]
common: Generalise to_interval_set to allow more interval_set implementations.
This generalises to_interval_set so that the interval set does not need to share a
common internal map structure with interval_map. The implementation is achieved
through iteration, so there is no requirement for the old restriction.
Alex Ainscow [Thu, 27 Mar 2025 11:44:29 +0000 (11:44 +0000)]
common: bitset_set
This bitset_set change relaxes policing of bitset_set, so that
out-of-range can be queried in the contains interface. This means
that callers cam simplifiy calls. For example:
if (key == invalid) || !set.contains(key)) {
do_stuff
}
Bill Scales [Wed, 26 Mar 2025 13:43:43 +0000 (13:43 +0000)]
osd: EC Optimizations: proc_master_log changes for partial logs
proc_master_log is part of the peering process that merges
the authorative log (in the case of EC pools the log of the
shard missing the most updates) into the primary log.
When there are partial writes it is likely that the
authorative log is behind because of partial writes that
did not update that shard. proc_master_log works out where
the logs diverge and then studies each additional log entry
to see if all the updates made in that log entry have been
applied. If any shard is missing an update then that log
entry (and all subsequent entries) need to be rolled back,
otherwise the entry can be rolled forward and included in
the authorative log.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>