Dan Mick [Tue, 7 Oct 2025 03:50:17 +0000 (20:50 -0700)]
make-debs.sh: use ID instead of NAME for workdir
NAME is "for the user", and as such, Debian's is "Debian GNU/Linux",
which isn't friendly for making a pathname. ID is more like what
we want (lowercase, no spaces, limited special characters),
in the two cases we care most about, 'ubuntu' and 'debian'.
Kyr Shatskyy [Thu, 30 Oct 2025 11:58:05 +0000 (12:58 +0100)]
qa/cephfs: lua to respect missing kernel in yaml
When teuthology-suite is called with '-k none' option, which is valid,
there is no kernel record in job config created.
However at some test cases the lua premerge dies with exception:
KeyError: 'kernel'
and when branch is not set for '-k none' and kernel client is
overridden:
KeyError: 'branch'
so teuthology-suite quits unexpectedly without scheduling any jobs.
Patrick Donnelly [Fri, 22 May 2026 16:14:15 +0000 (12:14 -0400)]
qa/suites/upgrade: add "OBJECT_UNFOUND" to ignorelists
The thrashing in the upgrade tests has been configured to be very aggressive;
the tests are permitted to stop up to 4 of the 8 OSDs, so it is expected that
it is causing these kinds of health warnings to be generated.
This commit also cleans up some expected filesystem and pg peering warnings
in the upgrade tests.
Patrick Donnelly [Wed, 22 Apr 2026 17:50:39 +0000 (13:50 -0400)]
Merge PR #68116 into squid
* refs/pull/68116/head:
ceph-volume: single lvs call to speed up exclude_lvm_osd_devices
ceph-volume: avoid Device() instantiation in lvm OSD filtering
ceph-volume: avoid RuntimeError on ceph-volume raw list with non-existent loop devices
Patrick Donnelly [Wed, 22 Apr 2026 15:27:18 +0000 (11:27 -0400)]
Merge PR #68451 into squid
* refs/pull/68451/head:
qa/suites/orch/cephadm: replace "reef" with "v18.2.8"
qa/suites/fs/upgrade/mds_upgrade_sequence: replace "reef" with "v18.2.8"
qa/suites/upgrade: use tagged versions of reef
qa/suites/orch/cephadm: replace "quincy" with "v17.2.8"
qa/suites/fs/upgrade/mds_upgrade_sequence: replace "quincy" with "v17.2.8"
qa/suites/upgrade/telemetry-upgrade: ignore expected health warning
qa/suites/upgrade: use tagged versions of quincy
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Yuri Weinstein <yweins@redhat.com>
qa/tests: ignore 'pg stuck peering' during upgrade tests
This warning is part of PG_AVAILABILITY, which we already ignore.
The detail message 'pg .* is stuck peering' does not always include
PG_AVAILABILITY explicitly, so add it separately to the ignorelist.
qa/tests: v19.2.0 lacks the fix from tracker #66019 which causes NeoRadosCls.RemoteReads
to fail during P2P upgrade tests. Bump the starting version to v19.2.1 which
includes the fix. New upgrade path: v19.2.1 -> v19.2.3 -> squid-latest.
Laura Flores [Fri, 17 Apr 2026 15:20:09 +0000 (10:20 -0500)]
qa/suites/fs/upgrade/mds_upgrade_sequence: replace "reef" with "v18.2.8"
Since Reef is EOL, the "reef" tag was removed from quay.ceph.io.
The solution is to replace it with a test for the last point release,
which is "v18.2.8".
Fixes: https://tracker.ceph.com/issues/76028 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 17 Apr 2026 15:09:20 +0000 (10:09 -0500)]
qa/suites/fs/upgrade/mds_upgrade_sequence: replace "quincy" with "v17.2.8"
Since Quincy is EOL, the "quincy" tag was removed from quay.ceph.io.
The solution is to replace it with a test for the last point release
with a container image, which is "v17.2.8".
Fixes: https://tracker.ceph.com/issues/76028 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 17 Apr 2026 15:03:58 +0000 (10:03 -0500)]
qa/suites/upgrade/telemetry-upgrade: ignore expected health warning
"Telemetry requires re-opt-in" briefly shows up when we upgrade.
The test already re-opts in to telemetry to get rid of this warning,
but the cluster badness check still sometimes picks it up.
Fixes: https://tracker.ceph.com/issues/76028 Signed-off-by: Laura Flores <lflores@ibm.com>
Anoop C S [Fri, 5 Dec 2025 09:25:58 +0000 (14:55 +0530)]
mon/MonMap: Dump addr in backward compatible format
Prior to c5b43e9b2765ff98419c649a5ae53ec16601975d, we dumped only the
legacy string component from public_addrs as `addr`. Ensure that this
backward compatible filtering is retained when dumping MonMap.
Patrick Donnelly [Fri, 10 Apr 2026 17:23:45 +0000 (13:23 -0400)]
Merge PR #66541 into squid
* refs/pull/66541/head:
include: detect corrupt frag from byteswap
mds: dump frag_t as an object
common/frag: produce valid fragments for test instances
common: simplify fragment printing
common: properly convert frag_t to net/store endianness
mds: include sysinfo in status command output
include/frag.h: un-inline methods to reduce header dependencies
* refs/pull/67322/head:
qa: set column for insertion
qa: bail sqlite3 on any error
qa: use actual sqlite3 blob instead of string
test: use json_extract instead of awkward json_tree
Previously s3tests_java.py set JAVA_HOME using the `alternatives`
command. That had issues in that `alternatives` is not present on all
Ubuntu systems, and some installations of Java don't update
alternatives. So instead we look for a "java-8" jvm in /usr/lib/jvm/
and set JAVA_HOME to the first one we find.
* refs/pull/61894/head:
pybind/rados: add note for reversed arguments to WriteOp.zero()
test/pybind/test_rados.py: add test for reversed arguments offset,length in WriteOp.zero
pybind/rados: fix the incorrect order of offset,length in WriteOp.zero
Wang Chao [Sat, 10 Aug 2024 11:40:52 +0000 (19:40 +0800)]
pybind/rados: fix the incorrect order of offset,length in WriteOp.zero
The offset and length parameters in the rados pybind `WriteOp.zero()` method are being passed to the rados_write_op_zero() function in the incorrect order.
Incorrect order cause OP_ZERO not work correctly when use pybind's rados.
qa/tasks/backfill_toofull.py: Fix assert failures with & without compression
The following issues with the test are addressed:
1. The test was encountering assertion failure (assert backfillfull < 0.9) with
compression enabled. This was because the condition was not factoring in the
compression ratio. Without it the backfillfull ratio can easily exceed 1. By
factoring in the compression ratio, the backfillfull ratio will be in the
range (0 - n), where n can vary depending on the type of compression used.
2. The main contributing factor for (1) above is the amount of data written to
the pool. The writes were time-bound earlier leading to excess data and
eventually the assertion failure. By limiting the data written to the OSDs
to 50% of the OSD capacity in the first phase and only 20% in the re-write
phase, the outcome of the test is more deterministic regardless of
compression being enabled or not.
3. A potential false cluster error is avoided by swapping the setting of
the nearfull-ratio and backfill-ratio after the re-write phase.
ceph-volume: single lvs call to speed up exclude_lvm_osd_devices
Cache all LVs from one global `lvs` call and use it in
`filter_lvm_osd_devices` to avoid repeated subprocesses and speed up
`ceph-volume generic activate` significantly.
ceph-volume: avoid Device() instantiation in lvm OSD filtering
Replace Device() instantiation with direct LVM API calls to reduce
subprocess overhead.
Use sysfs check first, then only query LVM for actual LVs.
caches LVM mappers list to avoid repeated sysfs reads.
Patrick Donnelly [Sat, 28 Mar 2026 07:24:42 +0000 (12:54 +0530)]
Merge PR #67884 into squid
* refs/pull/67884/head:
qa/standalone: shorten bluefs test durations
qa/standalone: increase WAL volume size to 1GB
qa/standalone: fix bluefs expand test case
Patrick Donnelly [Thu, 19 Mar 2026 15:00:04 +0000 (11:00 -0400)]
Merge PR #66838 into squid
* refs/pull/66838/head:
os/bluestore: rename row names in RocksDBBlueFSVolumeSelector.
test/bluestore: add volume selector tests
os/bluestore:fix bluestore_volume_selection_reserved_factor usage
os/bluestore: print the first RocksDB level which doesn't fit into fast
The readable.sh script has forward incompat checks, but no
backward incompat checks.
This fix will:
1. Add check for backward_incompat directory for each type for specific
objects or all objects with the same type and skip those objects from being tested.
2. Add version comparison helper functions (version_lt, version_le, version_ge,
versions_span) for robust version handling
3. Replace 'sort -n' with 'sort -V' for proper version number sorting
4. Add CORPUS_PATH environment variable to allow teuthology tests to execute this script
5. Improve readability of the script
The difference between backward and forward incompat:
- forward_incompat: Marks objects from older versions that newer ceph-dencoder
versions cannot read. Example: Version 19.2.x objects marked incompat at version 20.2.x
means ceph-dencoder v20.2.x+ can't decode them. Skip when testing old objects
with a new ceph-dencoder.
- backward_incompat: Marks objects from newer versions that older ceph-dencoder
versions cannot read. Example: Version 19.2.x objects marked backward_incompat at v19.2.x
means ceph-dencoder < v19.2.x can't decode them. Skip when testing new objects
with an old ceph-dencoder.
NitzanMordhai [Mon, 2 Feb 2026 07:34:24 +0000 (07:34 +0000)]
workunits/dencoder: use readable.sh script instead of python script
The python script test_readable.py was added for backword and forward
compability. maintaining 2 scripts that finally doing the same is west,
reverting and using readable.sh and leave the python out.
2026-02-08T13:02:24.439 INFO:tasks.workunit.client.0.trial031.stderr:Parse error near line 2: no such column: "start" - should this be a string literal in single-quotes?
test: use json_extract instead of awkward json_tree
Ideally this should be port better across sqlite3 versions. The sqlite3
on rocky10 failed because it started requiring components of the keys
to be quoted:
sqlite> select * from p as a, p as b where a.i=1 and b.i = 2 and a.fullkey = '$."libcephsqlite_vfs"."opf_sync".avgcount' and b.fullkey = '$."libcephsqlite_vfs"."opf_sync".avgcount';
i key value type atom id parent fullkey path i key value type atom id parent fullkey
- -------- ----- ------- ---- --- ------ ----------------------------------------- -------------------------------- - -------- ----- ------- ---- --- ------ ------------------
1 avgcount 4 integer 4 581 570 $."libcephsqlite_vfs"."opf_sync".avgcount $."libcephsqlite_vfs"."opf_sync" 2 avgcount 5 integer 5 581 570 $."libcephsqlite_v
Fixes: https://tracker.ceph.com/issues/74755 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit f304daa74ace4e6b856b585d71b8ff9c6e8a024a)
Patrick Donnelly [Wed, 19 Nov 2025 23:16:21 +0000 (18:16 -0500)]
mon/HealthMonitor: avoid MON_DOWN for freshly added Monitor
In testing, we often have the scenario where cephadm has created a
cluster but doesn't add more monitors until well past
mon_down_mkfs_grace. This causes useless MON_DOWN warnings to be thrown
which fails QA jobs. Avoid this situation entirely by giving a
reasonable grace period for a monitor added to the MonMap to join
quorum.
Fixes: https://tracker.ceph.com/issues/73934 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit b028a41e1f000b87aab3f263ab3259a0ca439555)