Laura Flores [Fri, 17 Apr 2026 15:20:09 +0000 (10:20 -0500)]
qa/suites/fs/upgrade/mds_upgrade_sequence: replace "reef" with "v18.2.8"
Since Reef is EOL, the "reef" tag was removed from quay.ceph.io.
The solution is to replace it with a test for the last point release,
which is "v18.2.8".
Fixes: https://tracker.ceph.com/issues/76028 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 17 Apr 2026 15:09:20 +0000 (10:09 -0500)]
qa/suites/fs/upgrade/mds_upgrade_sequence: replace "quincy" with "v17.2.8"
Since Quincy is EOL, the "quincy" tag was removed from quay.ceph.io.
The solution is to replace it with a test for the last point release
with a container image, which is "v17.2.8".
Fixes: https://tracker.ceph.com/issues/76028 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 17 Apr 2026 15:03:58 +0000 (10:03 -0500)]
qa/suites/upgrade/telemetry-upgrade: ignore expected health warning
"Telemetry requires re-opt-in" briefly shows up when we upgrade.
The test already re-opts in to telemetry to get rid of this warning,
but the cluster badness check still sometimes picks it up.
Fixes: https://tracker.ceph.com/issues/76028 Signed-off-by: Laura Flores <lflores@ibm.com>
Anoop C S [Fri, 5 Dec 2025 09:25:58 +0000 (14:55 +0530)]
mon/MonMap: Dump addr in backward compatible format
Prior to c5b43e9b2765ff98419c649a5ae53ec16601975d, we dumped only the
legacy string component from public_addrs as `addr`. Ensure that this
backward compatible filtering is retained when dumping MonMap.
Patrick Donnelly [Fri, 10 Apr 2026 17:23:45 +0000 (13:23 -0400)]
Merge PR #66541 into squid
* refs/pull/66541/head:
include: detect corrupt frag from byteswap
mds: dump frag_t as an object
common/frag: produce valid fragments for test instances
common: simplify fragment printing
common: properly convert frag_t to net/store endianness
mds: include sysinfo in status command output
include/frag.h: un-inline methods to reduce header dependencies
* refs/pull/67322/head:
qa: set column for insertion
qa: bail sqlite3 on any error
qa: use actual sqlite3 blob instead of string
test: use json_extract instead of awkward json_tree
Previously s3tests_java.py set JAVA_HOME using the `alternatives`
command. That had issues in that `alternatives` is not present on all
Ubuntu systems, and some installations of Java don't update
alternatives. So instead we look for a "java-8" jvm in /usr/lib/jvm/
and set JAVA_HOME to the first one we find.
* refs/pull/61894/head:
pybind/rados: add note for reversed arguments to WriteOp.zero()
test/pybind/test_rados.py: add test for reversed arguments offset,length in WriteOp.zero
pybind/rados: fix the incorrect order of offset,length in WriteOp.zero
Wang Chao [Sat, 10 Aug 2024 11:40:52 +0000 (19:40 +0800)]
pybind/rados: fix the incorrect order of offset,length in WriteOp.zero
The offset and length parameters in the rados pybind `WriteOp.zero()` method are being passed to the rados_write_op_zero() function in the incorrect order.
Incorrect order cause OP_ZERO not work correctly when use pybind's rados.
qa/tasks/backfill_toofull.py: Fix assert failures with & without compression
The following issues with the test are addressed:
1. The test was encountering assertion failure (assert backfillfull < 0.9) with
compression enabled. This was because the condition was not factoring in the
compression ratio. Without it the backfillfull ratio can easily exceed 1. By
factoring in the compression ratio, the backfillfull ratio will be in the
range (0 - n), where n can vary depending on the type of compression used.
2. The main contributing factor for (1) above is the amount of data written to
the pool. The writes were time-bound earlier leading to excess data and
eventually the assertion failure. By limiting the data written to the OSDs
to 50% of the OSD capacity in the first phase and only 20% in the re-write
phase, the outcome of the test is more deterministic regardless of
compression being enabled or not.
3. A potential false cluster error is avoided by swapping the setting of
the nearfull-ratio and backfill-ratio after the re-write phase.
Patrick Donnelly [Sat, 28 Mar 2026 07:24:42 +0000 (12:54 +0530)]
Merge PR #67884 into squid
* refs/pull/67884/head:
qa/standalone: shorten bluefs test durations
qa/standalone: increase WAL volume size to 1GB
qa/standalone: fix bluefs expand test case
Patrick Donnelly [Thu, 19 Mar 2026 15:00:04 +0000 (11:00 -0400)]
Merge PR #66838 into squid
* refs/pull/66838/head:
os/bluestore: rename row names in RocksDBBlueFSVolumeSelector.
test/bluestore: add volume selector tests
os/bluestore:fix bluestore_volume_selection_reserved_factor usage
os/bluestore: print the first RocksDB level which doesn't fit into fast
The readable.sh script has forward incompat checks, but no
backward incompat checks.
This fix will:
1. Add check for backward_incompat directory for each type for specific
objects or all objects with the same type and skip those objects from being tested.
2. Add version comparison helper functions (version_lt, version_le, version_ge,
versions_span) for robust version handling
3. Replace 'sort -n' with 'sort -V' for proper version number sorting
4. Add CORPUS_PATH environment variable to allow teuthology tests to execute this script
5. Improve readability of the script
The difference between backward and forward incompat:
- forward_incompat: Marks objects from older versions that newer ceph-dencoder
versions cannot read. Example: Version 19.2.x objects marked incompat at version 20.2.x
means ceph-dencoder v20.2.x+ can't decode them. Skip when testing old objects
with a new ceph-dencoder.
- backward_incompat: Marks objects from newer versions that older ceph-dencoder
versions cannot read. Example: Version 19.2.x objects marked backward_incompat at v19.2.x
means ceph-dencoder < v19.2.x can't decode them. Skip when testing new objects
with an old ceph-dencoder.
NitzanMordhai [Mon, 2 Feb 2026 07:34:24 +0000 (07:34 +0000)]
workunits/dencoder: use readable.sh script instead of python script
The python script test_readable.py was added for backword and forward
compability. maintaining 2 scripts that finally doing the same is west,
reverting and using readable.sh and leave the python out.
2026-02-08T13:02:24.439 INFO:tasks.workunit.client.0.trial031.stderr:Parse error near line 2: no such column: "start" - should this be a string literal in single-quotes?
test: use json_extract instead of awkward json_tree
Ideally this should be port better across sqlite3 versions. The sqlite3
on rocky10 failed because it started requiring components of the keys
to be quoted:
sqlite> select * from p as a, p as b where a.i=1 and b.i = 2 and a.fullkey = '$."libcephsqlite_vfs"."opf_sync".avgcount' and b.fullkey = '$."libcephsqlite_vfs"."opf_sync".avgcount';
i key value type atom id parent fullkey path i key value type atom id parent fullkey
- -------- ----- ------- ---- --- ------ ----------------------------------------- -------------------------------- - -------- ----- ------- ---- --- ------ ------------------
1 avgcount 4 integer 4 581 570 $."libcephsqlite_vfs"."opf_sync".avgcount $."libcephsqlite_vfs"."opf_sync" 2 avgcount 5 integer 5 581 570 $."libcephsqlite_v
Fixes: https://tracker.ceph.com/issues/74755 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit f304daa74ace4e6b856b585d71b8ff9c6e8a024a)
Patrick Donnelly [Wed, 19 Nov 2025 23:16:21 +0000 (18:16 -0500)]
mon/HealthMonitor: avoid MON_DOWN for freshly added Monitor
In testing, we often have the scenario where cephadm has created a
cluster but doesn't add more monitors until well past
mon_down_mkfs_grace. This causes useless MON_DOWN warnings to be thrown
which fails QA jobs. Avoid this situation entirely by giving a
reasonable grace period for a monitor added to the MonMap to join
quorum.
Fixes: https://tracker.ceph.com/issues/73934 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit b028a41e1f000b87aab3f263ab3259a0ca439555)
Patrick Donnelly [Thu, 13 Nov 2025 19:51:20 +0000 (14:51 -0500)]
include: detect corrupt frag from byteswap
If a big-endian MDS writes frag_t values into the metadata pool, these
will persist and confuse the MDS after it tries properly parsing them as
little-endian. Fortunately detecting this situation is fairly easy as we
restrict the number of bits and the number of bits restricts the mask
value.
Fixes: https://tracker.ceph.com/issues/73792 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 6bf91e4f6e49d99711b8be845eb77c883d662704)
Patrick Donnelly [Wed, 18 Mar 2026 00:22:23 +0000 (20:22 -0400)]
Merge PR #67623 into squid
* refs/pull/67623/head:
mgr/orchestrator: make group parameter optional for nvmeof (squid)
pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id
Nitzan Mordechai [Mon, 17 Nov 2025 11:51:14 +0000 (11:51 +0000)]
qa/tasks/mgr: test_module_selftest set influx hostname to avoid warnings
self-test will hit error MGR_INFLUX_NO_SERVER since we dont have
hostname configed, the following command will add a test hostname
so the error won't appear and fail the test.
Ilya Dryomov [Sun, 1 Mar 2026 21:55:52 +0000 (22:55 +0100)]
qa/workunits/rbd: short-circuit status() if "ceph -s" fails
In mirror-thrash tests, status() can be invoked after one of the
clusters is effectively stopped due to a watchdog bark:
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.rbd_mirror.[cluster2] failed
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
...
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ status
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ local cluster daemon image_pool image_ns image
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ for cluster in ${CLUSTER1} ${CLUSTER2}
In this scenario all commands that are invoked from the loop body
are going to time out anyway.
Ilya Dryomov [Sun, 1 Mar 2026 16:45:51 +0000 (17:45 +0100)]
qa: rbd_mirror_fsx_compare.sh doesn't error out as expected
In mirror-thrash tests, one of the clusters can be effectively stopped
due to a watchdog bark while rbd_mirror_fsx_compare.sh is running and is
in the middle of the "wait for all images" loop:
In this scenario "rbd ls" is going to time out repeatedly, turning the
loop into up to a ~60-hour sleep (up to 720 iterations with a 5-minute
timeout + 10-second sleep per iteration).
Ilya Dryomov [Fri, 27 Feb 2026 14:18:27 +0000 (15:18 +0100)]
qa/tasks: make rbd_mirror_thrash inherit from ThrasherGreenlet
Commit 21b4b89e5280 ("qa/tasks: watchdog terminate thrasher") made it
required for a thrasher to have stop_and_join() method, but the
preceding commit a035b5a22fb8 ("thrashers: standardize stop and join
method names") missed to add it to rbd_mirror_thrash (whether as an
ad-hoc implementation or by way of inheriting from ThrasherGreenlet).
Later on, commit 783f0e3a9903 ("qa: Adding a new class for the
daemonwatchdog to monitor") worsened the issue by expanding the use
of stop_and_join() to all watchdog barks rather than just the case of
a thrasher throwing an exception which is something that practically
never happens.
The reason we had a slow-requests is because during the test, 16 concurrent 4 MB writes were running while recovery and backfill were disabled. At the same time, osd.0 was marked out and then back in, causing PG remapping. Because recovery/backfill was disabled, some PGs could not restore their replicas after the remap, leaving them in degraded/remapped states. As a result, a batch of writes remained stuck in the replicated write path, leading to IO stall and slow ops being reported. Solution is to ignore this as we are testing the progress module, not the write paths of OSDs. We intentionally disable backfill and recovery in order to prevent the recovery event to finish quickly. We wanted to prolong it until the progress event pops up.
qa: make test_progress atomically capture OSD marked in/out events
Problem:
Test had a race condition where events could complete and disappear
between checking the event count and fetching the event, causing
test failures.
Solution:
Refactor to atomically capture events during the wait condition check.
Added helper methods _wait_for_osd_marked_out_event() and
_wait_for_osd_marked_in_event() that capture events at the moment
they're detected, eliminating the race window.