git-server-git.apps.pok.os.sepia.ceph.com Git

mon/MonMap: Dump addr in backward compatible format

Prior to c5b43e9b2765ff98419c649a5ae53ec16601975d, we dumped only the
legacy string component from public_addrs as `addr`. Ensure that this
backward compatible filtering is retained when dumping MonMap.

Signed-off-by: Anoop C S <anoopcs@cryptolab.net>
(cherry picked from commit 01c85255bc9b266ecf9bd3b58d2a8d4cb4650d7f)

Merge PR #66335 into squid

* refs/pull/66335/head:
RGW:fix obj by multipart upload cant get tag

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #67994 into squid

* refs/pull/67994/head:
test/rgw/kafka: fix kafka relase to more recent one

Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>

Merge PR #66541 into squid

* refs/pull/66541/head:
include: detect corrupt frag from byteswap
mds: dump frag_t as an object
common/frag: produce valid fragments for test instances
common: simplify fragment printing
common: properly convert frag_t to net/store endianness
mds: include sysinfo in status command output
include/frag.h: un-inline methods to reduce header dependencies

Reviewed-by: Yuri Weinstein <yweins@redhat.com>

Merge PR #68227 into squid

* refs/pull/68227/head:
rgw: enhanced java s3-tests change setting of JAVA_HOME
rgw: java s3-tests change setting of JAVA_HOME

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge PR #67322 into squid

* refs/pull/67322/head:
qa: set column for insertion
qa: bail sqlite3 on any error
qa: use actual sqlite3 blob instead of string
test: use json_extract instead of awkward json_tree

Reviewed-by: Shraddha Agrawal <shraddhaag@ibm.com>

Merge PR #67324 into squid

* refs/pull/67324/head:
mon/HealthMonitor: avoid MON_DOWN for freshly added Monitor
mon: add time_added to mon_info_t
common/options: add missing runtime flag
mon/MonMap: cleanup initialization

Reviewed-by: Shraddha Agrawal <shraddhaag@ibm.com>

Merge PR #62061 into squid

* refs/pull/62061/head:
tools: respect set features when adding addresses

rgw: enhanced java s3-tests change setting of JAVA_HOME

Under Centos 9 the Java 8 version is recognized by the substring
"java-1.8" rather than "java-8". So the grep has been modified to
accept either.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit a49d4446e4d84b28273b460b85a193011a9c4ed8)

rgw: java s3-tests change setting of JAVA_HOME

Previously s3tests_java.py set JAVA_HOME using the `alternatives`
command. That had issues in that `alternatives` is not present on all
Ubuntu systems, and some installations of Java don't update
alternatives. So instead we look for a "java-8" jvm in /usr/lib/jvm/
and set JAVA_HOME to the first one we find.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit b8e2796270f4558b406411682a9b916109d0c530)

Merge PR #67800 into squid

* refs/pull/67800/head:
qa/tasks/mgr: test_module_selftest set influx hostname to avoid warnings

Reviewed-by: Laura Flores <lflores@redhat.com>

Merge PR #61894 into squid

* refs/pull/61894/head:
pybind/rados: add note for reversed arguments to WriteOp.zero()
test/pybind/test_rados.py: add test for reversed arguments offset,length in WriteOp.zero
pybind/rados: fix the incorrect order of offset,length in WriteOp.zero

Reviewed-by: Laura Flores <lflores@redhat.com>

pybind/rados: add note for reversed arguments to WriteOp.zero()

Signed-off-by: Wang Chao <sean10reborn@gmail.com>
(cherry picked from commit e9ca8a01323d49c656c54d622a34280adc5b244b)

test/pybind/test_rados.py: add test for reversed arguments offset,length in WriteOp.zero

Before the fix, zero(0, 2) would have no effect, and read would get '12345' instead of the expected '\x00\x00345'.

Signed-off-by: Wang Chao <sean10reborn@gmail.com>
(cherry picked from commit 3a27c3e58fca96d0f0c80a1d264cb3f5f156f5c3)

pybind/rados: fix the incorrect order of offset,length in WriteOp.zero

The offset and length parameters in the rados pybind `WriteOp.zero()` method are being passed to the rados_write_op_zero() function in the incorrect order.
Incorrect order cause OP_ZERO not work correctly when use pybind's rados.

Signed-off-by: Wang Chao <sean10reborn@gmail.com>
(cherry picked from commit 049d7d35abe0aa2560e3bb9d4fafb43eefb4a0ed)

tools: respect set features when adding addresses

Fixes: https://tracker.ceph.com/issues/53751
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
(cherry picked from commit 19545eb9864b002c1a37d4f2509d1b2baa833128)

Merge PR #67527 into squid

* refs/pull/67527/head:
mgr/Mgr.cc: clear daemon health metrics instead of removing down/out osd from daemon state

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #67450 into squid

* refs/pull/67450/head:
qa/rgw: bucket notifications use pynose

Reviewed-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>

Merge PR #67575 into squid

* refs/pull/67575/head:
rgw/notification: fix reserved_size drift in 2pc_queue causing ENOSPC errors
rgw/notification: Prevent reserved_size leak by decrementing overhead on commit/abort.

Reviewed-by: Yuval Lifshitz <ylifshit@redhat.com>

Merge PR #67398 into squid

* refs/pull/67398/head:
os/bluestore: Fix default base size for histogram

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>

Merge PR #67884 into squid

* refs/pull/67884/head:
qa/standalone: shorten bluefs test durations
qa/standalone: increase WAL volume size to 1GB
qa/standalone: fix bluefs expand test case

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>

Merge PR #67392 into squid

* refs/pull/67392/head:
test/encoding/readable: Add backward incompat checks
workunits/dencoder: use readable.sh script instead of python script

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

test/rgw/kafka: fix kafka relase to more recent one

Fixes: https://tracker.ceph.com/issues/75323
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit dc412a7e519d037acbcac8a92c7ecf2dbde9875a)

Merge PR #66884 into squid

* refs/pull/66884/head:
Squid: mgr/dashboard: Changing placement of a mds to label - creates a new mds-service, mds.label

Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Afreen Misbah <afreen@ibm.com>

Merge PR #62454 into squid

* refs/pull/62454/head:
mgr/dashboard: add types for mgr-module list
mgr/dashboard: fix access control permissions for roles

Reviewed-by: Afreen Misbah <afreen@ibm.com>

Merge PR #67796 into squid

* refs/pull/67796/head:
qa/workunits/rbd: fix unbound variable in status()
qa/workunits/rbd: short-circuit status() if "ceph -s" fails
qa: rbd_mirror_fsx_compare.sh doesn't error out as expected

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #67794 into squid

* refs/pull/67794/head:
qa/tasks: make rbd_mirror_thrash inherit from ThrasherGreenlet

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #67704 into squid

* refs/pull/67704/head:
librbd/cache/pwl: WriteLogOperationSet::cell can be garbage

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #66838 into squid

* refs/pull/66838/head:
os/bluestore: rename row names in RocksDBBlueFSVolumeSelector.
test/bluestore: add volume selector tests
os/bluestore:fix bluestore_volume_selection_reserved_factor usage
os/bluestore: print the first RocksDB level which doesn't fit into fast

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>

qa/standalone: shorten bluefs test durations

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 5a901808dfe03dad5e34ef6374e34c0c03766e96)

qa/standalone: increase WAL volume size to 1GB

to avoid unexpected test case failures due to ENOSPC.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 2de79c64420ffba91becdf29f2d4f6b2d5931830)

qa/standalone: fix bluefs expand test case

Fixes: https://tracker.ceph.com/issues/74525
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 9fc57f9ed1c61d54ca8ecd9e1b98782eee13848a)

Squid: mgr/dashboard: Changing placement of a mds to label - creates a new mds-service, mds.label

Fixes: https://tracker.ceph.com/issues/74376
Signed-off-by: Dnyaneshwari Talwekar <dtalweka@redhat.com>

Merge PR #63344 into squid

* refs/pull/63344/head:
mgr/DaemonServer: fixed mistype for mgr_osd_messages

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #61417 into squid

* refs/pull/61417/head:
qa/cephfs: update ignorelist

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #59688 into squid

* refs/pull/59688/head:
qa: some test set `refuse_client_session`, so the cluster log is expected

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #64686 into squid

* refs/pull/64686/head:
mon/MgrMonitor: add a space before "is already disabled"

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #65298 into squid

* refs/pull/65298/head:
qa/suites/upgrade: update ignorelist with cephfs specific warnings (under stress-split)
qa/suites/upgrade: add "Replacing daemon mds" to ignorelist

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #65758 into squid

* refs/pull/65758/head:
.github: pin GH Actions to SHA-1 commit

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #66126 into squid

* refs/pull/66126/head:
qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

os/bluestore: rename row names in RocksDBBlueFSVolumeSelector.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit a9f591f4e1cb1e364879165250c55cb0f841d64f)

test/bluestore: add volume selector tests

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 158d1550a021ed60e5ad1c565b247e5b0b6d5946)

Conflicts:
src/test/objectstore/CMakeLists.txt - allocsim not present in Squid

os/bluestore:fix bluestore_volume_selection_reserved_factor usage

Fixes: https://tracker.ceph.com/issues/71368
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 43d7864093f92977a3fd084bbfd65229244b1cc9)

os/bluestore: print the first RocksDB level which doesn't fit into fast
device by default.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit d95aa620b315d9261cb50b0465ecfd2b6b534a60)

Merge PR #66915 into squid

* refs/pull/66915/head:
monc: synchronize tick() of MonClient with shutdown()

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Merge PR #66973 into squid

* refs/pull/66973/head:
qa/tasks/thrashosds-health: whitelist PG_BACKFILL_FULL

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #60391 into squid

* refs/pull/60391/head:
qa/cephfs: ignore when specific OSD is reported down during upgrade

Merge PR #63026 into squid

* refs/pull/63026/head:
qa/workunits/cephtool: add extra privileges to cephtool script

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

test/encoding/readable: Add backward incompat checks

The readable.sh script has forward incompat checks, but no
backward incompat checks.

This fix will:
1. Add check for backward_incompat directory for each type for specific
    objects or all objects with the same type and skip those objects from being tested.
2. Add version comparison helper functions (version_lt, version_le, version_ge,
    versions_span) for robust version handling
3. Replace 'sort -n' with 'sort -V' for proper version number sorting
4. Add CORPUS_PATH environment variable to allow teuthology tests to execute this script
5. Improve readability of the script

The difference between backward and forward incompat:
- forward_incompat: Marks objects from older versions that newer ceph-dencoder
  versions cannot read. Example: Version 19.2.x objects marked incompat at version 20.2.x
  means ceph-dencoder v20.2.x+ can't decode them. Skip when testing old objects
  with a new ceph-dencoder.
- backward_incompat: Marks objects from newer versions that older ceph-dencoder
  versions cannot read. Example: Version 19.2.x objects marked backward_incompat at v19.2.x
  means ceph-dencoder < v19.2.x can't decode them. Skip when testing new objects
  with an old ceph-dencoder.

Fixes: https://tracker.ceph.com/issues/74074
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
(cherry picked from commit 011b25d8038e0f0bd3272fa57b0c7e068feb130c)

workunits/dencoder: use readable.sh script instead of python script

The python script test_readable.py was added for backword and forward
compability. maintaining 2 scripts that finally doing the same is west,
reverting and using readable.sh and leave the python out.

https://tracker.ceph.com/issues/74074
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
(cherry picked from commit 9d289ed14e79fa8008ba30b77b425a4508030110)

qa/rgw: bucket notifications use pynose

nose incompatibility in multisite tests was fixed by switching to pynose
in https://github.com/ceph/teuthology/pull/1947, so i'm trying the same
here

Fixes: https://tracker.ceph.com/issues/74573
Signed-off-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 915a5309a639333839829b5a554f3fdb6c560464)

Merge PR #63018 into squid

* refs/pull/63018/head:
qa/workunits/fs/misc: remove data pool cleanup

Merge PR #61302 into squid

* refs/pull/61302/head:
qa: do not fail cephfs QA tests for slow bluestore ops

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>

qa: set column for insertion

2026-02-08T13:02:24.439 INFO:tasks.workunit.client.0.trial031.stderr:Parse error near line 2: no such column: "start" - should this be a string literal in single-quotes?

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 6ebb625669afd6e112be26ff87a1e61cfc4ee979)

qa: bail sqlite3 on any error

Otherwise it will wrongly proceed executing the next SQL statement.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit e6d10c23eb3c2b8b4aae15146f24bbfcf65ad1b6)

qa: use actual sqlite3 blob instead of string

No functional change.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 1e0114bf2795e5c19348cec0eafa7d351d22dc81)

test: use json_extract instead of awkward json_tree

Ideally this should be port better across sqlite3 versions. The sqlite3
on rocky10 failed because it started requiring components of the keys
to be quoted:

    sqlite> select * from p as a, p as b where a.i=1 and b.i = 2 and a.fullkey = '$."libcephsqlite_vfs"."opf_sync".avgcount' and b.fullkey = '$."libcephsqlite_vfs"."opf_sync".avgcount';
    i  key       value  type     atom  id   parent  fullkey                                    path                              i  key       value  type     atom  id   parent  fullkey
    -  --------  -----  -------  ----  ---  ------  -----------------------------------------  --------------------------------  -  --------  -----  -------  ----  ---  ------  ------------------
    1  avgcount  4      integer  4     581  570     $."libcephsqlite_vfs"."opf_sync".avgcount  $."libcephsqlite_vfs"."opf_sync"  2  avgcount  5      integer  5     581  570     $."libcephsqlite_v

Fixes: https://tracker.ceph.com/issues/74755
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit f304daa74ace4e6b856b585d71b8ff9c6e8a024a)

mon/HealthMonitor: avoid MON_DOWN for freshly added Monitor

In testing, we often have the scenario where cephadm has created a
cluster but doesn't add more monitors until well past
mon_down_mkfs_grace. This causes useless MON_DOWN warnings to be thrown
which fails QA jobs. Avoid this situation entirely by giving a
reasonable grace period for a monitor added to the MonMap to join
quorum.

Fixes: https://tracker.ceph.com/issues/73934
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit b028a41e1f000b87aab3f263ab3259a0ca439555)

mon: add time_added to mon_info_t

So we know when the Monitor was added to the map.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit c5b43e9b2765ff98419c649a5ae53ec16601975d)

Conflicts:
src/mon/MonMap.cc: generate_test_instances refactor missing

common/options: add missing runtime flag

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 62c449e208aef23df35c311020f1518f62d3f013)

mon/MonMap: cleanup initialization

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 42a37916ff5c69d7df4d917cd9e143b4e92d389f)

include: detect corrupt frag from byteswap

If a big-endian MDS writes frag_t values into the metadata pool, these
will persist and confuse the MDS after it tries properly parsing them as
little-endian. Fortunately detecting this situation is fairly easy as we
restrict the number of bits and the number of bits restricts the mask
value.

Fixes: https://tracker.ceph.com/issues/73792
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 6bf91e4f6e49d99711b8be845eb77c883d662704)

mds: dump frag_t as an object

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 1d526c50de0712a180db1b6fa39ae6f51e346c3c)

Conflicts:
src/mds/mdstypes.h: add dump method

common/frag: produce valid fragments for test instances

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 4fb5ad4a258a3c447e9d7413ce87fcdf6a89d056)

Conflicts:
src/common/frag.cc: missing test instance refactor
src/mds/mdstypes.cc: missing test instance refactor

Conflicts:
src/mds/mdstypes.cc

common: simplify fragment printing

There's better tooling for this now and we can avoid magic numbers.

Fixes: https://tracker.ceph.com/issues/73792
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 647de21c85f14d67e7941428c3af2ebeef39ad4f)

common: properly convert frag_t to net/store endianness

The MDS/client are already accidentally doing the right thing unless
they are running on a big-endian machine.

Credit to Venky Shankar for originally hypothesizing an endianness issue
with the frag_t.

Fixes: https://tracker.ceph.com/issues/73792
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 9e3837c837bc9f76805f998dd06fe386dce35722)

mds: include sysinfo in status command output

Of particular interest is the CPU architecture and endianness.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit fa4078adfc4e54d0bdd472437c7dcd8bc55ba4dd)

include/frag.h: un-inline methods to reduce header dependencies

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
(cherry picked from commit 5f1a893dc54dc579a8428100496adc27d638aab9)

Merge PR #67001 into squid

* refs/pull/67001/head:
doc: fetch releases from main branch

Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>

Merge PR #67623 into squid

* refs/pull/67623/head:
mgr/orchestrator: make group parameter optional for nvmeof (squid)
pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #66964 into squid

* refs/pull/66964/head:
monitoring: upgrade grafana version to 12.3.1

Reviewed-by: Afreen Misbah <afreen@ibm.com>

Merge PR #66990 into squid

* refs/pull/66990/head:
monitoring: fix rgw_servers filtering in rgw sync overview grafana

Reviewed-by: Afreen Misbah <afreen@ibm.com>

qa/tasks/mgr: test_module_selftest set influx hostname to avoid warnings

self-test will hit error MGR_INFLUX_NO_SERVER since we dont have
hostname configed, the following command will add a test hostname
so the error won't appear and fail the test.

Fixes: https://tracker.ceph.com/issues/72747
Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
(cherry picked from commit 6b170bb5366ec13239b768f7c344fa5e842af7ff)

qa/workunits/rbd: fix unbound variable in status()

It was missed in commit 5fe64fa806f3 ("qa: rbd_mirror.sh: change
parameters to cluster rather than daemon name").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 1a280b9a320d51bdc4cb80be9bdd6ae265151132)

qa/workunits/rbd: short-circuit status() if "ceph -s" fails

In mirror-thrash tests, status() can be invoked after one of the
clusters is effectively stopped due to a watchdog bark:

2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.rbd_mirror.[cluster2] failed
2026-03-01T22:27:38.633 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
...
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ status
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ local cluster daemon image_pool image_ns image
2026-03-01T22:32:46.964 INFO:tasks.workunit.cluster1.client.mirror.trial199.stderr:+ for cluster in ${CLUSTER1} ${CLUSTER2}

In this scenario all commands that are invoked from the loop body
are going to time out anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 82717e43a08a1262987f5e271fd72d4433c4fb3b)

qa: rbd_mirror_fsx_compare.sh doesn't error out as expected

In mirror-thrash tests, one of the clusters can be effectively stopped
due to a watchdog bark while rbd_mirror_fsx_compare.sh is running and is
in the middle of the "wait for all images" loop:

2026-03-01T12:55:35.059 INFO:tasks.workunit.cluster1.client.mirror.trial055.stderr:+ retrying_seconds=1040
2026-03-01T12:55:35.060 INFO:tasks.workunit.cluster1.client.mirror.trial055.stderr:+ '[' 1040 -le 7200 ']'
2026-03-01T12:55:35.060 INFO:tasks.workunit.cluster1.client.mirror.trial055.stderr:++ rbd --cluster cluster2 --pool mirror ls
2026-03-01T12:55:35.060 INFO:tasks.workunit.cluster1.client.mirror.trial055.stderr:++ wc -l
2026-03-01T12:55:35.084 INFO:tasks.workunit.cluster1.client.mirror.trial055.stderr:+ '[' 290 -ge 292 ']'
2026-03-01T12:55:35.084 INFO:tasks.workunit.cluster1.client.mirror.trial055.stderr:+ sleep 10
...
2026-03-01T12:55:49.568 INFO:tasks.daemonwatchdog.daemon_watchdog:thrasher.rbd_mirror.[cluster2] failed
2026-03-01T12:55:49.568 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons

In this scenario "rbd ls" is going to time out repeatedly, turning the
loop into up to a ~60-hour sleep (up to 720 iterations with a 5-minute
timeout + 10-second sleep per iteration).

Fixes: https://tracker.ceph.com/issues/75239
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 81a5906f0d1cc844bb4ef16aae9ace3e7d371ac2)

qa/tasks: make rbd_mirror_thrash inherit from ThrasherGreenlet

Commit 21b4b89e5280 ("qa/tasks: watchdog terminate thrasher") made it
required for a thrasher to have stop_and_join() method, but the
preceding commit a035b5a22fb8 ("thrashers: standardize stop and join
method names") missed to add it to rbd_mirror_thrash (whether as an
ad-hoc implementation or by way of inheriting from ThrasherGreenlet).
Later on, commit 783f0e3a9903 ("qa: Adding a new class for the
daemonwatchdog to monitor") worsened the issue by expanding the use
of stop_and_join() to all watchdog barks rather than just the case of
a thrasher throwing an exception which is something that practically
never happens.

Fixes: https://tracker.ceph.com/issues/75200
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 3ebe3a0a43251b0f126497d4100bd1af9ca8afc5)

Merge PR #66829 into squid

* refs/pull/66829/head:
monitoring: fix CephPgImbalance alert rule expression

Reviewed-by: Afreen Misbah <afreen@ibm.com>

Merge PR #66897 into squid

* refs/pull/66897/head:
common: drop stack singleton object of temp messenger for foreground ceph daemons

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Merge PR #67066 into squid

* refs/pull/67066/head:
qa: Disable OSD benchmark from running for tests.

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Merge PR #67278 into squid

* refs/pull/67278/head:
librbd: introduce RBD_LOCK_MODE_EXCLUSIVE_TRANSIENT
librbd: prepare lock_acquire() for changing between policies
librbd: fix RequestLockPayload log message in ImageWatcher
librbd: amend error message in lock_acquire()

Reviewed-by: Ramana Raja <rraja@redhat.com>

Merge PR #67280 into squid

* refs/pull/67280/head:
qa/valgrind.supp: make gcm_cipher_internal suppression more resilient

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>

Merge PR #67356 into squid

* refs/pull/67356/head:
osd/PrimaryLogPG: encode an empty data_bl for empty sparse reads

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Merge PR #67454 into squid

* refs/pull/67454/head:
qa: krbd_rxbounce.sh: do more reads to generate more errors

Reviewed-by: Ramana Raja <rraja@redhat.com>

Merge PR #66985 into squid

* refs/pull/66985/head:
monitoring: make cluster matcher backward compatible for pre-7.1 metrics

Reviewed-by: Afreen Misbah <afreen@ibm.com>

Merge PR #67761 into squid

* refs/pull/67761/head:
qa: whitelist slow requests progress.yaml
qa: make test_progress atomically capture OSD marked in/out events

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Shraddha Agrawal <shraddhaag@ibm.com>

qa: whitelist slow requests progress.yaml

The reason we had a slow-requests is because during the test, 16 concurrent 4 MB writes were running while recovery and backfill were disabled. At the same time, osd.0 was marked out and then back in, causing PG remapping. Because recovery/backfill was disabled, some PGs could not restore their replicas after the remap, leaving them in degraded/remapped states. As a result, a batch of writes remained stuck in the replicated write path, leading to IO stall and slow ops being reported. Solution is to ignore this as we are testing the progress module, not the write paths of OSDs. We intentionally disable backfill and recovery in order to prevent the recovery event to finish quickly. We wanted to prolong it until the progress event pops up.

Fixes: https://tracker.ceph.com/issues/70320
Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
(cherry picked from commit 6b0c943c8bd004665529c5c5786ecec42bcc9ff7)

qa: make test_progress atomically capture OSD marked in/out events

Problem:
Test had a race condition where events could complete and disappear
between checking the event count and fetching the event, causing
test failures.

Solution:
Refactor to atomically capture events during the wait condition check.
Added helper methods _wait_for_osd_marked_out_event() and
_wait_for_osd_marked_in_event() that capture events at the moment
they're detected, eliminating the race window.

Fixes: https://tracker.ceph.com/issues/70320
Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
(cherry picked from commit 0ef66f6f2e1881061ecb49e457bb2b9061c0260b)

Merge PR #66480 into squid

* refs/pull/66480/head:
mgr/dashboard: service creation fails if service name is same as service type

Reviewed-by: Afreen Misbah <afreen@ibm.com>

Merge PR #67582 into squid

* refs/pull/67582/head:
librbd/mirror: detect trashed snapshots in UnlinkPeerRequest

Reviewed-by: Ramana Raja <rraja@redhat.com>

Merge PR #67580 into squid

* refs/pull/67580/head:
librbd: don't complete ImageUpdateWatchers::shut_down() prematurely

Reviewed-by: Ramana Raja <rraja@redhat.com>

mgr/dashboard: service creation fails if service name is same as service type

Fixes: https://tracker.ceph.com/issues/73948
Signed-off-by: Naman Munet <naman.munet@ibm.com>
(cherry picked from commit 57d081d6b5efcbeac6c60e73d50aa5f1f8cab560)

mgr/orchestrator: make group parameter optional for nvmeof (squid)

Add default value for group parameter in nvmeof commands to maintain
backward compatibility with existing squid tests and deployments.

Context:
--------
On main branch, when commit 6bee4e10f7f added the group parameter, the
tests were subsequently updated to provide the group argument explicitly:

  Main test: ceph orch apply nvmeof foo default
  Expected: nvmeof.foo.default

However, on squid branch, the existing tests still use the older syntax
without specifying a group:

  Squid test: ceph orch apply nvmeof foo
  Expected: nvmeof.foo

The previous cherry-pick (e1612d048a1) fixed the service_id construction
logic to handle empty groups correctly, but the group parameter was still
required without a default value, causing "ceph orch apply nvmeof foo" to
fail with EINVAL (missing required argument).

This commit adds the missing default value (group: str = '') to make the
parameter optional, maintaining backward compatibility with existing squid
tests and user scripts that don't specify a group.

With both changes:
1. Cherry-picked e1612d048a1: service_id logic handles empty group
2. This commit: group parameter has default value ''

Result:
  "ceph orch apply nvmeof foo" works (creates nvmeof.foo)
  "ceph orch apply nvmeof foo mygroup" also works (creates nvmeof.foo.mygroup)

Test: qa/suites/orch/cephadm/smoke-roleless/2-services/nvmeof.yaml
Fixes job 50373 failure from test run dgalloway-2026-02-13_23:06:25

Please note, this change was not cherry-picked from main branch, because
main intentionally still requires the CLI group argument for arch
apply/add nvmeof, and its tests were updated accordingly.
On squid, however, the earlier cherry-pick 6bee4e10 introduced the
required group parameter, but squid still has the old test/behavior
(ceph orch apply nvmeof foo expecting nvmeof.foo) and does not contain
the later main commits 3e5e85aadc1 and b377085c302.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>

librbd/cache/pwl: WriteLogOperationSet::cell can be garbage

The pointer is never initialized but gets printed by operator<<.
Luckily outside of that it's unused.

Fixes: https://tracker.ceph.com/issues/74971
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit bffa11487cb7d68c0aa39994f50fbc3b4b00e415)

mgr/dashboard: add types for mgr-module list

also introducing a const for rgw

Fixes: https://tracker.ceph.com/issues/70331
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 3d6de8a669887c57711f176b3a75f2f2a635a23e)

Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/ceph/rgw/rgw-multisite-details/rgw-multisite-details.component.ts
- kept only the import that is relavant
src/pybind/mgr/dashboard/frontend/src/app/shared/api/mgr-module.service.ts
- same as above

mgr/dashboard: fix access control permissions for roles

Since prometheus is being used in the dashboard page we need to make
sure every role has prometheus read only access so that the dashboard
page can load the utilization metrics.

I also saw permission issue with the osd settings endpoint when its
trying to get the nearfull/full ratio. so instead of failing the entire
page i am proceeding with a chart that doesn't have those details when
the user doesn't have permission to access the config opt.

Multisite page was not accessible in the case of rgw-manager or
read-only user because its trying to show the status of rgw module. This
si also now gracefully handled to show the alert only when the user has
sufficient permission.

Fixes: https://tracker.ceph.com/issues/70331
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit f4bc03e4040ca32591d9b46b79309b162c3942db)

Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/ceph/dashboard-v3/dashboard/dashboard-v3.component.ts
- kept changes only relavant to bug fix and ignored the other changes
like h/w monitoring
src/pybind/mgr/dashboard/frontend/src/app/ceph/rgw/rgw-multisite-details/rgw-multisite-details.component.html
- ignored multisite wizard changes
src/pybind/mgr/dashboard/frontend/src/app/core/navigation/administration/administration.component.html
- kept the current changes since carbon is not there in squid which
means this issue is not present
src/pybind/mgr/dashboard/frontend/src/app/core/navigation/navigation/navigation.component.html
- kept the current changes for the same reason above
src/pybind/mgr/dashboard/services/access_control.py
- ignored the SMB role manager and kept only what's available in squid

pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

- make service_id better alligned with default/empty group
(https://github.com/ceph/ceph/commit/f6d552d7c777f1160545188dcffa6b685b05ca8a)
- fix service_id in nvmeof daemon add

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e1612d048a102a716aaa8b5d0d91a45525828664)

librbd/mirror: detect trashed snapshots in UnlinkPeerRequest

If two instances of UnlinkPeerRequest race with each other (e.g. due
to rbd-mirror daemon unlinking from a previous mirror snapshot and the
user taking another mirror snapshot at same time), the snapshot that
UnlinkPeerRequest was created for may be in the process of being removed
(which may mean trashed by SnapshotRemoveRequest::trash_snap()) or fully
removed by the time unlink_peer() grabs the image lock. Because trashed
snapshots weren't handled explicitly, UnlinkPeerRequest could spuriously
fail with EINVAL ("not mirror snapshot" case) instead of the expected
ENOENT ("missing snapshot" case). This in turn could lead to spurious
ImageReplayer failures with it stopping prematurely.

Fixes: https://tracker.ceph.com/issues/68279
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 3596ca077097a4e0ff8e8d05a410c2044332391e)

librbd: don't complete ImageUpdateWatchers::shut_down() prematurely

ImageUpdateWatchers::flush() requests aren't tracked with
m_in_flight-like mechanism the way ImageUpdateWatchers::send_notify()
requests are, but in both cases callbacks that represent delayed work
that is very likely to (indirectly) reference ImageCtx are involved.
When the image is getting closed, ImageUpdateWatchers::shut_down() is
called before anything that belongs to ImageCtx is destroyed. However,
the shutdown can complete prematurely in the face of a pending flush if
one gets sent shortly before CloseRequest is invoked. The callback for
that flush will then race with CloseRequest and may execute after parts
of or even the entire ImageCtx is destroyed, leading to use-after-free
and various segfaults.

Fixes: https://tracker.ceph.com/issues/75161
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 3ea6ee62aa339d1ad9976fdcc6e207a505f9bf44)

rgw/notification: fix reserved_size drift in 2pc_queue causing ENOSPC errors

The urgent_data.reserved_size field was accumulating incorrect values over time due to a mismatch between what was added during reserve() and what was subtracted during commit()/abort(). This caused the reserved_size to grow unbounded, eventually hitting the queue capacity limit and returning ENOSPC errors even when the queue had plenty of actual space.

solution:
Add a one time self healing capability, where the reservation value is re calculated during the reserve and counter is updated with correct value.

Signed-off-by: Krunal Chheda <kchheda3@bloomberg.net>
(cherry picked from commit 7f4eaee30cba6efd3e0acc5b3c315c182a3bc8d9)