Ville Ojamo [Mon, 5 Jan 2026 06:10:45 +0000 (13:10 +0700)]
doc: Remove sphinxcontrib-seqdiag Python package from RTD builds
This is a proactive PR to avoid breaking docs builds when Setuptools 81
starts to be used in the RTD builds process.
The sphnixcontrib-seqdiag Python package is not compatible with
Setuptools 81 or later due to use of pkg_resources:
https://setuptools.pypa.io/en/latest/pkg_resources.html
Setuptools 81 release should be imminent, with the Python deprecation
warning stating pkg_resources "removal as early as 2025-11-30".
Seqdiag seems to be unmaintained with the latest update at Pypi in
the year 2021 and also no updates to the seqdiag git repo.
There are no seqdiag directives left in the docs after last seqdiags
were removed in PR #52308.
Two other options would exist for fixing the situation (see PR for
discussion) but this seems to be the suitable one.
Patrick Donnelly [Wed, 18 Feb 2026 15:41:32 +0000 (10:41 -0500)]
Merge PR #64815 into squid
* refs/pull/64815/head:
The compilation of ISAL compress in the current code depends on the macro HAVE_NASM_X64_AVX2. However, the macro HAVE_NASM_X64_AVX2 has been removed, resulting in the compression not using ISAL even if the compressor_zlib_isal parameter is set to true.
Kyr Shatskyy [Fri, 21 Nov 2025 21:20:04 +0000 (22:20 +0100)]
qa/workunits/rgw: drop netstat usage
The `netstat` is deprecated now in modern Linux and usually
requires an extra package dependency to be installed.
Usually it is `net-tools`, however, for example, opensuse,
`netstat` does not present in it. Thus, let us use `ss` as
an alternative.
When using `netstat -nltp` we get lines like:
'tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 25156/valgrind.bin \ntcp6 0 0 :::443 :::* LISTEN 25156/valgrind.bin \n'
When using `ss -nltp` we get lines like:
'LISTEN 0 4096 0.0.0.0:443 0.0.0.0:* users:(("memcheck-amd64-",pid=66045,fd=72))'
so we need to filter processes by `memcheck`. However further
parsing code works equivalently as for netstat.
Ilya Dryomov [Fri, 30 Jan 2026 15:32:35 +0000 (16:32 +0100)]
qa/tasks/rbd_mirror_thrash: don't use random.randrange() on floats
This stopped working in Python 3.12:
Changed in version 3.12: Automatic conversion of non-integer types
is no longer supported. Calls such as randrange(10.0) and
randrange(Fraction(10, 1)) now raise a TypeError.
Ilya Dryomov [Thu, 29 Jan 2026 20:41:03 +0000 (21:41 +0100)]
qa/workunits/rbd: reduce randomized sleeps in live import tests
These tests were tuned for slower hardware than what we have now.
Currently "rbd migration execute" always finishes (successfully) before
the NBD server is killed.
Conflicts:
qa/workunits/rbd/cli_migration.sh [ commit afc89fdde80f
("qa/workunits/rbd: add test_import_nbd_stream_disconnected()")
was originally skipped due to NBD stream not being in squid at
the time ]
Ilya Dryomov [Wed, 28 Jan 2026 09:41:13 +0000 (10:41 +0100)]
qa/workunits/rbd: drop randomized sleeps in "big image" tests
These tests were tuned for slower hardware than what we have now.
Even without these the image is often 25-30% synced by the time the
test gets to the "non-primary snapshot in question is still being
synced" assert.
Ilya Dryomov [Tue, 27 Jan 2026 20:56:23 +0000 (21:56 +0100)]
qa/workunits/rbd: avoid unnecessary sleeping in stop_mirror()
There is no need to wait for anything if -KILL is passed for sig
because the process would disappear immediately. In teuthology runs
where multiple rbd-mirror daemons are deployed (and therefore need to
be stopped when stop_mirrors() is called by the test), it causes
gratuitous delays of 4+ seconds.
Ilya Dryomov [Fri, 23 Jan 2026 13:48:53 +0000 (14:48 +0100)]
qa: don't assume that /dev/sda or /dev/vda is present in unmap.t
Instead of hard-coding the block device name, use the block device that
is backing the filesystem that the test is running on. We can be quite
sure it won't be an RBD device ;)
Ilya Dryomov [Wed, 21 Jan 2026 18:41:41 +0000 (19:41 +0100)]
qa: krbd_blkroset.t: eliminate a race in the open_count test
Even at QD=1, dd may take less than 10 seconds to work its way to the
end of a 10M image, producing "No space left on device" error instead
of the expected "Operation not permitted" error which is supposed to
arise from the device getting marked read-only while opened.
Ville Ojamo [Mon, 19 Jan 2026 13:06:46 +0000 (20:06 +0700)]
doc/cephadm: remove sections not apply to Squid in rgw.rst
4949311 backported changes that do not apply to Squid.
PR #63073 body and the commit referenced therein as cherry-pick do not
correspond to the diff. Remove the additions that do not apply to Squid:
- Wildcard SAN feature in 3c24753 only since Tentacle.
- Shutdown delay feature in b84bb72 only since Tentacle.
The third feature doc addition is valid, d620ba6 was backported to Squid
in PR #61350 for disable multisite sync traffic, commit 59b3f28. This
backport cherry-picked only the feature addition and missed the docs
commit 8878619. Leave this section in.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Kefu Chai [Wed, 24 Dec 2025 05:55:26 +0000 (13:55 +0800)]
debian/control: add iproute2 to build dependencies
Test scripts like qa/tasks/cephfs/mount.py expect the ip command to be
available in the container environment. Without it, tests fail with:
```
/bin/bash: line 1: ip: command not found
File "/ceph/qa/tasks/cephfs/mount.py", line 96, in cleanup_stale_netnses_and_bridge
p = remote.run(args=['ip', 'netns', 'list'],
...
teuthology.exceptions.CommandFailedError: Command failed with status 127: 'ip netns list'
```
Add iproute2 to the debian package build dependencies when the
<pkg.ceph.check> build profile is enabled. This ensures the package is
available during container-based builds, since buildcontainer-setup.sh
→ script/run-make.sh → install-deps.sh → debian/control → generated
dependency package chain respects build profiles configured via
`FOR_MAKE_CHECK` and `WITH_CRIMSON` environment variables set in
Dockerfile.build.
David Galloway [Tue, 16 Dec 2025 22:08:00 +0000 (17:08 -0500)]
install-deps: Replace apt-mirror
apt-mirror.front.sepia.ceph.com has happened to always work because we set up CNAMEs to gitbuilder.ceph.com.
That host is making its way to a new home upstate (literally and figuratively) so we'll get rid of the front subdomain since it's publicly accessible anyway and add TLS while we're at it.
Ilya Dryomov [Tue, 9 Dec 2025 14:22:02 +0000 (15:22 +0100)]
librbd: fix ExclusiveLock::accept_request() when !is_state_locked()
To accept an async request, two conditions must be met: a) exclusive
lock must be a firm STATE_LOCKED state and b) async requests shouldn't
be blocked or if they are blocked there should be an exception in place
for a given request_type. If a) is met but b) isn't, ret_val is set
to m_request_blocked_ret_val, as expected -- the reason for denying
the request is that async requests are blocked. However, if a) isn't
met, ret_val also gets set to m_request_blocked_ret_val. This is wrong
because the reason for denying the request in this case isn't that
async requests are blocked (they may or may not be) but a much heavier
circumstance of exclusive lock being in a transient state or not held
at all.
In such scenarios, whether async requests are blocked or not isn't
relevant and ExclusiveLock::accept_request() behaving otherwise can
lead to bogus "duplicate lock owners detected" errors getting raised
during an attempt to handle any maintenance operation notification in
ImageWatcher::handle_operation_request(). This error isn't considered
retryable so the entire operation that needed the exclusive lock would
be spuriously failed with EINVAL.
Adam Kupczyk [Tue, 15 Apr 2025 08:37:25 +0000 (08:37 +0000)]
os/bluestore: Fix dirty_range in BlueStore::_do_remove
dirty_range used to have length = 1 byte.
This is good if whole extent is inside shard.
But this has proven not to be the case.
dirty_range(offset, length) is slower only when it crosses shard.
Rishabh Dave [Tue, 18 Feb 2025 12:30:03 +0000 (18:00 +0530)]
qa/cephfs: ignore warning that pg is stuck peering for upgrade jobs
Health warning "pg .* is stuck peering" is seen while Ceph cluster is
under the upgrade process during fs/upgrade QA job. Being an expected
warning, it should be added to the ignorelist.
And besides this one, we already ignore more severe warnings ("pg is
stuck inactive" and "pg is degrarded") for fs/upgrade jobs.
Fixes: https://tracker.ceph.com/issues/70023 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 9748de76e02254c6dc284dcc20ec5d5761760dcb)
Conflicts:
qa/cephfs/overrides/pg_health.yaml
- Line before the point where the patch was to be applied is different
comapred to main branch.
In statfs, when the quota root for a dir is discovered,
it uses that dir to base values for max_files and max_bytes.
This can be an issue when a dir is found with only one of two potential quota
fields. Take for instance, a dir with only max_files set and parent dir
has only max_bytes set. During a statfs call, it will then use the max_files
value for provided dir, but does not have a value for max_bytes. In this case,
this behavior will cause the size of the filesystem to be displayed.
Instead, find the quota root for max_files and max_bytes separately. This will
allow for mixed quotas to inherit missing values from its parent. In the above
example, max_files from current dir and max_bytes from parent dir will be
displayed.
Fixes: https://tracker.ceph.com/issues/73487 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit dd02ea9b18502b87ce815eba4286ae3516e334b3)
In cases where there is a single element in a batch_op_map,new_batch_head
is a nullptr, when this is retried at Finisher we'd hit one of the asserts when
dereferencing
In SingletonClient::init(), objecter->start() called before
monc->authenticate(), it makes conns of monc authencated before
monc->authenticate() called if mons reply faster, in this case,
monc will not subsribe monmap/config.
Naveen Naidu [Thu, 5 Dec 2024 04:21:21 +0000 (09:51 +0530)]
qa/workunit: update telemetry quincy/reef workunits with "basic_stretch_cluster" collection
Note, this is not a clean cherry pick. The 4dac20e updated the
`test_telemetry_reef_x.sh` and `test_telemetry_squid_x.sh` upgrade
workunits. These upgrade workunits test the upgrade of a cluster from
reef and squid (X-2) releases to the X version of cluster.
Since we are cherry picking the commit to squid (X release), we would
instead have to update the workunit files of quicy and reef i,e the
(X-2) releases.
DanWritesCode [Mon, 18 Dec 2023 21:09:07 +0000 (16:09 -0500)]
osd: add clear_shards_repaired command
This command will allow us to clear the OSD_TOO_MANY_REPAIRS alert
by setting the shard repair count to 0. This will help in cases where
the alert was a false positive, or a condition that has since cleared
at the disk level. Often, zeroing out the repair count is
better than muting the alert or restarting the OSD.
Fixes: https://tracker.ceph.com/issues/54182 Co-authored-by: David Zafman <dzafman@redhat.com> Signed-off-by: Daniel Radjenovic <dradjenovic@digitalocean.com>
(cherry picked from commit 78d6bfe54c3b9b60fab36a640b1ce77c8f022fa9)
mds: client is evicted when an export subtree task is interrupted
The importer will force open some sessions provided by the exporter but the client does not know about
the new sessions until the exporter notifies it, and the notifications cannot be sent if the exporter
is interrupted. The client does not renew the sessions regularly that it does not know about, so the client
will be evicted by the importer after `session_autoclose` seconds (300 seconds by default).
The sessions that are forced opened in the importer need to be closed when the import process is reversed.
Zhansong Gao [Fri, 26 May 2023 04:20:17 +0000 (12:20 +0800)]
mds: session in the importing state cannot be cleared if an export subtree task is interrupted while the state of importer is acking
The related sessions in the importer are in the importing state(`Session::is_importing` return true) when the state of importer is `acking`,
`Migrator::import_reverse` called by `MDCache::handle_resolve` should reverse the process to clear the importing state if the exporter restarts
at this time, but it doesn't do that actually because of its bug. And it will cause these sessions to not be cleared when the client is
unmounted(evicted or timeout) until the mds is restarted.
The bug in `import_reverse` is that it contains the code to handle state `IMPORT_ACKING` but it will never be executed because
the state is modified to `IMPORT_ABORTING` at the beginning. Move `stat.state = IMPORT_ABORTING` to the end of import_reverse
so that it can handle the state `IMPORT_ACKING`.
Casey Bodley [Fri, 3 Oct 2025 16:24:18 +0000 (12:24 -0400)]
rgw: fix 'bucket rm --bypass-gc' for copied objects
the `--bypass-gc` argument to `radosgw-admin bucket rm` causes us to
call `RadosBucket::remove_bypass_gc()`, which loops over the tail
objects and removes each with `RGWRados::delete_raw_obj_aio()`
however, this was removing the objects with `cls_rgw_remove_obj()`,
which is for head objects, not tails. tail objects must be removed with
`cls_refcount_put()`, which preserves them until the last copy is
removed
rename `delete_raw_obj_aio()` to `delete_tail_obj_aio()` to clarify its
purpose
Nitzan Mordechai [Wed, 22 Oct 2025 05:41:56 +0000 (05:41 +0000)]
tasks/cbt_performance: Tolerate exceptions during performance data updates
If an exception occurs during the POST request to update CBT performance,
log the error instead of failing the entire job. This ensures that
intermittent update failures do not block the main workflow.
The unlink subcommand did not handle unsharded bucket indices
appropriately. These are when the number of shards listed in the
bucket instance object is 0. In that case there will actually be 1
shard.
When number of shards as 0 is passed into the function that maps
object names to shards, it returns -1. And that was not handled
properly. That is now fixed.
Henry Richter [Wed, 8 Oct 2025 23:00:34 +0000 (01:00 +0200)]
rgw: asio/beast add ssl hot-reload
Adds the `ssl_reload` config option to the beast frontend.
This sets an interval in seconds to periodically reload the ssl context to pick up changes without restarting. It can be disabled (default) be setting it to `0`.
Nitzan Mordechai [Tue, 10 Dec 2024 09:04:34 +0000 (09:04 +0000)]
msg/async: race condition between reset_recv_state and shutdown_connections
when shutting down monitors and valgrind is involved, we can,
sometimes, to hit race condition and locks that causing the shutdown
process to hang for a long time.
reset_recv_state - issuing a message without proper locks that
causing the shutdown to hang during shutdown connection (drain network)