In this commit, I added some pattern matching for
warnings that show up in the cluster log detail that
are related to degraded PGs. In these tests, we are intentionally
restarting OSDs to upgrade them, which leads to these states
showing up in the cluster log. So, the warnings are intended
and can be ignored in the context of an upgrade.
Fixes: https://tracker.ceph.com/issues/67881 Signed-off-by: Laura Flores <lflores@ibm.com>
Laura Flores [Fri, 7 Feb 2025 21:53:21 +0000 (15:53 -0600)]
qa/tasks: improve ignorelist for thrashing OSDs
This yaml file is used in rados/thrash-old-clients.
In this commit, I added some pattern matching for
warnings that show up in the cluster log detail that
are related to degraded PGs. In these tests, we are intentionally
marking down or killing OSDs, which leads to these states
showing up in the cluster log. So, the warnings are intended
and can be ignored in the context of OSD thrashing.
Fixes: https://tracker.ceph.com/issues/67913 Signed-off-by: Laura Flores <lflores@ibm.com>
* refs/pull/61562/head:
qa: remove redundant and broken test
mds: skip scrubbing damaged dirfrag
tools/cephfs/DataScan: test equality of link including frag
tools/cephfs/DataScan: skip linkages that have been removed
tools/cephfs/DataScan: do not error out when failing to read a dentry
tools/cephfs/DataScan: create all ancestors during scan_inodes
tools/cephfs/DataScan: cleanup debug prints
qa: remove old MovedDir test
qa: add data scan tests for ancestry rebuild
qa: make the directory non-empty to force migration
qa: avoid unnecessary mds restart
John Mulligan [Tue, 20 Aug 2024 19:01:05 +0000 (15:01 -0400)]
src/script: add a script to help build ceph using containers
The build-with-container script tries to encapsulate nearly all major
build tasks using docker/podman containers. If there's no build image
locally it will create one for your. It provides targets for building
(make), testing (make check), building rpm packages or deb packages and
is designed to be fairly easily extended.
View the comment at the top of the source file for usage details.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Tue, 20 Aug 2024 19:00:57 +0000 (15:00 -0400)]
build: add files needed to create a build container
A build container contains all the tools and dependencies needed to
build ceph. It provides a Container file and small script that
helps bootstrap the container setup. This script installs a few extra
things we need before farming most of the work out to install-deps.sh.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Sat, 14 Sep 2024 10:31:23 +0000 (06:31 -0400)]
build: small script tweak to allow different build dirs
Move the mkdir line to allow for other builds dir naming schemes outside
of what appears in the .gitignore file. A tiny bit of added flexibility
at little cost.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Mon, 14 Nov 2022 15:57:25 +0000 (10:57 -0500)]
src/script: add helper function has_build_dir
This function returns successfully if $BUILD_DIR exists and is valid.
This is a useful building block for automation around the build and
can be used to avoid re-running commands that fail is the build dir
exists already.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Adam King [Wed, 29 Jan 2025 20:48:53 +0000 (15:48 -0500)]
mgr/cephadm: continue in nfs service purge if grace file is already deleted
The test_nfs task we run in teuthology creates and removes a number of
nfs clusters during the task. I think it's possible based on timing for
it to end up in a situation where it tries to remove an nfs service before
the grace file has been created. In that case, cephadm doesn't know it
hasn't created the grace file and just repeatedly fails forever attempting
to remove the nonexistent file. This patch adds handling for the error
case where we get a nonzero rc but the error message implies the command
failed because the file already does not exist.
Fixes: https://tracker.ceph.com/issues/69736 Signed-off-by: Adam King <adking@redhat.com>
Vallari Agrawal [Tue, 4 Feb 2025 07:50:18 +0000 (13:20 +0530)]
qa/suites/nvmeof: use SCALING_DELAYS: '120'
Increase delays for qa/workunits/nvmeof/scalability_test.sh
as namespace rebalancing takes more time. After upscaling,
gateway initially could be 'CREATED', it is a valid state during
gateway initialization, but then the state should progress
to 'AVAILABLE' within couple of seconds.
Kushal Deb [Fri, 29 Nov 2024 08:38:51 +0000 (14:08 +0530)]
cephadm: Add pre_remove and ensure deployment values are reset and API settings are updated when removing Prometheus or Alertmanager daemons
This fixes an issue where the dashboard API settings are not updated
properly when the active Prometheus or Alertmanager daemon is removed.
If the active daemon is removed, the settings are reconfigured to point
to a remaining daemon or reset if no daemons are available.
This avoids dashboard errors like "404 Not Found" caused by stale API
host settings.
Zac Dover [Sat, 1 Feb 2025 21:50:07 +0000 (07:50 +1000)]
doc/cephadm: clarify "Monitoring OSD State"
Change "Remove an OSD" to "Monitoring OSD State During OSD Removal" and
reword a sentence so that it more clearly refers to the process under
discussion.
Let ReplicatedRecoveryBackend::handle_recovery_op route pushes
between handle_push and handle_pull_response instead of
ReplicatedRecoveryBackend::handle_push.
Adam King [Wed, 29 Jan 2025 17:02:50 +0000 (12:02 -0500)]
mgr/cephadm: add Server_Scope = <fsid> to NFSv4 section of ganesha conf
From the ganesha team
"""
In the NFSv4 param block, we need a parameter Server_Scope set to some value common among all servers in a cluster.
The default with it blank is to use the hostname which may be different for each server in the cluster.
"""
This is related to ongoing work on high availability nfs. From the cephadm side
we just need to make sure all nfs daemons in the cluster end up with
the same value for the Server_Scope field. This patch uses the cluster
id (which we already brought into the template as the "namespace" attribute)
Ilya Dryomov [Thu, 30 Jan 2025 19:30:18 +0000 (20:30 +0100)]
doc/rbd: use https links in live import examples
Even though it's explicitly said that "http" stream can be used to
import via both HTTP and HTTPS, it can still be confusing that "type":
"http" is expected to go with "url": "https://...". Switch example
URLs from HTTP to HTTPS to make it more obvious.
Adam King [Thu, 30 Jan 2025 14:15:37 +0000 (09:15 -0500)]
mgr/cephadm: create OSD daemon deploy specs through make_daemon_spec
That function handles setting up the extra container/entrypoint
args for the daemon during initial deployment. Having the
CephadmDaemonDeploySpec made directly in the OSD deployment
workflow means initial deployments of OSDs won't have the
extra container/entrypoint args from the spec
Fixes: https://tracker.ceph.com/issues/69734 Signed-off-by: Adam King <adking@redhat.com>
When the NBD server is killed, nbd_pread() can set errno to at least
ENOTCONN, EINVAL and 0 which is supposed to stand for "no additional
errno information is available for this error". Add a test to ensure
that "rbd migration execute" command always fails and that the image
isn't transitioned to MIGRATION_STATE_EXECUTED in this scenario.
Ilya Dryomov [Wed, 29 Jan 2025 11:56:34 +0000 (12:56 +0100)]
librbd: stop filtering async request error codes
The roots of this go back to 2015 when snap create was changed to
filter EEXIST in commit 63f6c9bac9a4 ("librbd: fixed snap create race
conditions") and flatten respectively EINVAL in commit ef7e210c3f74
("librbd: better handling for duplicate flatten requests"). From there
this pattern made it to most other operations that can be proxied
including "rbd migration execute".
The motivation was to suppress generation of an "expected" error in
response to a duplicate async request notification for the operation.
However, doing this at the top of the handler (right before returning
to the caller) and for an error as generic as EINVAL is super fragile.
It's trivial for an error that is being filtered to sneak in with
a lower level change completely unnoticed. For example, live migration
recently added NBD stream which is implemented on top of libnbd and it
turns out that some libnbd APIs return EINVAL on various occasions when
the NBD endpoint disappears and an error like ENOTCONN would make more
sense. If this occurs during "rbd migration execute" operation, the
rest of librbd never learns that migration was disrupted and the image
is transitioned to MIGRATION_STATE_EXECUTED, thus handing a partially
imported (read: corrupted) image to the user.
Luckily, with commits 07fbc4b71df4 ("librbd: track complete async
operation requests") and 96bc20445afb ("librbd: track complete async
operation return code"), the scenario which originally prompted error
code filtering isn't an issue anymore. Despite a few shortcomings
(e.g. when an async request notification is acked with result 0, it's
impossible to tell whether a) a new operation was kicked off, b) there
is an operation that is still in progress or c) it's for an operation
that completed earlier but hasn't "expired" yet), even just commit 07fbc4b71df4 by itself prevents a duplicate notification from kicking
off a second operation that could generate an error for something that
actually succeeded. With that in mind, eradicate error code filtering
from Operations class.
Vallari Agrawal [Tue, 28 Jan 2025 09:18:15 +0000 (14:48 +0530)]
qa/tasks/nvmeof.py: Fix do_checks() method
All checks currently run on initator node, now
run all "ceph" commands on one of gateway hosts
instead of initator nodes. And run "nvme list"
and "nvme list-subsys" checks on initator node.
Add retry (5 times) to do_checks if any command fails.