Adam King [Wed, 13 Mar 2024 19:30:25 +0000 (15:30 -0400)]
mgr/cephadm: refresh public_network for config checks before checking
The place it was being run before meant it would only grab the
public_network setting once at startup of the module. This meant
if a user changed the setting, which they are likely to do if they
get the warning, cephadm would ignore the change and continue
reporting that the hosts don't match up with the old setting
for the public_network. This moves the call to refresh the
setting to right before we actually run the checks. It does
mean we'll do the `ceph config dump --format json` call
each serve loop iteration, but I've found that only tends
to take a few milliseconds, which is nothing compared to
the time to refresh other things we check during the serve
loop.
I additionally modified the use of this option to use
the attribute on the mgr, rather than calling
`get_module_option`. This was just to get it more in
line with how we tend to handle other config options
Ramana Raja [Thu, 29 Feb 2024 17:12:19 +0000 (12:12 -0500)]
qa/suites: add diff-continuous and compare-mirror-image tests
... to rbd and krbd suites respectively.
This allows the compare-mirror-image tests introduced in ea3a567
to be run against various kernel branches, e.g., testing branch.
And allows diff_continuous test in rbd_suite to run against distro
kernel.
After some tests, it turns out that depending on the hardware,
the header 'Location' which is returned by the server after logged can be different.
I could notice the following:
the endpoint passed down to util.query() is wrong:
is passes the full url (scheme://addr:port/path) where it should only
pass the path. The cause is that RedFishClient.login() basically stores
the value of the Location header in `self.location`.
The consequence of this is that it makes the client unable to properly logout.
Venky Shankar [Mon, 4 Mar 2024 13:23:53 +0000 (18:53 +0530)]
mds: disable `defer_client_eviction_on_laggy_osds' by default
This config can result in a single client holding up mds to service
other clients since once a client is deferred from eviction due to
laggy OSD(s), a new clients cap acquire request can be possibly
blocked until the other laggy client resumes operation, i.e., when
the laggy OSD is considered non-laggy anymore.
Disable the config by default till the issue is fixed.
Ramana Raja [Thu, 25 May 2023 16:48:12 +0000 (16:48 +0000)]
qa: Add tests to validate syncing of images using rbd-mirror
Introduce functional tests to validate that the images under
workloads are correctly mirrored between two clusters using snapshot
based mirroring.
Run workload on a primary image using a krbd or nbd client. Take
mirror snapshots of the image under workload. Unmount the mapped image
and calculate its MD5 checksum before demoting it. After demotion,
wait for the mirror status of the image to be 'up+unknown' in both
the clusters. This is to make sure that the non-primary image in the
other cluster is ready to be promoted. Now promote the non-primary
image in the other cluster. Map the promoted image and calculate its
MD5 checksum. Verify that the checksums of the demoted and promoted
images in the two clusters are the same.
The above test is run as part of two different workunits:
- a workunit that validates the syncing of multiple mirrored images
with workloads running on them
- another workunit that validates the syncing of a single mirrored
image with workload running on it and the image is set as primary
alternatively between the two clusters, as it happens during
failover and failback scenarios.
Fixes: https://tracker.ceph.com/issues/61617 Signed-off-by: Ramana Raja <rraja@redhat.com> Co-authored-by: Ilya Dryomov <idryomov@redhat.com> Co-authored-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit b7aae5c3c5a1dd24c4cb7ceb499292af00bae680)
Cherry-pick notes:
- In qa/workunits/rbd/compare_mirror_images.sh, replace
`wait_for_replaying_status_in_pool_dir` with `wait_for_status_in_pool_dir`
Commit 3fd8a03 that added `wait_for_replaying_status_in_pool_dir`
not backported
Adam King [Fri, 16 Feb 2024 16:24:32 +0000 (11:24 -0500)]
mgr/cephadm: catch CancelledError in asyncio timeout handler
Specifically, concurrent.futures.CancelledError. At least on
python 3.9, this error can be raised when certain commands
being run asynchronously fail. Not catching this results in
the whole cephadm module crashing with something like
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/utils.py", line 94, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 267, in refresh
r = self._refresh_facts(host)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 370, in _refresh_facts
val = self.mgr.wait_async(self._run_cephadm_json(
File "/usr/share/ceph/mgr/cephadm/module.py", line 671, in wait_async
return self.event_loop.get_result(coro, timeout)
File "/usr/share/ceph/mgr/cephadm/ssh.py", line 64, in get_result
return future.result(timeout)
File "/lib64/python3.9/concurrent/futures/_base.py", line 444, in result
raise CancelledError()
concurrent.futures._base.CancelledError
Zac Dover [Wed, 13 Mar 2024 12:04:35 +0000 (22:04 +1000)]
doc/dev: backport zipapp docs to reef
Backport the docs changes in https://github.com/ceph/ceph/pull/54173 to
the Reef release branch. This was not previously done because the docs
changes in PR#54173 were bundled with code changes.
Adam King [Mon, 6 Nov 2023 16:19:09 +0000 (11:19 -0500)]
qa/cephadm: adjust host drain test to handle explicit placement warning
Since we're adding a warning if any host is listed explicitly
in the placement of any service when removing the host,
we need to adjust the host drain test that removes a host
without the --force flag to not have the explicit hostname
in the placement for the mon service.
and then you run `ceph orch host drain host3`, cephadm will remove
the daemon from that host and the placement would now match nothing.
This is definitely an issue that should be able to be bypassed as
it generally isn't serious, but it would be good to let users
know they have the host listed explicitly in placements like this
when they want to drain it.
Adam King [Fri, 29 Sep 2023 20:09:48 +0000 (16:09 -0400)]
qa/cephadm: add teuthology test for host draining
This was a gap in our testing in general, but I'm
adding it here right now specifically to use it
to test the "--rm-crush-entry" flag in a follow
up commit
Adam King [Fri, 29 Sep 2023 18:39:10 +0000 (14:39 -0400)]
mgr/cephadm: add --rm-crush-entry flag to host removal
This will tell cephadm to try and remove the
crush bucket for the host at the end of the host
removal process. If this fails, we still consider the
host as having been successfully remove from
cephadm's POV, but the user will get back an error
message telling them we failed to remove the
host from the crush map
Adam King [Thu, 15 Feb 2024 14:24:23 +0000 (09:24 -0500)]
qa/cephadm: don't test certain workunits with agent
There are a handful of workunits that don't work
with or don't make sense with the agent.
The test for the cephadm timeout only works if
the mgr directly runs ceph-volume inventory which
it won't do with the agent present. The adoption
test is just running direct cephadm commands that
are irrelevant to the agent. The test_orch_cli tests
rely on refresh timings that are different with
the agent running, causing spurious failures.
Seena Fallah [Sun, 11 Feb 2024 21:50:05 +0000 (22:50 +0100)]
cephadm: remove restriction for crush device classes
A restriction has been introduced here (https://github.com/ceph/ceph/commit/6c6cb2f5130dbcf8e42cf03666173948411fc92b) which doesn't let OSDs be created with custom crush device classes.
Crush Device Class is the key that helps the crush distinguish between multiple storage classes, so it must accept any custom names.
Adam King [Wed, 14 Feb 2024 17:02:09 +0000 (12:02 -0500)]
cephadm: rm podman-auth.json if removing last cluster
We have points in rm-cluster where we check that
there are no other clusters on the host. If that
is the case, we can also clear /etc/ceph/podman-auth.json
which gets written out when we log in to a registry
while using podman
Adam King [Sun, 10 Mar 2024 20:42:51 +0000 (16:42 -0400)]
cephadm: create ceph-exporter sock dir if it's not present
Since this is usually /var/run/ceph/ which ends up getting
created by other daemons as well, it was common to see
ceph-exporter fail to deploy and then deploy fine after
once other daemons were down on the host. I don't see any
reason we can't just try to make the directory here instead
of bailing out.
This patch had to be rewritten for reef, as it depended on
changes in cephadm that will not be backported to reef.
Fixes: https://tracker.ceph.com/issues/64491 Signed-off-by: Adam King <adking@redhat.com>
Paul Cuzner [Thu, 21 Dec 2023 01:12:45 +0000 (20:12 -0500)]
orchestrator: Add summary line to orch device ls
This patch just adds a summary line to the plain
text output of orch device ls when the --summary
switch is given. This helps to quickly understand your
device countswhen managing hosts with many devices.