Nizamudeen A [Thu, 7 Mar 2024 08:43:54 +0000 (14:13 +0530)]
mgr/dashboard: disable applitools e2e
Temporarily disabling this so the CI could turn green. Meanwhile I'll
research for a proper way to handle the applitools e2es which I'll track
on https://tracker.ceph.com/issues/64783
Zac Dover [Wed, 13 Mar 2024 12:04:35 +0000 (22:04 +1000)]
doc/dev: backport zipapp docs to reef
Backport the docs changes in https://github.com/ceph/ceph/pull/54173 to
the Reef release branch. This was not previously done because the docs
changes in PR#54173 were bundled with code changes.
Adam King [Mon, 6 Nov 2023 16:19:09 +0000 (11:19 -0500)]
qa/cephadm: adjust host drain test to handle explicit placement warning
Since we're adding a warning if any host is listed explicitly
in the placement of any service when removing the host,
we need to adjust the host drain test that removes a host
without the --force flag to not have the explicit hostname
in the placement for the mon service.
and then you run `ceph orch host drain host3`, cephadm will remove
the daemon from that host and the placement would now match nothing.
This is definitely an issue that should be able to be bypassed as
it generally isn't serious, but it would be good to let users
know they have the host listed explicitly in placements like this
when they want to drain it.
Adam King [Fri, 29 Sep 2023 20:09:48 +0000 (16:09 -0400)]
qa/cephadm: add teuthology test for host draining
This was a gap in our testing in general, but I'm
adding it here right now specifically to use it
to test the "--rm-crush-entry" flag in a follow
up commit
Adam King [Fri, 29 Sep 2023 18:39:10 +0000 (14:39 -0400)]
mgr/cephadm: add --rm-crush-entry flag to host removal
This will tell cephadm to try and remove the
crush bucket for the host at the end of the host
removal process. If this fails, we still consider the
host as having been successfully remove from
cephadm's POV, but the user will get back an error
message telling them we failed to remove the
host from the crush map
Adam King [Thu, 15 Feb 2024 14:24:23 +0000 (09:24 -0500)]
qa/cephadm: don't test certain workunits with agent
There are a handful of workunits that don't work
with or don't make sense with the agent.
The test for the cephadm timeout only works if
the mgr directly runs ceph-volume inventory which
it won't do with the agent present. The adoption
test is just running direct cephadm commands that
are irrelevant to the agent. The test_orch_cli tests
rely on refresh timings that are different with
the agent running, causing spurious failures.
Seena Fallah [Sun, 11 Feb 2024 21:50:05 +0000 (22:50 +0100)]
cephadm: remove restriction for crush device classes
A restriction has been introduced here (https://github.com/ceph/ceph/commit/6c6cb2f5130dbcf8e42cf03666173948411fc92b) which doesn't let OSDs be created with custom crush device classes.
Crush Device Class is the key that helps the crush distinguish between multiple storage classes, so it must accept any custom names.
Adam King [Wed, 14 Feb 2024 17:02:09 +0000 (12:02 -0500)]
cephadm: rm podman-auth.json if removing last cluster
We have points in rm-cluster where we check that
there are no other clusters on the host. If that
is the case, we can also clear /etc/ceph/podman-auth.json
which gets written out when we log in to a registry
while using podman
Adam King [Sun, 10 Mar 2024 20:42:51 +0000 (16:42 -0400)]
cephadm: create ceph-exporter sock dir if it's not present
Since this is usually /var/run/ceph/ which ends up getting
created by other daemons as well, it was common to see
ceph-exporter fail to deploy and then deploy fine after
once other daemons were down on the host. I don't see any
reason we can't just try to make the directory here instead
of bailing out.
This patch had to be rewritten for reef, as it depended on
changes in cephadm that will not be backported to reef.
Fixes: https://tracker.ceph.com/issues/64491 Signed-off-by: Adam King <adking@redhat.com>
Paul Cuzner [Thu, 21 Dec 2023 01:12:45 +0000 (20:12 -0500)]
orchestrator: Add summary line to orch device ls
This patch just adds a summary line to the plain
text output of orch device ls when the --summary
switch is given. This helps to quickly understand your
device countswhen managing hosts with many devices.
Adam King [Mon, 27 Nov 2023 20:04:42 +0000 (15:04 -0500)]
python-common/drive_selection: fix limit with existing devices
When devices have already been used for OSDs, they are still
allowed to pass filtering as they are still needed for the
resulting ceph-volume lvm batch command. This was causing an
issue with limit however. Limit adds the devices we've found
that match the filter and existing OSD daemons tied to the spec.
This allows double counting of devices that hae been used for
OSDs, as they're counted in terms of being an existing device
and that they match the filter. To avoid this issue, devices
should only be counted towards the limit if they are not already
part of an OSD.
An additional note: The limit feature is only applied for
data devices, so there is no need to worry about the effect
of this change on selection of db, wal, or journal devices.
Also, we would still want to not count these devices if they
did end up passing the data device filter but had been used
for a db/wal/journal device previously.
When no `service_id` is provided to service spec (osd) it results in
OSDs created with "osdspec_affinity" attribute set to a string
containing "None".
The DriveSelection class relies on the comparison of the actual
value of this attribute with the value of the service_id which has
the python type `None` in that case.
If any existing deployments were created without the service_id
attribute, we now have to support this case and make sure the check
won't filter out devices unexpectedly.
mgr/dashboard: discovery service (port 8765) fails on ipv6 only clusters
Having ms_bind_ipv4=false and ipv6=true the code that the Ceph dashboard runs
for the discovery service (port 8765) fails, because it requests the address
of the mgr container which returns ipv6 and the mgr code expects ipv4 address
Adam King [Wed, 18 Oct 2023 18:00:05 +0000 (14:00 -0400)]
mgr/cephadm: update timestamp on repeat daemon/service events
If you have a daemon/service event and then an identical
event happens later (e.g. the same daemon is redeployed
multiple times) the events are not updated on the repeat
instances. In cases like this I think it makes more
sense to update the timestamp so users can see the most
recent time the event happened.
Adam King [Tue, 9 May 2023 19:06:41 +0000 (15:06 -0400)]
mgr/cephadm: make jaeger-collector urls a dep for jaeger-agent
the jaeger-agent's need to know the url for the collector(s)
that have been deployed. If a collector moves, or we deployed
the agents before the collector, we need to reconfig the agents
with updated info about the collectors. Failure to do so can
leave the jager-agents down reporting
```
Could not create collector proxy","error":"at least one collector hostPort address is required when resolver is not available"
```
Zac Dover [Thu, 7 Mar 2024 03:01:47 +0000 (13:01 +1000)]
doc/start: add Slack invite link
Add a link to the ceph-storage Slack invitation page. Previously the
link went to a plain old "this is the ceph-storage Slack" page that did
not direct the reader to sign up.
Although this seems to be a bug from RedFish, we need to handle
the case when it happens otherwise it makes the mgr orchestrator module
throw an error.
The idea here is to create a new status "unknown" when we can't fetch the
real status of a component.
Adam King [Mon, 26 Jun 2023 20:42:52 +0000 (16:42 -0400)]
python-common/service_spec: add extra_entrypoint_args to CephExporter Spec
Similar to the mon, there's no reason for Ceph Exporter
in particular not to have this, it's just missing because
of the timing of when it was merged in.
Adam King [Mon, 19 Jun 2023 20:07:31 +0000 (16:07 -0400)]
mgr/cephadm: add extra_entrypoint_args to mon spec
There was no reason for the mon spec to not include
this option. I believe this was just an oversight caused
by the addition of the mon spec and extra_entrypoint_args
in separate PRs around the same time.