Zac Dover [Wed, 13 Mar 2024 17:25:06 +0000 (03:25 +1000)]
doc/cephadm: explain different methods of cephadm delivery
Explain that only in Reef and later releases is cephadm distributed as
an executable compiled from source code. This note is to go into Quincy
and only into Quincy, to direct new users of Ceph whom circumstance has
delivered into the hands of Quincy and who might have the wrong idea
that the documentation of Reef and later releases applies to their
release.
Adam King [Fri, 29 Sep 2023 20:09:48 +0000 (16:09 -0400)]
qa/cephadm: add teuthology test for host draining
This was a gap in our testing in general, but I'm
adding it here right now specifically to use it
to test the "--rm-crush-entry" flag in a follow
up commit
Adam King [Fri, 29 Sep 2023 18:39:10 +0000 (14:39 -0400)]
mgr/cephadm: add --rm-crush-entry flag to host removal
This will tell cephadm to try and remove the
crush bucket for the host at the end of the host
removal process. If this fails, we still consider the
host as having been successfully remove from
cephadm's POV, but the user will get back an error
message telling them we failed to remove the
host from the crush map
Adam King [Wed, 18 Oct 2023 18:00:05 +0000 (14:00 -0400)]
mgr/cephadm: update timestamp on repeat daemon/service events
If you have a daemon/service event and then an identical
event happens later (e.g. the same daemon is redeployed
multiple times) the events are not updated on the repeat
instances. In cases like this I think it makes more
sense to update the timestamp so users can see the most
recent time the event happened.
Ronen Friedman [Mon, 22 May 2023 15:09:28 +0000 (18:09 +0300)]
osd/scrub: increasing max_osd_scrubs to 3
Bug reports seem to hint that the current default value of
'1' is too low: the cluster is susceptible to scrub scheduling
delays and issues stemming from local software/networking/hardware
problems, even if affecting a very small number of OSDs.
Squid will include a major overhaul of the way scrubs are counted
in the cluster, providing a better solution to the problem. For
now - modifying the default is an effective stop-gap measure.
which is what the migration is actually concerned about
(verification of the rgw_frontend_type in these specs).
In the case where the spec is more simple, we should
just leave the spec alone and move on. Unfortunately
the current code assumes the field will always be
there and hits an unhandled KeyError when trying to
migrate the more simple specs. This causes the
cephadm module to crash shortly after starting an
upgrade to a version that includes this migration
and it's very difficult to find the root cause. This
can be worked around by adding fields to the rgw
spec before upgrade so the "spec" field exists in
the spec and the migration works as intended.
This commit fixes the migration in the simple
case as well as adding testing for that case to
both the unit tests and orch/cephadm teuthology
upgrade tests
Seena Fallah [Sun, 11 Feb 2024 21:50:05 +0000 (22:50 +0100)]
cephadm: remove restriction for crush device classes
A restriction has been introduced here (https://github.com/ceph/ceph/commit/6c6cb2f5130dbcf8e42cf03666173948411fc92b) which doesn't let OSDs be created with custom crush device classes.
Crush Device Class is the key that helps the crush distinguish between multiple storage classes, so it must accept any custom names.
Adam King [Mon, 27 Nov 2023 20:04:42 +0000 (15:04 -0500)]
python-common/drive_selection: fix limit with existing devices
When devices have already been used for OSDs, they are still
allowed to pass filtering as they are still needed for the
resulting ceph-volume lvm batch command. This was causing an
issue with limit however. Limit adds the devices we've found
that match the filter and existing OSD daemons tied to the spec.
This allows double counting of devices that hae been used for
OSDs, as they're counted in terms of being an existing device
and that they match the filter. To avoid this issue, devices
should only be counted towards the limit if they are not already
part of an OSD.
An additional note: The limit feature is only applied for
data devices, so there is no need to worry about the effect
of this change on selection of db, wal, or journal devices.
Also, we would still want to not count these devices if they
did end up passing the data device filter but had been used
for a db/wal/journal device previously.
When no `service_id` is provided to service spec (osd) it results in
OSDs created with "osdspec_affinity" attribute set to a string
containing "None".
The DriveSelection class relies on the comparison of the actual
value of this attribute with the value of the service_id which has
the python type `None` in that case.
If any existing deployments were created without the service_id
attribute, we now have to support this case and make sure the check
won't filter out devices unexpectedly.
Zac Dover [Thu, 7 Mar 2024 03:01:47 +0000 (13:01 +1000)]
doc/start: add Slack invite link
Add a link to the ceph-storage Slack invitation page. Previously the
link went to a plain old "this is the ceph-storage Slack" page that did
not direct the reader to sign up.
Adam King [Mon, 21 Aug 2023 17:48:56 +0000 (13:48 -0400)]
cephadm: make custom_configs work for tcmu-runner container
This is intended to be a temporary workaround to make
custom config files be able to be mounted into
the tcmu-runner container. The hope is to refactor
cephadm's iscsi handling for squid, but a patch
like this could be useful for iscsi in older
releases where currently custom config files
are unusable for the tcmu-runner container
What this patch actually does is have us write the
custom config files to a dir for the tcmu-runner
container so that the rest of the logic works without
change. I thought this would be easier to remove later
than a patch that integrates more with the container
mounts or general deployment
Adam King [Tue, 13 Jun 2023 23:54:30 +0000 (19:54 -0400)]
cephadm: run tcmu-runner through script to do restart on failure
Currently, cephadm runs tcmu-runner as a background
process inside the unit file deployed for iscsi
(rbd-target-api is the primary process). This means
if tcmu-runner crashes for whatever reason, systemd
will not attempt to restart it. This commits sets
up a script to serve as the container entrypoint
for the tcmu-runner container that will run
tcmu-runner and also restart it on failure
(unless there are too many failures in a short
period, at which point it gives up).
The hope is to eventually drop use of this script
for a better solution in squid onward, but this
should be helpful on older releases (quincy and
pacific at least) where we won't be able to
bring that better solution
Adam King [Tue, 1 Aug 2023 21:43:36 +0000 (17:43 -0400)]
mgr/cephadm: filter hosts that can't support VIP for ingress
Keepalive daemons need the host to have an interface
on which they can set up their VIP. If a host
does not have any interface that can work, we should
filter it out
Adam King [Tue, 1 Aug 2023 20:32:06 +0000 (16:32 -0400)]
mgr/cephadm: select IPs/interface based on VIP for keepalive conf
We need to make sure the keepalive conf sets
the unicast src and peer IPs to be the ones
in the same subnet as the VIP we're setting up,
as well as specify the correct interface. Otherwise,
the keepalive daemons don't speak to each other
properly and all end up going into MASTER state.
Adam King [Fri, 2 Jun 2023 00:06:35 +0000 (20:06 -0400)]
cephadm: add tcmu-runner to logrotate config
This process could be used to set up the tcmu-runner
to log to a file much like other ceph daemons
- create /etc/tcmu directory
- create /etc/tcmu/tcmu.conf directory with default options
- change dir to /var/log
- change log level to 4
- add -v /etc/tcmu:/etc/tcmu to tcmu-runner container podman line in unit.run
In order to support this (mostly for debugging) we should
add tcmu-runner to the logrotate config
Adam King [Fri, 7 Jul 2023 15:03:56 +0000 (11:03 -0400)]
qa/cephadm: add test for ca signed keys
Test that bootstraps with a CA signed key using
the use_ca_signed_key cephadm override. Then follows
up by doing a check-host on each host which verifies
the cephadm mgr module can reach and authenticate with
the nodes using the new key setup.
This probably should really be a workunit, but
I didn't want to create a full new section for
this test and I needed a section that didn't
already run the cephadm task for every test. I could
see this being moved into some sort of
"test_special_deployment_scenarios" section in the future
Adam King [Fri, 7 Jul 2023 14:36:39 +0000 (10:36 -0400)]
qa/cephadm: add ca signed key to cephadm task
To allow bootstrapping a cluster using a CA signed
key instead of the standard pubkey authentication.
Will allow explicit testing of this as we add support
for it
Adam King [Sat, 3 Jun 2023 18:39:05 +0000 (14:39 -0400)]
doc/cephadm: document how to pass self made SSH key pairs to bootstrap
This didn't seem to exist in the install section of
the cephadm docs. Wanted to add it in before adding
documentation for bootstrapping with CA signed keys.
Adam King [Thu, 1 Jun 2023 23:23:45 +0000 (19:23 -0400)]
mgr/cephadm: add is_host_<status> functions to HostCache
A bunch of places were doing list compression to see if a host
was unreachable/draining/schedulable by hostname. This is meant to
replace all those instances of list compression with a function
call that does the same
Redouane Kachach [Wed, 26 Oct 2022 09:33:38 +0000 (11:33 +0200)]
mgr/cephadm: Adding extra arguments support for RGW frontend Fixes: https://tracker.ceph.com/issues/57931 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 2c46c0741962e0e6a5ddbc960dfd21948daf0947)
Error EINVAL: ServiceSpec: 'dict' object has no attribute 'validate'
which is not a useful error message. This is caused by the
spec assuming all osd specific fields are either defined
in the 'spec' section or outside of it, but not mixed in.
We could also just consider these specs to be invalid
and just raise a better error message, but it seems easier
to make the minor adjustment for it to work, given there doesn't
seem to be an issue with mixing the styles for specs for
other service types.
Adam King [Mon, 5 Jun 2023 17:18:06 +0000 (13:18 -0400)]
python-common/drive_selection: lower log level of limit policy message
This gets logged every time cephadm tries to apply a
relevant OSD spec and ends up spamming the logs. There's no reason
we really need this to be at info rather than debug level,
so let's lower it.
Adam King [Thu, 13 Apr 2023 17:54:00 +0000 (13:54 -0400)]
cephadm: open ports in firewall when adopting monitoring stack daemons
Otherwise we risk the prometheus/alertmanager/grafana
not functioning properly after adoption due to the necessary
port in the firewall not being open.
Adam King [Thu, 13 Apr 2023 17:05:11 +0000 (13:05 -0400)]
cephadm: still try to open ports in firewall on redeploy/reconfig
Prior to this patch we were discarding the provided
ports on reconfig and redeploy in order to not fail
thinking there was a port conflict with the instance
of the daemon we were about to reconfig/redeploy. However,
it's still desirable for us to make sure the firewall ports
are open when we do a reconfig/redpeloy, so this refactors
the port handling approach to have it do that but
still avoid checking for port conflicts. It also include
an update of the type signature of deploy_daemon
to the py3 style. That wasn't needed for the change
but since I was added an arugment there I thought we might
as well do it now.
Adam King [Mon, 19 Jun 2023 20:07:31 +0000 (16:07 -0400)]
mgr/cephadm: add extra_entrypoint_args to mon spec
There was no reason for the mon spec to not include
this option. I believe this was just an oversight caused
by the addition of the mon spec and extra_entrypoint_args
in separate PRs around the same time.
Adam King [Mon, 19 Jun 2023 19:46:45 +0000 (15:46 -0400)]
mgr/cephadm: add extra_container_args and custom_configs to CustomContainer
CustomContainer was skipped previously for the extra_container_args
and custom_configs feature as these could already be done
using other fields within the custom container service spec
(the "args" and "files" fields respectively). It seems
desirable for us to allow setting these things for custom
containers the same as for other services for uniformity sake
and this allows us to use custom containers to test
these features.
Zac Dover [Mon, 4 Mar 2024 10:41:16 +0000 (20:41 +1000)]
doc/rados: link to pg setting commands
Link to the instructions for manually setting the number of PGs per
pool, from the mention of placement groups. These instructions are
included here in response to a request from Ronen Friedman on the
occasion of the removal of links to the PGcalc (see
https://github.com/ceph/ceph/pull/55899#pullrequestreview-1912940118).
Afreen [Tue, 6 Feb 2024 09:43:58 +0000 (15:13 +0530)]
mgr/dashboard: fix error while accessing roles tab when policy attached
Fixes https://tracker.ceph.com/issues/64270
Issue:
======
Accessing Object->Users-Roles tab causing 500 internal servor error.
This is due to the "PermissionPolicies" which are attached to role and
backend was not handling this field for rgw roles.
Fix:
====
Added "PermissionPolicies" as the valid field in backend and updated
frontend to render the attached policy in formatted JSON
Zac Dover [Sun, 3 Mar 2024 10:28:00 +0000 (20:28 +1000)]
doc/rados: remove PGcalc from docs
Remove mention of the "PG calc" tool from the documentation. I have
removed all mention of this in one fell swoop to help posterity restore
mention of this tool if we decide we need to do so.
Zac Dover [Fri, 1 Mar 2024 12:11:14 +0000 (22:11 +1000)]
doc/install: add manual RADOSGW install procedure
Add a manual RADOSGW installation procedure to
doc/install/manual-deployment.rst. This procedure was developed by Janne
Johansson and reported to the ceph-users mailing list on 29 Jan 2024
here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/LB3YRIKAPOHXYCW7MKLVUJPYWYRQVARU/
Co-authored-by: Janne Johansson <icepic.dz@gmail.com> Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 565bc9503838906995fa48f59debcd2843775b18)