Adam King [Thu, 15 Feb 2024 14:24:23 +0000 (09:24 -0500)]
qa/cephadm: don't test certain workunits with agent
There are a handful of workunits that don't work
with or don't make sense with the agent.
The test for the cephadm timeout only works if
the mgr directly runs ceph-volume inventory which
it won't do with the agent present. The adoption
test is just running direct cephadm commands that
are irrelevant to the agent. The test_orch_cli tests
rely on refresh timings that are different with
the agent running, causing spurious failures.
Adam King [Wed, 18 Oct 2023 18:00:05 +0000 (14:00 -0400)]
mgr/cephadm: update timestamp on repeat daemon/service events
If you have a daemon/service event and then an identical
event happens later (e.g. the same daemon is redeployed
multiple times) the events are not updated on the repeat
instances. In cases like this I think it makes more
sense to update the timestamp so users can see the most
recent time the event happened.
Adam King [Tue, 9 May 2023 19:06:41 +0000 (15:06 -0400)]
mgr/cephadm: make jaeger-collector urls a dep for jaeger-agent
the jaeger-agent's need to know the url for the collector(s)
that have been deployed. If a collector moves, or we deployed
the agents before the collector, we need to reconfig the agents
with updated info about the collectors. Failure to do so can
leave the jager-agents down reporting
```
Could not create collector proxy","error":"at least one collector hostPort address is required when resolver is not available"
```
Zac Dover [Thu, 7 Mar 2024 03:01:47 +0000 (13:01 +1000)]
doc/start: add Slack invite link
Add a link to the ceph-storage Slack invitation page. Previously the
link went to a plain old "this is the ceph-storage Slack" page that did
not direct the reader to sign up.
Although this seems to be a bug from RedFish, we need to handle
the case when it happens otherwise it makes the mgr orchestrator module
throw an error.
The idea here is to create a new status "unknown" when we can't fetch the
real status of a component.
Venky Shankar [Thu, 25 Jan 2024 09:32:33 +0000 (15:02 +0530)]
qa: remove error string checks and check w/ return value
I ran into this failure once #54972 was merged. The test is validating
the error string returned due to the failed mount. There aren't any
return value checks - which is a _more_ important check. Generic error
string checks will fail once a (error) string is changed (typo, etc..).
Zac Dover [Mon, 4 Mar 2024 10:41:16 +0000 (20:41 +1000)]
doc/rados: link to pg setting commands
Link to the instructions for manually setting the number of PGs per
pool, from the mention of placement groups. These instructions are
included here in response to a request from Ronen Friedman on the
occasion of the removal of links to the PGcalc (see
https://github.com/ceph/ceph/pull/55899#pullrequestreview-1912940118).
Zac Dover [Sun, 3 Mar 2024 10:28:00 +0000 (20:28 +1000)]
doc/rados: remove PGcalc from docs
Remove mention of the "PG calc" tool from the documentation. I have
removed all mention of this in one fell swoop to help posterity restore
mention of this tool if we decide we need to do so.
Zac Dover [Fri, 1 Mar 2024 12:11:14 +0000 (22:11 +1000)]
doc/install: add manual RADOSGW install procedure
Add a manual RADOSGW installation procedure to
doc/install/manual-deployment.rst. This procedure was developed by Janne
Johansson and reported to the ceph-users mailing list on 29 Jan 2024
here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/LB3YRIKAPOHXYCW7MKLVUJPYWYRQVARU/
Co-authored-by: Janne Johansson <icepic.dz@gmail.com> Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 565bc9503838906995fa48f59debcd2843775b18)
Zac Dover [Thu, 29 Feb 2024 08:08:10 +0000 (18:08 +1000)]
doc/glossary: improve "MDS" entry
Improve the entry for "MDS" in doc/glossary.rst by linking to the
"ceph-mds" man page and mentioning the relationship between clients and
MDS (or MDSes).
where the value ends up as a floating point value
after converting to a string (which is necessary to actually
pass it to the binary). By setting the field to be an
int, we should be able to avoid this.
Adam King [Sat, 4 Nov 2023 22:45:17 +0000 (18:45 -0400)]
qa/cephadm: add test for cephadm asyncio based timeout
Adds a test that will set the default cephadm command
timeout and then force a timeout to occur by holding
the cephadm lock and triggering a device refresh.
This works because cephadm ceph-volume commands
require the cephadm lock to run, so the command will
timeout waiting for the lock to become available.
Adam King [Wed, 7 Jun 2023 14:33:13 +0000 (10:33 -0400)]
mgr/cephadm: Also catch concurrent.futures.TimeoutError for timeouts
On python 3.6 which Ceph currently uses for its
container builds (which are based on centos 8 stream builds
hence the python version) the exception raised by a timeout
from a concurrent.futures.Future is successfully caught by
looking for asyncio.TimeoutError. However, in builds with
later python versions, e.g. 3.9.16, the timeout is no
longer caught. This results in situations like
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 241, in refresh
r = self._refresh_host_devices(host)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 352, in _refresh_host_devices
devices = self.mgr.wait_async(self._run_cephadm_json(
File "/usr/share/ceph/mgr/cephadm/module.py", line 635, in wait_async
return self.event_loop.get_result(coro, timeout)
File "/usr/share/ceph/mgr/cephadm/ssh.py", line 63, in get_result
return future.result(timeout)
File "/lib64/python3.9/concurrent/futures/_base.py", line 448, in result
raise TimeoutError()
concurrent.futures._base.TimeoutError
which causes the cephadm module to crash whenever one of these
command timeouts happen. This patch is to also catch the
newer exception type so it works on later python versions as well
Kefu Chai [Sat, 3 Feb 2024 05:46:05 +0000 (13:46 +0800)]
debian/cephadm.postinst: stop using adduser --gecos
--gecos option of adduser is deprecated in debian/bookworm, and
will be removed in debian/trixie,
see https://manpages.debian.org/bookworm/adduser/adduser.8.en.html.
so to be future-proof, let's switch to `usermod --comment`. please
note, since we still need to support ubuntu/jammy which is used in
our CI, and `adduser` shipped by ubuntu/jammy does not support
`--comment` yet, so we cannot use this option.
Kefu Chai [Wed, 17 Jan 2024 15:47:39 +0000 (23:47 +0800)]
debian/cephadm.postinst: specify --home when adduser
quote from adduser/NEWS.Debian.gz:
> System user home defaults to /nonexistent if --home is not specified.
> Packages that call adduser to create system accounts should explicitly
> specify a location for /home (see Lintian check
> maintainer-script-lacks-home-in-adduser).
so let's follow this change in adduser. otherwise "cephadm"
would have a $HOME at `/nonexistent`.
Kefu Chai [Wed, 17 Jan 2024 15:36:12 +0000 (23:36 +0800)]
debian/ceph-common.postinst: set user directory using adduser
now that adduser allows us to set its home directory, we can do
this using adduser instead of using usermod. this change also
silences the warning from lintian
"maintainer-script-lacks-home-in-adduser". lintian complains if
`adduser --system` is called without passing `--home` option.
also, take this opportunity to s/-c/--comment/ in the command line
of `usermod`, for better readability.
Kefu Chai [Wed, 17 Jan 2024 15:09:02 +0000 (23:09 +0800)]
debian/control: add adduser to Depends of cephadm and ceph-common
in `debian/ceph-common.postinst` and `debian/cephadm.postinst`, we
use `adduser --system` to create the system user when configuring
the corresponding package.
before this change, the dependency is not listed in the runtime
`Depends` section of ceph-common and cephadm.
in this change, the dependency is added. this is also suggested
by Securing Debian Manual, see
https://www.debian.org/doc/manuals/securing-debian-manual/bpp-lower-privs.en.html
Zac Dover [Mon, 26 Feb 2024 10:03:48 +0000 (20:03 +1000)]
doc/rados: add "change public network" procedure
Add a procedure to /doc/rados/operations/add-or-rm-mons.rst that
explains how to change the public_network in a Ceph cluster deployed
with cephadm. This procedure was developed by Eugen Block, and can be
seen in its original form here:
https://heiterbiswolkig.blogs.nde.ag/2024/02/22/cephadm-change-public-network/
Casey Bodley [Mon, 26 Feb 2024 14:38:52 +0000 (09:38 -0500)]
test/rgw: increase timeouts in unittest_rgw_dmclock_scheduler
1ms sleeps are generally below the timer's resolution. increase run_for()
durations to 50ms to make the tests far less sensitive to timing. in
practice, none of the sleeps actually wait the full 50ms