Patrick Donnelly [Thu, 21 Dec 2023 13:48:33 +0000 (08:48 -0500)]
pybind/mgr/devicehealth: skip legacy objects that cannot be loaded
Log looks like after test:
2023-12-21T16:09:28.804+0000 7fbe7fd86700 0 [devicehealth DEBUG root] loading object ABC_DEADB33F_FA
2023-12-21T16:09:28.805+0000 7fbe7fd86700 0 [devicehealth DEBUG root] object rados.Object(ioctx=<rados.Ioctx object at 0x7fbeee0c4668>,key=ABC_DEADB33F_FA,nspace=--default--,locator=None) does not exist because it is deleted in HEAD
2023-12-21T16:09:28.805+0000 7fbe7fd86700 0 [devicehealth DEBUG root] finished reading legacy pool, complete = True
Credit to Greg Farnum for postulating the cause.
Fixes: https://tracker.ceph.com/issues/63882 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 5e6fc0bf5f52732966d5cf2987e679abee8a384d)
Patrick Donnelly [Thu, 21 Dec 2023 15:39:03 +0000 (10:39 -0500)]
qa: test devicehealth legacy load of deleted snap obj
Failure without fix looks like:
2023-12-21T16:05:55.737+0000 7fbe585b0700 0 [devicehealth DEBUG root] loading object ABC_DEADB33F_FA
2023-12-21T16:05:55.737+0000 7fbe585b0700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.x: [errno 2] RADOS object not found (Failed to operate read op for oid ABC_DEADB33F_FA)
2023-12-21T16:05:55.737+0000 7fbe585b0700 -1 devicehealth.serve:
2023-12-21T16:05:55.737+0000 7fbe585b0700 -1 Traceback (most recent call last):
File "/home/pdonnell/ceph/src/pybind/mgr/devicehealth/module.py", line 394, in serve
self._do_serve()
File "/home/pdonnell/ceph/src/pybind/mgr/mgr_module.py", line 524, in check
return func(self, *args, **kwargs)
File "/home/pdonnell/ceph/src/pybind/mgr/devicehealth/module.py", line 354, in _do_serve
finished_loading_legacy = self.check_legacy_pool()
File "/home/pdonnell/ceph/src/pybind/mgr/devicehealth/module.py", line 326, in check_legacy_pool
if self._load_legacy_object(ioctx, obj.key):
File "/home/pdonnell/ceph/src/pybind/mgr/devicehealth/module.py", line 300, in _load_legacy_object
ioctx.operate_read_op(op, oid)
File "rados.pyx", line 3723, in rados.Ioctx.operate_read_op
rados.ObjectNotFound: [errno 2] RADOS object not found (Failed to operate read op for oid ABC_DEADB33F_FA)
Nizamudeen A [Tue, 19 Mar 2024 14:57:13 +0000 (20:27 +0530)]
mgr/dashboard: rm warning/error threshold for cpu usage
for multi-core cpu's the value can be more than 100% so it doesn't make
sense to show warning/error when the usage is at or more than 100%.
hence removing it
Ivo Almeida [Wed, 21 Feb 2024 13:02:19 +0000 (13:02 +0000)]
mgr/dashboard: fix retention add for subvolume
- Added parameters for subvolume and subvolume group when adding a new
snap schedule.
- Added call to remove retention policies when removing a snap schedule
in case it is the last one with same path
Fixes: https://tracker.ceph.com/issues/64524 Signed-off-by: Ivo Almeida <ialmeida@redhat.com>
(cherry picked from commit 80e1207f4b536fe6edbc81e61cbf951e135eba54)
Adam King [Wed, 13 Mar 2024 19:30:25 +0000 (15:30 -0400)]
mgr/cephadm: refresh public_network for config checks before checking
The place it was being run before meant it would only grab the
public_network setting once at startup of the module. This meant
if a user changed the setting, which they are likely to do if they
get the warning, cephadm would ignore the change and continue
reporting that the hosts don't match up with the old setting
for the public_network. This moves the call to refresh the
setting to right before we actually run the checks. It does
mean we'll do the `ceph config dump --format json` call
each serve loop iteration, but I've found that only tends
to take a few milliseconds, which is nothing compared to
the time to refresh other things we check during the serve
loop.
I additionally modified the use of this option to use
the attribute on the mgr, rather than calling
`get_module_option`. This was just to get it more in
line with how we tend to handle other config options
Florent Carli [Tue, 12 Mar 2024 17:31:16 +0000 (18:31 +0100)]
cephadm.py: add timemaster to timesync services list
On debian/ubuntu, if you need PTP, it's possible to use the linuxptp package for time-synchonization.
In that case the systemd service is called timemaster and is a wrapper for chrony/ntpd/phc2sys/ptp4l.
where the networks is set and the
"only_bind_port_on_networks" option is
set to true, the grafana daemon will bind
to its port (3000 in this case since it's
the default and I didn't set a port) only
on an IP from that network. I tested this
by holding port 3000 on an IP from a different
network on the host and then deploying
grafana. Without this patch it would have
failed with a port conflict error.
Nizamudeen A [Wed, 18 Oct 2023 06:38:21 +0000 (12:08 +0530)]
mgr/dashboard: support rgw roles updating
Right now only the modification of max_session_duration is supported via
the roles update command. To update, we need to use `policy modify`
command which is not added in this PR. That should be done separately
Adam King [Fri, 1 Mar 2024 18:22:44 +0000 (13:22 -0500)]
cephadm: improve cephadm pull usage message
Generally, it's uncommon for users to run this
directly, but in case they need to for debugging
purposes, we should include how to pass the
image to be pulled in the usage message.
Additionally, include that this is only to be used
for pulling ceph images in the help message, as
that isn't necessarily clear. Pulling anything
else will result in a traceback as it tries
to run `ceph --version` inside the container.
cephadm: adjust the ingress ha proxy health check interval
Currently health checker uses default value of 2s, it is send list
bucket request for every 2s. This seems to be frequent and need to
adjust properly. Hence introducing new setting health_check_interval in
the ingress spec for haproxy.
Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
Apply suggestions from code review
Co-authored-by: Adam King <47704447+adk3798@users.noreply.github.com> Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
(cherry picked from commit 75327c5b56591c6a29ad47745df24d16320f5a99)
Ramana Raja [Thu, 29 Feb 2024 17:12:19 +0000 (12:12 -0500)]
qa/suites: add diff-continuous and compare-mirror-image tests
... to rbd and krbd suites respectively.
This allows the compare-mirror-image tests introduced in ea3a567
to be run against various kernel branches, e.g., testing branch.
And allows diff_continuous test in rbd_suite to run against distro
kernel.
After some tests, it turns out that depending on the hardware,
the header 'Location' which is returned by the server after logged can be different.
I could notice the following:
the endpoint passed down to util.query() is wrong:
is passes the full url (scheme://addr:port/path) where it should only
pass the path. The cause is that RedFishClient.login() basically stores
the value of the Location header in `self.location`.
The consequence of this is that it makes the client unable to properly logout.
and it placed the node-exporter daemons on vm-00
and vm-02 but not vm-01. Obviously there are more
advanced scenarios that justify this than listing
two hosts, but using "|" as an OR like that is an
example of something you can't do with the fnmatch
version of the host pattern
Nizamudeen A [Thu, 7 Mar 2024 08:43:54 +0000 (14:13 +0530)]
mgr/dashboard: disable applitools e2e
Temporarily disabling this so the CI could turn green. Meanwhile I'll
research for a proper way to handle the applitools e2es which I'll track
on https://tracker.ceph.com/issues/64783
Ilya Dryomov [Wed, 28 Feb 2024 13:20:16 +0000 (14:20 +0100)]
librbd: don't clip expanded diff on truncate in ObjectListSnapsRequest
If the diff was expanded due to LIST_SNAPS_FLAG_WHOLE_OBJECT, clipping
it when handling a truncate is wrong -- when subtracting that interval,
we either split the expanded extent into two or chop off a piece of it.
However the point of LIST_SNAPS_FLAG_WHOLE_OBJECT is to report a single
extent covering the entire object.
Ilya Dryomov [Sun, 18 Feb 2024 10:46:15 +0000 (11:46 +0100)]
librados/snap_set_diff: ignore truncates above size at start
Because currently calc_snap_set_diff() only ever appends to the running
diff, an excessive (either too large or completely bogus) zero extent
is reported in cases where an object is first expanded (with a snapshot
taken at that point) and then truncated but still above the size of the
object as of the starting snapshot.