mgr/prometheus: Fix regression with OSD/host details/overview dashboards
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.
As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk. This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros). The data we
have expected is simply different in some rare cases.
I have not found a sole PromQL solution to this issue. What we basically
need is the following.
1. Match on labels `host` and `instance` to get one or more OSD names
from a metadata metric (`ceph_disk_occupation`) to let a user know
about which OSDs belong to which disk.
2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
in which case the value of `ceph_daemon` must not refer to more than
a single OSD. The exact opposite to requirement 1.
As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.
Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk). This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.
`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.
foo * on(ceph_daemon) group_left ceph_disk_occupation
`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).
foo * on(device,instance)
group_left(ceph_daemon) ceph_disk_occupation_human
Fixes: https://tracker.ceph.com/issues/52974
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
(cherry picked from commit
18d3a71618a5e3bc3cbd0bce017fb7b9c18c2ca0)
Conflicts:
monitoring/grafana/dashboards/host-details.json
monitoring/grafana/dashboards/hosts-overview.json
monitoring/grafana/dashboards/jsonnet/grafana_dashboards.jsonnet
monitoring/grafana/dashboards/osd-device-details.json
monitoring/grafana/dashboards/tests/features/hosts_overview.feature
src/pybind/mgr/prometheus/module.py
- Octopus does not generate Grafana dashboards using jsonnet, hence
grafana_dashboards.jsonnet was removed.
- Octopus does not support features, hence hosts_overview.feature was
removed.
- Features implemented in prometheus/module.py that never were
backported to Octopus were removed.
- `tox.ini` file adapted to include mgr/prometheus tests introduced by
the backport.
- Add `cherrypy` to src/pybind/mgr/requirements.txt to fix Prometheus
unit testing.