git-server-git.apps.pok.os.sepia.ceph.com Git

mgr/prometheus: Fix regression with OSD/host details/overview dashboards

Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.

As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk.  This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros).  The data we
have expected is simply different in some rare cases.

I have not found a sole PromQL solution to this issue. What we basically
need is the following.

1. Match on labels `host` and `instance` to get one or more OSD names
   from a metadata metric (`ceph_disk_occupation`) to let a user know
   about which OSDs belong to which disk.

2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
   in which case the value of `ceph_daemon` must not refer to more than
   a single OSD. The exact opposite to requirement 1.

As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.

Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk).  This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.

`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.

    foo * on(ceph_daemon) group_left ceph_disk_occupation

`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).

    foo * on(device,instance)
    group_left(ceph_daemon) ceph_disk_occupation_human

Fixes: https://tracker.ceph.com/issues/52974
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>

author	Patrick Seidensal <pseidensal@suse.com>
	Mon, 25 Oct 2021 13:00:14 +0000 (15:00 +0200)
committer	Patrick Seidensal <patrick@nawracay.de>
	Thu, 13 Jan 2022 12:27:55 +0000 (13:27 +0100)
commit	18d3a71618a5e3bc3cbd0bce017fb7b9c18c2ca0
tree	5e3d53e4cdded29d0a858be2967bf715e769a993	tree \| snapshot
parent	154d3525b19135a929851c0b027da19abda20ebe	commit \| diff

doc/mgr/prometheus.rst		diff \| blob \| history
monitoring/grafana/dashboards/host-details.json		diff \| blob \| history
monitoring/grafana/dashboards/hosts-overview.json		diff \| blob \| history
monitoring/grafana/dashboards/jsonnet/grafana_dashboards.jsonnet		diff \| blob \| history
monitoring/grafana/dashboards/osd-device-details.json		diff \| blob \| history
monitoring/grafana/dashboards/tests/features/host-details.feature	[new file with mode: 0644]	blob
monitoring/grafana/dashboards/tests/features/hosts_overview.feature		diff \| blob \| history
monitoring/grafana/dashboards/tests/features/osd-device-details.feature	[new file with mode: 0644]	blob
src/pybind/mgr/prometheus/module.py		diff \| blob \| history
src/pybind/mgr/prometheus/test_module.py	[new file with mode: 0644]	blob