Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.
As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk. This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros). The data we
have expected is simply different in some rare cases.
I have not found a sole PromQL solution to this issue. What we basically
need is the following.
1. Match on labels `host` and `instance` to get one or more OSD names
from a metadata metric (`ceph_disk_occupation`) to let a user know
about which OSDs belong to which disk.
2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
in which case the value of `ceph_daemon` must not refer to more than
a single OSD. The exact opposite to requirement 1.
As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.
Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk). This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.
`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.
foo * on(ceph_daemon) group_left ceph_disk_occupation
`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).
foo * on(device,instance)
group_left(ceph_daemon) ceph_disk_occupation_human
Fixes: https://tracker.ceph.com/issues/52974
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
(cherry picked from commit
18d3a71618a5e3bc3cbd0bce017fb7b9c18c2ca0)
Conflicts:
monitoring/grafana/dashboards/host-details.json
monitoring/grafana/dashboards/hosts-overview.json
monitoring/grafana/dashboards/jsonnet/grafana_dashboards.jsonnet
monitoring/grafana/dashboards/osd-device-details.json
monitoring/grafana/dashboards/tests/features/hosts_overview.feature
src/pybind/mgr/prometheus/module.py
- Octopus does not generate Grafana dashboards using jsonnet, hence
grafana_dashboards.jsonnet was removed.
- Octopus does not support features, hence hosts_overview.feature was
removed.
- Features implemented in prometheus/module.py that never were
backported to Octopus were removed.
- `tox.ini` file adapted to include mgr/prometheus tests introduced by
the backport.
- Add `cherrypy` to src/pybind/mgr/requirements.txt to fix Prometheus
unit testing.
::
- ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}
+ ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
To use this to get disk statistics by OSD ID, use either the ``and`` operator or
the ``*`` operator in your prometheus query. All metadata metrics (like ``
-ceph_disk_occupation`` have the value 1 so they act neutral with ``*``. Using ``*``
+ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
the resulting metric has additional labels from one side of the query.
::
- rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}
+ rate(node_disk_bytes_written[30s]) and
+ on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
Out of the box the above query will not return any metrics since the ``instance`` labels of
-both metrics don't match. The ``instance`` label of ``ceph_disk_occupation``
+both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
will be the currently active MGR node.
- The following two section outline two approaches to remedy this.
+The following two section outline two approaches to remedy this.
+
+.. note::
+
+ If you need to group on the `ceph_daemon` label instead of `device` and
+ `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
+ It is advised that you use `ceph_disk_occupation` instead.
+
+ The difference is that `ceph_disk_occupation_human` may group several OSDs
+ into the value of a single `ceph_daemon` label in cases where multiple OSDs
+ share a disk.
Use label_replace
=================
::
- label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}
+ label_replace(
+ rate(node_disk_bytes_written[30s]),
+ "exported_instance",
+ "$1",
+ "instance",
+ "(.*):.*"
+ ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
Configuring Prometheus server
=============================
"steppedLine": false,
"targets": [
{
- "expr": "label_replace(\n (\n irate(node_disk_writes_completed{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or\n irate(node_disk_writes_completed_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])\n ),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n)\n* on(instance, device, ceph_daemon) group_left\n label_replace(\n label_replace(\n ceph_disk_occupation,\n \"device\",\n \"$1\",\n \"device\",\n \"/dev/(.*)\"\n ),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n )",
+ "expr": "label_replace(\n (\n irate(node_disk_writes_completed{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or\n irate(node_disk_writes_completed_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])\n ),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n)\n* on(instance, device) group_left(ceph_daemon)\n label_replace(\n label_replace(\n ceph_disk_occupation_human,\n \"device\",\n \"$1\",\n \"device\",\n \"/dev/(.*)\"\n ),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n )",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{device}}({{ceph_daemon}}) writes",
"textEditor": true
},
{
- "expr": "label_replace(\n (irate(node_disk_reads_completed{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or irate(node_disk_reads_completed_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n)\n* on(instance, device, ceph_daemon) group_left\n label_replace(\n label_replace(\n ceph_disk_occupation,\n \"device\",\n \"$1\",\n \"device\",\n \"/dev/(.*)\"\n ),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n )",
+ "expr": "label_replace(\n (irate(node_disk_reads_completed{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or irate(node_disk_reads_completed_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n)\n* on(instance, device) group_left(ceph_daemon)\n label_replace(\n label_replace(\n ceph_disk_occupation_human,\n \"device\",\n \"$1\",\n \"device\",\n \"/dev/(.*)\"\n ),\n \"instance\",\n \"$1\",\n \"instance\",\n \"([^:.]*).*\"\n )",
"format": "time_series",
"hide": false,
"intervalFactor": 1,
"steppedLine": false,
"targets": [
{
- "expr": "label_replace((irate(node_disk_bytes_written{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or irate(node_disk_written_bytes_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") * on(instance, device, ceph_daemon) group_left label_replace(label_replace(ceph_disk_occupation, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace((irate(node_disk_bytes_written{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or irate(node_disk_written_bytes_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") * on(instance, device) group_left(ceph_daemon) label_replace(label_replace(ceph_disk_occupation_human, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{device}}({{ceph_daemon}}) write",
"refId": "B"
},
{
- "expr": "label_replace((irate(node_disk_bytes_read{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or irate(node_disk_read_bytes_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") * on(instance, device, ceph_daemon) group_left label_replace(label_replace(ceph_disk_occupation, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace((irate(node_disk_bytes_read{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) or irate(node_disk_read_bytes_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m])), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") * on(instance, device) group_left(ceph_daemon) label_replace(label_replace(ceph_disk_occupation_human, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{device}}({{ceph_daemon}}) read",
"steppedLine": false,
"targets": [
{
- "expr": "max by(instance,device) (label_replace((irate(node_disk_write_time_seconds_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) ) / clamp_min(irate(node_disk_writes_completed_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]), 0.001) or (irate(node_disk_read_time_seconds_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) ) / clamp_min(irate(node_disk_reads_completed_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]), 0.001), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")) * on(instance, device, ceph_daemon) group_left label_replace(label_replace(ceph_disk_occupation{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "max by(instance,device) (label_replace((irate(node_disk_write_time_seconds_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) ) / clamp_min(irate(node_disk_writes_completed_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]), 0.001) or (irate(node_disk_read_time_seconds_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) ) / clamp_min(irate(node_disk_reads_completed_total{ instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]), 0.001), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")) * on(instance, device) group_left(ceph_daemon) label_replace(label_replace(ceph_disk_occupation_human{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"hide": false,
"intervalFactor": 1,
"steppedLine": false,
"targets": [
{
- "expr": "label_replace(((irate(node_disk_io_time_ms{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) / 10 ) or irate(node_disk_io_time_seconds_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) * 100), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") * on(instance, device, ceph_daemon) group_left label_replace(label_replace(ceph_disk_occupation{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace(((irate(node_disk_io_time_ms{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) / 10 ) or irate(node_disk_io_time_seconds_total{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}[5m]) * 100), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") * on(instance, device) group_left(ceph_daemon) label_replace(label_replace(ceph_disk_occupation_human{instance=~\"($ceph_hosts)([\\\\.:].*)?\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"hide": false,
"intervalFactor": 1,
"tableColumn": "",
"targets": [
{
- "expr" : "avg (\n label_replace((irate(node_disk_io_time_ms[5m]) / 10 ) or\n (irate(node_disk_io_time_seconds_total[5m]) * 100), \"instance\", \"$1\", \"instance\", \"([^.:]*).*\"\n ) *\n on(instance, device, ceph_daemon) label_replace(label_replace(ceph_disk_occupation{instance=~\"($osd_hosts).*\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^.:]*).*\")\n)",
+ "expr": "avg (\n label_replace((irate(node_disk_io_time_ms[5m]) / 10 ) or\n (irate(node_disk_io_time_seconds_total[5m]) * 100), \"instance\", \"$1\", \"instance\", \"([^.:]*).*\"\n ) *\n on(instance, device) group_left(ceph_daemon) label_replace(label_replace(ceph_disk_occupation_human{instance=~\"($osd_hosts).*\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^.:]*).*\")\n)",
"format": "time_series",
"instant": true,
"intervalFactor": 1,
"steppedLine": false,
"targets": [
{
- "expr": "(label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\"))",
+ "expr": "(label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\"))",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{instance}}/{{device}} Reads",
"refId": "A"
},
{
- "expr": "(label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\"))",
+ "expr": "(label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\"))",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{instance}}/{{device}} Writes",
"steppedLine": false,
"targets": [
{
- "expr": "label_replace(irate(node_disk_writes_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace(irate(node_disk_writes_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{device}} on {{instance}} Writes",
"refId": "A"
},
{
- "expr": "label_replace(irate(node_disk_reads_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace(irate(node_disk_reads_completed_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{device}} on {{instance}} Reads",
"steppedLine": false,
"targets": [
{
- "expr": "label_replace(irate(node_disk_read_bytes_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace(irate(node_disk_read_bytes_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{instance}} {{device}} Reads",
"refId": "A"
},
{
- "expr": "label_replace(irate(node_disk_written_bytes_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace(irate(node_disk_written_bytes_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{instance}} {{device}} Writes",
"steppedLine": false,
"targets": [
{
- "expr": "label_replace(irate(node_disk_io_time_seconds_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
+ "expr": "label_replace(irate(node_disk_io_time_seconds_total[1m]), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~\"$osd\"}, \"device\", \"$1\", \"device\", \"/dev/(.*)\"), \"instance\", \"$1\", \"instance\", \"([^:.]*).*\")",
"format": "time_series",
"intervalFactor": 1,
"legendFormat": "{{device}} on {{instance}}",
if v < StrictVersion("9.0.0"):
from cherrypy.wsgiserver.ssl_builtin import BuiltinSSLAdapter as builtin_ssl
else:
- from cheroot.ssl.builtin import BuiltinSSLAdapter as builtin_ssl
+ from cheroot.ssl.builtin import BuiltinSSLAdapter as builtin_ssl # type: ignore
builtin_ssl.wrap = new_wrap(builtin_ssl.wrap)
from rbd import RBD
from collections import namedtuple
try:
- from typing import DefaultDict, Optional, Dict, Any, Set
+ from typing import DefaultDict, Optional, Dict, Any, Set, Tuple, Union, List, Callable
+ LabelValues = Tuple[str, ...]
+ Number = Union[int, float]
+ MetricValue = Dict[LabelValues, Number]
except ImportError:
pass
)
return expfmt
+ def group_by(
+ self,
+ keys: List[str],
+ joins: Dict[str, Callable[[List[str]], str]],
+ name: Optional[str] = None,
+ ) -> "Metric":
+ """
+ Groups data by label names.
+
+ Label names not passed are being removed from the resulting metric but
+ by providing a join function, labels of metrics can be grouped.
+
+ The purpose of this method is to provide a version of a metric that can
+ be used in matching where otherwise multiple results would be returned.
+
+ As grouping is possible in Prometheus, the only additional value of this
+ method is the possibility to join labels when grouping. For that reason,
+ passing joins is required. Please use PromQL expressions in all other
+ cases.
+
+ >>> m = Metric('type', 'name', '', labels=('label1', 'id'))
+ >>> m.value = {
+ ... ('foo', 'x'): 1,
+ ... ('foo', 'y'): 1,
+ ... }
+ >>> m.group_by(['label1'], {'id': lambda ids: ','.join(ids)}).value
+ {('foo', 'x,y'): 1}
+
+ The functionality of group by could roughly be compared with Prometheus'
+
+ group (ceph_disk_occupation) by (device, instance)
+
+ with the exception that not all labels which aren't used as a condition
+ to group a metric are discarded, but their values can are joined and the
+ label is thereby preserved.
+
+ This function takes the value of the first entry of a found group to be
+ used for the resulting value of the grouping operation.
+
+ >>> m = Metric('type', 'name', '', labels=('label1', 'id'))
+ >>> m.value = {
+ ... ('foo', 'x'): 555,
+ ... ('foo', 'y'): 10,
+ ... }
+ >>> m.group_by(['label1'], {'id': lambda ids: ','.join(ids)}).value
+ {('foo', 'x,y'): 555}
+ """
+ assert self.labelnames, "cannot match keys without label names"
+ for key in keys:
+ assert key in self.labelnames, "unknown key: {}".format(key)
+ assert joins, "joins must not be empty"
+ assert all(callable(c) for c in joins.values()), "joins must be callable"
+
+ # group
+ grouped = defaultdict(list) # type: Dict[LabelValues, List[Tuple[Dict[str, str], Number]]]
+ for label_values, metric_value in self.value.items():
+ labels = dict(zip(self.labelnames, label_values))
+ if not all(k in labels for k in keys):
+ continue
+ group_key = tuple(labels[k] for k in keys)
+ grouped[group_key].append((labels, metric_value))
+
+ # as there is nothing specified on how to join labels that are not equal
+ # and Prometheus `group` aggregation functions similarly, we simply drop
+ # those labels.
+ labelnames = tuple(
+ label for label in self.labelnames if label in keys or label in joins
+ )
+ superfluous_labelnames = [
+ label for label in self.labelnames if label not in labelnames
+ ]
+
+ # iterate and convert groups with more than one member into a single
+ # entry
+ values = {} # type: MetricValue
+ for group in grouped.values():
+ labels, metric_value = group[0]
+
+ for label in superfluous_labelnames:
+ del labels[label]
+
+ if len(group) > 1:
+ for key, fn in joins.items():
+ labels[key] = fn(list(labels[key] for labels, _ in group))
+
+ values[tuple(labels.values())] = metric_value
+
+ new_metric = Metric(self.mtype, name if name else self.name, self.desc, labelnames)
+ new_metric.value = values
+
+ return new_metric
+
class MetricCollectionThread(threading.Thread):
def __init__(self, module):
DISK_OCCUPATION
)
+ metrics['disk_occupation_human'] = Metric(
+ 'untyped',
+ 'disk_occupation_human',
+ 'Associate Ceph daemon with disk used for displaying to humans,'
+ ' not for joining tables (vector matching)',
+ DISK_OCCUPATION, # label names are automatically decimated on grouping
+ )
+
metrics['pool_metadata'] = Metric(
'untyped',
'pool_metadata',
self.log.info("Missing dev node metadata for osd {0}, skipping "
"occupation record for this osd".format(id_))
+ if 'disk_occupation' in self.metrics:
+ try:
+ self.metrics['disk_occupation_human'] = \
+ self.metrics['disk_occupation'].group_by(
+ ['device', 'instance'],
+ {'ceph_daemon': lambda daemons: ', '.join(daemons)},
+ name='disk_occupation_human',
+ )
+ except Exception as e:
+ self.log.error(e)
+
for pool in osd_map['pools']:
self.metrics['pool_metadata'].set(
1, (pool['pool'], pool['pool_name']))
--- /dev/null
+from typing import Dict
+from unittest import TestCase
+
+from prometheus.module import Metric, LabelValues, Number
+
+
+class MetricGroupTest(TestCase):
+ def setUp(self):
+ self.DISK_OCCUPATION = (
+ "ceph_daemon",
+ "device",
+ "db_device",
+ "wal_device",
+ "instance",
+ )
+ self.metric_value: Dict[LabelValues, Number] = {
+ ("osd.0", "/dev/dm-0", "", "", "node1"): 1,
+ ("osd.1", "/dev/dm-0", "", "", "node3"): 1,
+ ("osd.2", "/dev/dm-0", "", "", "node2"): 1,
+ ("osd.3", "/dev/dm-1", "", "", "node1"): 1,
+ ("osd.4", "/dev/dm-1", "", "", "node3"): 1,
+ ("osd.5", "/dev/dm-1", "", "", "node2"): 1,
+ ("osd.6", "/dev/dm-1", "", "", "node2"): 1,
+ }
+
+ def test_metric_group_by(self):
+ m = Metric("untyped", "disk_occupation", "", self.DISK_OCCUPATION)
+ m.value = self.metric_value
+ grouped_metric = m.group_by(
+ ["device", "instance"],
+ {"ceph_daemon": lambda xs: "+".join(xs)},
+ name="disk_occupation_display",
+ )
+ self.assertEqual(
+ grouped_metric.value,
+ {
+ ("osd.0", "/dev/dm-0", "node1"): 1,
+ ("osd.1", "/dev/dm-0", "node3"): 1,
+ ("osd.2", "/dev/dm-0", "node2"): 1,
+ ("osd.3", "/dev/dm-1", "node1"): 1,
+ ("osd.4", "/dev/dm-1", "node3"): 1,
+ ("osd.5+osd.6", "/dev/dm-1", "node2"): 1,
+ },
+ )
+ self.maxDiff = None
+ self.assertEqual(
+ grouped_metric.str_expfmt(),
+ """
+# HELP ceph_disk_occupation_display
+# TYPE ceph_disk_occupation_display untyped
+ceph_disk_occupation_display{ceph_daemon="osd.0",device="/dev/dm-0",instance="node1"} 1.0
+ceph_disk_occupation_display{ceph_daemon="osd.1",device="/dev/dm-0",instance="node3"} 1.0
+ceph_disk_occupation_display{ceph_daemon="osd.2",device="/dev/dm-0",instance="node2"} 1.0
+ceph_disk_occupation_display{ceph_daemon="osd.3",device="/dev/dm-1",instance="node1"} 1.0
+ceph_disk_occupation_display{ceph_daemon="osd.4",device="/dev/dm-1",instance="node3"} 1.0
+ceph_disk_occupation_display{ceph_daemon="osd.5+osd.6",device="/dev/dm-1",instance="node2"} 1.0""", # noqa: W291
+ )
+ self.assertEqual(
+ grouped_metric.labelnames, ("ceph_daemon", "device", "instance")
+ )
+
+ def test_metric_group_by__no_value(self):
+ m = Metric("metric_type", "name", "desc", labels=('foo', 'bar'))
+ grouped = m.group_by(['foo'], {'bar': lambda bars: ', '.join(bars)})
+ self.assertEqual(grouped.value, {})
+ self.assertEqual(grouped.str_expfmt(),
+ '\n# HELP ceph_name desc\n# TYPE ceph_name metric_type')
+
+ def test_metric_group_by__no_labels(self):
+ m = Metric("metric_type", "name", "desc", labels=None)
+ with self.assertRaises(AssertionError) as cm:
+ m.group_by([], {})
+ self.assertEqual(str(cm.exception), "cannot match keys without label names")
+
+ def test_metric_group_by__key_not_in_labels(self):
+ m = Metric("metric_type", "name", "desc", labels=("foo", "bar"))
+ m.value = self.metric_value
+ with self.assertRaises(AssertionError) as cm:
+ m.group_by(["baz"], {})
+ self.assertEqual(str(cm.exception), "unknown key: baz")
+
+ def test_metric_group_by__empty_joins(self):
+ m = Metric("", "", "", ("foo", "bar"))
+ with self.assertRaises(AssertionError) as cm:
+ m.group_by(["foo"], joins={})
+ self.assertEqual(str(cm.exception), "joins must not be empty")
+
+ def test_metric_group_by__joins_not_callable(self):
+ m = Metric("", "", "", ("foo", "bar"))
+ m.value = self.metric_value
+ with self.assertRaises(AssertionError) as cm:
+ m.group_by(["foo"], {"bar": "not callable str"})
+ self.assertEqual(str(cm.exception), "joins must be callable")
+cherrypy
pytest-cov==2.7.1
mock; python_version <= '3.3'
ipaddress; python_version < '3.3'
cephadm/ \
orchestrator/ \
pg_autoscaler/ \
- progress/}
+ progress/ \
+ prometheus/ \
+ }
[testenv:mypy]
basepython = python3