From: Patrick Seidensal Date: Wed, 11 Nov 2020 17:55:30 +0000 (+0100) Subject: mgr/dashboard: prometheus alerting: add some leeway for package drops and errors... X-Git-Tag: v17.1.0~2960^2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=refs%2Fpull%2F38030%2Fhead;p=ceph.git mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%) Fixes: https://tracker.ceph.com/issues/48201 Signed-off-by: Patrick Seidensal --- diff --git a/monitoring/prometheus/alerts/ceph_default_alerts.yml b/monitoring/prometheus/alerts/ceph_default_alerts.yml index b14eb15460cc..f7e8ce1188e0 100644 --- a/monitoring/prometheus/alerts/ceph_default_alerts.yml +++ b/monitoring/prometheus/alerts/ceph_default_alerts.yml @@ -175,30 +175,48 @@ groups: description: > Root volume (OSD and MON store) is dangerously full: {{ $value | humanize }}% free. - # alert on nic packet errors and drops rates > 1 packet/s + # alert on nic packet errors and drops rates > 1% packets/s - alert: network packets dropped - expr: irate(node_network_receive_drop_total{device!="lo"}[5m]) + irate(node_network_transmit_drop_total{device!="lo"}[5m]) > 1 + expr: | + ( + increase(node_network_receive_drop_total{device!="lo"}[1m]) + + increase(node_network_transmit_drop_total{device!="lo"}[1m]) + ) / ( + increase(node_network_receive_packets_total{device!="lo"}[1m]) + + increase(node_network_transmit_packets_total{device!="lo"}[1m]) + ) >= 0.0001 or ( + increase(node_network_receive_drop_total{device!="lo"}[1m]) + + increase(node_network_transmit_drop_total{device!="lo"}[1m]) + ) >= 10 labels: severity: warning type: ceph_default oid: 1.3.6.1.4.1.50495.15.1.2.8.2 annotations: description: > - Node {{ $labels.instance }} experiences packet drop > 1 - packet/s on interface {{ $labels.device }}. + Node {{ $labels.instance }} experiences packet drop > 0.01% or > + 10 packets/s on interface {{ $labels.device }}. - alert: network packet errors expr: | - irate(node_network_receive_errs_total{device!="lo"}[5m]) + - irate(node_network_transmit_errs_total{device!="lo"}[5m]) > 1 + ( + increase(node_network_receive_errs_total{device!="lo"}[1m]) + + increase(node_network_transmit_errs_total{device!="lo"}[1m]) + ) / ( + increase(node_network_receive_packets_total{device!="lo"}[1m]) + + increase(node_network_transmit_packets_total{device!="lo"}[1m]) + ) >= 0.0001 or ( + increase(node_network_receive_errs_total{device!="lo"}[1m]) + + increase(node_network_transmit_errs_total{device!="lo"}[1m]) + ) >= 10 labels: severity: warning type: ceph_default oid: 1.3.6.1.4.1.50495.15.1.2.8.3 annotations: description: > - Node {{ $labels.instance }} experiences packet errors > 1 - packet/s on interface {{ $labels.device }}. + Node {{ $labels.instance }} experiences packet errors > 0.01% or + > 10 packets/s on interface {{ $labels.device }}. - alert: storage filling up expr: |