From: Ernesto Puerta <37327689+epuertat@users.noreply.github.com> Date: Thu, 11 Nov 2021 16:36:30 +0000 (+0100) Subject: Merge pull request #43464 from rsommer/wip-prometheus-standby-behaviour X-Git-Tag: v17.1.0~457 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=45eb9dd328c53c252f7c2a17cf96e3c50739f19f;p=ceph.git Merge pull request #43464 from rsommer/wip-prometheus-standby-behaviour mgr/prometheus: Make prometheus standby behaviour configurable Reviewed-by: Ernesto Puerta Reviewed-by: Pere Diaz Bou --- 45eb9dd328c53c252f7c2a17cf96e3c50739f19f diff --cc doc/mgr/prometheus.rst index 733c4bfdb4f3,a8774ff33234..ef138477608c --- a/doc/mgr/prometheus.rst +++ b/doc/mgr/prometheus.rst @@@ -96,45 -98,26 +98,63 @@@ If you are confident that you don't req ceph config set mgr mgr/prometheus/cache false + If you are using the prometheus module behind some kind of reverse proxy or + loadbalancer, you can simplify discovering the active instance by switching + to ``error``-mode:: + + ceph config set mgr mgr/prometheus/standby_behaviour error + + If set, the prometheus module will repond with a HTTP error when requesting ``/`` + from the standby instance. The default error code is 500, but you can configure + the HTTP response code with:: + + ceph config set mgr mgr/prometheus/standby_error_status_code 503 + + Valid error codes are between 400-599. + + To switch back to the default behaviour, simply set the config key to ``default``:: + + ceph config set mgr mgr/prometheus/standby_behaviour default + .. _prometheus-rbd-io-statistics: +Ceph Health Checks +------------------ + +The mgr/prometheus module also tracks and maintains a history of Ceph health checks, +exposing them to the Prometheus server as discrete metrics. This allows Prometheus +alert rules to be configured for specific health check events. + +The metrics take the following form; + +:: + + # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active) + # TYPE ceph_health_detail gauge + ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0 + ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0 + ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0 + +The health check history is made available through the following commands; + +:: + + healthcheck history ls [--format {plain|json|json-pretty}] + healthcheck history clear + +The ``ls`` command provides an overview of the health checks that the cluster has +encountered, or since the last ``clear`` command was issued. The example below; + +:: + + [ceph: root@c8-node1 /]# ceph healthcheck history ls + Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active + OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No + OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes + PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes + 3 health check(s) listed + + RBD IO statistics -----------------