git.apps.os.sepia.ceph.com Git

author	Anmol Babu <Anmol.Babu@ibm.com>
	Mon, 8 Sep 2025 03:40:27 +0000 (09:10 +0530)
committer	Anmol Babu <Anmol.Babu@ibm.com>
	Sat, 11 Oct 2025 01:11:04 +0000 (06:41 +0530)
commit	d4b9aa6113eb63a859a7adfa27fba94d428b9ad0
tree	d9a449f51c3f47be689ce87f32f19919c2e1bfe9	tree \| snapshot
parent	f96567578976c2c84a31d366a04c28fb95ceb0d9	commit \| diff

Increase metric priorities to CRITICAL for metrics used in dashboard

As part of scale testing, we observed that, the volume of metrics was
very huge on large clusters. We did an analysis of the used metrics
and the complete list of metrics used in the dashboards and made the
below observations:
1. Only 17 metrics were used in the dashboards(grafana and management UI)
2. Total number of metrics collected in prometheus stack were around 245

A lot of metrics will incur:
1. Greater CPU and Memory demand for all marshaling and un-marshaling
requirements
2. Greater storage volume
3. Increased per-scrape network consumption

We intend to bump all the metrics leveraged in Ceph monitoring dashboards
to prio_level CRITICAL and also raise the default ceph-exporter prio_level
to CRITICAL. So, that prometheus ends up having only the required metrics.
This is Part 1 of the efforts to request the metric implementation teams to
revisit the metric priorities.

If the customer needs other metrics, they can lower the ceph-exporter prio
level and restart the ceph exporter after a careful evaluation of the storage,
CPU and networking costs.

Signed-off-by: Anmol Babu <Anmol.Babu@ibm.com>

src/os/bluestore/BlueFS.cc		diff \| blob \| history
src/os/bluestore/BlueStore.cc		diff \| blob \| history
src/osd/osd_perf_counters.cc		diff \| blob \| history
src/rgw/rgw_perf_counters.cc		diff \| blob \| history