Paul Cuzner [Fri, 28 Jul 2017 08:02:45 +0000 (20:02 +1200)]
osd: add % used to each OSD
percent used was not available within the osd metric tree only the
physical disk. With the inclusion under the osd, the percent_used can
reference the osd_id directly easier in any queries
Paul Cuzner [Fri, 28 Jul 2017 02:28:19 +0000 (14:28 +1200)]
osd: support device-mappper (dmcrypt) osd's/journals
dmcrypt osd/journals make use of /dev/mapper devices, so this change
supports the device mappper naming for the device links. In addition,
all disks (osd/jrnl metrics) have additional metrics; "osd_type" and
"encrypted" to help understand the status of the OSDs within the cluster.
Paul Cuzner [Mon, 24 Jul 2017 02:13:09 +0000 (14:13 +1200)]
alert-status dashboard : Enable default alerts
dashUpdater has been updated to automatically set up a cephmetrics
notifications channel (if it's not already there), and the alert-status
dashboard is loaded, which references the cephmetrics channel.
The ansible templates has been updated to reflect the introduction of the
alert-status dashboard
Paul Cuzner [Fri, 21 Jul 2017 22:25:17 +0000 (10:25 +1200)]
osd: fix determination of osd type
the presence of the type file was being relied upon across versions.
However, not all versions show this file (10.2.2 did, 10.2.7 didn't!), so
this fix looks for type and if it's there it uses it, if not it will
look for the presence of the journal link to determine if the osd
is filestore. It is assumed that bluestore will 'always' use the type
file..
Paul Cuzner [Fri, 7 Jul 2017 04:01:50 +0000 (16:01 +1200)]
dashUpdater: remove $domain from dashboards, if domain is not configured
For environments that don't use dns, collectd will not provide a FQDN
on the metric name. In these circumstances, the dashboards are empty.
This fix looks for the domain setting, and if it's not supplied the
$domain reference in all queries is removed before the dashboard is loaded
into grafana.
Paul Cuzner [Thu, 6 Jul 2017 23:31:48 +0000 (11:31 +1200)]
osd: add support for osd related stats, and support journal devices
OSD daemons are now asked for perf data, so latencies within ceph can be
loaded to graphite. In addition the journal device is detected. If it's
not collocated on the osd device, additional disk metrics under a journal
subtree are created within graphite
Paul Cuzner [Thu, 6 Jul 2017 23:29:06 +0000 (11:29 +1200)]
common: changes to the Disk class
Two main things;
1. Disk instances are now initialized here, instead of with the caller
devices simplying code in the osd class
2. get_real_dev function added to convert a device name of an OSD to the
name we'll use as a metric. this now provides initial support for nvme
and intelcas based osd
Paul Cuzner [Fri, 30 Jun 2017 02:05:33 +0000 (14:05 +1200)]
at-a-glance: multiple fixes to mon/osd/growth and forecast panels
MON/OSD panel queries updated to address the interpolation
problem where floats were shown. OSD panel also now shows
total OSDs
Templating update for the disk_full_threshold (2->80)
Growth/Forecast panel queries updated to account for data coming
from multiple mon's
Health Panel updated to show as RED when the cluster is in an
ERROR state
Paul Cuzner [Thu, 29 Jun 2017 04:50:30 +0000 (16:50 +1200)]
at-a-glance: pg status pie chart changes
a degraded state is now shown based on the diff of pg_active and
pg_active_clean. This intermediate metric has been added to the pie
chart so it shows; active+clean, degraded and peering.
Paul Cuzner [Thu, 29 Jun 2017 03:15:58 +0000 (15:15 +1200)]
network-usage: dashboard updated to track enX interface stats
graphite doesn't support blacklisting in queries, so interface names that
we're interested in have to be whitelisted. This fix now tracks enX, ethX and
bondX interface names.
Paul Cuzner [Thu, 29 Jun 2017 03:12:54 +0000 (15:12 +1200)]
ceph-rados: display fixes to a several charts
health history was a mess with the light theme. The chart now uses threshold lines
(amber and red), and plots health against those lines. In addition small fixes to the capacity
chart (it was stacking values!), and the monitor status table
Paul Cuzner [Thu, 29 Jun 2017 03:09:01 +0000 (15:09 +1200)]
backend-storage: cosmetic changes to heatmap and graphs
Heatmap 'spectrum' was not showing well in the light theme - this
make it more readable. In addition 'info' has been added to explain the
heatmap and what it represents.
Paul Cuzner [Thu, 29 Jun 2017 03:04:17 +0000 (15:04 +1200)]
at-a-glance : query and cosmetic changes
Multiple changes as follows;
- dashboard links (top) hover over was mis-aligned. fixed
- forecast value if negative now shows N/A
- forecast and growth queries updated
- disks near full set to 0, if there aren't any issues (instead of no value)
- added descriptions on various panels
Paul Cuzner [Thu, 29 Jun 2017 02:53:43 +0000 (14:53 +1200)]
status-panel: Update the bkgnd color to support the light theme
By default the panel just uses 'green', which with the light theme is too
dark, making the text on the panel difficult to read. This ansible step just
updates the color in the css to make the text more readable