Paul Cuzner [Tue, 8 Aug 2017 21:48:55 +0000 (09:48 +1200)]
iscsi.py: fix to defer the import of rtslib_fb
the goal of the parent module cephmetrics is to be generic across the
different ceph roles. By deferring the import of rtslib to the instantiation
of the first (and only!) ISCSIGateway object cephmetrics can import this
iscsi module without a problem regardless of the runtime environment.
Paul Cuzner [Mon, 7 Aug 2017 23:43:44 +0000 (11:43 +1200)]
iscsi-overview: minor fixes and rename of the iscsi_gateways variable
The client configuration panel was not excluding null entries, so when
rbd get unmasked from clients and not reused, they would still show up
in the table
In addition the templating variable iscsi_gateway was renamed to iscsi_gateways
aligning to the naming of the osd_servers and rgw_servers
Paul Cuzner [Mon, 7 Aug 2017 23:41:50 +0000 (11:41 +1200)]
dahsboard.yml : updated to show the variable needed by the iscsi dashboard
the iscsi_gateways templating variable is used to generate the graphs,
so for iscsi based deployments this variable will need to be defined to
ensure the queries work correctly in grafana
Paul Cuzner [Mon, 7 Aug 2017 00:20:10 +0000 (12:20 +1200)]
cephmetrics.py: updated to detect and collect stats for iSCSI gateways
The probe method now looks for the sysfs kernel entries that denote an
iscsi gateway is running on the node. When this dir is found an instance
of the iscsi collector (ISCSIGateway) is created and polled during
every read callback.
Paul Cuzner [Thu, 3 Aug 2017 04:53:57 +0000 (16:53 +1200)]
osd-information: minor fixes for larger environments
In a 600+ OSD environment the charts were based on averageSeries which
was taking a long time. This has now been changed, so the comparison
chart only shows current values for a given OSD for comparison
Paul Cuzner [Wed, 2 Aug 2017 02:20:41 +0000 (14:20 +1200)]
at-a-glance: link updates to reflect dashboard name changes
In addition to the link updates
- scrub state reflects both scrub and deep-scrub status
- the scrub panel no longer turns to "warn" when scrub is active - it's
a natural feature of the cluster and not a problem! However, it will
turn red, if scrub or deep-scrub is disabled
- disk util panel changed to a graph and switched to average and 95%ile
(95%ile on a busy cluster just shows too much variance)
- OSDs panel now links to the ceph-osd-information dashboard
Paul Cuzner [Wed, 2 Aug 2017 02:13:00 +0000 (14:13 +1200)]
ceph-osd-information : osd dashboard now provides summary and performance data
Summary row shows;
- osd count
- osd up count
- osd's down
- disk size summary (pie chart showing what sizes of disk are in the cluster
- table of osd to disk size
- OSD encryption summary (how many of my OSDs are encrypted?)
- OSD type status (how many OSDs are filestore vs bluestore
Panel includes an OSD id which is used as a filter for the filestore
performance row
The performance row now shows average OSD performance for a single OSD or
all OSDs. This can then be used for side-by-side comparison with OSD
performance across the cluster at the 95%ile.
Paul Cuzner [Wed, 2 Aug 2017 02:08:24 +0000 (14:08 +1200)]
osd-node-detail : multiple changes to better show OSD level metrics
changes include;
- support for osd and disk name to provide filters to the graphs
- add osd overview row containing
- raw capacity panel
- host.disk -> disk size table
- host.disk -> osd id
- shortcut links added to other overview type dashboards
Paul Cuzner [Fri, 28 Jul 2017 08:02:45 +0000 (20:02 +1200)]
osd: add % used to each OSD
percent used was not available within the osd metric tree only the
physical disk. With the inclusion under the osd, the percent_used can
reference the osd_id directly easier in any queries
Paul Cuzner [Fri, 28 Jul 2017 02:28:19 +0000 (14:28 +1200)]
osd: support device-mappper (dmcrypt) osd's/journals
dmcrypt osd/journals make use of /dev/mapper devices, so this change
supports the device mappper naming for the device links. In addition,
all disks (osd/jrnl metrics) have additional metrics; "osd_type" and
"encrypted" to help understand the status of the OSDs within the cluster.
Paul Cuzner [Mon, 24 Jul 2017 02:13:09 +0000 (14:13 +1200)]
alert-status dashboard : Enable default alerts
dashUpdater has been updated to automatically set up a cephmetrics
notifications channel (if it's not already there), and the alert-status
dashboard is loaded, which references the cephmetrics channel.
The ansible templates has been updated to reflect the introduction of the
alert-status dashboard
Paul Cuzner [Fri, 21 Jul 2017 22:25:17 +0000 (10:25 +1200)]
osd: fix determination of osd type
the presence of the type file was being relied upon across versions.
However, not all versions show this file (10.2.2 did, 10.2.7 didn't!), so
this fix looks for type and if it's there it uses it, if not it will
look for the presence of the journal link to determine if the osd
is filestore. It is assumed that bluestore will 'always' use the type
file..
Paul Cuzner [Fri, 7 Jul 2017 04:01:50 +0000 (16:01 +1200)]
dashUpdater: remove $domain from dashboards, if domain is not configured
For environments that don't use dns, collectd will not provide a FQDN
on the metric name. In these circumstances, the dashboards are empty.
This fix looks for the domain setting, and if it's not supplied the
$domain reference in all queries is removed before the dashboard is loaded
into grafana.
Paul Cuzner [Thu, 6 Jul 2017 23:31:48 +0000 (11:31 +1200)]
osd: add support for osd related stats, and support journal devices
OSD daemons are now asked for perf data, so latencies within ceph can be
loaded to graphite. In addition the journal device is detected. If it's
not collocated on the osd device, additional disk metrics under a journal
subtree are created within graphite
Paul Cuzner [Thu, 6 Jul 2017 23:29:06 +0000 (11:29 +1200)]
common: changes to the Disk class
Two main things;
1. Disk instances are now initialized here, instead of with the caller
devices simplying code in the osd class
2. get_real_dev function added to convert a device name of an OSD to the
name we'll use as a metric. this now provides initial support for nvme
and intelcas based osd
Paul Cuzner [Fri, 30 Jun 2017 02:05:33 +0000 (14:05 +1200)]
at-a-glance: multiple fixes to mon/osd/growth and forecast panels
MON/OSD panel queries updated to address the interpolation
problem where floats were shown. OSD panel also now shows
total OSDs
Templating update for the disk_full_threshold (2->80)
Growth/Forecast panel queries updated to account for data coming
from multiple mon's
Health Panel updated to show as RED when the cluster is in an
ERROR state
Paul Cuzner [Thu, 29 Jun 2017 04:50:30 +0000 (16:50 +1200)]
at-a-glance: pg status pie chart changes
a degraded state is now shown based on the diff of pg_active and
pg_active_clean. This intermediate metric has been added to the pie
chart so it shows; active+clean, degraded and peering.
Paul Cuzner [Thu, 29 Jun 2017 03:15:58 +0000 (15:15 +1200)]
network-usage: dashboard updated to track enX interface stats
graphite doesn't support blacklisting in queries, so interface names that
we're interested in have to be whitelisted. This fix now tracks enX, ethX and
bondX interface names.