Paul Cuzner [Wed, 23 Aug 2017 02:56:37 +0000 (14:56 +1200)]
mon: account for null dict from _admin_socket
the _admin_socket method could return a null dict if the
socket is not there (i.e. ceph-mon is down). By checking for the
empty dict, the collector can remain active while ceph-mon is
stopped and restarted during normal maintenance processes on a
host.
Paul Cuzner [Wed, 23 Aug 2017 02:54:15 +0000 (14:54 +1200)]
iscsi: trigger stats only when iscsi is active
look for the iscsi dir in sysfs to determine when to
send the iscsi stats. If the iscsi base dir is not there
the collector will just send the version of gwcli
Paul Cuzner [Mon, 21 Aug 2017 04:57:16 +0000 (16:57 +1200)]
dashboard query update to filter out old OSDs
Old OSDs will still exist in the TSDB, and could show as out or down.
The update uses transformNull to pick out osds with null values and
filter them out of the results shown.
Paul Cuzner [Mon, 21 Aug 2017 01:36:01 +0000 (13:36 +1200)]
osd-information: fixes for null entries and time windows used on pie-charts
osd's that fail result in nulls in the data series, so queries updated to
account for this gaps. In addition, a time window if 2 mins used to
restrict the obs that have to be grabbed from graphite for the pie charts
Paul Cuzner [Tue, 8 Aug 2017 21:48:55 +0000 (09:48 +1200)]
iscsi.py: fix to defer the import of rtslib_fb
the goal of the parent module cephmetrics is to be generic across the
different ceph roles. By deferring the import of rtslib to the instantiation
of the first (and only!) ISCSIGateway object cephmetrics can import this
iscsi module without a problem regardless of the runtime environment.
Paul Cuzner [Mon, 7 Aug 2017 23:43:44 +0000 (11:43 +1200)]
iscsi-overview: minor fixes and rename of the iscsi_gateways variable
The client configuration panel was not excluding null entries, so when
rbd get unmasked from clients and not reused, they would still show up
in the table
In addition the templating variable iscsi_gateway was renamed to iscsi_gateways
aligning to the naming of the osd_servers and rgw_servers
Paul Cuzner [Mon, 7 Aug 2017 23:41:50 +0000 (11:41 +1200)]
dahsboard.yml : updated to show the variable needed by the iscsi dashboard
the iscsi_gateways templating variable is used to generate the graphs,
so for iscsi based deployments this variable will need to be defined to
ensure the queries work correctly in grafana
Paul Cuzner [Mon, 7 Aug 2017 00:20:10 +0000 (12:20 +1200)]
cephmetrics.py: updated to detect and collect stats for iSCSI gateways
The probe method now looks for the sysfs kernel entries that denote an
iscsi gateway is running on the node. When this dir is found an instance
of the iscsi collector (ISCSIGateway) is created and polled during
every read callback.
Paul Cuzner [Thu, 3 Aug 2017 04:53:57 +0000 (16:53 +1200)]
osd-information: minor fixes for larger environments
In a 600+ OSD environment the charts were based on averageSeries which
was taking a long time. This has now been changed, so the comparison
chart only shows current values for a given OSD for comparison
Paul Cuzner [Wed, 2 Aug 2017 02:20:41 +0000 (14:20 +1200)]
at-a-glance: link updates to reflect dashboard name changes
In addition to the link updates
- scrub state reflects both scrub and deep-scrub status
- the scrub panel no longer turns to "warn" when scrub is active - it's
a natural feature of the cluster and not a problem! However, it will
turn red, if scrub or deep-scrub is disabled
- disk util panel changed to a graph and switched to average and 95%ile
(95%ile on a busy cluster just shows too much variance)
- OSDs panel now links to the ceph-osd-information dashboard
Paul Cuzner [Wed, 2 Aug 2017 02:13:00 +0000 (14:13 +1200)]
ceph-osd-information : osd dashboard now provides summary and performance data
Summary row shows;
- osd count
- osd up count
- osd's down
- disk size summary (pie chart showing what sizes of disk are in the cluster
- table of osd to disk size
- OSD encryption summary (how many of my OSDs are encrypted?)
- OSD type status (how many OSDs are filestore vs bluestore
Panel includes an OSD id which is used as a filter for the filestore
performance row
The performance row now shows average OSD performance for a single OSD or
all OSDs. This can then be used for side-by-side comparison with OSD
performance across the cluster at the 95%ile.
Paul Cuzner [Wed, 2 Aug 2017 02:08:24 +0000 (14:08 +1200)]
osd-node-detail : multiple changes to better show OSD level metrics
changes include;
- support for osd and disk name to provide filters to the graphs
- add osd overview row containing
- raw capacity panel
- host.disk -> disk size table
- host.disk -> osd id
- shortcut links added to other overview type dashboards
Paul Cuzner [Fri, 28 Jul 2017 08:02:45 +0000 (20:02 +1200)]
osd: add % used to each OSD
percent used was not available within the osd metric tree only the
physical disk. With the inclusion under the osd, the percent_used can
reference the osd_id directly easier in any queries
Paul Cuzner [Fri, 28 Jul 2017 02:28:19 +0000 (14:28 +1200)]
osd: support device-mappper (dmcrypt) osd's/journals
dmcrypt osd/journals make use of /dev/mapper devices, so this change
supports the device mappper naming for the device links. In addition,
all disks (osd/jrnl metrics) have additional metrics; "osd_type" and
"encrypted" to help understand the status of the OSDs within the cluster.
Paul Cuzner [Mon, 24 Jul 2017 02:13:09 +0000 (14:13 +1200)]
alert-status dashboard : Enable default alerts
dashUpdater has been updated to automatically set up a cephmetrics
notifications channel (if it's not already there), and the alert-status
dashboard is loaded, which references the cephmetrics channel.
The ansible templates has been updated to reflect the introduction of the
alert-status dashboard
Paul Cuzner [Fri, 21 Jul 2017 22:25:17 +0000 (10:25 +1200)]
osd: fix determination of osd type
the presence of the type file was being relied upon across versions.
However, not all versions show this file (10.2.2 did, 10.2.7 didn't!), so
this fix looks for type and if it's there it uses it, if not it will
look for the presence of the journal link to determine if the osd
is filestore. It is assumed that bluestore will 'always' use the type
file..
Paul Cuzner [Fri, 7 Jul 2017 04:01:50 +0000 (16:01 +1200)]
dashUpdater: remove $domain from dashboards, if domain is not configured
For environments that don't use dns, collectd will not provide a FQDN
on the metric name. In these circumstances, the dashboards are empty.
This fix looks for the domain setting, and if it's not supplied the
$domain reference in all queries is removed before the dashboard is loaded
into grafana.
Paul Cuzner [Thu, 6 Jul 2017 23:31:48 +0000 (11:31 +1200)]
osd: add support for osd related stats, and support journal devices
OSD daemons are now asked for perf data, so latencies within ceph can be
loaded to graphite. In addition the journal device is detected. If it's
not collocated on the osd device, additional disk metrics under a journal
subtree are created within graphite
Paul Cuzner [Thu, 6 Jul 2017 23:29:06 +0000 (11:29 +1200)]
common: changes to the Disk class
Two main things;
1. Disk instances are now initialized here, instead of with the caller
devices simplying code in the osd class
2. get_real_dev function added to convert a device name of an OSD to the
name we'll use as a metric. this now provides initial support for nvme
and intelcas based osd