Boris Ranto [Fri, 6 Oct 2017 10:00:25 +0000 (12:00 +0200)]
ansible: Fix merge_vars.yml
Currently, we override the variables in merge_vars. However, if we run
the script several times (e.g. a host is a grafana and a collectd node)
then vars[item] is defined but it can be that it is not a mapping. In
this case we redefine the value to empty string breaking the actual
values of the variables.
We redefine e.g. devel_mode or use_epel to empty string this way making
it false for the grafana server node that is both grafana and the
collectd node.
Zack Cerza [Thu, 5 Oct 2017 19:43:57 +0000 (13:43 -0600)]
Use --fake-initial for migrations if necessary
Migrations won't work correctly if the db already exists. This manifests
in an error like:
django.db.utils.OperationalError: table "django_content_type" already exists
Zack Cerza [Wed, 4 Oct 2017 21:23:41 +0000 (15:23 -0600)]
Add a note about waiting to collect data
Users might initially be confused that immediately after deployment, the
dashboard looks broken. This is because it doesn't yet have the data it
needs to function.
Boris Ranto [Mon, 2 Oct 2017 09:29:01 +0000 (11:29 +0200)]
ansible: Do not enable rhsm repos
We are shipping as a regular product and as such, we cannot enable
additional repos via rhsm. The customers might even not have these repos
installed (especially the storage console repo might not be available to
them).
If we need any of the packages from these repos, we need to cross-ship
them in our product (as we already do downstream).
Paul Cuzner [Tue, 12 Sep 2017 23:57:41 +0000 (11:57 +1200)]
iscsi-overview: multiple panel fixes
Values were shown in correctly in environments where the iscsi config had
been dropped and recreated. This update addresses issues in the following
panels; path summary, unused LUNs, defined capacity. In addition the
client charts only show entries for clients with i/o or load > 0.
Paul Cuzner [Wed, 23 Aug 2017 21:14:44 +0000 (09:14 +1200)]
mon: simplify the admin_socket read logic
The initial commit placed logic in each area that called the admin
socket. This patch separates the admin socket call out to a separate
method, so it gets checked in one place.
Paul Cuzner [Wed, 23 Aug 2017 02:58:34 +0000 (14:58 +1200)]
rgw: look for the admin_socket on each call
The admin_socket name for rgw is not fixed, unlike mon/osds. Therefore
to account for svc restarts and name changes the socket name is
determined at each get_stats cycle. If the socket isn't there, the
collector just passes back the version of radosgw to the caller and
will send stats again once a socket is detected on the host
Paul Cuzner [Wed, 23 Aug 2017 02:56:37 +0000 (14:56 +1200)]
mon: account for null dict from _admin_socket
the _admin_socket method could return a null dict if the
socket is not there (i.e. ceph-mon is down). By checking for the
empty dict, the collector can remain active while ceph-mon is
stopped and restarted during normal maintenance processes on a
host.
Paul Cuzner [Wed, 23 Aug 2017 02:54:15 +0000 (14:54 +1200)]
iscsi: trigger stats only when iscsi is active
look for the iscsi dir in sysfs to determine when to
send the iscsi stats. If the iscsi base dir is not there
the collector will just send the version of gwcli
Paul Cuzner [Mon, 21 Aug 2017 04:57:16 +0000 (16:57 +1200)]
dashboard query update to filter out old OSDs
Old OSDs will still exist in the TSDB, and could show as out or down.
The update uses transformNull to pick out osds with null values and
filter them out of the results shown.
Paul Cuzner [Mon, 21 Aug 2017 01:36:01 +0000 (13:36 +1200)]
osd-information: fixes for null entries and time windows used on pie-charts
osd's that fail result in nulls in the data series, so queries updated to
account for this gaps. In addition, a time window if 2 mins used to
restrict the obs that have to be grabbed from graphite for the pie charts
Paul Cuzner [Tue, 8 Aug 2017 21:48:55 +0000 (09:48 +1200)]
iscsi.py: fix to defer the import of rtslib_fb
the goal of the parent module cephmetrics is to be generic across the
different ceph roles. By deferring the import of rtslib to the instantiation
of the first (and only!) ISCSIGateway object cephmetrics can import this
iscsi module without a problem regardless of the runtime environment.
Paul Cuzner [Mon, 7 Aug 2017 23:43:44 +0000 (11:43 +1200)]
iscsi-overview: minor fixes and rename of the iscsi_gateways variable
The client configuration panel was not excluding null entries, so when
rbd get unmasked from clients and not reused, they would still show up
in the table
In addition the templating variable iscsi_gateway was renamed to iscsi_gateways
aligning to the naming of the osd_servers and rgw_servers
Paul Cuzner [Mon, 7 Aug 2017 23:41:50 +0000 (11:41 +1200)]
dahsboard.yml : updated to show the variable needed by the iscsi dashboard
the iscsi_gateways templating variable is used to generate the graphs,
so for iscsi based deployments this variable will need to be defined to
ensure the queries work correctly in grafana
Paul Cuzner [Mon, 7 Aug 2017 00:20:10 +0000 (12:20 +1200)]
cephmetrics.py: updated to detect and collect stats for iSCSI gateways
The probe method now looks for the sysfs kernel entries that denote an
iscsi gateway is running on the node. When this dir is found an instance
of the iscsi collector (ISCSIGateway) is created and polled during
every read callback.
Paul Cuzner [Thu, 3 Aug 2017 04:53:57 +0000 (16:53 +1200)]
osd-information: minor fixes for larger environments
In a 600+ OSD environment the charts were based on averageSeries which
was taking a long time. This has now been changed, so the comparison
chart only shows current values for a given OSD for comparison
Paul Cuzner [Wed, 2 Aug 2017 02:20:41 +0000 (14:20 +1200)]
at-a-glance: link updates to reflect dashboard name changes
In addition to the link updates
- scrub state reflects both scrub and deep-scrub status
- the scrub panel no longer turns to "warn" when scrub is active - it's
a natural feature of the cluster and not a problem! However, it will
turn red, if scrub or deep-scrub is disabled
- disk util panel changed to a graph and switched to average and 95%ile
(95%ile on a busy cluster just shows too much variance)
- OSDs panel now links to the ceph-osd-information dashboard
Paul Cuzner [Wed, 2 Aug 2017 02:13:00 +0000 (14:13 +1200)]
ceph-osd-information : osd dashboard now provides summary and performance data
Summary row shows;
- osd count
- osd up count
- osd's down
- disk size summary (pie chart showing what sizes of disk are in the cluster
- table of osd to disk size
- OSD encryption summary (how many of my OSDs are encrypted?)
- OSD type status (how many OSDs are filestore vs bluestore
Panel includes an OSD id which is used as a filter for the filestore
performance row
The performance row now shows average OSD performance for a single OSD or
all OSDs. This can then be used for side-by-side comparison with OSD
performance across the cluster at the 95%ile.