mgr/prometheus: improve Prometheus module cache

author Patrick Seidensal <pseidensal@suse.com>

Tue, 9 Jun 2020 10:35:00 +0000 (12:35 +0200)

committer Laura Paduano <lpaduano@suse.com>

Fri, 3 Jul 2020 12:18:59 +0000 (14:18 +0200)
author Patrick Seidensal <pseidensal@suse.com>
Tue, 9 Jun 2020 10:35:00 +0000 (12:35 +0200)
committer Laura Paduano <lpaduano@suse.com>
Fri, 3 Jul 2020 12:18:59 +0000 (14:18 +0200)
diff --git a/doc/mgr/prometheus.rst b/doc/mgr/prometheus.rst

index 8dbc44f5292ac858bcdd293cacadba26a55cf172..87296be391471de29032959bfabe51b88c9abdbb 100644 (file)
--- a/doc/mgr/prometheus.rst
+++ b/doc/mgr/prometheus.rst
@@ -25,11 +25,65 @@ The *prometheus* module is enabled with::
  Configuration
  -------------
  
-By default the module will accept HTTP requests on port ``9283`` on all
-IPv4 and IPv6 addresses on the host.  The port and listen address are both
+.. note::
+
+    The Prometheus manager module needs to be restarted for configuration changes to be applied.
+
+By default the module will accept HTTP requests on port ``9283`` on all IPv4
+and IPv6 addresses on the host.  The port and listen address are both
  configurable with ``ceph config-key set``, with keys
-``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
-This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
+``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
+is registered with Prometheus's `registry
+<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
+
+::
+
+    ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
+    ceph config set mgr mgr/prometheus/server_port 9283
+
+.. warning::
+
+    The ``scrape_interval`` of this module should always be set to match
+    Prometheus' scrape interval to work properly and not cause any issues.
+    
+The Prometheus manager module is, by default, configured with a scrape interval
+of 15 seconds.  The scrape interval in the module is used for caching purposes
+and to determine when a cache is stale.
+
+It is not recommended to use a scrape interval below 10 seconds.  It is
+recommended to use 15 seconds as scrape interval, though, in some cases it
+might be useful to increase the scrape interval.
+
+To set a different scrape interval in the Prometheus module, set
+``scrape_interval`` to the desired value::
+
+    ceph config set mgr mgr/prometheus/scrape_interval 20
+
+On large clusters (>1000 OSDs), the time to fetch the metrics may become
+significant.  Without the cache, the Prometheus manager module could,
+especially in conjunction with multiple Prometheus instances, overload the
+manager and lead to unresponsive or crashing Ceph manager instances.  Hence,
+the cache is enabled by default and cannot be disabled.  This means that there
+is a possibility that the cache becomes stale.  The cache is considered stale
+when the time to fetch the metrics from Ceph exceeds the configured
+``scrape_interval``.
+
+If that is the case, **a warning will be logged** and the module will either
+
+* respond with a 503 HTTP status code (service unavailable) or,
+* it will return the content of the cache, even though it might be stale.
+
+This behavior can be configured. By default, it will return a 503 HTTP status
+code (service unavailable). You can set other options using the ``ceph config
+set`` commands.
+
+To tell the module to respond with possibly stale data, set it to ``return``::
+
+    ceph config set mgr mgr/prometheus/stale_cache_strategy return
+
+To tell the module to respond with "service unavailable", set it to ``fail``::
+
+    ceph config set mgr mgr/prometheus/stale_cache_strategy fail
  
  .. _prometheus-rbd-io-statistics:
  
@@ -62,7 +116,7 @@ Statistic names and labels
  ==========================
  
  The names of the stats are exactly as Ceph names them, with
-illegal characters ``.``, ``-`` and ``::`` translated to ``_``, 
+illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
  and ``ceph_`` prefixed to all names.
  
  
@@ -75,7 +129,7 @@ rocksdb stats.
  
  
  The *cluster* statistics (i.e. those global to the Ceph cluster)
-have labels appropriate to what they report on.  For example, 
+have labels appropriate to what they report on.  For example,
  metrics relating to pools have a ``pool_id`` label.
  
  
@@ -109,7 +163,7 @@ Correlating drive statistics with node_exporter
  The prometheus output from Ceph is designed to be used in conjunction
  with the generic host monitoring from the Prometheus node_exporter.
  
-To enable correlation of Ceph OSD statistics with node_exporter's 
+To enable correlation of Ceph OSD statistics with node_exporter's
  drive statistics, special series are output like this:
  
  ::
diff --git a/src/pybind/mgr/prometheus/module.py b/src/pybind/mgr/prometheus/module.py

index 3a11a362c9abeb1a3a374e669d898d47422dd7d8..54c1c265a6f6547ec37125f94f770c6cb735b0ee 100644 (file)
--- a/src/pybind/mgr/prometheus/module.py
+++ b/src/pybind/mgr/prometheus/module.py
@@ -12,6 +12,11 @@ from mgr_module import MgrModule, MgrStandbyModule, CommandResult, PG_STATES
  from mgr_util import get_default_addr
  from rbd import RBD
  
+try:
+    from typing import Optional
+except:
+    pass
+
  # Defaults for the Prometheus HTTP server.  Can also set in config-key
  # see https://github.com/prometheus/prometheus/wiki/Default-port-allocations
  # for Prometheus exporter port registry
@@ -42,12 +47,7 @@ os._exit = os_exit_noop
  # to access things in class Module from subclass Root.  Because
  # it's a dict, the writer doesn't need to declare 'global' for access
  
-_global_instance = {'plugin': None}
-
-
-def global_instance():
-    assert _global_instance['plugin'] is not None
-    return _global_instance['plugin']
+_global_instance = None  # type: Optional[Module]
  
  
  def health_status_to_number(status):
@@ -176,6 +176,44 @@ class Metric(object):
          return expfmt
  
  
+class MetricCollectionThread(threading.Thread):
+    def __init__(self):
+        super(MetricCollectionThread, self).__init__(target=self.collect)
+
+    @staticmethod
+    def collect():
+        inst = _global_instance
+        inst.log.info('starting metric collection thread')
+        while True:
+            if inst.have_mon_connection():
+                start_time = time.time()
+                data = inst.collect()
+                duration = time.time() - start_time
+                
+                sleep_time = inst.scrape_interval - duration
+                if sleep_time < 0:
+                    inst.log.warning(
+                        'Collecting data took more time than configured scrape interval. '
+                        'This possibly results in stale data. Please check the '
+                        '`stale_cache_strategy` configuration option. '
+                        'Collecting data took {:.2f} seconds but scrape interval is configured '
+                        'to be {:.0f} seconds.'.format(
+                            duration,
+                            inst.scrape_interval,
+                        )
+                    )
+                    sleep_time = 0
+
+                with inst.collect_lock:
+                    inst.collect_cache = data
+                    inst.collect_time = duration
+
+                time.sleep(sleep_time)
+            else:
+                inst.log.error('No MON connection')
+                time.sleep(inst.scrape_interval)
+
+
  class Module(MgrModule):
      COMMANDS = [
          {
@@ -189,17 +227,22 @@ class Module(MgrModule):
          {'name': 'server_addr'},
          {'name': 'server_port'},
          {'name': 'scrape_interval'},
+        {'name': 'stale_cache_strategy'},
          {'name': 'rbd_stats_pools'},
          {'name': 'rbd_stats_pools_refresh_interval', 'type': 'int', 'default': 300},
      ]
  
+    STALE_CACHE_FAIL = 'fail'
+    STALE_CACHE_RETURN = 'return'
+
      def __init__(self, *args, **kwargs):
          super(Module, self).__init__(*args, **kwargs)
          self.metrics = self._setup_static_metrics()
          self.shutdown_event = threading.Event()
-        self.collect_lock = threading.RLock()
+        self.collect_lock = threading.Lock()
          self.collect_time = 0
-        self.collect_timeout = 5.0
+        self.scrape_interval = 15.0
+        self.stale_cache_strategy = self.STALE_CACHE_FAIL
          self.collect_cache = None
          self.rbd_stats = {
              'pools': {},
@@ -219,7 +262,9 @@ class Module(MgrModule):
                                   'desc': 'RBD image reads latency (msec)'},
              },
          }
-        _global_instance['plugin'] = self
+        global _global_instance
+        _global_instance = self
+        MetricCollectionThread().start()
  
      def _setup_static_metrics(self):
          metrics = {}
@@ -986,7 +1031,7 @@ class Module(MgrModule):
                  # TODO use get_config_prefix or get_config here once
                  # https://github.com/ceph/ceph/pull/20458 is merged
                  result = CommandResult("")
-                global_instance().send_command(
+                _global_instance.send_command(
                      result, "mon", '',
                      json.dumps({
                          "prefix": "config-key get",
@@ -995,7 +1040,7 @@ class Module(MgrModule):
                      "")
                  r, outb, outs = result.wait()
                  if r != 0:
-                    global_instance().log.error("Failed to retrieve port for mgr {}: {}".format(id_, outs))
+                    _global_instance.log.error("Failed to retrieve port for mgr {}: {}".format(id_, outs))
                      targets.append('{}:{}'.format(hostname, DEFAULT_PORT))
                  else:
                      port = json.loads(outb)
@@ -1042,33 +1087,54 @@ class Module(MgrModule):
  
              @cherrypy.expose
              def metrics(self):
-                instance = global_instance()
                  # Lock the function execution
-                try:
-                    instance.collect_lock.acquire()
-                    return self._metrics(instance)
-                finally:
-                    instance.collect_lock.release()
+                with _global_instance.collect_lock:
+                    return self._metrics(_global_instance)
  
              @staticmethod
              def _metrics(instance):
-                # Return cached data if available and collected before the cache times out
-                if instance.collect_cache and time.time() - instance.collect_time < instance.collect_timeout:
-                    cherrypy.response.headers['Content-Type'] = 'text/plain'
-                    return instance.collect_cache
+                # Return cached data if available
+                if not instance.collect_cache:
+                    raise cherrypy.HTTPError(503, 'No cached data available yet')
  
-                if instance.have_mon_connection():
-                    instance.collect_cache = None
-                    instance.collect_time = time.time()
-                    instance.collect_cache = instance.collect()
+                def respond():
                      cherrypy.response.headers['Content-Type'] = 'text/plain'
                      return instance.collect_cache
-                else:
-                    raise cherrypy.HTTPError(503, 'No MON connection')
+
+                if instance.collect_time < instance.scrape_interval:
+                    # Respond if cache isn't stale
+                    return respond()
+
+                if instance.stale_cache_strategy == instance.STALE_CACHE_RETURN:
+                    # Respond even if cache is stale
+                    instance.log.info(
+                        'Gathering data took {:.2f} seconds, metrics are stale for {:.2f} seconds, '
+                        'returning metrics from stale cache.'.format(
+                            instance.collect_time,
+                            instance.collect_time - instance.scrape_interval
+                        )
+                    )
+                    return respond()
+
+                if instance.stale_cache_strategy == instance.STALE_CACHE_FAIL:
+                    # Fail if cache is stale
+                    msg = (
+                        'Gathering data took {:.2f} seconds, metrics are stale for {:.2f} seconds, '
+                        'returning "service unavailable".'.format(
+                            instance.collect_time,
+                            instance.collect_time - instance.scrape_interval,
+                        )
+                    )
+                    instance.log.error(msg)
+                    raise cherrypy.HTTPError(503, msg)
  
          # Make the cache timeout for collecting configurable
-        self.collect_timeout = float(self.get_localized_module_option(
-            'scrape_interval', 5.0))
+        self.scrape_interval = float(self.get_localized_module_option('scrape_interval', 15.0))
+
+        self.stale_cache_strategy = self.get_localized_module_option('stale_cache_strategy', 'log')
+        if self.stale_cache_strategy not in [self.STALE_CACHE_FAIL,
+                                             self.STALE_CACHE_RETURN]:
+            self.stale_cache_strategy = self.STALE_CACHE_FAIL
  
          server_addr = self.get_localized_module_option(
              'server_addr', get_default_addr())
author	Patrick Seidensal <pseidensal@suse.com>
	Tue, 9 Jun 2020 10:35:00 +0000 (12:35 +0200)
committer	Laura Paduano <lpaduano@suse.com>
	Fri, 3 Jul 2020 12:18:59 +0000 (14:18 +0200)
doc/mgr/prometheus.rst		patch \| blob \| history
src/pybind/mgr/prometheus/module.py		patch \| blob \| history