Configuration
-------------
-By default the module will accept HTTP requests on port ``9283`` on all
-IPv4 and IPv6 addresses on the host. The port and listen address are both
+.. note::
+
+ The Prometheus manager module needs to be restarted for configuration changes to be applied.
+
+By default the module will accept HTTP requests on port ``9283`` on all IPv4
+and IPv6 addresses on the host. The port and listen address are both
configurable with ``ceph config-key set``, with keys
-``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.
-This port is registered with Prometheus's `registry <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
+``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``. This port
+is registered with Prometheus's `registry
+<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
+
+::
+
+ ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
+ ceph config set mgr mgr/prometheus/server_port 9283
+
+.. warning::
+
+ The ``scrape_interval`` of this module should always be set to match
+ Prometheus' scrape interval to work properly and not cause any issues.
+
+The Prometheus manager module is, by default, configured with a scrape interval
+of 15 seconds. The scrape interval in the module is used for caching purposes
+and to determine when a cache is stale.
+
+It is not recommended to use a scrape interval below 10 seconds. It is
+recommended to use 15 seconds as scrape interval, though, in some cases it
+might be useful to increase the scrape interval.
+
+To set a different scrape interval in the Prometheus module, set
+``scrape_interval`` to the desired value::
+
+ ceph config set mgr mgr/prometheus/scrape_interval 20
+
+On large clusters (>1000 OSDs), the time to fetch the metrics may become
+significant. Without the cache, the Prometheus manager module could,
+especially in conjunction with multiple Prometheus instances, overload the
+manager and lead to unresponsive or crashing Ceph manager instances. Hence,
+the cache is enabled by default and cannot be disabled. This means that there
+is a possibility that the cache becomes stale. The cache is considered stale
+when the time to fetch the metrics from Ceph exceeds the configured
+``scrape_interval``.
+
+If that is the case, **a warning will be logged** and the module will either
+
+* respond with a 503 HTTP status code (service unavailable) or,
+* it will return the content of the cache, even though it might be stale.
+
+This behavior can be configured. By default, it will return a 503 HTTP status
+code (service unavailable). You can set other options using the ``ceph config
+set`` commands.
+
+To tell the module to respond with possibly stale data, set it to ``return``::
+
+ ceph config set mgr mgr/prometheus/stale_cache_strategy return
+
+To tell the module to respond with "service unavailable", set it to ``fail``::
+
+ ceph config set mgr mgr/prometheus/stale_cache_strategy fail
.. _prometheus-rbd-io-statistics:
==========================
The names of the stats are exactly as Ceph names them, with
-illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
+illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
and ``ceph_`` prefixed to all names.
The *cluster* statistics (i.e. those global to the Ceph cluster)
-have labels appropriate to what they report on. For example,
+have labels appropriate to what they report on. For example,
metrics relating to pools have a ``pool_id`` label.
The prometheus output from Ceph is designed to be used in conjunction
with the generic host monitoring from the Prometheus node_exporter.
-To enable correlation of Ceph OSD statistics with node_exporter's
+To enable correlation of Ceph OSD statistics with node_exporter's
drive statistics, special series are output like this:
::
from mgr_util import get_default_addr
from rbd import RBD
+try:
+ from typing import Optional
+except:
+ pass
+
# Defaults for the Prometheus HTTP server. Can also set in config-key
# see https://github.com/prometheus/prometheus/wiki/Default-port-allocations
# for Prometheus exporter port registry
# to access things in class Module from subclass Root. Because
# it's a dict, the writer doesn't need to declare 'global' for access
-_global_instance = {'plugin': None}
-
-
-def global_instance():
- assert _global_instance['plugin'] is not None
- return _global_instance['plugin']
+_global_instance = None # type: Optional[Module]
def health_status_to_number(status):
return expfmt
+class MetricCollectionThread(threading.Thread):
+ def __init__(self):
+ super(MetricCollectionThread, self).__init__(target=self.collect)
+
+ @staticmethod
+ def collect():
+ inst = _global_instance
+ inst.log.info('starting metric collection thread')
+ while True:
+ if inst.have_mon_connection():
+ start_time = time.time()
+ data = inst.collect()
+ duration = time.time() - start_time
+
+ sleep_time = inst.scrape_interval - duration
+ if sleep_time < 0:
+ inst.log.warning(
+ 'Collecting data took more time than configured scrape interval. '
+ 'This possibly results in stale data. Please check the '
+ '`stale_cache_strategy` configuration option. '
+ 'Collecting data took {:.2f} seconds but scrape interval is configured '
+ 'to be {:.0f} seconds.'.format(
+ duration,
+ inst.scrape_interval,
+ )
+ )
+ sleep_time = 0
+
+ with inst.collect_lock:
+ inst.collect_cache = data
+ inst.collect_time = duration
+
+ time.sleep(sleep_time)
+ else:
+ inst.log.error('No MON connection')
+ time.sleep(inst.scrape_interval)
+
+
class Module(MgrModule):
COMMANDS = [
{
{'name': 'server_addr'},
{'name': 'server_port'},
{'name': 'scrape_interval'},
+ {'name': 'stale_cache_strategy'},
{'name': 'rbd_stats_pools'},
{'name': 'rbd_stats_pools_refresh_interval', 'type': 'int', 'default': 300},
]
+ STALE_CACHE_FAIL = 'fail'
+ STALE_CACHE_RETURN = 'return'
+
def __init__(self, *args, **kwargs):
super(Module, self).__init__(*args, **kwargs)
self.metrics = self._setup_static_metrics()
self.shutdown_event = threading.Event()
- self.collect_lock = threading.RLock()
+ self.collect_lock = threading.Lock()
self.collect_time = 0
- self.collect_timeout = 5.0
+ self.scrape_interval = 15.0
+ self.stale_cache_strategy = self.STALE_CACHE_FAIL
self.collect_cache = None
self.rbd_stats = {
'pools': {},
'desc': 'RBD image reads latency (msec)'},
},
}
- _global_instance['plugin'] = self
+ global _global_instance
+ _global_instance = self
+ MetricCollectionThread().start()
def _setup_static_metrics(self):
metrics = {}
# TODO use get_config_prefix or get_config here once
# https://github.com/ceph/ceph/pull/20458 is merged
result = CommandResult("")
- global_instance().send_command(
+ _global_instance.send_command(
result, "mon", '',
json.dumps({
"prefix": "config-key get",
"")
r, outb, outs = result.wait()
if r != 0:
- global_instance().log.error("Failed to retrieve port for mgr {}: {}".format(id_, outs))
+ _global_instance.log.error("Failed to retrieve port for mgr {}: {}".format(id_, outs))
targets.append('{}:{}'.format(hostname, DEFAULT_PORT))
else:
port = json.loads(outb)
@cherrypy.expose
def metrics(self):
- instance = global_instance()
# Lock the function execution
- try:
- instance.collect_lock.acquire()
- return self._metrics(instance)
- finally:
- instance.collect_lock.release()
+ with _global_instance.collect_lock:
+ return self._metrics(_global_instance)
@staticmethod
def _metrics(instance):
- # Return cached data if available and collected before the cache times out
- if instance.collect_cache and time.time() - instance.collect_time < instance.collect_timeout:
- cherrypy.response.headers['Content-Type'] = 'text/plain'
- return instance.collect_cache
+ # Return cached data if available
+ if not instance.collect_cache:
+ raise cherrypy.HTTPError(503, 'No cached data available yet')
- if instance.have_mon_connection():
- instance.collect_cache = None
- instance.collect_time = time.time()
- instance.collect_cache = instance.collect()
+ def respond():
cherrypy.response.headers['Content-Type'] = 'text/plain'
return instance.collect_cache
- else:
- raise cherrypy.HTTPError(503, 'No MON connection')
+
+ if instance.collect_time < instance.scrape_interval:
+ # Respond if cache isn't stale
+ return respond()
+
+ if instance.stale_cache_strategy == instance.STALE_CACHE_RETURN:
+ # Respond even if cache is stale
+ instance.log.info(
+ 'Gathering data took {:.2f} seconds, metrics are stale for {:.2f} seconds, '
+ 'returning metrics from stale cache.'.format(
+ instance.collect_time,
+ instance.collect_time - instance.scrape_interval
+ )
+ )
+ return respond()
+
+ if instance.stale_cache_strategy == instance.STALE_CACHE_FAIL:
+ # Fail if cache is stale
+ msg = (
+ 'Gathering data took {:.2f} seconds, metrics are stale for {:.2f} seconds, '
+ 'returning "service unavailable".'.format(
+ instance.collect_time,
+ instance.collect_time - instance.scrape_interval,
+ )
+ )
+ instance.log.error(msg)
+ raise cherrypy.HTTPError(503, msg)
# Make the cache timeout for collecting configurable
- self.collect_timeout = float(self.get_localized_module_option(
- 'scrape_interval', 5.0))
+ self.scrape_interval = float(self.get_localized_module_option('scrape_interval', 15.0))
+
+ self.stale_cache_strategy = self.get_localized_module_option('stale_cache_strategy', 'log')
+ if self.stale_cache_strategy not in [self.STALE_CACHE_FAIL,
+ self.STALE_CACHE_RETURN]:
+ self.stale_cache_strategy = self.STALE_CACHE_FAIL
server_addr = self.get_localized_module_option(
'server_addr', get_default_addr())