A design limitation of prometheus-client's multiprocessing mode is that
each process creates files to store its own metrics; the exporter then
has to read each file, even if the process which created it is dead.
This results in request latency growing over time, to the point of
multiple seconds when the file count gets into the thousands. This
eventually results in prometheus failing to fetch, leaving gaps in our
data.
We can work around this by restarting at a regular interval; 24h seems
like a fine place to start.
Signed-off-by: Zack Cerza <zack@redhat.com>
return file_mtime > start_time
-def restart():
+def restart(log=log):
log.info('Restarting...')
args = sys.argv[:]
args.insert(0, sys.executable)
JobProcesses(),
Nodes(),
]
+ self._created_time = time.perf_counter()
def start(self):
start_http_server(self.port, registry=registry)
while True:
try:
before = time.perf_counter()
+ if before - self._created_time > 24 * 60 * 60:
+ self.restart()
try:
self.update()
except Exception:
log.info("Stopping.")
raise SystemExit
+ def restart(self):
+ # Use the dispatcher's restart function - note that by using this here,
+ # it restarts the exporter, *not* the dispatcher.
+ return teuthology.dispatcher.restart(log=log)
+
class TeuthologyMetric:
def __init__(self):