]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/commit
rgw/multisite: fix uninitialized LatencyMonitor average and use exponentially weighte...
authorOguzhan Ozmen <oozmen@bloomberg.net>
Tue, 28 Apr 2026 00:09:16 +0000 (00:09 +0000)
committerOguzhan Ozmen <oozmen@bloomberg.net>
Wed, 29 Apr 2026 13:37:25 +0000 (13:37 +0000)
commit824514e49a4acc505b23a4ce18921f5ad0fa46d9
treef54253683e7f0752b3cf9f3bd1b3c3615a54ea0e
parenteb42801b6f0c34d06419ad4e44631eaefdb209d1
rgw/multisite: fix uninitialized LatencyMonitor average and use exponentially weighted moving average

LatencyMonitor::total was declared without an initializer. Since
std::chrono::duration's default constructor leaves the value indeterminate,
the very first add_latency() call adds a real sample to garbage, producing a
huge average that immediately triggers the "OSD cluster is overloaded" warning
within seconds of RGW startup, before any actual slow ops occur.

Additionally, the old implementation uses a naive lifetime average
(total/count) that could slow the recovery from a transient slow-ops
episode. Once poisoned, the average stayed high for a long time,
keeping the throttling sync concurrency to 1.

So, also replace the naive lifetime average in LatencyMonitor with an
exponentially weighted moving average (alpha=0.15). With the weighted average,
after a series of  normal lock operations a past spike's influence decays faster,
allowing concurrency to recover without an RGW restart.

Fixes: https://tracker.ceph.com/issues/76308
Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
src/rgw/driver/rados/rgw_cr_rados.h