]> git.apps.os.sepia.ceph.com Git - ceph.git/commit
src/mon/HealthMonitor: Add mon_netsplit_grace_period to suppress transient MON_NETSPL...
authorKamoltat Sirivadhna <ksirivad@redhat.com>
Sun, 13 Jul 2025 20:26:49 +0000 (20:26 +0000)
committerKamoltat Sirivadhna <ksirivad@redhat.com>
Fri, 18 Jul 2025 05:13:04 +0000 (05:13 +0000)
commitbcee73003b86011a9f60361befcdf2b4f66dc127
tree51e6fe4477d75c0ad4b1c39a3e0e1992889c9cec
parentb4e39b239a1406f06eab9026c8a9101efe164e62
src/mon/HealthMonitor: Add mon_netsplit_grace_period to suppress transient MON_NETSPLIT warnings

When a monitor is elected leader and begins evaluating connectivity,
it may detect temporary disconnections between monitors that have not
yet fully reconnected to each other—particularly after events like
monitor restarts, SIGSTOP/SIGCONT (as used in mon_thrash), or brief network blips.

This can result in false-positive MON_NETSPLIT health warnings that
quickly disappear within seconds as the cluster topology stabilizes.

This commit introduces a configurable option:
  - mon_netsplit_grace_period (default: 9 seconds)

When the leader observes a netsplit between two monitors or locations,
it will wait for the grace period before raising a health warning.
If the split resolves within this window, no warning is emitted.

This reduces test flakiness and alert fatigue while preserving the
accuracy of persistent MON_NETSPLIT detection.

Fixes: https://tracker.ceph.com/issues/71344
Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
src/common/options/mon.yaml.in
src/mon/HealthMonitor.cc
src/mon/HealthMonitor.h