]> git.apps.os.sepia.ceph.com Git - ceph-ci.git/commit
src/mon/HealthMonitor: Add mon_netsplit_grace_period to suppress transient MON_NETSPL...
authorKamoltat Sirivadhna <ksirivad@redhat.com>
Sun, 13 Jul 2025 20:26:49 +0000 (20:26 +0000)
committerKamoltat Sirivadhna <ksirivad@redhat.com>
Tue, 7 Oct 2025 19:53:00 +0000 (19:53 +0000)
commitb297babca37ab73c46cf26f5ede80fec9e9dd04c
treed96e8261919fe95426e0459b1b5548d44ec849c6
parente5b9e3479d63fb73d9b11934301d61501be19573
src/mon/HealthMonitor: Add mon_netsplit_grace_period to suppress transient MON_NETSPLIT warnings

When a monitor is elected leader and begins evaluating connectivity,
it may detect temporary disconnections between monitors that have not
yet fully reconnected to each other—particularly after events like
monitor restarts, SIGSTOP/SIGCONT (as used in mon_thrash), or brief network blips.

This can result in false-positive MON_NETSPLIT health warnings that
quickly disappear within seconds as the cluster topology stabilizes.

This commit introduces a configurable option:
  - mon_netsplit_grace_period (default: 9 seconds)

When the leader observes a netsplit between two monitors or locations,
it will wait for the grace period before raising a health warning.
If the split resolves within this window, no warning is emitted.

This reduces test flakiness and alert fatigue while preserving the
accuracy of persistent MON_NETSPLIT detection.

Fixes: https://tracker.ceph.com/issues/71344
Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Conflicts:
src/common/options/mon.yaml.in - trivial fix
src/common/options/mon.yaml.in
src/mon/HealthMonitor.cc
src/mon/HealthMonitor.h