]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph-ci.git/commit
HealthMonitor: Add topology-aware netsplit detection and warning
authorKamoltat Sirivadhna <ksirivad@redhat.com>
Thu, 15 Aug 2024 20:25:43 +0000 (20:25 +0000)
committerKamoltat Sirivadhna <ksirivad@redhat.com>
Tue, 22 Apr 2025 21:31:50 +0000 (21:31 +0000)
commit96f15cee41c47eaf02e5da25d55a5b94df8a35b7
tree03d7fdbf1518d63706b4e70b5d302241f8bebadc
parent7c994bfa05f329be88c6648141b9bfde83e81d64
HealthMonitor: Add topology-aware netsplit detection and warning

Problem:
Currently, Ceph cannot detect and report network partitions (netsplits)
between monitors in different topology locations in a consolidated way.
While stretch mode can handle partitions through monitor elections,
users lack visibility into the topology-level view of network
disconnections, making troubleshooting difficult.

Solution:
This implementation adds a hierarchical netsplit detection mechanism that:
- Uses DirectedGraph structure for netsplit detection
- Maps monitor disconnections to relevant CRUSH topology levels
- Aggregates individual disconnections into location-level reports when appropriate
- Detects complete location-level netsplits when ALL monitors between locations
  cannot communicate
- Reports specific topology locations experiencing complete communication failures
- Falls back to individual monitor-level reporting for partial disconnections
- Handles monitors with missing location data gracefully
- Leverages HealthMonitor::check_for_mon_down to receive a set of down monitors,
  efficiently avoiding false netsplit reports for monitors already known to be down
- Implements smart filtering that correctly excludes down monitors from location-based
  analysis, ensuring accurate netsplit reporting at both individual and topology levels

The implementation produces user-friendly health warnings:
1. For complete location netsplits: "Netsplit detected between dc1 and dc2"
2. For individual monitor disconnections: "Netsplit detected between mon.a and mon.d"

Performance considerations:
- Time complexity: O(m²) where m is the number of monitors
- Space complexity: O(m²) for connection tracking
- Practical impact is minimal as monitor count is typically small (3-7)

Fixes: https://tracker.ceph.com/issues/67371
Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Conflicts:
src/mon/Elector.cc - Trivial Fix
src/mon/ConnectionTracker.cc
src/mon/ConnectionTracker.h
src/mon/Elector.cc
src/mon/Elector.h
src/mon/HealthMonitor.cc
src/mon/HealthMonitor.h