]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph-ci.git/commit
HealthMonitor: Add topology-aware netsplit detection and warning
authorKamoltat Sirivadhna <ksirivad@redhat.com>
Thu, 15 Aug 2024 20:25:43 +0000 (20:25 +0000)
committerKamoltat Sirivadhna <ksirivad@redhat.com>
Thu, 19 Jun 2025 16:45:20 +0000 (16:45 +0000)
commita7477bef6484134887c19371ca3968090794fcc5
tree1f3341cb7550211b5a8524cceed99882cb6a36bd
parent56be771df3e2f0f93424ab0f24bce526bbd06195
HealthMonitor: Add topology-aware netsplit detection and warning

Problem:
Currently, Ceph cannot detect and report network partitions (netsplits)
between monitors in different topology locations in a consolidated way.
While stretch mode can handle partitions through monitor elections,
users lack visibility into the topology-level view of network
disconnections, making troubleshooting difficult.

Solution:
This implementation adds a hierarchical netsplit detection mechanism that:
- Uses DirectedGraph structure for netsplit detection
- Maps monitor disconnections to relevant CRUSH topology levels
- Aggregates individual disconnections into location-level reports when appropriate
- Detects complete location-level netsplits when ALL monitors between locations
  cannot communicate
- Reports specific topology locations experiencing complete communication failures
- Falls back to individual monitor-level reporting for partial disconnections
- Handles monitors with missing location data gracefully
- Leverages HealthMonitor::check_for_mon_down to receive a set of down monitors,
  efficiently avoiding false netsplit reports for monitors already known to be down
- Implements smart filtering that correctly excludes down monitors from location-based
  analysis, ensuring accurate netsplit reporting at both individual and topology levels

The implementation produces user-friendly health warnings:
1. For complete location netsplits: "Netsplit detected between dc1 and dc2"
2. For individual monitor disconnections: "Netsplit detected between mon.a and mon.d"

Performance considerations:
- Time complexity: O(m²) where m is the number of monitors
- Space complexity: O(m²) for connection tracking
- Practical impact is minimal as monitor count is typically small (3-7)

Fixes: https://tracker.ceph.com/issues/67371
Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Conflicts:
src/mon/Elector.cc - Trivial Fix
(cherry picked from commit 96f15cee41c47eaf02e5da25d55a5b94df8a35b7)

Conflicts:
src/mon/HealthMonitor.cc - Trivial fix
src/mon/ConnectionTracker.cc
src/mon/ConnectionTracker.h
src/mon/Elector.cc
src/mon/Elector.h
src/mon/HealthMonitor.cc
src/mon/HealthMonitor.h