From 1fa1103edee33a8a4abe9226194d3dc608c014d0 Mon Sep 17 00:00:00 2001 From: xie xingguo Date: Wed, 26 Jun 2019 14:24:08 +0800 Subject: [PATCH] osd/OSD: auto mark heartbeat sessions as stale and tear them down The primary benefit is that the OSD doesn't need to keep a flood of blocked heartbeat messages around in memory. This prevents OSDs from accumulating heartbeat messages due to a broken switch and then exhausting the whole node's memory: Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.137077] Out of memory: Kill process 1471476 (ceph-osd) score 47 or sacrifice child Jun 11 04:19:26 host-192-168-9-12 kernel: [409881.146054] Killed process 1471476 (ceph-osd) total-vm:4822548kB, anon-rss:3097860kB, file-rss:2556kB, shmem-rss:0kB Fixes: http://tracker.ceph.com/issues/40586 Signed-off-by: xie xingguo (cherry picked from commit 6cc90f363b8096d2d5fad30e57426d0cea9e3478) Conflicts: src/osd/OSD.cc (no boot_finisher.stop() and no lock_guard) src/osd/OSD.h (trivial) Fixed get_val() call in reset_heartbeat_peers() --- src/common/options.cc | 7 +++++++ src/osd/OSD.cc | 31 ++++++++++++++++++++----------- src/osd/OSD.h | 10 +++++++++- 3 files changed, 36 insertions(+), 12 deletions(-) diff --git a/src/common/options.cc b/src/common/options.cc index e52b5d533642..42c9f73fbddb 100644 --- a/src/common/options.cc +++ b/src/common/options.cc @@ -2941,6 +2941,13 @@ std::vector