git.apps.os.sepia.ceph.com Git - ceph-ci.git/commit

author	xie xingguo <xie.xingguo@zte.com.cn>
	Fri, 16 Nov 2018 06:56:59 +0000 (14:56 +0800)
committer	xie xingguo <xie.xingguo@zte.com.cn>
	Wed, 28 Nov 2018 09:03:58 +0000 (17:03 +0800)
commit	114c65fc0b04971fb93093c1e3fce6a71781351a
tree	92a29580523032dd6c8392c21e99558c0d55c12d	tree \| snapshot
parent	8d8e8a359c66b5767be6a4a2327c5f7097885464	commit \| diff

osd: fix heartbeat brain-split behaviour

Yet another similar issue as 8d8e8a359c66b5767be6a4a2327c5f7097885464.
To reproduce, construct a cluster with 3 hosts, each containing a single osd only:
- cut off osd.1's cluster network, waiting osd.1 to be marked as down
- cut off both osd.2 & osd.3's cluster network

It is possible we'll get __two__ down osds (e.g., both osd.1 & osd.2 are down)
now and then restore osd.1 and osd.2's cluster network won't change anything.

The root cause is that by default we always call for at least 1/3 active heartbeat
connections with all current __up__ osds to bring a previously dead (unhealthy)
osd back to life. However, it is possible that the __up__ set could be the
minority part that has been cut off from the rest of the cluster entirely and hence
cause brain-split behaviour as demonstrated above.

The simplest way to fix is to try to re-activate an unhealthy osd whenever
we are still safe to do so. Also please keep in mind that frequent up-to-down
transitions will kill off the osd process entirely, and that is why the
```osd_markdown_log``` related checking is needed here..

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>