From: root Date: Mon, 30 Jul 2018 01:29:48 +0000 (-0400) Subject: msg: ceph_abort() when there are enough accepter errors in msg server X-Git-Tag: v14.0.1~280^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=00e0ab407b2e9659d9121be1217e95c8117c411e;p=ceph-ci.git msg: ceph_abort() when there are enough accepter errors in msg server In some extrem cases(we have met one in our production cluster), when Accepter thread break out , new client can not connect to the osd. Because the former heartbeat connections are already connected, other osd can not detect failure then notify monitor to mark the failed osd down. In the patch, we there are abnormal communication errors ,we just ceph_abort so that osd can go down fastly and other osds can notify monitor to mark the failed osd down. Signed-off-by: penglaiyxy@gmail.com --- diff --git a/src/common/legacy_config_opts.h b/src/common/legacy_config_opts.h index 6a38549b33b..40ef425ba65 100644 --- a/src/common/legacy_config_opts.h +++ b/src/common/legacy_config_opts.h @@ -175,6 +175,10 @@ OPTION(ms_async_rdma_dscp, OPT_INT) // in RoCE, this means DSCP OPTION(ms_async_rdma_cm, OPT_BOOL) OPTION(ms_async_rdma_type, OPT_STR) +// when there are enough accept failures, indicating there are unrecoverable failures, +// just do ceph_abort() . Here we make it configurable. +OPTION(ms_max_accept_failures, OPT_INT) + OPTION(ms_dpdk_port_id, OPT_INT) SAFE_OPTION(ms_dpdk_coremask, OPT_STR) // it is modified in unittest so that use SAFE_OPTION to declare OPTION(ms_dpdk_memory_channel, OPT_STR) diff --git a/src/common/options.cc b/src/common/options.cc index 79e88c5f5fa..1d91bd2188e 100644 --- a/src/common/options.cc +++ b/src/common/options.cc @@ -1100,6 +1100,11 @@ std::vector