From 9db84be4b76c566cb4b5bd17662cffd9d2f0bd88 Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Tue, 4 Sep 2012 16:33:47 -0700 Subject: [PATCH] doc: Added monitor failure recovery. Will be re-factored again soon. Signed-off-by: John Wilkins --- doc/cluster-ops/troubleshooting-mon.rst | 31 +++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 doc/cluster-ops/troubleshooting-mon.rst diff --git a/doc/cluster-ops/troubleshooting-mon.rst b/doc/cluster-ops/troubleshooting-mon.rst new file mode 100644 index 0000000000000..e2b834acffbf2 --- /dev/null +++ b/doc/cluster-ops/troubleshooting-mon.rst @@ -0,0 +1,31 @@ +================================== + Recovering from Monitor Failures +================================== + +In production clusters, we recommend running the cluster with a minimum +of three monitors. The failure of a single monitor should not take down +the entire monitor cluster, provided a majority of the monitors remain +available. If the majority of nodes are available, the remaining nodes +will be able to form a quorum. + +When you check your cluster's health, you may notice that a monitor +has failed. For example:: + + ceph health + HEALTH_WARN 1 mons down, quorum 0,2 + +For additional detail, you may check the cluster status:: + + ceph status + HEALTH_WARN 1 mons down, quorum 0,2 + mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum) + +In most cases, you can simply restart the affected node. +For example:: + + service ceph -a restart {failed-mon} + +If there are not enough monitors to form a quorum, the ``ceph`` +command will block trying to reach the cluster. In this situation, +you need to get enough ``ceph-mon`` daemons running to form a quorum +before doing anything else with the cluster. \ No newline at end of file -- 2.39.5