doc: Added monitor failure recovery. Will be re-factored again soon.

author John Wilkins <john.wilkins@inktank.com>

Tue, 4 Sep 2012 23:33:47 +0000 (16:33 -0700)

committer John Wilkins <john.wilkins@inktank.com>

Tue, 4 Sep 2012 23:33:47 +0000 (16:33 -0700)
author John Wilkins <john.wilkins@inktank.com>
Tue, 4 Sep 2012 23:33:47 +0000 (16:33 -0700)
committer John Wilkins <john.wilkins@inktank.com>
Tue, 4 Sep 2012 23:33:47 +0000 (16:33 -0700)
diff --git a/doc/cluster-ops/troubleshooting-mon.rst b/doc/cluster-ops/troubleshooting-mon.rst

new file mode 100644 (file)

index 0000000..e2b834a
--- /dev/null
+++ b/doc/cluster-ops/troubleshooting-mon.rst
@@ -0,0 +1,31 @@
+==================================
+ Recovering from Monitor Failures
+==================================
+
+In production clusters, we recommend running the cluster with a minimum
+of three monitors. The failure of a single monitor should not take down
+the entire monitor cluster, provided a majority of the monitors remain
+available. If the majority of nodes are available, the remaining nodes
+will be able to form a quorum.
+
+When you check your cluster's health, you may notice that a monitor
+has failed. For example:: 
+
+       ceph health
+       HEALTH_WARN 1 mons down, quorum 0,2
+
+For additional detail, you may check the cluster status::
+
+       ceph status
+       HEALTH_WARN 1 mons down, quorum 0,2
+       mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum)
+
+In most cases, you can simply restart the affected node. 
+For example:: 
+
+       service ceph -a restart {failed-mon}
+
+If there are not enough monitors to form a quorum, the ``ceph``
+command will block trying to reach the cluster.  In this situation,
+you need to get enough ``ceph-mon`` daemons running to form a quorum
+before doing anything else with the cluster.
+\ No newline at end of file
author	John Wilkins <john.wilkins@inktank.com>
	Tue, 4 Sep 2012 23:33:47 +0000 (16:33 -0700)
committer	John Wilkins <john.wilkins@inktank.com>
	Tue, 4 Sep 2012 23:33:47 +0000 (16:33 -0700)