From 9da49667b8020e97cb9eb502bbf501707685efa3 Mon Sep 17 00:00:00 2001
From: John Wilkins <john.wilkins@inktank.com>
Date: Tue, 4 Sep 2012 11:37:13 -0700
Subject: [PATCH] doc: Created a more robust doc for monitoring a cluster.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
---
 doc/cluster-ops/monitoring.rst | 207 +++++++++++++++++++++++++++++++++
 1 file changed, 207 insertions(+)
 create mode 100644 doc/cluster-ops/monitoring.rst

diff --git a/doc/cluster-ops/monitoring.rst b/doc/cluster-ops/monitoring.rst
new file mode 100644
index 0000000000000..a414f23e9f33b
--- /dev/null
+++ b/doc/cluster-ops/monitoring.rst
@@ -0,0 +1,207 @@
+======================
+ Monitoring a Cluster
+======================
+
+Once you have a running cluster, you may use the ``ceph`` tool to monitor your
+cluster. Monitoring a cluster typically involves checking OSD status, monitor 
+status, placement group status and metadata server status.
+
+Interactive Mode
+================
+
+To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
+with no arguments.  For example:: 
+
+	ceph
+	ceph> health
+	ceph> status
+	ceph> quorum_status
+	ceph> mon_status
+	
+
+Checking Cluster Health
+=======================
+
+After you start your cluster, and before you start reading and/or
+writing data, check your cluster's health first. You can check on the 
+health of your Ceph cluster with the following::
+
+	ceph health
+
+If you specified non-default locations for your configuration or keyring,
+you may specify their locations::
+
+   ceph -c /path/to/conf -k /path/to/keyring health
+
+Upon starting the Ceph cluster, you will likely encounter a health
+warning such as ``HEALTH_WARN XXX num placement groups stale``. Wait a few moments and check
+it again. When your cluster is ready, ``ceph health`` should return a message
+such as ``HEALTH_OK``. At that point, it is okay to begin using the cluster.
+
+Watching a Cluster
+==================
+
+To watch the cluster's ongoing events, open a new terminal. Then, enter:: 
+
+	ceph -w
+
+Ceph will print each version of the placement group map and their status.  For
+example, a tiny Ceph cluster consisting of one monitor, one metadata server and
+two OSDs may print the following:: 
+
+   health HEALTH_OK
+   monmap e1: 1 mons at {a=192.168.0.1:6789/0}, election epoch 0, quorum 0 a
+   osdmap e13: 2 osds: 2 up, 2 in
+    placement groupmap v9713: 384 placement groups: 384 active+clean; 8730 bytes data, 22948 MB used, 264 GB / 302 GB avail
+   mdsmap e4: 1/1/1 up {0=a=up:active}
+
+	2012-08-01 11:33:53.831268 mon.0 [INF] placement groupmap v9712: 384 placement groups: 384 active+clean; 8730 bytes data, 22948 MB used, 264 GB / 302 GB avail
+	2012-08-01 11:35:31.904650 mon.0 [INF] placement groupmap v9713: 384 placement groups: 384 active+clean; 8730 bytes data, 22948 MB used, 264 GB / 302 GB avail
+	2012-08-01 11:35:53.903189 mon.0 [INF] placement groupmap v9714: 384 placement groups: 384 active+clean; 8730 bytes data, 22948 MB used, 264 GB / 302 GB avail
+	2012-08-01 11:37:31.865809 mon.0 [INF] placement groupmap v9715: 384 placement groups: 384 active+clean; 8730 bytes data, 22948 MB used, 264 GB / 302 GB avail
+
+
+Checking a Cluster's Status
+===========================
+
+To check a cluster's status, execute the following:: 
+
+	ceph status
+	
+Or:: 
+
+	ceph -s
+
+In interactive mode, type ``status`` and press **Enter**. ::
+
+	ceph> status
+
+Ceph will print the cluster status. For example, a tiny Ceph  cluster consisting
+of one monitor, one metadata server and  two OSDs may print the following::
+
+   health HEALTH_OK
+   monmap e1: 1 mons at {a=192.168.0.1:6789/0}, election epoch 0, quorum 0 a
+   osdmap e13: 2 osds: 2 up, 2 in
+    placement groupmap v9754: 384 placement groups: 384 active+clean; 8730 bytes data, 22948 MB used, 264 GB / 302 GB avail
+   mdsmap e4: 1/1/1 up {0=a=up:active}
+
+
+Checking OSD Status
+===================
+
+An OSD's status is either in the cluster (``in``) or out of the
+cluster (``out``); and, it is either up and running (``up``), or it is down and
+not running (``down``). If an OSD is ``up``, it may be either ``in`` in the
+cluster (you can read and write data) or it is out of the cluster ``out``.  If
+it is ``down``, it should also be ``out``. If an OSD is ``down`` and ``in``,
+there is a problem.
+
+.. ditaa:: +----------------+        +----------------+
+           |                |        |                |
+           |   OSD #n In    |        |   OSD #n Up    |
+           |                |        |                |
+           +----------------+        +----------------+
+                   ^                         ^
+                   |                         |
+                   |                         |
+                   v                         v
+           +----------------+        +----------------+
+           |                |        |                |
+           |   OSD #n Out   |        |   OSD #n Down  |
+           |                |        |                |
+           +----------------+        +----------------+
+
+You can check OSDs to ensure they are ``up`` and ``in`` by executing:: 
+
+	ceph osd stat
+	
+Or:: 
+
+	ceph osd dump
+	
+You can also check view OSDs according to their position in the CRUSH map. :: 
+
+	ceph osd tree
+
+Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
+and their weight. ::  
+
+	# id	weight	type name	up/down	reweight
+	-1	3	pool default
+	-3	3		rack mainrack
+	-2	3			host osd-host
+	0	1				osd.0	up	1	
+	1	1				osd.1	up	1	
+	2	1				osd.2	up	1
+
+
+Checking Monitor Status
+=======================
+
+If your cluster has multiple monitors (likely), you should check the monitor
+quorum status after you start the cluster before reading and/or writing data. A
+quorum must be present when multiple monitors are running. You should also check
+monitor status periodically to ensure that they are running.
+
+To see display the monitor map, execute the following::
+
+	ceph mon stat
+	
+Or:: 
+
+	ceph mon dump
+	
+To check the quorum status for the monitor cluster, execute the following:: 
+	
+	ceph quorum_status
+
+Ceph will return the quorum status. For example, a Ceph  cluster consisting of
+three monitors may return the following:
+
+.. code-block:: javascript
+
+	{ "election_epoch": 10,
+	  "quorum": [
+	        0,
+	        1,
+	        2],
+	  "monmap": { "epoch": 1,
+	      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
+	      "modified": "2011-12-12 13:28:27.505520",
+	      "created": "2011-12-12 13:28:27.505520",
+	      "mons": [
+	            { "rank": 0,
+	              "name": "a",
+	              "addr": "127.0.0.1:6789\/0"},
+	            { "rank": 1,
+	              "name": "b",
+	              "addr": "127.0.0.1:6790\/0"},
+	            { "rank": 2,
+	              "name": "c",
+	              "addr": "127.0.0.1:6791\/0"}
+	           ]
+	    }
+	}
+
+Checking MDS Status
+===================
+
+Metadata servers provide metadata services for  Ceph FS. Metadata servers have
+two sets of states: ``up | down`` and ``active | inactive``. To ensure your
+metadata servers are ``up`` and ``active``,  execute the following:: 
+
+	ceph mds stat
+	
+To display details of the metadata cluster, execute the following:: 
+
+	ceph mds dump
+
+
+Checking Placement Group States
+===============================
+
+Placement groups map objects to OSDs. When you monitor your
+placement groups,  you will want them to be ``active`` and ``clean``. For other
+PG states, see `Placement Group States`_.
+
+.. _Placement Group States: ../pg-states
-- 
2.39.5