doc/rados: add page for health checks and update monitoring.rst

author John Spray <john.spray@redhat.com>

Tue, 25 Jul 2017 14:13:02 +0000 (15:13 +0100)

committer John Spray <john.spray@redhat.com>

Tue, 25 Jul 2017 14:13:02 +0000 (15:13 +0100)
author John Spray <john.spray@redhat.com>
Tue, 25 Jul 2017 14:13:02 +0000 (15:13 +0100)
committer John Spray <john.spray@redhat.com>
Tue, 25 Jul 2017 14:13:02 +0000 (15:13 +0100)
diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst

new file mode 100644 (file)

index 0000000..4dced94
--- /dev/null
+++ b/doc/rados/operations/health-checks.rst
@@ -0,0 +1,181 @@
+
+=============
+Health checks
+=============
+
+Overview
+========
+
+There is a finite set of possible health messages that a Ceph cluster can
+raise -- these are defined as *health checks* which have unique identifiers.
+
+The identifier is a terse pseudo-human-readable (i.e. like a variable name)
+string.  It is intended to enable tools (such as UIs) to make sense of
+health checks, and present them in a way that reflects their meaning.
+
+This page lists the health checks that are raised by the monitor and manager
+daemons.  In addition to these, you may also see health checks that originate
+from MDS daemons (see :doc:`/cephfs/health-messages`), and health checks
+that are defined by ceph-mgr python modules.
+
+Definitions
+===========
+
+
+OSDs
+----
+
+OSD_DOWN
+________
+
+One or more OSDs are marked down
+
+OSD_<crush type>_DOWN 
+_____________________
+
+(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN)
+
+All the OSDs within a particular CRUSH subtree are marked down, for example
+all OSDs on a host.
+
+OSD_ORPHAN
+__________
+
+
+OSD_OUT_OF_ORDER_FULL
+_____________________
+
+
+OSD_FULL
+________
+
+
+OSD_BACKFILLFULL
+________________
+
+
+OSD_NEARFULL
+____________
+
+
+OSDMAP_FLAGS
+____________
+
+
+OSD_FLAGS
+_________
+
+
+OLD_CRUSH_TUNABLES
+__________________
+
+
+OLD_CRUSH_STRAW_CALC_VERSION
+____________________________
+
+
+CACHE_POOL_NO_HIT_SET
+_____________________
+
+
+OSD_NO_SORTBITWISE
+__________________
+
+
+POOL_FULL
+_________
+
+
+Data health (pools & placement groups)
+------------------------------
+
+PG_AVAILABILITY
+_______________
+
+
+PG_DEGRADED
+___________
+
+
+PG_DEGRADED_FULL
+________________
+
+
+PG_DAMAGED
+__________
+
+OSD_SCRUB_ERRORS
+________________
+
+
+CACHE_POOL_NEAR_FULL
+____________________
+
+
+TOO_FEW_PGS
+___________
+
+
+TOO_MANY_PGS
+____________
+
+
+SMALLER_PGP_NUM
+_______________
+
+
+MANY_OBJECTS_PER_PG
+___________________
+
+
+POOL_FULL
+_________
+
+
+POOL_NEAR_FULL
+______________
+
+
+OBJECT_MISPLACED
+________________
+
+
+OBJECT_UNFOUND
+______________
+
+
+REQUEST_SLOW
+____________
+
+
+REQUEST_STUCK
+_____________
+
+
+PG_NOT_SCRUBBED
+_______________
+
+
+PG_NOT_DEEP_SCRUBBED
+____________________
+
+
+CephFS
+------
+
+FS_WITH_FAILED_MDS
+__________________
+
+
+FS_DEGRADED
+___________
+
+
+MDS_INSUFFICIENT_STANDBY
+________________________
+
+
+MDS_DAMAGED
+___________
+
+
diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst

index ad1484bcaf0639c917ca61dda996ee7cc7bd0de8..c291440b78a6f6bca35ea28d7636da1864a917ef 100644 (file)
--- a/doc/rados/operations/monitoring.rst
+++ b/doc/rados/operations/monitoring.rst
@@ -6,8 +6,11 @@ Once you have a running cluster, you may use the ``ceph`` tool to monitor your
  cluster. Monitoring a cluster typically involves checking OSD status, monitor 
  status, placement group status and metadata server status.
  
-Interactive Mode
-================
+Using the command line
+======================
+
+Interactive mode
+----------------
  
  To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
  with no arguments.  For example:: 
@@ -17,72 +20,58 @@ with no arguments.  For example::
         ceph> status
         ceph> quorum_status
         ceph> mon_status
-       
  
-Checking Cluster Health
-=======================
-
-After you start your cluster, and before you start reading and/or
-writing data, check your cluster's health first. You can check on the 
-health of your Ceph cluster with the following::
-
-       ceph health
+Non-default paths
+-----------------
  
  If you specified non-default locations for your configuration or keyring,
  you may specify their locations::
  
     ceph -c /path/to/conf -k /path/to/keyring health
  
-Upon starting the Ceph cluster, you will likely encounter a health
-warning such as ``HEALTH_WARN XXX num placement groups stale``. Wait a few moments and check
-it again. When your cluster is ready, ``ceph health`` should return a message
-such as ``HEALTH_OK``. At that point, it is okay to begin using the cluster.
+Checking a Cluster's Status
+===========================
+
+After you start your cluster, and before you start reading and/or
+writing data, check your cluster's status first.
  
-Watching a Cluster
-==================
+To check a cluster's status, execute the following:: 
  
-To watch the cluster's ongoing events, open a new terminal. Then, enter:: 
+       ceph status
+       
+Or:: 
  
-       ceph -w
+       ceph -s
+
+In interactive mode, type ``status`` and press **Enter**. ::
+
+       ceph> status
+
+Ceph will print the cluster status. For example, a tiny Ceph demonstration
+cluster with one of each service may print the following:
+
+::
+
+  cluster:
+    id:     477e46f1-ae41-4e43-9c8f-72c918ab0a20
+    health: HEALTH_OK
+   
+  services:
+    mon: 1 daemons, quorum a
+    mgr: x(active)
+    mds: 1/1/1 up {0=a=up:active}
+    osd: 1 osds: 1 up, 1 in
+  
+  data:
+    pools:   2 pools, 16 pgs
+    objects: 21 objects, 2246 bytes
+    usage:   546 GB used, 384 GB / 931 GB avail
+    pgs:     16 active+clean
  
-Ceph will print each event.  For example, a tiny Ceph cluster consisting of 
-one monitor, and two OSDs may print the following:: 
-
-    cluster b370a29d-9287-4ca3-ab57-3d824f65e339
-     health HEALTH_OK
-     monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
-     osdmap e63: 2 osds: 2 up, 2 in
-      pgmap v41338: 952 pgs, 20 pools, 17130 MB data, 2199 objects
-            115 GB used, 167 GB / 297 GB avail
-                 952 active+clean
-
-    2014-06-02 15:45:21.655871 osd.0 [INF] 17.71 deep-scrub ok
-    2014-06-02 15:45:47.880608 osd.1 [INF] 1.0 scrub ok
-    2014-06-02 15:45:48.865375 osd.1 [INF] 1.3 scrub ok
-    2014-06-02 15:45:50.866479 osd.1 [INF] 1.4 scrub ok
-    2014-06-02 15:45:01.345821 mon.0 [INF] pgmap v41339: 952 pgs: 952 active+clean; 17130 MB data, 115 GB used, 167 GB / 297 GB avail
-    2014-06-02 15:45:05.718640 mon.0 [INF] pgmap v41340: 952 pgs: 1 active+clean+scrubbing+deep, 951 active+clean; 17130 MB data, 115 GB used, 167 GB / 297 GB avail
-    2014-06-02 15:45:53.997726 osd.1 [INF] 1.5 scrub ok
-    2014-06-02 15:45:06.734270 mon.0 [INF] pgmap v41341: 952 pgs: 1 active+clean+scrubbing+deep, 951 active+clean; 17130 MB data, 115 GB used, 167 GB / 297 GB avail
-    2014-06-02 15:45:15.722456 mon.0 [INF] pgmap v41342: 952 pgs: 952 active+clean; 17130 MB data, 115 GB used, 167 GB / 297 GB avail
-    2014-06-02 15:46:06.836430 osd.0 [INF] 17.75 deep-scrub ok
-    2014-06-02 15:45:55.720929 mon.0 [INF] pgmap v41343: 952 pgs: 1 active+clean+scrubbing+deep, 951 active+clean; 17130 MB data, 115 GB used, 167 GB / 297 GB avail
-
-
-The output provides:
-
-- Cluster ID
-- Cluster health status
-- The monitor map epoch and the status of the monitor quorum
-- The OSD map epoch and the status of OSDs 
-- The placement group map version
-- The number of placement groups and pools
-- The *notional* amount of data stored and the number of objects stored; and,
-- The total amount of data stored.
  
  .. topic:: How Ceph Calculates Data Usage
  
-   The ``used`` value reflects the *actual* amount of raw storage used. The 
+   The ``usage`` value reflects the *actual* amount of raw storage used. The 
     ``xxx GB / xxx GB`` value means the amount available (the lesser number)
     of the overall storage capacity of the cluster. The notional number reflects 
     the size of the stored data before it is replicated, cloned or snapshotted.
@@ -91,6 +80,96 @@ The output provides:
     storage capacity for cloning and snapshotting.
  
  
+Watching a Cluster
+==================
+
+In addition to local logging by each daemon, Ceph clusters maintain
+a *cluster log* that records high level events about the whole system.
+This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
+default), but can also be monitored via the command line.
+
+To follow the cluster log, use the following command
+
+:: 
+
+       ceph -w
+
+Ceph will print the status of the system, followed by each log message as it
+is emitted.  For example:
+
+:: 
+
+  cluster:
+    id:     477e46f1-ae41-4e43-9c8f-72c918ab0a20
+    health: HEALTH_OK
+  
+  services:
+    mon: 1 daemons, quorum a
+    mgr: x(active)
+    mds: 1/1/1 up {0=a=up:active}
+    osd: 1 osds: 1 up, 1 in
+  
+  data:
+    pools:   2 pools, 16 pgs
+    objects: 21 objects, 2246 bytes
+    usage:   546 GB used, 384 GB / 931 GB avail
+    pgs:     16 active+clean
+  
+  
+  2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
+  2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
+  2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
+
+
+In addition to using ``ceph -w`` to print log lines as they are emitted,
+use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
+log.
+
+Monitoring Health Checks
+========================
+
+Ceph continously runs various *health checks* against its own status.  When
+a health check fails, this is reflected in the output of ``ceph status`` (or
+``ceph health``).  In addition, messages are sent to the cluster log to
+indicate when a check fails, and when the cluster recovers.
+
+For example, when an OSD goes down, the ``health`` section of the status
+output may be updated as follows:
+
+::
+
+    health: HEALTH_WARN
+            1 osds down
+            Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
+
+At this time, cluster log messages are also emitted to record the failure of the 
+health checks:
+
+::
+
+    2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
+    2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
+
+When the OSD comes back online, the cluster log records the cluster's return
+to a health state:
+
+::
+
+    2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
+    2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
+    2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
+
+
+Detecting configuration issues
+==============================
+
+In addition to the health checks that Ceph continuously runs on its
+own status, there are some configuration issues that may only be detected
+by an external tool.
+
+Use the `ceph-medic`_ tool to run these additional checks on your Ceph
+cluster's configuration.
+
  Checking a Cluster's Usage Stats
  ================================
  
@@ -138,33 +217,6 @@ on the number of replicas, clones and snapshots.
     mon_osd_full_ratio.
  
  
-Checking a Cluster's Status
-===========================
-
-To check a cluster's status, execute the following:: 
-
-       ceph status
-       
-Or:: 
-
-       ceph -s
-
-In interactive mode, type ``status`` and press **Enter**. ::
-
-       ceph> status
-
-Ceph will print the cluster status. For example, a tiny Ceph  cluster consisting
-of one monitor, and two OSDs may print the following::
-
-    cluster b370a29d-9287-4ca3-ab57-3d824f65e339
-     health HEALTH_OK
-     monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
-     osdmap e63: 2 osds: 2 up, 2 in
-      pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects
-            115 GB used, 167 GB / 297 GB avail
-                   1 active+clean+scrubbing+deep
-                 951 active+clean
-
  
  Checking OSD Status
  ===================
@@ -296,3 +348,4 @@ directly to the host in question ).
  
  .. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config
  .. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
+.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/
author	John Spray <john.spray@redhat.com>
	Tue, 25 Jul 2017 14:13:02 +0000 (15:13 +0100)
committer	John Spray <john.spray@redhat.com>
	Tue, 25 Jul 2017 14:13:02 +0000 (15:13 +0100)
doc/rados/operations/health-checks.rst	[new file with mode: 0644]	patch \| blob
doc/rados/operations/monitoring.rst		patch \| blob \| history