doc: Copied contents of rgw troubleshooting over to the new ops section.

author John Wilkins <john.wilkins@inktank.com>

Wed, 19 Sep 2012 23:25:11 +0000 (16:25 -0700)

committer John Wilkins <john.wilkins@inktank.com>

Wed, 19 Sep 2012 23:25:11 +0000 (16:25 -0700)
author John Wilkins <john.wilkins@inktank.com>
Wed, 19 Sep 2012 23:25:11 +0000 (16:25 -0700)
committer John Wilkins <john.wilkins@inktank.com>
Wed, 19 Sep 2012 23:25:11 +0000 (16:25 -0700)
diff --git a/doc/radosgw/troubleshooting.rst b/doc/radosgw/troubleshooting.rst

new file mode 100644 (file)

index 0000000..19221a7
--- /dev/null
+++ b/doc/radosgw/troubleshooting.rst
@@ -0,0 +1,106 @@
+=================
+ Troubleshooting
+=================
+
+
+HTTP Request Errors
+===================
+
+Examining the access and error logs for the web server itself is
+probably the first step in identifying what is going on.  If there is
+a 500 error, that usually indicates a problem communicating with the
+``radosgw`` daemon.  Ensure the daemon is running, its socket path is
+configured, and that the web server is looking for it in the proper
+location.
+
+
+Crashed ``radosgw`` process
+===========================
+
+If the ``radosgw`` process dies, you will normally see a 500 error
+from the web server (apache, nginx, etc.).  In that situation, simply
+restarting radosgw will restore service.
+
+To diagnose the cause of the crash, check the log in ``/var/log/ceph``
+and/or the core file (if one was generated).
+
+
+Blocked ``radosgw`` Requests
+============================
+
+If some (or all) radosgw requests appear to be blocked, you can get
+some insight into the internal state of the ``radosgw`` daemon via
+its admin socket.  By default, there will be a socket configured to
+reside in ``/var/run/ceph``, and the daemon can be queried with::
+
+ ceph --admin-daemon /var/run/ceph/client.rgw help
+ 
+ help                list available commands
+ objecter_requests   show in-progress osd requests
+ perfcounters_dump   dump perfcounters value
+ perfcounters_schema dump perfcounters schema
+ version             get protocol version
+
+Of particular interest::
+
+ ceph --admin-daemon /var/run/ceph/client.rgw objecter_requests
+ ...
+
+will dump information about current in-progress requests with the
+RADOS cluster.  This allows one to identify if any requests are blocked
+by a non-responsive ceph-osd.  For example, one might see::
+
+  { "ops": [
+        { "tid": 1858,
+          "pg": "2.d2041a48",
+          "osd": 1,
+          "last_sent": "2012-03-08 14:56:37.949872",
+          "attempts": 1,
+          "object_id": "fatty_25647_object1857",
+          "object_locator": "@2",
+          "snapid": "head",
+          "snap_context": "0=[]",
+          "mtime": "2012-03-08 14:56:37.949813",
+          "osd_ops": [
+                "write 0~4096"]},
+        { "tid": 1873,
+          "pg": "2.695e9f8e",
+          "osd": 1,
+          "last_sent": "2012-03-08 14:56:37.970615",
+          "attempts": 1,
+          "object_id": "fatty_25647_object1872",
+          "object_locator": "@2",
+          "snapid": "head",
+          "snap_context": "0=[]",
+          "mtime": "2012-03-08 14:56:37.970555",
+          "osd_ops": [
+                "write 0~4096"]}],
+  "linger_ops": [],
+  "pool_ops": [],
+  "pool_stat_ops": [],
+  "statfs_ops": []}
+
+In this dump, two requests are in progress.  The ``last_sent`` field is
+the time the RADOS request was sent.  If this is a while ago, it suggests
+that the OSD is not responding.  For example, for request 1858, you could
+check the OSD status with::
+
+ ceph pg map 2.d2041a48
+ 
+ osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]
+
+This tells us to look at ``osd.1``, the primary copy for this PG::
+
+ ceph --admin-daemon /var/run/ceph/osd.1.asok
+ { "num_ops": 651,
+  "ops": [
+        { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
+          "received_at": "1331247573.344650",
+          "age": "25.606449",
+          "flag_point": "waiting for sub ops",
+          "client_info": { "client": "client.4124",
+              "tid": 1858}},
+ ...
+
+The ``flag_point`` field indicates that the OSD is currently waiting
+for replicas to respond, in this case ``osd.0``.
author	John Wilkins <john.wilkins@inktank.com>
	Wed, 19 Sep 2012 23:25:11 +0000 (16:25 -0700)
committer	John Wilkins <john.wilkins@inktank.com>
	Wed, 19 Sep 2012 23:25:11 +0000 (16:25 -0700)