+++ /dev/null
-==============================
- Troubleshooting OSDs and PGs
-==============================
-
-Before troubleshooting your OSDs, check your monitors and network first. If
-you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
-a health status, the return of a status means that the monitors have a quorum.
-If you don't have a monitor quorum or if there are errors with the monitor
-status, address the monitor issues first. Check your networks to ensure they
-are running properly, because networks may have a significant impact on OSD
-operation and performance.
-
-
-The Ceph Community
-==================
-
-The Ceph community is an excellent source of information and help. For
-operational issues with Ceph releases we recommend you `subscribe to the
-ceph-users email list`_. When you no longer want to receive emails, you can
-`unsubscribe from the ceph-users email list`_.
-
-If you have read through this guide and you have contacted ``ceph-users``,
-but you haven't resolved your issue, you may contact `Inktank`_ for support.
-
-You may also `subscribe to the ceph-devel email list`_. You should do so if
-your issue is:
-
-- Likely related to a bug
-- Related to a development release package
-- Related to a development testing package
-- Related to your own builds
-
-If you no longer want to receive emails from the ``ceph-devel`` email list, you
-may `unsubscribe from the ceph-devel email list`_.
-
-.. tip:: The Ceph community is growing rapidly, and community members can help
- you if you provide them with detailed information about your problem. See
- `Obtaining Data About OSDs`_ before you post questions to ensure that
- community members have sufficient data to help you.
-
-
-Obtaining Data About OSDs
-=========================
-
-A good first step in troubleshooting your OSDs is to obtain information in
-addition to the information you collected while `monitoring your OSDs`_
-(e.g., ``ceph osd tree``).
-
-
-Ceph Logs
----------
-
-If you haven't changed the default path, you can find Ceph log files at
-``/var/log/ceph``::
-
- ls /var/log/ceph
-
-If you don't get enough log detail, you can change your logging level. See
-`Ceph Logging and Debugging`_ and `Logging and Debugging Config Reference`_ in
-the Ceph Configuration documentation for details. Also, see `Debugging and
-Logging`_ in the Ceph Operations documentation to ensure that Ceph performs
-adequately under high logging volume.
-
-
-Admin Socket
-------------
-
-Use the admin socket tool to retrieve runtime information. For details, list
-the sockets for your Ceph processes::
-
- ls /var/run/ceph
-
-Then, execute the following, replacing ``{socket-name}`` with an actual
-socket name to show the list of available options::
-
- ceph --admin-daemon /var/run/ceph/{socket-name} help
-
-The admin socket, among other things, allows you to:
-
-- List your configuration at runtime
-- Dump historic operations
-- Dump the operation priority queue state
-- Dump operations in flight
-- Dump perfcounters
-
-
-Display Freespace
------------------
-
-Filesystem issues may arise. To display your filesystem's free space, execute
-``df``. ::
-
- df -h
-
-Execute ``df --help`` for additional usage.
-
-
-I/O Statistics
---------------
-
-Use `iostat`_ to identify I/O-related issues. ::
-
- iostat -x
-
-
-Diagnostic Messages
--------------------
-
-To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
-or ``tail``. For example::
-
- dmesg | grep scsi
-
-
-Stopping w/out Rebalancing
-==========================
-
-Periodically, you may need to perform maintenance on a subset of your cluster,
-or resolve a problem that affects a failure domain (e.g., a rack). If you do not
-want CRUSH to automatically rebalance the cluster as you stop OSDs for
-maintenance, set the cluster to ``noout`` first::
-
- ceph osd set noout
-
-Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
-failure domain that requires maintenance work. ::
-
- ceph osd stop osd.{num}
-
-.. note:: Placement groups within the OSDs you stop will become ``degraded``
- while you are addressing issues with within the failure domain.
-
-Once you have completed your maintenance, restart the OSDs. ::
-
- ceph osd start osd.{num}
-
-Finally, you must unset the cluster from ``noout``. ::
-
- ceph osd unset noout
-
-
-
-.. _osd-not-running:
-
-OSD Not Running
-===============
-
-Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
-allow it to rejoin the cluster and recover.
-
-An OSD Won't Start
-------------------
-
-If you start your cluster and an OSD won't start, check the following:
-
-- **Configuration File:** If you were not able to get OSDs running from
- a new installation, check your configuration file to ensure it conforms
- (e.g., ``host`` not ``hostname``, etc.).
-
-- **Check Paths:** Check the paths in your configuration, and the actual
- paths themselves for data and journals. If you separate the OSD data from
- the journal data and there are errors in your configuration file or in the
- actual mounts, you may have trouble starting OSDs. If you want to store the
- journal on a block device, you should partition your journal disk and assign
- one partition per OSD.
-
-- **Kernel Version:** Identify the kernel version and distribution you
- are using. Ceph uses some third party tools by default, which may be
- buggy or may conflict with certain distributions and/or kernel
- versions (e.g., Google perftools). Check the `OS recommendations`_
- to ensure you have addressed any issues related to your kernel.
-
-- **Segment Fault:** If there is a segment fault, turn your logging up
- (if it isn't already), and try again. If it segment faults again,
- contact the ceph-devel email list and provide your Ceph configuration
- file, your monitor output and the contents of your log file(s).
-
-If you cannot resolve the issue and the email list isn't helpful, you may
-contact `Inktank`_ for support.
-
-
-An OSD Failed
--------------
-
-When a ``ceph-osd`` process dies, the monitor will learn about the failure
-from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
-command::
-
- ceph health
- HEALTH_WARN 1/3 in osds are down
-
-Specifically, you will get a warning whenever there are ``ceph-osd``
-processes that are marked ``in`` and ``down``. You can identify which
-``ceph-osds`` are ``down`` with::
-
- ceph health detail
- HEALTH_WARN 1/3 in osds are down
- osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
-
-If there is a disk
-failure or other fault preventing ``ceph-osd`` from functioning or
-restarting, an error message should be present in its log file in
-``/var/log/ceph``.
-
-If the daemon stopped because of a heartbeat failure, the underlying
-kernel file system may be unresponsive. Check ``dmesg`` output for disk
-or other kernel errors.
-
-If the problem is a software error (failed assertion or other
-unexpected error), it should be reported to the `ceph-devel`_ email list.
-
-
-No Free Drive Space
--------------------
-
-Ceph prevents you from writing to a full OSD so that you don't lose data.
-In an operational cluster, you should receive a warning when your cluster
-is getting near its full ratio. The ``mon osd full ratio`` defaults to
-``0.95``, or 95% of capacity before it stops clients from writing data.
-The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
-when it generates a health warning.
-
-Full cluster issues usually arise when testing how Ceph handles an OSD
-failure on a small cluster. When one node has a high percentage of the
-cluster's data, the cluster can easily eclipse its nearfull and full ratio
-immediately. If you are testing how Ceph reacts to OSD failures on a small
-cluster, you should leave ample free disk space and consider temporarily
-lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``.
-
-Full ``ceph-osds`` will be reported by ``ceph health``::
-
- ceph health
- HEALTH_WARN 1 nearfull osds
- osd.2 is near full at 85%
-
-Or::
-
- ceph health
- HEALTH_ERR 1 nearfull osds, 1 full osds
- osd.2 is near full at 85%
- osd.3 is full at 97%
-
-The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
-the cluster to redistribute data to the newly available storage.
-
-If you cannot start an OSD because it is full, you may delete some data by deleting
-some placement group directories in the full OSD.
-
-.. important:: If you choose to delete a placement group directory on a full OSD,
- **DO NOT** delete the same placement group directory on another full OSD, or
- **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
- at least one OSD.
-
-
-OSDs are Slow/Unresponsive
-==========================
-
-A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
-have eliminated other troubleshooting possibilities before delving into OSD
-performance issues. For example, ensure that your network(s) is working properly
-and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
-
-.. tip:: Newer versions of Ceph provide better recovery handling by preventing
- recovering OSDs from using up system resources so that ``up`` and ``in``
- OSDs aren't available or are otherwise slow.
-
-
-Networking Issues
------------------
-
-Ceph is a distributed storage system, so it depends upon networks to peer with
-OSDs, replicate objects, recover from faults and check heartbeats. Networking
-issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
-details.
-
-Ensure that Ceph processes and Ceph-dependent processes are connected and/or
-listening. ::
-
- netstat -a | grep ceph
- netstat -l | grep ceph
- sudo netstat -p | grep ceph
-
-Check network statistics. ::
-
- netstat -s
-
-
-Drive Configuration
--------------------
-
-A storage drive should only support one OSD. Sequential read and sequential
-write throughput can bottleneck if other processes share the drive, including
-journals, operating systems, monitors, other OSDs and non-Ceph processes.
-
-Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
-option to accelerate the response time--particularly when using the ``ext4`` or
-XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal
-simultaneously.
-
-.. note:: Partitioning a drive does not change its total throughput or
- sequential read/write limits. Running a journal in a separate partition
- may help, but you should prefer a separate physical drive.
-
-
-Bad Sectors / Fragmented Disk
------------------------------
-
-Check your disks for bad sectors and fragmentation. This can cause total throughput
-to drop substantially.
-
-
-Co-resident Monitors/OSDs
--------------------------
-
-Monitors are generally light-weight processes, but they do lots of ``fsync()``,
-which can interfere with other workloads, particularly if monitors run on the
-same drive as your OSDs. Additionally, if you run monitors on the same host as
-the OSDs, you may incur performance issues related to:
-
-- Running an older kernel (pre-3.0)
-- Running Argonaut with an old ``glibc``
-- Running a kernel with no syncfs(2) syscall.
-
-In these cases, multiple OSDs running on the same host can drag each other down
-by doing lots of commits. That often leads to the bursty writes.
-
-
-Co-resident Processes
----------------------
-
-Spinning up co-resident processes such as a cloud-based solution, virtual
-machines and other applications that write data to Ceph while operating on the
-same hardware as OSDs can introduce significant OSD latency. Generally, we
-recommend optimizing a host for use with Ceph and using other hosts for other
-processes. The practice of separating Ceph operations from other applications
-may help improve performance and may streamline troubleshooting and maintenance.
-
-
-Logging Levels
---------------
-
-If you turned logging levels up to track an issue and then forgot to turn
-logging levels back down, the OSD may be putting a lot of logs onto the disk. If
-you intend to keep logging levels high, you may consider mounting a drive to the
-default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
-
-
-Recovery Throttling
--------------------
-
-Depending upon your configuration, Ceph may reduce recovery rates to maintain
-performance or it may increase recovery rates to the point that recovery
-impacts OSD performance. Check to see if the OSD is recovering.
-
-
-Kernel Version
---------------
-
-Check the kernel version you are running. Older kernels may not receive
-new backports that Ceph depends upon for better performance.
-
-
-Kernel Issues with SyncFS
--------------------------
-
-Try running one OSD per host to see if performance improves. Old kernels
-might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
-
-
-Filesystem Issues
------------------
-
-Currently, we recommend deploying clusters with XFS or ext4. The btrfs
-filesystem has many attractive features, but bugs in the filesystem may
-lead to performance issues.
-
-
-Insufficient RAM
-----------------
-
-We recommend 1GB of RAM per OSD daemon. You may notice that during normal
-operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
-Unused RAM makes it tempting to use the excess RAM for co-resident applications,
-VMs and so forth. However, when OSDs go into recovery mode, their memory
-utilization spikes. If there is no RAM available, the OSD performance will slow
-considerably.
-
-
-Old Requests or Slow Requests
------------------------------
-
-If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
-complaining about requests that are taking too long. The warning threshold
-defaults to 30 seconds, and is configurable via the ``osd op complaint time``
-option. When this happens, the cluster log will receive messages.
-
-Legacy versions of Ceph complain about 'old requests`::
-
- osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
-
-New versions of Ceph complain about 'slow requests`::
-
- {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
- {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
-
-
-Possible causes include:
-
-- A bad drive (check ``dmesg`` output)
-- A bug in the kernel file system bug (check ``dmesg`` output)
-- An overloaded cluster (check system load, iostat, etc.)
-- A bug in the ``ceph-osd`` daemon.
-
-Possible solutions
-
-- Remove VMs Cloud Solutions from Ceph Hosts
-- Upgrade Kernel
-- Upgrade Ceph
-- Restart OSDs
-
-
-
-Flapping OSDs
-=============
-
-We recommend using both a public (front-end) network and a cluster (back-end)
-network so that you can better meet the capacity requirements of object replication. Another
-advantage is that you can run a cluster network such that it isn't connected to
-the internet, thereby preventing some denial of service attacks. When OSDs peer
-and check heartbeats, they use the cluster (back-end) network when it's available.
-See `Monitor/OSD Interaction`_ for details.
-
-However, if the cluster (back-end) network fails or develops significant latency
-while the public (front-end) network operates optimally, OSDs currently do not
-handle this situation well. What happens is that OSDs mark each other ``down``
-on the monitor, while marking themselves ``up``. We call this scenario 'flapping`.
-
-If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and then
-``up`` again), you can force the monitors to stop the flapping with::
-
- ceph osd set noup # prevent osds from getting marked up
- ceph osd set nodown # prevent osds from getting marked down
-
-These flags are recorded in the osdmap structure::
-
- ceph osd dump | grep flags
- flags no-up,no-down
-
-You can clear the flags with::
-
- ceph osd unset noup
- ceph osd unset nodown
-
-Two other flags are supported, ``noin`` and ``noout``, which prevent
-booting OSDs from being marked ``in`` (allocated data) or down
-ceph-osds from eventually being marked ``out`` (regardless of what the
-current value for ``mon osd down out interval`` is).
-
-.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
- sense that once the flags are cleared, the action they were blocking
- should occur shortly after. The ``noin`` flag, on the other hand,
- prevents OSDs from being marked ``in`` on boot, and any daemons that
- started while the flag was set will remain that way.
-
-
-
-Troubleshooting PG Errors
-=========================
-
-
-Placement Groups Never Get Clean
---------------------------------
-
-There are a few cases where Ceph placement groups never get clean:
-
-#. **One OSD:** If you deviate from the quick start and use only one OSD, you
- will likely run into problems. OSDs report other OSDs to the monitor, and
- also interact with other OSDs when replicating data. If you have only one
- OSD, a second OSD cannot check its heartbeat. Also, if you remove an OSD
- and have only one OSD remaining, you may encounter problems. An secondary
- or tertiary OSD expects another OSD to tell it which placement groups it
- should have. The lack of another OSD prevents this from occurring. So a
- placement group can remain stuck “stale” forever.
-
-#. **Pool Size = 1**: If you have only one copy of an object, no other OSD will
- tell the OSD which objects it should have. For each placement group mapped
- to the remaining OSD (see ``ceph pg dump``), you can force the OSD to notice
- the placement groups it needs by running::
-
- ceph pg force_create_pg <pgid>
-
-As a general rule, you should run your cluster with more than one OSD and a
-pool size greater than 1 object replica.
-
-
-Stuck Placement Groups
-----------------------
-
-It is normal for placement groups to enter states like "degraded" or "peering"
-following a failure. Normally these states indicate the normal progression
-through the failure recovery process. However, if a placement group stays in one
-of these states for a long time this may be an indication of a larger problem.
-For this reason, the monitor will warn when placement groups get "stuck" in a
-non-optimal state. Specifically, we check for:
-
-* ``inactive`` - The placement group has not been ``active`` for too long
- (i.e., it hasn't been able to service read/write requests).
-
-* ``unclean`` - The placement group has not been ``clean`` for too long
- (i.e., it hasn't been able to completely recover from a previous failure).
-
-* ``stale`` - The placement group status has not been updated by a ``ceph-osd``,
- indicating that all nodes storing this placement group may be ``down``.
-
-You can explicitly list stuck placement groups with one of::
-
- ceph pg dump_stuck stale
- ceph pg dump_stuck inactive
- ceph pg dump_stuck unclean
-
-For stuck ``stale`` placement groups, it is normally a matter of getting the
-right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement
-groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For
-stuck ``unclean`` placement groups, there is usually something preventing
-recovery from completing, like unfound objects (see
-:ref:`failures-osd-unfound`);
-
-
-
-.. _failures-osd-peering:
-
-Placement Group Down - Peering Failure
---------------------------------------
-
-In certain cases, the ``ceph-osd`` `Peering` process can run into
-problems, preventing a PG from becoming active and usable. For
-example, ``ceph health`` might report::
-
- ceph health detail
- HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
- ...
- pg 0.5 is down+peering
- pg 1.4 is down+peering
- ...
- osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
-
-We can query the cluster to determine exactly why the PG is marked ``down`` with::
-
- ceph pg 0.5 query
-
-.. code-block:: javascript
-
- { "state": "down+peering",
- ...
- "recovery_state": [
- { "name": "Started\/Primary\/Peering\/GetInfo",
- "enter_time": "2012-03-06 14:40:16.169679",
- "requested_info_from": []},
- { "name": "Started\/Primary\/Peering",
- "enter_time": "2012-03-06 14:40:16.169659",
- "probing_osds": [
- 0,
- 1],
- "blocked": "peering is blocked due to down osds",
- "down_osds_we_would_probe": [
- 1],
- "peering_blocked_by": [
- { "osd": 1,
- "current_lost_at": 0,
- "comment": "starting or marking this osd lost may let us proceed"}]},
- { "name": "Started",
- "enter_time": "2012-03-06 14:40:16.169513"}
- ]
- }
-
-The ``recovery_state`` section tells us that peering is blocked due to
-down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd``
-and things will recover.
-
-Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk
-failure), we can tell the cluster that it is ``lost`` and to cope as
-best it can.
-
-.. important:: This is dangerous in that the cluster cannot
- guarantee that the other copies of the data are consistent
- and up to date.
-
-To instruct Ceph to continue anyway::
-
- ceph osd lost 1
-
-Recovery will proceed.
-
-
-.. _failures-osd-unfound:
-
-Unfound Objects
----------------
-
-Under certain combinations of failures Ceph may complain about
-``unfound`` objects::
-
- ceph health detail
- HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
- pg 2.4 is active+degraded, 78 unfound
-
-This means that the storage cluster knows that some objects (or newer
-copies of existing objects) exist, but it hasn't found copies of them.
-One example of how this might come about for a PG whose data is on ceph-osds
-1 and 2:
-
-* 1 goes down
-* 2 handles some writes, alone
-* 1 comes up
-* 1 and 2 repeer, and the objects missing on 1 are queued for recovery.
-* Before the new objects are copied, 2 goes down.
-
-Now 1 knows that these object exist, but there is no live ``ceph-osd`` who
-has a copy. In this case, IO to those objects will block, and the
-cluster will hope that the failed node comes back soon; this is
-assumed to be preferable to returning an IO error to the user.
-
-First, you can identify which objects are unfound with::
-
- ceph pg 2.4 list_missing [starting offset, in json]
-
-.. code-block:: javascript
-
- { "offset": { "oid": "",
- "key": "",
- "snapid": 0,
- "hash": 0,
- "max": 0},
- "num_missing": 0,
- "num_unfound": 0,
- "objects": [
- { "oid": "object 1",
- "key": "",
- "hash": 0,
- "max": 0 },
- ...
- ],
- "more": 0}
-
-If there are too many objects to list in a single result, the ``more``
-field will be true and you can query for more. (Eventually the
-command line tool will hide this from you, but not yet.)
-
-Second, you can identify which OSDs have been probed or might contain
-data::
-
- ceph pg 2.4 query
-
-.. code-block:: javascript
-
- "recovery_state": [
- { "name": "Started\/Primary\/Active",
- "enter_time": "2012-03-06 15:15:46.713212",
- "might_have_unfound": [
- { "osd": 1,
- "status": "osd is down"}]},
-
-In this case, for example, the cluster knows that ``osd.1`` might have
-data, but it is ``down``. The full range of possible states include::
-
- * already probed
- * querying
- * osd is down
- * not queried (yet)
-
-Sometimes it simply takes some time for the cluster to query possible
-locations.
-
-It is possible that there are other locations where the object can
-exist that are not listed. For example, if a ceph-osd is stopped and
-taken out of the cluster, the cluster fully recovers, and due to some
-future set of failures ends up with an unfound object, it won't
-consider the long-departed ceph-osd as a potential location to
-consider. (This scenario, however, is unlikely.)
-
-If all possible locations have been queried and objects are still
-lost, you may have to give up on the lost objects. This, again, is
-possible given unusual combinations of failures that allow the cluster
-to learn about writes that were performed before the writes themselves
-are recovered. To mark the "unfound" objects as "lost"::
-
- ceph pg 2.5 mark_unfound_lost revert
-
-This the final argument specifies how the cluster should deal with
-lost objects. Currently the only supported option is "revert", which
-will either roll back to a previous version of the object or (if it
-was a new object) forget about it entirely. Use this with caution, as
-it may confuse applications that expected the object to exist.
-
-
-Homeless Placement Groups
--------------------------
-
-It is possible for all OSDs that had copies of a given placement groups to fail.
-If that's the case, that subset of the object store is unavailable, and the
-monitor will receive no status updates for those placement groups. To detect
-this situation, the monitor marks any placement group whose primary OSD has
-failed as ``stale``. For example::
-
- ceph health
- HEALTH_WARN 24 pgs stale; 3/300 in osds are down
-
-You can identify which placement groups are ``stale``, and what the last OSDs to
-store them were, with::
-
- ceph health detail
- HEALTH_WARN 24 pgs stale; 3/300 in osds are down
- ...
- pg 2.5 is stuck stale+active+remapped, last acting [2,0]
- ...
- osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
- osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
- osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
-
-If we want to get placement group 2.5 back online, for example, this tells us that
-it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
-daemons will allow the cluster to recover that placement group (and, presumably,
-many others).
-
-
-
-.. _iostat: http://en.wikipedia.org/wiki/Iostat
-.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
-.. _Logging and Debugging Config Reference: ../../configuration/log-and-debug-ref
-.. _Debugging and Logging: ../debug
-.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
-.. _monitoring your OSDs: ../monitoring-osd-pg
-.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
-.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
-.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users
-.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users
-.. _Inktank: http://inktank.com
-.. _OS recommendations: ../../../install/os-recommendations
-.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file
--- /dev/null
+==============================
+ Troubleshooting OSDs and PGs
+==============================
+
+Before troubleshooting your OSDs, check your monitors and network first. If
+you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
+a health status, the return of a status means that the monitors have a quorum.
+If you don't have a monitor quorum or if there are errors with the monitor
+status, address the monitor issues first. Check your networks to ensure they
+are running properly, because networks may have a significant impact on OSD
+operation and performance.
+
+
+
+Obtaining Data About OSDs
+=========================
+
+A good first step in troubleshooting your OSDs is to obtain information in
+addition to the information you collected while `monitoring your OSDs`_
+(e.g., ``ceph osd tree``).
+
+
+Ceph Logs
+---------
+
+If you haven't changed the default path, you can find Ceph log files at
+``/var/log/ceph``::
+
+ ls /var/log/ceph
+
+If you don't get enough log detail, you can change your logging level. See
+`Logging and Debugging`_ for details to ensure that Ceph performs adequately
+under high logging volume.
+
+
+Admin Socket
+------------
+
+Use the admin socket tool to retrieve runtime information. For details, list
+the sockets for your Ceph processes::
+
+ ls /var/run/ceph
+
+Then, execute the following, replacing ``{socket-name}`` with an actual
+socket name to show the list of available options::
+
+ ceph --admin-daemon /var/run/ceph/{socket-name} help
+
+The admin socket, among other things, allows you to:
+
+- List your configuration at runtime
+- Dump historic operations
+- Dump the operation priority queue state
+- Dump operations in flight
+- Dump perfcounters
+
+
+Display Freespace
+-----------------
+
+Filesystem issues may arise. To display your filesystem's free space, execute
+``df``. ::
+
+ df -h
+
+Execute ``df --help`` for additional usage.
+
+
+I/O Statistics
+--------------
+
+Use `iostat`_ to identify I/O-related issues. ::
+
+ iostat -x
+
+
+Diagnostic Messages
+-------------------
+
+To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
+or ``tail``. For example::
+
+ dmesg | grep scsi
+
+
+Stopping w/out Rebalancing
+==========================
+
+Periodically, you may need to perform maintenance on a subset of your cluster,
+or resolve a problem that affects a failure domain (e.g., a rack). If you do not
+want CRUSH to automatically rebalance the cluster as you stop OSDs for
+maintenance, set the cluster to ``noout`` first::
+
+ ceph osd set noout
+
+Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
+failure domain that requires maintenance work. ::
+
+ ceph osd stop osd.{num}
+
+.. note:: Placement groups within the OSDs you stop will become ``degraded``
+ while you are addressing issues with within the failure domain.
+
+Once you have completed your maintenance, restart the OSDs. ::
+
+ ceph osd start osd.{num}
+
+Finally, you must unset the cluster from ``noout``. ::
+
+ ceph osd unset noout
+
+
+
+.. _osd-not-running:
+
+OSD Not Running
+===============
+
+Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
+allow it to rejoin the cluster and recover.
+
+An OSD Won't Start
+------------------
+
+If you start your cluster and an OSD won't start, check the following:
+
+- **Configuration File:** If you were not able to get OSDs running from
+ a new installation, check your configuration file to ensure it conforms
+ (e.g., ``host`` not ``hostname``, etc.).
+
+- **Check Paths:** Check the paths in your configuration, and the actual
+ paths themselves for data and journals. If you separate the OSD data from
+ the journal data and there are errors in your configuration file or in the
+ actual mounts, you may have trouble starting OSDs. If you want to store the
+ journal on a block device, you should partition your journal disk and assign
+ one partition per OSD.
+
+- **Kernel Version:** Identify the kernel version and distribution you
+ are using. Ceph uses some third party tools by default, which may be
+ buggy or may conflict with certain distributions and/or kernel
+ versions (e.g., Google perftools). Check the `OS recommendations`_
+ to ensure you have addressed any issues related to your kernel.
+
+- **Segment Fault:** If there is a segment fault, turn your logging up
+ (if it isn't already), and try again. If it segment faults again,
+ contact the ceph-devel email list and provide your Ceph configuration
+ file, your monitor output and the contents of your log file(s).
+
+If you cannot resolve the issue and the email list isn't helpful, you may
+contact `Inktank`_ for support.
+
+
+An OSD Failed
+-------------
+
+When a ``ceph-osd`` process dies, the monitor will learn about the failure
+from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
+command::
+
+ ceph health
+ HEALTH_WARN 1/3 in osds are down
+
+Specifically, you will get a warning whenever there are ``ceph-osd``
+processes that are marked ``in`` and ``down``. You can identify which
+``ceph-osds`` are ``down`` with::
+
+ ceph health detail
+ HEALTH_WARN 1/3 in osds are down
+ osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
+
+If there is a disk
+failure or other fault preventing ``ceph-osd`` from functioning or
+restarting, an error message should be present in its log file in
+``/var/log/ceph``.
+
+If the daemon stopped because of a heartbeat failure, the underlying
+kernel file system may be unresponsive. Check ``dmesg`` output for disk
+or other kernel errors.
+
+If the problem is a software error (failed assertion or other
+unexpected error), it should be reported to the `ceph-devel`_ email list.
+
+
+No Free Drive Space
+-------------------
+
+Ceph prevents you from writing to a full OSD so that you don't lose data.
+In an operational cluster, you should receive a warning when your cluster
+is getting near its full ratio. The ``mon osd full ratio`` defaults to
+``0.95``, or 95% of capacity before it stops clients from writing data.
+The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
+when it generates a health warning.
+
+Full cluster issues usually arise when testing how Ceph handles an OSD
+failure on a small cluster. When one node has a high percentage of the
+cluster's data, the cluster can easily eclipse its nearfull and full ratio
+immediately. If you are testing how Ceph reacts to OSD failures on a small
+cluster, you should leave ample free disk space and consider temporarily
+lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``.
+
+Full ``ceph-osds`` will be reported by ``ceph health``::
+
+ ceph health
+ HEALTH_WARN 1 nearfull osds
+ osd.2 is near full at 85%
+
+Or::
+
+ ceph health
+ HEALTH_ERR 1 nearfull osds, 1 full osds
+ osd.2 is near full at 85%
+ osd.3 is full at 97%
+
+The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
+the cluster to redistribute data to the newly available storage.
+
+If you cannot start an OSD because it is full, you may delete some data by deleting
+some placement group directories in the full OSD.
+
+.. important:: If you choose to delete a placement group directory on a full OSD,
+ **DO NOT** delete the same placement group directory on another full OSD, or
+ **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
+ at least one OSD.
+
+See `Monitor Config Reference`_ for additional details.
+
+
+OSDs are Slow/Unresponsive
+==========================
+
+A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
+have eliminated other troubleshooting possibilities before delving into OSD
+performance issues. For example, ensure that your network(s) is working properly
+and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
+
+.. tip:: Newer versions of Ceph provide better recovery handling by preventing
+ recovering OSDs from using up system resources so that ``up`` and ``in``
+ OSDs aren't available or are otherwise slow.
+
+
+Networking Issues
+-----------------
+
+Ceph is a distributed storage system, so it depends upon networks to peer with
+OSDs, replicate objects, recover from faults and check heartbeats. Networking
+issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
+details.
+
+Ensure that Ceph processes and Ceph-dependent processes are connected and/or
+listening. ::
+
+ netstat -a | grep ceph
+ netstat -l | grep ceph
+ sudo netstat -p | grep ceph
+
+Check network statistics. ::
+
+ netstat -s
+
+
+Drive Configuration
+-------------------
+
+A storage drive should only support one OSD. Sequential read and sequential
+write throughput can bottleneck if other processes share the drive, including
+journals, operating systems, monitors, other OSDs and non-Ceph processes.
+
+Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
+option to accelerate the response time--particularly when using the ``ext4`` or
+XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal
+simultaneously.
+
+.. note:: Partitioning a drive does not change its total throughput or
+ sequential read/write limits. Running a journal in a separate partition
+ may help, but you should prefer a separate physical drive.
+
+
+Bad Sectors / Fragmented Disk
+-----------------------------
+
+Check your disks for bad sectors and fragmentation. This can cause total throughput
+to drop substantially.
+
+
+Co-resident Monitors/OSDs
+-------------------------
+
+Monitors are generally light-weight processes, but they do lots of ``fsync()``,
+which can interfere with other workloads, particularly if monitors run on the
+same drive as your OSDs. Additionally, if you run monitors on the same host as
+the OSDs, you may incur performance issues related to:
+
+- Running an older kernel (pre-3.0)
+- Running Argonaut with an old ``glibc``
+- Running a kernel with no syncfs(2) syscall.
+
+In these cases, multiple OSDs running on the same host can drag each other down
+by doing lots of commits. That often leads to the bursty writes.
+
+
+Co-resident Processes
+---------------------
+
+Spinning up co-resident processes such as a cloud-based solution, virtual
+machines and other applications that write data to Ceph while operating on the
+same hardware as OSDs can introduce significant OSD latency. Generally, we
+recommend optimizing a host for use with Ceph and using other hosts for other
+processes. The practice of separating Ceph operations from other applications
+may help improve performance and may streamline troubleshooting and maintenance.
+
+
+Logging Levels
+--------------
+
+If you turned logging levels up to track an issue and then forgot to turn
+logging levels back down, the OSD may be putting a lot of logs onto the disk. If
+you intend to keep logging levels high, you may consider mounting a drive to the
+default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
+
+
+Recovery Throttling
+-------------------
+
+Depending upon your configuration, Ceph may reduce recovery rates to maintain
+performance or it may increase recovery rates to the point that recovery
+impacts OSD performance. Check to see if the OSD is recovering.
+
+
+Kernel Version
+--------------
+
+Check the kernel version you are running. Older kernels may not receive
+new backports that Ceph depends upon for better performance.
+
+
+Kernel Issues with SyncFS
+-------------------------
+
+Try running one OSD per host to see if performance improves. Old kernels
+might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
+
+
+Filesystem Issues
+-----------------
+
+Currently, we recommend deploying clusters with XFS or ext4. The btrfs
+filesystem has many attractive features, but bugs in the filesystem may
+lead to performance issues.
+
+
+Insufficient RAM
+----------------
+
+We recommend 1GB of RAM per OSD daemon. You may notice that during normal
+operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
+Unused RAM makes it tempting to use the excess RAM for co-resident applications,
+VMs and so forth. However, when OSDs go into recovery mode, their memory
+utilization spikes. If there is no RAM available, the OSD performance will slow
+considerably.
+
+
+Old Requests or Slow Requests
+-----------------------------
+
+If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
+complaining about requests that are taking too long. The warning threshold
+defaults to 30 seconds, and is configurable via the ``osd op complaint time``
+option. When this happens, the cluster log will receive messages.
+
+Legacy versions of Ceph complain about 'old requests`::
+
+ osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
+
+New versions of Ceph complain about 'slow requests`::
+
+ {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
+ {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
+
+
+Possible causes include:
+
+- A bad drive (check ``dmesg`` output)
+- A bug in the kernel file system bug (check ``dmesg`` output)
+- An overloaded cluster (check system load, iostat, etc.)
+- A bug in the ``ceph-osd`` daemon.
+
+Possible solutions
+
+- Remove VMs Cloud Solutions from Ceph Hosts
+- Upgrade Kernel
+- Upgrade Ceph
+- Restart OSDs
+
+
+
+Flapping OSDs
+=============
+
+We recommend using both a public (front-end) network and a cluster (back-end)
+network so that you can better meet the capacity requirements of object
+replication. Another advantage is that you can run a cluster network such that
+it isn't connected to the internet, thereby preventing some denial of service
+attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
+network when it's available. See `Monitor/OSD Interaction`_ for details.
+
+However, if the cluster (back-end) network fails or develops significant latency
+while the public (front-end) network operates optimally, OSDs currently do not
+handle this situation well. What happens is that OSDs mark each other ``down``
+on the monitor, while marking themselves ``up``. We call this scenario
+'flapping`.
+
+If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
+then ``up`` again), you can force the monitors to stop the flapping with::
+
+ ceph osd set noup # prevent osds from getting marked up
+ ceph osd set nodown # prevent osds from getting marked down
+
+These flags are recorded in the osdmap structure::
+
+ ceph osd dump | grep flags
+ flags no-up,no-down
+
+You can clear the flags with::
+
+ ceph osd unset noup
+ ceph osd unset nodown
+
+Two other flags are supported, ``noin`` and ``noout``, which prevent
+booting OSDs from being marked ``in`` (allocated data) or down
+ceph-osds from eventually being marked ``out`` (regardless of what the
+current value for ``mon osd down out interval`` is).
+
+.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
+ sense that once the flags are cleared, the action they were blocking
+ should occur shortly after. The ``noin`` flag, on the other hand,
+ prevents OSDs from being marked ``in`` on boot, and any daemons that
+ started while the flag was set will remain that way.
+
+
+
+Troubleshooting PG Errors
+=========================
+
+
+Placement Groups Never Get Clean
+--------------------------------
+
+There are a few cases where Ceph placement groups never get clean:
+
+#. **One OSD:** If you deviate from the quick start and use only one OSD, you
+ will likely run into problems. OSDs report other OSDs to the monitor, and
+ also interact with other OSDs when replicating data. If you have only one
+ OSD, a second OSD cannot check its heartbeat. Also, if you remove an OSD
+ and have only one OSD remaining, you may encounter problems. An secondary
+ or tertiary OSD expects another OSD to tell it which placement groups it
+ should have. The lack of another OSD prevents this from occurring. So a
+ placement group can remain stuck “stale” forever.
+
+#. **Pool Size = 1**: If you have only one copy of an object, no other OSD will
+ tell the OSD which objects it should have. For each placement group mapped
+ to the remaining OSD (see ``ceph pg dump``), you can force the OSD to notice
+ the placement groups it needs by running::
+
+ ceph pg force_create_pg <pgid>
+
+As a general rule, you should run your cluster with more than one OSD and a
+pool size greater than 1 object replica.
+
+
+Stuck Placement Groups
+----------------------
+
+It is normal for placement groups to enter states like "degraded" or "peering"
+following a failure. Normally these states indicate the normal progression
+through the failure recovery process. However, if a placement group stays in one
+of these states for a long time this may be an indication of a larger problem.
+For this reason, the monitor will warn when placement groups get "stuck" in a
+non-optimal state. Specifically, we check for:
+
+* ``inactive`` - The placement group has not been ``active`` for too long
+ (i.e., it hasn't been able to service read/write requests).
+
+* ``unclean`` - The placement group has not been ``clean`` for too long
+ (i.e., it hasn't been able to completely recover from a previous failure).
+
+* ``stale`` - The placement group status has not been updated by a ``ceph-osd``,
+ indicating that all nodes storing this placement group may be ``down``.
+
+You can explicitly list stuck placement groups with one of::
+
+ ceph pg dump_stuck stale
+ ceph pg dump_stuck inactive
+ ceph pg dump_stuck unclean
+
+For stuck ``stale`` placement groups, it is normally a matter of getting the
+right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement
+groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For
+stuck ``unclean`` placement groups, there is usually something preventing
+recovery from completing, like unfound objects (see
+:ref:`failures-osd-unfound`);
+
+
+
+.. _failures-osd-peering:
+
+Placement Group Down - Peering Failure
+--------------------------------------
+
+In certain cases, the ``ceph-osd`` `Peering` process can run into
+problems, preventing a PG from becoming active and usable. For
+example, ``ceph health`` might report::
+
+ ceph health detail
+ HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
+ ...
+ pg 0.5 is down+peering
+ pg 1.4 is down+peering
+ ...
+ osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
+
+We can query the cluster to determine exactly why the PG is marked ``down`` with::
+
+ ceph pg 0.5 query
+
+.. code-block:: javascript
+
+ { "state": "down+peering",
+ ...
+ "recovery_state": [
+ { "name": "Started\/Primary\/Peering\/GetInfo",
+ "enter_time": "2012-03-06 14:40:16.169679",
+ "requested_info_from": []},
+ { "name": "Started\/Primary\/Peering",
+ "enter_time": "2012-03-06 14:40:16.169659",
+ "probing_osds": [
+ 0,
+ 1],
+ "blocked": "peering is blocked due to down osds",
+ "down_osds_we_would_probe": [
+ 1],
+ "peering_blocked_by": [
+ { "osd": 1,
+ "current_lost_at": 0,
+ "comment": "starting or marking this osd lost may let us proceed"}]},
+ { "name": "Started",
+ "enter_time": "2012-03-06 14:40:16.169513"}
+ ]
+ }
+
+The ``recovery_state`` section tells us that peering is blocked due to
+down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd``
+and things will recover.
+
+Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk
+failure), we can tell the cluster that it is ``lost`` and to cope as
+best it can.
+
+.. important:: This is dangerous in that the cluster cannot
+ guarantee that the other copies of the data are consistent
+ and up to date.
+
+To instruct Ceph to continue anyway::
+
+ ceph osd lost 1
+
+Recovery will proceed.
+
+
+.. _failures-osd-unfound:
+
+Unfound Objects
+---------------
+
+Under certain combinations of failures Ceph may complain about
+``unfound`` objects::
+
+ ceph health detail
+ HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
+ pg 2.4 is active+degraded, 78 unfound
+
+This means that the storage cluster knows that some objects (or newer
+copies of existing objects) exist, but it hasn't found copies of them.
+One example of how this might come about for a PG whose data is on ceph-osds
+1 and 2:
+
+* 1 goes down
+* 2 handles some writes, alone
+* 1 comes up
+* 1 and 2 repeer, and the objects missing on 1 are queued for recovery.
+* Before the new objects are copied, 2 goes down.
+
+Now 1 knows that these object exist, but there is no live ``ceph-osd`` who
+has a copy. In this case, IO to those objects will block, and the
+cluster will hope that the failed node comes back soon; this is
+assumed to be preferable to returning an IO error to the user.
+
+First, you can identify which objects are unfound with::
+
+ ceph pg 2.4 list_missing [starting offset, in json]
+
+.. code-block:: javascript
+
+ { "offset": { "oid": "",
+ "key": "",
+ "snapid": 0,
+ "hash": 0,
+ "max": 0},
+ "num_missing": 0,
+ "num_unfound": 0,
+ "objects": [
+ { "oid": "object 1",
+ "key": "",
+ "hash": 0,
+ "max": 0 },
+ ...
+ ],
+ "more": 0}
+
+If there are too many objects to list in a single result, the ``more``
+field will be true and you can query for more. (Eventually the
+command line tool will hide this from you, but not yet.)
+
+Second, you can identify which OSDs have been probed or might contain
+data::
+
+ ceph pg 2.4 query
+
+.. code-block:: javascript
+
+ "recovery_state": [
+ { "name": "Started\/Primary\/Active",
+ "enter_time": "2012-03-06 15:15:46.713212",
+ "might_have_unfound": [
+ { "osd": 1,
+ "status": "osd is down"}]},
+
+In this case, for example, the cluster knows that ``osd.1`` might have
+data, but it is ``down``. The full range of possible states include::
+
+ * already probed
+ * querying
+ * osd is down
+ * not queried (yet)
+
+Sometimes it simply takes some time for the cluster to query possible
+locations.
+
+It is possible that there are other locations where the object can
+exist that are not listed. For example, if a ceph-osd is stopped and
+taken out of the cluster, the cluster fully recovers, and due to some
+future set of failures ends up with an unfound object, it won't
+consider the long-departed ceph-osd as a potential location to
+consider. (This scenario, however, is unlikely.)
+
+If all possible locations have been queried and objects are still
+lost, you may have to give up on the lost objects. This, again, is
+possible given unusual combinations of failures that allow the cluster
+to learn about writes that were performed before the writes themselves
+are recovered. To mark the "unfound" objects as "lost"::
+
+ ceph pg 2.5 mark_unfound_lost revert
+
+This the final argument specifies how the cluster should deal with
+lost objects. Currently the only supported option is "revert", which
+will either roll back to a previous version of the object or (if it
+was a new object) forget about it entirely. Use this with caution, as
+it may confuse applications that expected the object to exist.
+
+
+Homeless Placement Groups
+-------------------------
+
+It is possible for all OSDs that had copies of a given placement groups to fail.
+If that's the case, that subset of the object store is unavailable, and the
+monitor will receive no status updates for those placement groups. To detect
+this situation, the monitor marks any placement group whose primary OSD has
+failed as ``stale``. For example::
+
+ ceph health
+ HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+
+You can identify which placement groups are ``stale``, and what the last OSDs to
+store them were, with::
+
+ ceph health detail
+ HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+ ...
+ pg 2.5 is stuck stale+active+remapped, last acting [2,0]
+ ...
+ osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
+ osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
+ osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
+
+If we want to get placement group 2.5 back online, for example, this tells us that
+it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
+daemons will allow the cluster to recover that placement group (and, presumably,
+many others).
+
+
+
+.. _iostat: http://en.wikipedia.org/wiki/Iostat
+.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
+.. _Logging and Debugging: ../log-and-debug
+.. _Debugging and Logging: ../debug
+.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
+.. _Monitor Config Reference: ../../configuration/mon-config-ref
+.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
+.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
+.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
+.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users
+.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users
+.. _Inktank: http://inktank.com
+.. _OS recommendations: ../../../install/os-recommendations
+.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file