From: John Wilkins Date: Thu, 18 Apr 2013 01:27:40 +0000 (-0700) Subject: doc: Moved troubleshooting OSD to troubleshooting section. X-Git-Tag: v0.61~167 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=594580c900f565a9c7012add564482d4a968fd2b;p=ceph.git doc: Moved troubleshooting OSD to troubleshooting section. Signed-off-by: John Wilkins --- diff --git a/doc/rados/operations/troubleshooting-osd.rst b/doc/rados/operations/troubleshooting-osd.rst deleted file mode 100644 index d167960c397e..000000000000 --- a/doc/rados/operations/troubleshooting-osd.rst +++ /dev/null @@ -1,739 +0,0 @@ -============================== - Troubleshooting OSDs and PGs -============================== - -Before troubleshooting your OSDs, check your monitors and network first. If -you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns -a health status, the return of a status means that the monitors have a quorum. -If you don't have a monitor quorum or if there are errors with the monitor -status, address the monitor issues first. Check your networks to ensure they -are running properly, because networks may have a significant impact on OSD -operation and performance. - - -The Ceph Community -================== - -The Ceph community is an excellent source of information and help. For -operational issues with Ceph releases we recommend you `subscribe to the -ceph-users email list`_. When you no longer want to receive emails, you can -`unsubscribe from the ceph-users email list`_. - -If you have read through this guide and you have contacted ``ceph-users``, -but you haven't resolved your issue, you may contact `Inktank`_ for support. - -You may also `subscribe to the ceph-devel email list`_. You should do so if -your issue is: - -- Likely related to a bug -- Related to a development release package -- Related to a development testing package -- Related to your own builds - -If you no longer want to receive emails from the ``ceph-devel`` email list, you -may `unsubscribe from the ceph-devel email list`_. - -.. tip:: The Ceph community is growing rapidly, and community members can help - you if you provide them with detailed information about your problem. See - `Obtaining Data About OSDs`_ before you post questions to ensure that - community members have sufficient data to help you. - - -Obtaining Data About OSDs -========================= - -A good first step in troubleshooting your OSDs is to obtain information in -addition to the information you collected while `monitoring your OSDs`_ -(e.g., ``ceph osd tree``). - - -Ceph Logs ---------- - -If you haven't changed the default path, you can find Ceph log files at -``/var/log/ceph``:: - - ls /var/log/ceph - -If you don't get enough log detail, you can change your logging level. See -`Ceph Logging and Debugging`_ and `Logging and Debugging Config Reference`_ in -the Ceph Configuration documentation for details. Also, see `Debugging and -Logging`_ in the Ceph Operations documentation to ensure that Ceph performs -adequately under high logging volume. - - -Admin Socket ------------- - -Use the admin socket tool to retrieve runtime information. For details, list -the sockets for your Ceph processes:: - - ls /var/run/ceph - -Then, execute the following, replacing ``{socket-name}`` with an actual -socket name to show the list of available options:: - - ceph --admin-daemon /var/run/ceph/{socket-name} help - -The admin socket, among other things, allows you to: - -- List your configuration at runtime -- Dump historic operations -- Dump the operation priority queue state -- Dump operations in flight -- Dump perfcounters - - -Display Freespace ------------------ - -Filesystem issues may arise. To display your filesystem's free space, execute -``df``. :: - - df -h - -Execute ``df --help`` for additional usage. - - -I/O Statistics --------------- - -Use `iostat`_ to identify I/O-related issues. :: - - iostat -x - - -Diagnostic Messages -------------------- - -To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` -or ``tail``. For example:: - - dmesg | grep scsi - - -Stopping w/out Rebalancing -========================== - -Periodically, you may need to perform maintenance on a subset of your cluster, -or resolve a problem that affects a failure domain (e.g., a rack). If you do not -want CRUSH to automatically rebalance the cluster as you stop OSDs for -maintenance, set the cluster to ``noout`` first:: - - ceph osd set noout - -Once the cluster is set to ``noout``, you can begin stopping the OSDs within the -failure domain that requires maintenance work. :: - - ceph osd stop osd.{num} - -.. note:: Placement groups within the OSDs you stop will become ``degraded`` - while you are addressing issues with within the failure domain. - -Once you have completed your maintenance, restart the OSDs. :: - - ceph osd start osd.{num} - -Finally, you must unset the cluster from ``noout``. :: - - ceph osd unset noout - - - -.. _osd-not-running: - -OSD Not Running -=============== - -Under normal circumstances, simply restarting the ``ceph-osd`` daemon will -allow it to rejoin the cluster and recover. - -An OSD Won't Start ------------------- - -If you start your cluster and an OSD won't start, check the following: - -- **Configuration File:** If you were not able to get OSDs running from - a new installation, check your configuration file to ensure it conforms - (e.g., ``host`` not ``hostname``, etc.). - -- **Check Paths:** Check the paths in your configuration, and the actual - paths themselves for data and journals. If you separate the OSD data from - the journal data and there are errors in your configuration file or in the - actual mounts, you may have trouble starting OSDs. If you want to store the - journal on a block device, you should partition your journal disk and assign - one partition per OSD. - -- **Kernel Version:** Identify the kernel version and distribution you - are using. Ceph uses some third party tools by default, which may be - buggy or may conflict with certain distributions and/or kernel - versions (e.g., Google perftools). Check the `OS recommendations`_ - to ensure you have addressed any issues related to your kernel. - -- **Segment Fault:** If there is a segment fault, turn your logging up - (if it isn't already), and try again. If it segment faults again, - contact the ceph-devel email list and provide your Ceph configuration - file, your monitor output and the contents of your log file(s). - -If you cannot resolve the issue and the email list isn't helpful, you may -contact `Inktank`_ for support. - - -An OSD Failed -------------- - -When a ``ceph-osd`` process dies, the monitor will learn about the failure -from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` -command:: - - ceph health - HEALTH_WARN 1/3 in osds are down - -Specifically, you will get a warning whenever there are ``ceph-osd`` -processes that are marked ``in`` and ``down``. You can identify which -``ceph-osds`` are ``down`` with:: - - ceph health detail - HEALTH_WARN 1/3 in osds are down - osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 - -If there is a disk -failure or other fault preventing ``ceph-osd`` from functioning or -restarting, an error message should be present in its log file in -``/var/log/ceph``. - -If the daemon stopped because of a heartbeat failure, the underlying -kernel file system may be unresponsive. Check ``dmesg`` output for disk -or other kernel errors. - -If the problem is a software error (failed assertion or other -unexpected error), it should be reported to the `ceph-devel`_ email list. - - -No Free Drive Space -------------------- - -Ceph prevents you from writing to a full OSD so that you don't lose data. -In an operational cluster, you should receive a warning when your cluster -is getting near its full ratio. The ``mon osd full ratio`` defaults to -``0.95``, or 95% of capacity before it stops clients from writing data. -The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity -when it generates a health warning. - -Full cluster issues usually arise when testing how Ceph handles an OSD -failure on a small cluster. When one node has a high percentage of the -cluster's data, the cluster can easily eclipse its nearfull and full ratio -immediately. If you are testing how Ceph reacts to OSD failures on a small -cluster, you should leave ample free disk space and consider temporarily -lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``. - -Full ``ceph-osds`` will be reported by ``ceph health``:: - - ceph health - HEALTH_WARN 1 nearfull osds - osd.2 is near full at 85% - -Or:: - - ceph health - HEALTH_ERR 1 nearfull osds, 1 full osds - osd.2 is near full at 85% - osd.3 is full at 97% - -The best way to deal with a full cluster is to add new ``ceph-osds``, allowing -the cluster to redistribute data to the newly available storage. - -If you cannot start an OSD because it is full, you may delete some data by deleting -some placement group directories in the full OSD. - -.. important:: If you choose to delete a placement group directory on a full OSD, - **DO NOT** delete the same placement group directory on another full OSD, or - **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on - at least one OSD. - - -OSDs are Slow/Unresponsive -========================== - -A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you -have eliminated other troubleshooting possibilities before delving into OSD -performance issues. For example, ensure that your network(s) is working properly -and your OSDs are running. Check to see if OSDs are throttling recovery traffic. - -.. tip:: Newer versions of Ceph provide better recovery handling by preventing - recovering OSDs from using up system resources so that ``up`` and ``in`` - OSDs aren't available or are otherwise slow. - - -Networking Issues ------------------ - -Ceph is a distributed storage system, so it depends upon networks to peer with -OSDs, replicate objects, recover from faults and check heartbeats. Networking -issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for -details. - -Ensure that Ceph processes and Ceph-dependent processes are connected and/or -listening. :: - - netstat -a | grep ceph - netstat -l | grep ceph - sudo netstat -p | grep ceph - -Check network statistics. :: - - netstat -s - - -Drive Configuration -------------------- - -A storage drive should only support one OSD. Sequential read and sequential -write throughput can bottleneck if other processes share the drive, including -journals, operating systems, monitors, other OSDs and non-Ceph processes. - -Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive -option to accelerate the response time--particularly when using the ``ext4`` or -XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal -simultaneously. - -.. note:: Partitioning a drive does not change its total throughput or - sequential read/write limits. Running a journal in a separate partition - may help, but you should prefer a separate physical drive. - - -Bad Sectors / Fragmented Disk ------------------------------ - -Check your disks for bad sectors and fragmentation. This can cause total throughput -to drop substantially. - - -Co-resident Monitors/OSDs -------------------------- - -Monitors are generally light-weight processes, but they do lots of ``fsync()``, -which can interfere with other workloads, particularly if monitors run on the -same drive as your OSDs. Additionally, if you run monitors on the same host as -the OSDs, you may incur performance issues related to: - -- Running an older kernel (pre-3.0) -- Running Argonaut with an old ``glibc`` -- Running a kernel with no syncfs(2) syscall. - -In these cases, multiple OSDs running on the same host can drag each other down -by doing lots of commits. That often leads to the bursty writes. - - -Co-resident Processes ---------------------- - -Spinning up co-resident processes such as a cloud-based solution, virtual -machines and other applications that write data to Ceph while operating on the -same hardware as OSDs can introduce significant OSD latency. Generally, we -recommend optimizing a host for use with Ceph and using other hosts for other -processes. The practice of separating Ceph operations from other applications -may help improve performance and may streamline troubleshooting and maintenance. - - -Logging Levels --------------- - -If you turned logging levels up to track an issue and then forgot to turn -logging levels back down, the OSD may be putting a lot of logs onto the disk. If -you intend to keep logging levels high, you may consider mounting a drive to the -default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). - - -Recovery Throttling -------------------- - -Depending upon your configuration, Ceph may reduce recovery rates to maintain -performance or it may increase recovery rates to the point that recovery -impacts OSD performance. Check to see if the OSD is recovering. - - -Kernel Version --------------- - -Check the kernel version you are running. Older kernels may not receive -new backports that Ceph depends upon for better performance. - - -Kernel Issues with SyncFS -------------------------- - -Try running one OSD per host to see if performance improves. Old kernels -might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. - - -Filesystem Issues ------------------ - -Currently, we recommend deploying clusters with XFS or ext4. The btrfs -filesystem has many attractive features, but bugs in the filesystem may -lead to performance issues. - - -Insufficient RAM ----------------- - -We recommend 1GB of RAM per OSD daemon. You may notice that during normal -operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). -Unused RAM makes it tempting to use the excess RAM for co-resident applications, -VMs and so forth. However, when OSDs go into recovery mode, their memory -utilization spikes. If there is no RAM available, the OSD performance will slow -considerably. - - -Old Requests or Slow Requests ------------------------------ - -If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages -complaining about requests that are taking too long. The warning threshold -defaults to 30 seconds, and is configurable via the ``osd op complaint time`` -option. When this happens, the cluster log will receive messages. - -Legacy versions of Ceph complain about 'old requests`:: - - osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops - -New versions of Ceph complain about 'slow requests`:: - - {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs - {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] - - -Possible causes include: - -- A bad drive (check ``dmesg`` output) -- A bug in the kernel file system bug (check ``dmesg`` output) -- An overloaded cluster (check system load, iostat, etc.) -- A bug in the ``ceph-osd`` daemon. - -Possible solutions - -- Remove VMs Cloud Solutions from Ceph Hosts -- Upgrade Kernel -- Upgrade Ceph -- Restart OSDs - - - -Flapping OSDs -============= - -We recommend using both a public (front-end) network and a cluster (back-end) -network so that you can better meet the capacity requirements of object replication. Another -advantage is that you can run a cluster network such that it isn't connected to -the internet, thereby preventing some denial of service attacks. When OSDs peer -and check heartbeats, they use the cluster (back-end) network when it's available. -See `Monitor/OSD Interaction`_ for details. - -However, if the cluster (back-end) network fails or develops significant latency -while the public (front-end) network operates optimally, OSDs currently do not -handle this situation well. What happens is that OSDs mark each other ``down`` -on the monitor, while marking themselves ``up``. We call this scenario 'flapping`. - -If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and then -``up`` again), you can force the monitors to stop the flapping with:: - - ceph osd set noup # prevent osds from getting marked up - ceph osd set nodown # prevent osds from getting marked down - -These flags are recorded in the osdmap structure:: - - ceph osd dump | grep flags - flags no-up,no-down - -You can clear the flags with:: - - ceph osd unset noup - ceph osd unset nodown - -Two other flags are supported, ``noin`` and ``noout``, which prevent -booting OSDs from being marked ``in`` (allocated data) or down -ceph-osds from eventually being marked ``out`` (regardless of what the -current value for ``mon osd down out interval`` is). - -.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the - sense that once the flags are cleared, the action they were blocking - should occur shortly after. The ``noin`` flag, on the other hand, - prevents OSDs from being marked ``in`` on boot, and any daemons that - started while the flag was set will remain that way. - - - -Troubleshooting PG Errors -========================= - - -Placement Groups Never Get Clean --------------------------------- - -There are a few cases where Ceph placement groups never get clean: - -#. **One OSD:** If you deviate from the quick start and use only one OSD, you - will likely run into problems. OSDs report other OSDs to the monitor, and - also interact with other OSDs when replicating data. If you have only one - OSD, a second OSD cannot check its heartbeat. Also, if you remove an OSD - and have only one OSD remaining, you may encounter problems. An secondary - or tertiary OSD expects another OSD to tell it which placement groups it - should have. The lack of another OSD prevents this from occurring. So a - placement group can remain stuck “stale” forever. - -#. **Pool Size = 1**: If you have only one copy of an object, no other OSD will - tell the OSD which objects it should have. For each placement group mapped - to the remaining OSD (see ``ceph pg dump``), you can force the OSD to notice - the placement groups it needs by running:: - - ceph pg force_create_pg - -As a general rule, you should run your cluster with more than one OSD and a -pool size greater than 1 object replica. - - -Stuck Placement Groups ----------------------- - -It is normal for placement groups to enter states like "degraded" or "peering" -following a failure. Normally these states indicate the normal progression -through the failure recovery process. However, if a placement group stays in one -of these states for a long time this may be an indication of a larger problem. -For this reason, the monitor will warn when placement groups get "stuck" in a -non-optimal state. Specifically, we check for: - -* ``inactive`` - The placement group has not been ``active`` for too long - (i.e., it hasn't been able to service read/write requests). - -* ``unclean`` - The placement group has not been ``clean`` for too long - (i.e., it hasn't been able to completely recover from a previous failure). - -* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, - indicating that all nodes storing this placement group may be ``down``. - -You can explicitly list stuck placement groups with one of:: - - ceph pg dump_stuck stale - ceph pg dump_stuck inactive - ceph pg dump_stuck unclean - -For stuck ``stale`` placement groups, it is normally a matter of getting the -right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement -groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For -stuck ``unclean`` placement groups, there is usually something preventing -recovery from completing, like unfound objects (see -:ref:`failures-osd-unfound`); - - - -.. _failures-osd-peering: - -Placement Group Down - Peering Failure --------------------------------------- - -In certain cases, the ``ceph-osd`` `Peering` process can run into -problems, preventing a PG from becoming active and usable. For -example, ``ceph health`` might report:: - - ceph health detail - HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down - ... - pg 0.5 is down+peering - pg 1.4 is down+peering - ... - osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 - -We can query the cluster to determine exactly why the PG is marked ``down`` with:: - - ceph pg 0.5 query - -.. code-block:: javascript - - { "state": "down+peering", - ... - "recovery_state": [ - { "name": "Started\/Primary\/Peering\/GetInfo", - "enter_time": "2012-03-06 14:40:16.169679", - "requested_info_from": []}, - { "name": "Started\/Primary\/Peering", - "enter_time": "2012-03-06 14:40:16.169659", - "probing_osds": [ - 0, - 1], - "blocked": "peering is blocked due to down osds", - "down_osds_we_would_probe": [ - 1], - "peering_blocked_by": [ - { "osd": 1, - "current_lost_at": 0, - "comment": "starting or marking this osd lost may let us proceed"}]}, - { "name": "Started", - "enter_time": "2012-03-06 14:40:16.169513"} - ] - } - -The ``recovery_state`` section tells us that peering is blocked due to -down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` -and things will recover. - -Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk -failure), we can tell the cluster that it is ``lost`` and to cope as -best it can. - -.. important:: This is dangerous in that the cluster cannot - guarantee that the other copies of the data are consistent - and up to date. - -To instruct Ceph to continue anyway:: - - ceph osd lost 1 - -Recovery will proceed. - - -.. _failures-osd-unfound: - -Unfound Objects ---------------- - -Under certain combinations of failures Ceph may complain about -``unfound`` objects:: - - ceph health detail - HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) - pg 2.4 is active+degraded, 78 unfound - -This means that the storage cluster knows that some objects (or newer -copies of existing objects) exist, but it hasn't found copies of them. -One example of how this might come about for a PG whose data is on ceph-osds -1 and 2: - -* 1 goes down -* 2 handles some writes, alone -* 1 comes up -* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. -* Before the new objects are copied, 2 goes down. - -Now 1 knows that these object exist, but there is no live ``ceph-osd`` who -has a copy. In this case, IO to those objects will block, and the -cluster will hope that the failed node comes back soon; this is -assumed to be preferable to returning an IO error to the user. - -First, you can identify which objects are unfound with:: - - ceph pg 2.4 list_missing [starting offset, in json] - -.. code-block:: javascript - - { "offset": { "oid": "", - "key": "", - "snapid": 0, - "hash": 0, - "max": 0}, - "num_missing": 0, - "num_unfound": 0, - "objects": [ - { "oid": "object 1", - "key": "", - "hash": 0, - "max": 0 }, - ... - ], - "more": 0} - -If there are too many objects to list in a single result, the ``more`` -field will be true and you can query for more. (Eventually the -command line tool will hide this from you, but not yet.) - -Second, you can identify which OSDs have been probed or might contain -data:: - - ceph pg 2.4 query - -.. code-block:: javascript - - "recovery_state": [ - { "name": "Started\/Primary\/Active", - "enter_time": "2012-03-06 15:15:46.713212", - "might_have_unfound": [ - { "osd": 1, - "status": "osd is down"}]}, - -In this case, for example, the cluster knows that ``osd.1`` might have -data, but it is ``down``. The full range of possible states include:: - - * already probed - * querying - * osd is down - * not queried (yet) - -Sometimes it simply takes some time for the cluster to query possible -locations. - -It is possible that there are other locations where the object can -exist that are not listed. For example, if a ceph-osd is stopped and -taken out of the cluster, the cluster fully recovers, and due to some -future set of failures ends up with an unfound object, it won't -consider the long-departed ceph-osd as a potential location to -consider. (This scenario, however, is unlikely.) - -If all possible locations have been queried and objects are still -lost, you may have to give up on the lost objects. This, again, is -possible given unusual combinations of failures that allow the cluster -to learn about writes that were performed before the writes themselves -are recovered. To mark the "unfound" objects as "lost":: - - ceph pg 2.5 mark_unfound_lost revert - -This the final argument specifies how the cluster should deal with -lost objects. Currently the only supported option is "revert", which -will either roll back to a previous version of the object or (if it -was a new object) forget about it entirely. Use this with caution, as -it may confuse applications that expected the object to exist. - - -Homeless Placement Groups -------------------------- - -It is possible for all OSDs that had copies of a given placement groups to fail. -If that's the case, that subset of the object store is unavailable, and the -monitor will receive no status updates for those placement groups. To detect -this situation, the monitor marks any placement group whose primary OSD has -failed as ``stale``. For example:: - - ceph health - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - -You can identify which placement groups are ``stale``, and what the last OSDs to -store them were, with:: - - ceph health detail - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - ... - pg 2.5 is stuck stale+active+remapped, last acting [2,0] - ... - osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 - osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 - osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 - -If we want to get placement group 2.5 back online, for example, this tells us that -it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` -daemons will allow the cluster to recover that placement group (and, presumably, -many others). - - - -.. _iostat: http://en.wikipedia.org/wiki/Iostat -.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging -.. _Logging and Debugging Config Reference: ../../configuration/log-and-debug-ref -.. _Debugging and Logging: ../debug -.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction -.. _monitoring your OSDs: ../monitoring-osd-pg -.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel -.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel -.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users -.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users -.. _Inktank: http://inktank.com -.. _OS recommendations: ../../../install/os-recommendations -.. _ceph-devel: ceph-devel@vger.kernel.org \ No newline at end of file diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 000000000000..aae3423a4c71 --- /dev/null +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,714 @@ +============================== + Troubleshooting OSDs and PGs +============================== + +Before troubleshooting your OSDs, check your monitors and network first. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns +a health status, the return of a status means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, address the monitor issues first. Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. + + + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't get enough log detail, you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph processes:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{socket-name}`` with an actual +socket name to show the list of available options:: + + ceph --admin-daemon /var/run/ceph/{socket-name} help + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + + +Display Freespace +----------------- + +Filesystem issues may arise. To display your filesystem's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +Once the cluster is set to ``noout``, you can begin stopping the OSDs within the +failure domain that requires maintenance work. :: + + ceph osd stop osd.{num} + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs. :: + + ceph osd start osd.{num} + +Finally, you must unset the cluster from ``noout``. :: + + ceph osd unset noout + + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and journals. If you separate the OSD data from + the journal data and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + journal on a block device, you should partition your journal disk and assign + one partition per OSD. + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google perftools). Check the `OS recommendations`_ + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, turn your logging up + (if it isn't already), and try again. If it segment faults again, + contact the ceph-devel email list and provide your Ceph configuration + file, your monitor output and the contents of your log file(s). + +If you cannot resolve the issue and the email list isn't helpful, you may +contact `Inktank`_ for support. + + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, the monitor will learn about the failure +from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` +command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are ``ceph-osd`` +processes that are marked ``in`` and ``down``. You can identify which +``ceph-osds`` are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +If there is a disk +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file in +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure, the underlying +kernel file system may be unresponsive. Check ``dmesg`` output for disk +or other kernel errors. + +If the problem is a software error (failed assertion or other +unexpected error), it should be reported to the `ceph-devel`_ email list. + + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster +is getting near its full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Full cluster issues usually arise when testing how Ceph handles an OSD +failure on a small cluster. When one node has a high percentage of the +cluster's data, the cluster can easily eclipse its nearfull and full ratio +immediately. If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``. + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osds + osd.2 is near full at 85% + +Or:: + + ceph health + HEALTH_ERR 1 nearfull osds, 1 full osds + osd.2 is near full at 85% + osd.3 is full at 97% + +The best way to deal with a full cluster is to add new ``ceph-osds``, allowing +the cluster to redistribute data to the newly available storage. + +If you cannot start an OSD because it is full, you may delete some data by deleting +some placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. + +See `Monitor Config Reference`_ for additional details. + + +OSDs are Slow/Unresponsive +========================== + +A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs aren't available or are otherwise slow. + + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it depends upon networks to peer with +OSDs, replicate objects, recover from faults and check heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + + +Drive Configuration +------------------- + +A storage drive should only support one OSD. Sequential read and sequential +write throughput can bottleneck if other processes share the drive, including +journals, operating systems, monitors, other OSDs and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive +option to accelerate the response time--particularly when using the ``ext4`` or +XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal +simultaneously. + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your disks for bad sectors and fragmentation. This can cause total throughput +to drop substantially. + + +Co-resident Monitors/OSDs +------------------------- + +Monitors are generally light-weight processes, but they do lots of ``fsync()``, +which can interfere with other workloads, particularly if monitors run on the +same drive as your OSDs. Additionally, if you run monitors on the same host as +the OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running Argonaut with an old ``glibc`` +- Running a kernel with no syncfs(2) syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + + +Co-resident Processes +--------------------- + +Spinning up co-resident processes such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing a host for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with XFS or ext4. The btrfs +filesystem has many attractive features, but bugs in the filesystem may +lead to performance issues. + + +Insufficient RAM +---------------- + +We recommend 1GB of RAM per OSD daemon. You may notice that during normal +operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). +Unused RAM makes it tempting to use the excess RAM for co-resident applications, +VMs and so forth. However, when OSDs go into recovery mode, their memory +utilization spikes. If there is no RAM available, the OSD performance will slow +considerably. + + +Old Requests or Slow Requests +----------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages +complaining about requests that are taking too long. The warning threshold +defaults to 30 seconds, and is configurable via the ``osd op complaint time`` +option. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about 'old requests`:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about 'slow requests`:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + + +Possible causes include: + +- A bad drive (check ``dmesg`` output) +- A bug in the kernel file system bug (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions + +- Remove VMs Cloud Solutions from Ceph Hosts +- Upgrade Kernel +- Upgrade Ceph +- Restart OSDs + + + +Flapping OSDs +============= + +We recommend using both a public (front-end) network and a cluster (back-end) +network so that you can better meet the capacity requirements of object +replication. Another advantage is that you can run a cluster network such that +it isn't connected to the internet, thereby preventing some denial of service +attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +However, if the cluster (back-end) network fails or develops significant latency +while the public (front-end) network operates optimally, OSDs currently do not +handle this situation well. What happens is that OSDs mark each other ``down`` +on the monitor, while marking themselves ``up``. We call this scenario +'flapping`. + +If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to stop the flapping with:: + + ceph osd set noup # prevent osds from getting marked up + ceph osd set nodown # prevent osds from getting marked down + +These flags are recorded in the osdmap structure:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or down +ceph-osds from eventually being marked ``out`` (regardless of what the +current value for ``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + + + +Troubleshooting PG Errors +========================= + + +Placement Groups Never Get Clean +-------------------------------- + +There are a few cases where Ceph placement groups never get clean: + +#. **One OSD:** If you deviate from the quick start and use only one OSD, you + will likely run into problems. OSDs report other OSDs to the monitor, and + also interact with other OSDs when replicating data. If you have only one + OSD, a second OSD cannot check its heartbeat. Also, if you remove an OSD + and have only one OSD remaining, you may encounter problems. An secondary + or tertiary OSD expects another OSD to tell it which placement groups it + should have. The lack of another OSD prevents this from occurring. So a + placement group can remain stuck “stale” forever. + +#. **Pool Size = 1**: If you have only one copy of an object, no other OSD will + tell the OSD which objects it should have. For each placement group mapped + to the remaining OSD (see ``ceph pg dump``), you can force the OSD to notice + the placement groups it needs by running:: + + ceph pg force_create_pg + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + + +Stuck Placement Groups +---------------------- + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +-------------------------------------- + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +--------------- + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_missing [starting offset, in json] + +.. code-block:: javascript + + { "offset": { "oid": "", + "key": "", + "snapid": 0, + "hash": 0, + "max": 0}, + "num_missing": 0, + "num_unfound": 0, + "objects": [ + { "oid": "object 1", + "key": "", + "hash": 0, + "max": 0 }, + ... + ], + "more": 0} + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include:: + + * already probed + * querying + * osd is down + * not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert + +This the final argument specifies how the cluster should deal with +lost objects. Currently the only supported option is "revert", which +will either roll back to a previous version of the object or (if it +was a new object) forget about it entirely. Use this with caution, as +it may confuse applications that expected the object to exist. + + +Homeless Placement Groups +------------------------- + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + + +.. _iostat: http://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users +.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users +.. _Inktank: http://inktank.com +.. _OS recommendations: ../../../install/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org \ No newline at end of file