From d2a3e19388ff12320a4dafbbce094cb1f3a7ac3c Mon Sep 17 00:00:00 2001 From: Zac Dover Date: Thu, 12 Oct 2023 18:33:58 +1000 Subject: [PATCH] doc/rados: Edit troubleshooting-osd (2 of x) Edit doc/rados/troubleshooting/troubleshooting.rst (2 of x). Follows https://github.com/ceph/ceph/pull/53936. Co-authored-by: Anthony D'Atri Signed-off-by: Zac Dover (cherry picked from commit 37e7099267996a3075b4902a10a19d94fc738c08) --- .../troubleshooting/troubleshooting-osd.rst | 334 +++++++++++------- 1 file changed, 203 insertions(+), 131 deletions(-) diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst index 63df09298cd5f..86c9ec57290eb 100644 --- a/doc/rados/troubleshooting/troubleshooting-osd.rst +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -267,206 +267,278 @@ If the cluster has started but an OSD isn't starting, check the following: An OSD Failed ------------- -When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report -to the mons that it appears down, which will in turn surface the new status -via the ``ceph health`` command:: +When an OSD fails, this means that a ``ceph-osd`` process is unresponsive or +has died and that the corresponding OSD has been marked ``down``. Surviving +``ceph-osd`` daemons will report to the monitors that the OSD appears to be +down, and a new status will be visible in the output of the ``ceph health`` +command, as in the following example: - ceph health - HEALTH_WARN 1/3 in osds are down +.. prompt:: bash + + ceph health + +:: + + HEALTH_WARN 1/3 in osds are down -Specifically, you will get a warning whenever there are OSDs marked ``in`` -and ``down``. You can identify which are ``down`` with:: +This health alert is raised whenever there are one or more OSDs marked ``in`` +and ``down``. To see which OSDs are ``down``, add ``detail`` to the command as in +the following example: - ceph health detail - HEALTH_WARN 1/3 in osds are down - osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 +.. prompt:: bash -or :: + ceph health detail - ceph osd tree down +:: + + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +Alternatively, run the following command: + +.. prompt:: bash -If there is a drive -failure or other fault preventing ``ceph-osd`` from functioning or -restarting, an error message should be present in its log file under -``/var/log/ceph``. + ceph osd tree down -If the daemon stopped because of a heartbeat failure or ``suicide timeout``, -the underlying drive or filesystem may be unresponsive. Check ``dmesg`` -and `syslog` output for drive or other kernel errors. You may need to -specify something like ``dmesg -T`` to get timestamps, otherwise it's -easy to mistake old errors for new. +If there is a drive failure or another fault that is preventing a given +``ceph-osd`` daemon from functioning or restarting, then there should be an +error message present in its log file under ``/var/log/ceph``. + +If the ``ceph-osd`` daemon stopped because of a heartbeat failure or a +``suicide timeout`` error, then the underlying drive or filesystem might be +unresponsive. Check ``dmesg`` output and `syslog` output for drive errors or +kernel errors. It might be necessary to specify certain flags (for example, +``dmesg -T`` to see human-readable timestamps) in order to avoid mistaking old +errors for new errors. + +If an entire host's OSDs are ``down``, check to see if there is a network +error or a hardware issue with the host. + +If the OSD problem is the result of a software error (for example, a failed +assertion or another unexpected error), search for reports of the issue in the +`bug tracker `_ , the `dev mailing list +archives `_, and the +`ceph-users mailing list archives +`_. If there is no +clear fix or existing bug, then :ref:`report the problem to the ceph-devel +email list `. -If the problem is a software error (failed assertion or other -unexpected error), search the archives and tracker as above, and -report it to the `ceph-devel`_ email list if there's no clear fix or -existing bug. .. _no-free-drive-space: No Free Drive Space ------------------- -Ceph prevents you from writing to a full OSD so that you don't lose data. -In an operational cluster, you should receive a warning when your cluster's OSDs -and pools approach the full ratio. The ``mon osd full ratio`` defaults to -``0.95``, or 95% of capacity before it stops clients from writing data. -The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of -capacity above which backfills will not start. The -OSD nearfull ratio defaults to ``0.85``, or 85% of capacity -when it generates a health warning. +If an OSD is full, Ceph prevents data loss by ensuring that no new data is +written to the OSD. In an properly running cluster, health checks are raised +when the cluster's OSDs and pools approach certain "fullness" ratios. The +``mon_osd_full_ratio`` threshold defaults to ``0.95`` (or 95% of capacity): +this is the point above which clients are prevented from writing data. The +``mon_osd_backfillfull_ratio`` threshold defaults to ``0.90`` (or 90% of +capacity): this is the point above which backfills will not start. The +``mon_osd_nearfull_ratio`` threshold defaults to ``0.85`` (or 85% of capacity): +this is the point at which it raises the ``OSD_NEARFULL`` health check. + +OSDs within a cluster will vary in how much data is allocated to them by Ceph. +To check "fullness" by displaying data utilization for every OSD, run the +following command: + +.. prompt:: bash -Note that individual OSDs within a cluster will vary in how much data Ceph -allocates to them. This utilization can be displayed for each OSD with :: + ceph osd df - ceph osd df +To check "fullness" by displaying a cluster’s overall data usage and data +distribution among pools, run the following command: -Overall cluster / pool fullness can be checked with :: +.. prompt:: bash - ceph df + ceph df -Pay close attention to the **most full** OSDs, not the percentage of raw space -used as reported by ``ceph df``. It only takes one outlier OSD filling up to -fail writes to its pool. The space available to each pool as reported by -``ceph df`` considers the ratio settings relative to the *most full* OSD that -is part of a given pool. The distribution can be flattened by progressively -moving data from overfull or to underfull OSDs using the ``reweight-by-utilization`` -command. With Ceph releases beginning with later revisions of Luminous one can also -exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically -and rather effectively. +When examining the output of the ``ceph df`` command, pay special attention to +the **most full** OSDs, as opposed to the percentage of raw space used. If a +single outlier OSD becomes full, all writes to this OSD's pool might fail as a +result. When ``ceph df`` reports the space available to a pool, it considers +the ratio settings relative to the *most full* OSD that is part of the pool. To +flatten the distribution, two approaches are available: (1) Using the +``reweight-by-utilization`` command to progressively move data from excessively +full OSDs or move data to insufficiently full OSDs, and (2) in later revisions +of Luminous and subsequent releases, exploiting the ``ceph-mgr`` ``balancer`` +module to perform the same task automatically. -The ratios can be adjusted: +To adjust the "fullness" ratios, run a command or commands of the following +form: + +.. prompt:: bash + + ceph osd set-nearfull-ratio + ceph osd set-full-ratio + ceph osd set-backfillfull-ratio + +Sometimes full cluster issues arise because an OSD has failed. This can happen +either because of a test or because the cluster is small, very full, or +unbalanced. When an OSD or node holds an excessive percentage of the cluster's +data, component failures or natural growth can result in the ``nearfull`` and +``full`` ratios being exceeded. When testing Ceph's resilience to OSD failures +on a small cluster, it is advised to leave ample free disk space and to +consider temporarily lowering the OSD ``full ratio``, OSD ``backfillfull +ratio``, and OSD ``nearfull ratio``. + +The "fullness" status of OSDs is visible in the output of the ``ceph health`` +command, as in the following example: + +.. prompt:: bash + + ceph health :: - ceph osd set-nearfull-ratio - ceph osd set-full-ratio - ceph osd set-backfillfull-ratio + HEALTH_WARN 1 nearfull osd(s) -Full cluster issues can arise when an OSD fails either as a test or organically -within small and/or very full or unbalanced cluster. When an OSD or node -holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full`` -ratios may be exceeded as a result of component failures or even natural growth. -If you are testing how Ceph reacts to OSD failures on a small -cluster, you should leave ample free disk space and consider temporarily -lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and -OSD ``nearfull ratio`` +For details, add the ``detail`` command as in the following example: -Full ``ceph-osds`` will be reported by ``ceph health``:: +.. prompt:: bash - ceph health - HEALTH_WARN 1 nearfull osd(s) + ceph health detail -Or:: +:: - ceph health detail - HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) - osd.3 is full at 97% - osd.4 is backfill full at 91% - osd.2 is near full at 87% + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% -The best way to deal with a full cluster is to add capacity via new OSDs, enabling -the cluster to redistribute data to newly available storage. +To address full cluster issues, it is recommended to add capacity by adding +OSDs. Adding new OSDs allows the cluster to redistribute data to newly +available storage. Search for ``rados bench`` orphans that are wasting space. -If you cannot start a legacy Filestore OSD because it is full, you may reclaim -some space deleting a few placement group directories in the full OSD. +If a legacy Filestore OSD cannot be started because it is full, it is possible +to reclaim space by deleting a small number of placement group directories in +the full OSD. -.. important:: If you choose to delete a placement group directory on a full OSD, - **DO NOT** delete the same placement group directory on another full OSD, or - **YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on - at least one OSD. This is a rare and extreme intervention, and is not to be - undertaken lightly. +.. important:: If you choose to delete a placement group directory on a full + OSD, **DO NOT** delete the same placement group directory on another full + OSD. **OTHERWISE YOU WILL LOSE DATA**. You **MUST** maintain at least one + copy of your data on at least one OSD. Deleting placement group directories + is a rare and extreme intervention. It is not to be undertaken lightly. + +See `Monitor Config Reference`_ for more information. -See `Monitor Config Reference`_ for additional details. OSDs are Slow/Unresponsive ========================== -A common issue involves slow or unresponsive OSDs. Ensure that you -have eliminated other troubleshooting possibilities before delving into OSD -performance issues. For example, ensure that your network(s) is working properly -and your OSDs are running. Check to see if OSDs are throttling recovery traffic. +OSDs are sometimes slow or unresponsive. When troubleshooting this common +problem, it is advised to eliminate other possibilities before investigating +OSD performance issues. For example, be sure to confirm that your network(s) +are working properly, to verify that your OSDs are running, and to check +whether OSDs are throttling recovery traffic. + +.. tip:: In pre-Luminous releases of Ceph, ``up`` and ``in`` OSDs were + sometimes not available or were otherwise slow because recovering OSDs were + consuming system resources. Newer releases provide better recovery handling + by preventing this phenomenon. -.. tip:: Newer versions of Ceph provide better recovery handling by preventing - recovering OSDs from using up system resources so that ``up`` and ``in`` - OSDs are not available or are otherwise slow. Networking Issues ----------------- -Ceph is a distributed storage system, so it relies upon networks for OSD peering -and replication, recovery from faults, and periodic heartbeats. Networking -issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for -details. +As a distributed storage system, Ceph relies upon networks for OSD peering and +replication, recovery from faults, and periodic heartbeats. Networking issues +can cause OSD latency and flapping OSDs. For more information, see `Flapping +OSDs`_. -Ensure that Ceph processes and Ceph-dependent processes are connected and/or -listening. :: +To make sure that Ceph processes and Ceph-dependent processes are connected and +listening, run the following commands: - netstat -a | grep ceph - netstat -l | grep ceph - sudo netstat -p | grep ceph +.. prompt:: bash + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph -Check network statistics. :: +To check network statistics, run the following command: - netstat -s +.. prompt:: bash + + netstat -s Drive Configuration ------------------- -A SAS or SATA storage drive should only house one OSD; NVMe drives readily -handle two or more. Read and write throughput can bottleneck if other processes -share the drive, including journals / metadata, operating systems, Ceph monitors, -`syslog` logs, other OSDs, and non-Ceph processes. +An SAS or SATA storage drive should house only one OSD, but a NVMe drive can +easily house two or more. However, it is possible for read and write throughput +to bottleneck if other processes share the drive. Such processes include: +journals / metadata, operating systems, Ceph monitors, ``syslog`` logs, other +OSDs, and non-Ceph processes. -Ceph acknowledges writes *after* journaling, so fast SSDs are an -attractive option to accelerate the response time--particularly when -using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs. -By contrast, the ``Btrfs`` -file system can write and journal simultaneously. (Note, however, that -we recommend against using ``Btrfs`` for production deployments.) +Because Ceph acknowledges writes *after* journaling, fast SSDs are an +attractive option for accelerating response time -- particularly when using the +``XFS`` or ``ext4`` filesystems for legacy FileStore OSDs. By contrast, the +``Btrfs`` file system can write and journal simultaneously. (However, use of +``Btrfs`` is not recommended for production deployments.) .. note:: Partitioning a drive does not change its total throughput or - sequential read/write limits. Running a journal in a separate partition - may help, but you should prefer a separate physical drive. + sequential read/write limits. Throughput might be improved somewhat by + running a journal in a separate partition, but it is better still to run + such a journal in a separate physical drive. + +.. warning:: Reef does not support FileStore. Releases after Reef do not + support FileStore. Any information that mentions FileStore is pertinent only + to the Quincy release of Ceph and to releases prior to Quincy. + Bad Sectors / Fragmented Disk ----------------------------- -Check your drives for bad blocks, fragmentation, and other errors that can cause -performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog`` -logs, and ``smartctl`` (from the ``smartmontools`` package). +Check your drives for bad blocks, fragmentation, and other errors that can +cause significantly degraded performance. Tools that are useful in checking for +drive errors include ``dmesg``, ``syslog`` logs, and ``smartctl`` (found in the +``smartmontools`` package). + +.. note:: ``smartmontools`` 7.0 and late provides NVMe stat passthrough and + JSON output. + Co-resident Monitors/OSDs ------------------------- -Monitors are relatively lightweight processes, but they issue lots of -``fsync()`` calls, -which can interfere with other workloads, particularly if monitors run on the -same drive as an OSD. Additionally, if you run monitors on the same host as -OSDs, you may incur performance issues related to: - -- Running an older kernel (pre-3.0) -- Running a kernel with no ``syncfs(2)`` syscall. +Although monitors are relatively lightweight processes, performance issues can +result when monitors are run on the same host machine as an OSD. Monitors issue +many ``fsync()`` calls and this can interfere with other workloads. The danger +of performance issues is especially acute when the monitors are co-resident on +the same storage drive as an OSD. In addition, if the monitors are running an +older kernel (pre-3.0) or a kernel with no ``syncfs(2)`` syscall, then multiple +OSDs running on the same host might make so many commits as to undermine each +other's performance. This problem sometimes results in what is called "the +bursty writes". -In these cases, multiple OSDs running on the same host can drag each other down -by doing lots of commits. That often leads to the bursty writes. Co-resident Processes --------------------- -Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual -machines and other applications that write data to Ceph while operating on the -same hardware as OSDs can introduce significant OSD latency. Generally, we -recommend optimizing hosts for use with Ceph and using other hosts for other -processes. The practice of separating Ceph operations from other applications -may help improve performance and may streamline troubleshooting and maintenance. +Significant OSD latency can result from processes that write data to Ceph (for +example, cloud-based solutions and virtual machines) while operating on the +same hardware as OSDs. For this reason, making such processes co-resident with +OSDs is not generally recommended. Instead, the recommended practice is to +optimize certain hosts for use with Ceph and use other hosts for other +processes. This practice of separating Ceph operations from other applications +might help improve performance and might also streamline troubleshooting and +maintenance. + +Running co-resident processes on the same hardware is sometimes called +"convergence". When using Ceph, engage in convergence only with expertise and +after consideration. + Logging Levels -------------- -If you turned logging levels up to track an issue and then forgot to turn -logging levels back down, the OSD may be putting a lot of logs onto the disk. If -you intend to keep logging levels high, you may consider mounting a drive to the -default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). +Performance issues can result from high logging levels. Operators sometimes +raise logging levels in order to track an issue and then forget to lower them +afterwards. In such a situation, OSDs might consume valuable system resources to +write needlessly verbose logs onto the disk. Anyone who does want to use high logging +levels is advised to consider mounting a drive to the default path for logging +(for example, ``/var/log/ceph/$cluster-$name.log``). Recovery Throttling ------------------- -- 2.39.5