operation and performance. Look for dropped packets on the host side
and CRC errors on the switch side.
-
-
Obtaining Data About OSDs
=========================
- Dump operations in flight
- Dump perfcounters
-
Display Freespace
-----------------
Execute ``df --help`` for additional usage.
-
I/O Statistics
--------------
iostat -x
-
Diagnostic Messages
-------------------
dmesg | grep scsi
-
Stopping w/out Rebalancing
==========================
ceph osd set-group noout prod-ceph-data1701
-Once the flag is set you can begin stopping the OSDs within the
-failure domain that requires maintenance work. ::
+Once the flag is set you can stop the OSDs and any other colocated Ceph
+services within the failure domain that requires maintenance work. ::
- stop ceph-osd id={num}
+ systemctl stop ceph\*.service ceph\*.target
.. note:: Placement groups within the OSDs you stop will become ``degraded``
while you are addressing issues with within the failure domain.
-Once you have completed your maintenance, restart the OSDs. ::
+Once you have completed your maintenance, restart the OSDs and any other
+daemons. If you rebooted the host as part of the maintenance, these should
+come back on their own without intervention. ::
- start ceph-osd id={num}
+ sudo systemctl start ceph.target
-Finally, you must unset the cluster from ``noout``. ::
+Finally, you must unset the cluster-wide``noout`` flag::
ceph osd unset noout
ceph osd unset-group noout prod-ceph-data1701
-
+Note that most Linux distributions that Ceph supports today employ ``systemd``
+for service management. For other or older operating systems you may need
+to issue equivalent ``service`` or ``start``/``stop`` commands.
.. _osd-not-running:
may activate connection tracking anyway, so a "set and forget" strategy for
the tunables is advised. On modern systems this will not consume appreciable
resources.
-
- **Kernel Version:** Identify the kernel version and distribution you
are using. Ceph uses some third party tools by default, which may be
release being run, ``ceph.conf`` (with secrets XXX'd out),
your monitor status output and excerpts from your log file(s).
-
An OSD Failed
-------------
ceph osd tree down
-
If there is a drive
failure or other fault preventing ``ceph-osd`` from functioning or
restarting, an error message should be present in its log file under
report it to the `ceph-devel`_ email list if there's no clear fix or
existing bug.
-
No Free Drive Space
-------------------
See `Monitor Config Reference`_ for additional details.
-
OSDs are Slow/Unresponsive
==========================
-A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
+A common issue involves slow or unresponsive OSDs. Ensure that you
have eliminated other troubleshooting possibilities before delving into OSD
performance issues. For example, ensure that your network(s) is working properly
and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
recovering OSDs from using up system resources so that ``up`` and ``in``
OSDs are not available or are otherwise slow.
-
Networking Issues
-----------------
netstat -s
-
Drive Configuration
-------------------
sequential read/write limits. Running a journal in a separate partition
may help, but you should prefer a separate physical drive.
-
Bad Sectors / Fragmented Disk
-----------------------------
performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog``
logs, and ``smartctl`` (from the ``smartmontools`` package).
-
Co-resident Monitors/OSDs
-------------------------
In these cases, multiple OSDs running on the same host can drag each other down
by doing lots of commits. That often leads to the bursty writes.
-
Co-resident Processes
---------------------
processes. The practice of separating Ceph operations from other applications
may help improve performance and may streamline troubleshooting and maintenance.
-
Logging Levels
--------------
you intend to keep logging levels high, you may consider mounting a drive to the
default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
-
Recovery Throttling
-------------------
performance or it may increase recovery rates to the point that recovery
impacts OSD performance. Check to see if the OSD is recovering.
-
Kernel Version
--------------
Check the kernel version you are running. Older kernels may not receive
new backports that Ceph depends upon for better performance.
-
Kernel Issues with SyncFS
-------------------------
Try running one OSD per host to see if performance improves. Old kernels
might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
-
Filesystem Issues
-----------------
.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
-
Insufficient RAM
----------------
there is insufficient RAM available, OSD performance will slow considerably
and the daemons may even crash or be killed by the Linux ``OOM Killer``.
-
Blocked Requests or Slow Requests
---------------------------------
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
-
Possible causes include:
- A failing drive (check ``dmesg`` output)