doc/rados/troubleshooting: description of upgrade procedure - stopping/starting OSDs

author Anthony D'Atri <anthony.datri@gmail.com>

Thu, 19 Nov 2020 19:48:50 +0000 (11:48 -0800)

committer Anthony D'Atri <anthony.datri@gmail.com>

Fri, 20 Nov 2020 07:39:42 +0000 (23:39 -0800)
author Anthony D'Atri <anthony.datri@gmail.com>
Thu, 19 Nov 2020 19:48:50 +0000 (11:48 -0800)
committer Anthony D'Atri <anthony.datri@gmail.com>
Fri, 20 Nov 2020 07:39:42 +0000 (23:39 -0800)
diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst

index 9347210eca9dcfb3c5d6cde1586ba71a2fc041ad..0264896e1c382f89c5d9ac0d8145f48207cf79b7 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-osd.rst
+++ b/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -12,8 +12,6 @@ are running properly, because networks may have a significant impact on OSD
  operation and performance. Look for dropped packets on the host side
  and CRC errors on the switch side.
  
-
-
  Obtaining Data About OSDs
  =========================
  
@@ -60,7 +58,6 @@ The admin socket, among other things, allows you to:
  - Dump operations in flight
  - Dump perfcounters
  
-
  Display Freespace
  -----------------
  
@@ -71,7 +68,6 @@ Filesystem issues may arise. To display your file system's free space, execute
  
  Execute ``df --help`` for additional usage.
  
-
  I/O Statistics
  --------------
  
@@ -79,7 +75,6 @@ Use `iostat`_ to identify I/O-related issues. ::
  
         iostat -x
  
-
  Diagnostic Messages
  -------------------
  
@@ -88,7 +83,6 @@ or ``tail``.  For example::
  
         dmesg | grep scsi
  
-
  Stopping w/out Rebalancing
  ==========================
  
@@ -110,24 +104,28 @@ Or an entire CRUSH bucket at a time.  Say you're going to take down
  
         ceph osd set-group noout prod-ceph-data1701
  
-Once the flag is set you can begin stopping the OSDs within the
-failure domain that requires maintenance work. ::
+Once the flag is set you can stop the OSDs and any other colocated Ceph
+services within the failure domain that requires maintenance work. ::
  
-       stop ceph-osd id={num}
+       systemctl stop ceph\*.service ceph\*.target
  
  .. note:: Placement groups within the OSDs you stop will become ``degraded``
     while you are addressing issues with within the failure domain.
  
-Once you have completed your maintenance, restart the OSDs. ::
+Once you have completed your maintenance, restart the OSDs and any other
+daemons.  If you rebooted the host as part of the maintenance, these should
+come back on their own without intervention. ::
  
-       start ceph-osd id={num}
+       sudo systemctl start ceph.target
  
-Finally, you must unset the cluster from ``noout``. ::
+Finally, you must unset the cluster-wide``noout`` flag::
  
         ceph osd unset noout
         ceph osd unset-group noout prod-ceph-data1701
  
-
+Note that most Linux distributions that Ceph supports today employ ``systemd``
+for service management.  For other or older operating systems you may need
+to issue equivalent ``service`` or ``start``/``stop`` commands.
  
  .. _osd-not-running:
  
@@ -184,7 +182,6 @@ If you start your cluster and an OSD won't start, check the following:
    may activate connection tracking anyway, so a "set and forget" strategy for
    the tunables is advised.  On modern systems this will not consume appreciable
    resources.
-  
  
  - **Kernel Version:** Identify the kernel version and distribution you
    are using. Ceph uses some third party tools by default, which may be
@@ -202,7 +199,6 @@ If you start your cluster and an OSD won't start, check the following:
    release being run, ``ceph.conf`` (with secrets XXX'd out),
    your monitor status output and excerpts from your log file(s).
  
-
  An OSD Failed
  -------------
  
@@ -224,7 +220,6 @@ or ::
  
         ceph osd tree down
  
-
  If there is a drive
  failure or other fault preventing ``ceph-osd`` from functioning or
  restarting, an error message should be present in its log file under
@@ -241,7 +236,6 @@ unexpected error), search the archives and tracker as above, and
  report it to the `ceph-devel`_ email list if there's no clear fix or
  existing bug.
  
-
  No Free Drive Space
  -------------------
  
@@ -317,11 +311,10 @@ some space deleting a few placement group directories in the full OSD.
  
  See `Monitor Config Reference`_ for additional details.
  
-
  OSDs are Slow/Unresponsive
  ==========================
  
-A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
+A common issue involves slow or unresponsive OSDs. Ensure that you
  have eliminated other troubleshooting possibilities before delving into OSD
  performance issues. For example, ensure that your network(s) is working properly
  and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
@@ -330,7 +323,6 @@ and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
     recovering OSDs from using up system resources so that ``up`` and ``in``
     OSDs are not available or are otherwise slow.
  
-
  Networking Issues
  -----------------
  
@@ -350,7 +342,6 @@ Check network statistics. ::
  
         netstat -s
  
-
  Drive Configuration
  -------------------
  
@@ -370,7 +361,6 @@ we recommend against using ``Btrfs`` for production deployments.)
     sequential read/write limits. Running a journal in a separate partition
     may help, but you should prefer a separate physical drive.
  
-
  Bad Sectors / Fragmented Disk
  -----------------------------
  
@@ -378,7 +368,6 @@ Check your drives for bad blocks, fragmentation, and other errors that can cause
  performance to drop substantially.  Invaluable tools include ``dmesg``, ``syslog``
  logs, and ``smartctl`` (from the ``smartmontools`` package).
  
-
  Co-resident Monitors/OSDs
  -------------------------
  
@@ -394,7 +383,6 @@ OSDs, you may incur performance issues related to:
  In these cases, multiple OSDs running on the same host can drag each other down
  by doing lots of commits. That often leads to the bursty writes.
  
-
  Co-resident Processes
  ---------------------
  
@@ -405,7 +393,6 @@ recommend optimizing hosts for use with Ceph and using other hosts for other
  processes. The practice of separating Ceph operations from other applications
  may help improve performance and may streamline troubleshooting and maintenance.
  
-
  Logging Levels
  --------------
  
@@ -414,7 +401,6 @@ logging levels back down, the OSD may be putting a lot of logs onto the disk. If
  you intend to keep logging levels high, you may consider mounting a drive to the
  default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
  
-
  Recovery Throttling
  -------------------
  
@@ -422,21 +408,18 @@ Depending upon your configuration, Ceph may reduce recovery rates to maintain
  performance or it may increase recovery rates to the point that recovery
  impacts OSD performance. Check to see if the OSD is recovering.
  
-
  Kernel Version
  --------------
  
  Check the kernel version you are running. Older kernels may not receive
  new backports that Ceph depends upon for better performance.
  
-
  Kernel Issues with SyncFS
  -------------------------
  
  Try running one OSD per host to see if performance improves. Old kernels
  might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
  
-
  Filesystem Issues
  -----------------
  
@@ -454,7 +437,6 @@ For more information, see `Filesystem Recommendations`_.
  
  .. _Filesystem Recommendations: ../configuration/filesystem-recommendations
  
-
  Insufficient RAM
  ----------------
  
@@ -467,7 +449,6 @@ when OSDs experience recovery their memory utilization spikes. If
  there is insufficient RAM available, OSD performance will slow considerably
  and the daemons may even crash or be killed by the Linux ``OOM Killer``.
  
-
  Blocked Requests or Slow Requests
  ---------------------------------
  
@@ -485,7 +466,6 @@ New versions of Ceph complain about ``slow requests``::
         {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
         {date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
  
-
  Possible causes include:
  
  - A failing drive (check ``dmesg`` output)
author	Anthony D'Atri <anthony.datri@gmail.com>
	Thu, 19 Nov 2020 19:48:50 +0000 (11:48 -0800)
committer	Anthony D'Atri <anthony.datri@gmail.com>
	Fri, 20 Nov 2020 07:39:42 +0000 (23:39 -0800)