doc/rados: edit troubleshooting-mon.rst (4 of x)

author Zac Dover <zac.dover@proton.me>

Mon, 9 Oct 2023 08:55:07 +0000 (18:55 +1000)

committer Zac Dover <zac.dover@proton.me>

Mon, 9 Oct 2023 19:17:44 +0000 (05:17 +1000)
author Zac Dover <zac.dover@proton.me>
Mon, 9 Oct 2023 08:55:07 +0000 (18:55 +1000)
committer Zac Dover <zac.dover@proton.me>
Mon, 9 Oct 2023 19:17:44 +0000 (05:17 +1000)
diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst

index bf1a9be3d7e9af58d642b34ddfb2a547ffba2ea8..48ce50813df14e2eb790af319a33ce4a907ae17e 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-mon.rst
+++ b/doc/rados/troubleshooting/troubleshooting-mon.rst
@@ -468,12 +468,16 @@ monitor is fully synchronized, it will be able to serve clients.
  Recovery using OSDs
  -------------------
  
-But what if all monitors fail at the same time? Since users are encouraged to
-deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous
-failure is rare. But unplanned power-downs in a data center with improperly
-configured disk/fs settings could fail the underlying file system, and hence
-kill all the monitors. In this case, we can recover the monitor store with the
-information stored in OSDs.
+Even if all monitors fail at the same time, it is possible to recover the
+monitor store by using information stored in OSDs. You are encouraged to deploy
+at least three (and preferably five) monitors in a Ceph cluster. In such a
+deployment, complete monitor failure is unlikely. However, unplanned power loss
+in a data center whose disk settings or filesystem settings are improperly
+configured could cause the underlying filesystem to fail and this could kill
+all of the monitors. In such a case, data in the OSDs can be used to recover
+the monitors.  The following is such a script and can be used to recover the
+monitors:
+
  
  .. code-block:: bash
  
@@ -524,124 +528,142 @@ information stored in OSDs.
    mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
    chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
  
-The steps above
+This script performs the following steps:
+
+#. Collects the map from each OSD host.
+#. Rebuilds the store.
+#. Fills the entities in the keyring file with appropriate capabilities.
+#. Replaces the corrupted store on ``mon.foo`` with the recovered copy.
  
-#. collect the map from all OSD hosts,
-#. then rebuild the store,
-#. fill the entities in keyring file with appropriate caps
-#. replace the corrupted store on ``mon.foo`` with the recovered copy.
  
  Known limitations
  ~~~~~~~~~~~~~~~~~
  
-Following information are not recoverable using the steps above:
+The above recovery tool is unable to recover the following information:
  
-- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
-  are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
-  using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
-  in the recovered monitor store. You might need to re-add them manually.
+- **Certain added keyrings**: All of the OSD keyrings added using the ``ceph
+  auth add`` command are recovered from the OSD's copy, and the
+  ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However,
+  the MDS keyrings and all other keyrings will be missing in the recovered
+  monitor store. You might need to manually re-add them.
  
-- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost.  The recovery tool assumes that all pools have been created.  If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command.  Note that this will create an *empty* PG, so only do this if you know the pool is empty.
-
-- **MDS Maps**: the MDS maps are lost.
+- **Creating pools**: If any RADOS pools were in the process of being created,
+  that state is lost. The recovery tool operates on the assumption that all
+  pools have already been created. If there are PGs that are stuck in the
+  'unknown' state after the recovery for a partially created pool, you can
+  force creation of the *empty* PG by running the ``ceph osd force-create-pg``
+  command. Note that this will create an *empty* PG, so take this action only
+  if you know the pool is empty.
  
+- **MDS Maps**: The MDS maps are lost.
  
  
  Everything Failed! Now What?
-=============================
+============================
  
  Reaching out for help
-----------------------
+---------------------
  
-You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
-and on ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
-sure you have grabbed your logs and have them ready if someone asks: the faster
-the interaction and lower the latency in response, the better chances everyone's
-time is optimized.
+You can find help on IRC in #ceph and #ceph-devel on OFTC (server
+irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
+sure that you have prepared your logs and that you have them ready upon
+request.
+
+See https://ceph.io/en/community/connect/ for current (as of October 2023)
+information on getting in contact with the upstream Ceph community.
  
  
  Preparing your logs
----------------------
+-------------------
+
+The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``.
+However, if they are not there, you can find their current location by running
+the following command:
  
-Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
-may want them. However, your logs may not have the necessary information. If
-you don't find your monitor logs at their default location, you can check
-where they should be by running::
+.. prompt:: bash
  
-  ceph-conf --name mon.FOO --show-config-value log_file
+   ceph-conf --name mon.FOO --show-config-value log_file
  
-The amount of information in the logs are subject to the debug levels being
-enforced by your configuration files. If you have not enforced a specific
-debug level then Ceph is using the default levels and your logs may not
-contain important information to track down you issue.
-A first step in getting relevant information into your logs will be to raise
-debug levels. In this case we will be interested in the information from the
-monitor.
-Similarly to what happens on other components, different parts of the monitor
-will output their debug information on different subsystems.
+The amount of information in the logs is determined by the debug levels in the
+cluster's configuration files. If Ceph is using the default debug levels, then
+your logs might be missing important information that would help the upstream
+Ceph community address your issue.
  
-You will have to raise the debug levels of those subsystems more closely
-related to your issue. This may not be an easy task for someone unfamiliar
-with troubleshooting Ceph. For most situations, setting the following options
-on your monitors will be enough to pinpoint a potential source of the issue::
+To make sure your monitor logs contain relevant information, you can raise
+debug levels. Here we are interested in information from the monitors.  As with
+other components, the monitors have different parts that output their debug
+information on different subsystems.
+
+If you are an experienced Ceph troubleshooter, we recommend raising the debug
+levels of the most relevant subsystems. Of course, this approach might not be
+easy for beginners. In most cases, however, enough information to address the
+issue will be secured if the following debug levels are entered::
  
        debug mon = 10
        debug ms = 1
  
-If we find that these debug levels are not enough, there's a chance we may
-ask you to raise them or even define other debug subsystems to obtain infos
-from -- but at least we started off with some useful information, instead
-of a massively empty log without much to go on with.
+Sometimes these debug levels do not yield enough information. In such cases,
+members of the upstream Ceph community might ask you to make additional changes
+to these or to other debug levels. In any case, it is better for us to receive
+at least some useful information than to receive an empty log.
+
  
  Do I need to restart a monitor to adjust debug levels?
  ------------------------------------------------------
  
-No. You may do it in one of two ways:
+No, restarting a monitor is not necessary. Debug levels may be adjusted by
+using two different methods, depending on whether or not there is a quorum:
  
-You have quorum
+There is a quorum
  
-  Either inject the debug option into the monitor you want to debug::
+  Either inject the debug option into the specific monitor that needs to 
+  be debugged::
  
          ceph tell mon.FOO config set debug_mon 10/10
  
-  or into all monitors at once::
+  Or inject it into all monitors at once::
  
          ceph tell mon.* config set debug_mon 10/10
  
-No quorum
  
-  Use the monitor's admin socket and directly adjust the configuration
-  options::
+There is no quorum
+
+  Use the admin socket of the specific monitor that needs to be debugged
+  and directly adjust the monitor's configuration options::
  
        ceph daemon mon.FOO config set debug_mon 10/10
  
  
-Going back to default values is as easy as rerunning the above commands
-using the debug level ``1/10`` instead.  You can check your current
-values using the admin socket and the following commands::
+To return the debug levels to their default values, run the above commands
+using the debug level ``1/10`` rather than ``10/10``. To check a monitor's
+current values, use the admin socket and run either of the following commands:
  
-      ceph daemon mon.FOO config show
+  .. prompt:: bash
  
-or::
+     ceph daemon mon.FOO config show
+
+or:
+
+  .. prompt:: bash
+
+     ceph daemon mon.FOO config get 'OPTION_NAME'
  
-      ceph daemon mon.FOO config get 'OPTION_NAME'
  
  
-Reproduced the problem with appropriate debug levels. Now what?
-----------------------------------------------------------------
+I Reproduced the problem with appropriate debug levels. Now what?
+-----------------------------------------------------------------
  
-Ideally you would send us only the relevant portions of your logs.
-We realise that figuring out the corresponding portion may not be the
-easiest of tasks. Therefore, we won't hold it to you if you provide the
-full log, but common sense should be employed. If your log has hundreds
-of thousands of lines, it may get tricky to go through the whole thing,
-specially if we are not aware at which point, whatever your issue is,
-happened. For instance, when reproducing, keep in mind to write down
-current time and date and to extract the relevant portions of your logs
-based on that.
+We prefer that you send us only the portions of your logs that are relevant to
+your monitor problems. Of course, it might not be easy for you to determine
+which portions are relevant so we are willing to accept complete and
+unabridged logs. However, we request that you avoid sending logs containing
+hundreds of thousands of lines with no additional clarifying information. One
+common-sense way of making our task easier is to write down the current time
+and date when you are reproducing the problem and then extract portions of your
+logs based on that information.
  
-Finally, you should reach out to us on the mailing lists, on IRC or file
-a new issue on the `tracker`_.
+Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a
+new issue on the `tracker`_.
  
  .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new
author	Zac Dover <zac.dover@proton.me>
	Mon, 9 Oct 2023 08:55:07 +0000 (18:55 +1000)
committer	Zac Dover <zac.dover@proton.me>
	Mon, 9 Oct 2023 19:17:44 +0000 (05:17 +1000)