From 6bac77e9609abbd6cce470ceabed90c4d5fd9986 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Wed, 26 Jul 2017 22:05:35 -0400 Subject: [PATCH] doc/rados/operations/health-checks: osd section First paragraph: explain what the error means. Second or later paragraph: describe steps to fix or mitigate. Signed-off-by: Sage Weil --- doc/rados/operations/health-checks.rst | 136 ++++++++++++++++++++++++- 1 file changed, 134 insertions(+), 2 deletions(-) diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst index 4dced94e29b..e5156fbe3af 100644 --- a/doc/rados/operations/health-checks.rst +++ b/doc/rados/operations/health-checks.rst @@ -28,7 +28,14 @@ OSDs OSD_DOWN ________ -One or more OSDs are marked down +One or more OSDs are marked down. The ceph-osd daemon may have been +stopped, or peer OSDs may be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a down host, or a +network outage. + +Verify the host is healthy, the daemon is started, and network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) may contain debugging information. OSD__DOWN _____________________ @@ -41,50 +48,175 @@ all OSDs on a host. OSD_ORPHAN __________ +An OSD is referenced in the CRUSH map hierarchy but does not exist. + +The OSD can be removed from the CRUSH hierarchy with:: + + ceph osd crush rm osd. OSD_OUT_OF_ORDER_FULL _____________________ +The utilization thresholds for `backfillfull`, `nearfull`, `full`, +and/or `failsafe_full` are not ascending. In particular, we expect +`backfillfull < nearfull`, `nearfull < full`, and `full < +failsafe_full`. + +The thresholds can be adjusted with:: + + ceph osd set-backfillfull-ratio + ceph osd set-nearfull-ratio + ceph osd set-full-ratio + OSD_FULL ________ +One or more OSDs has exceeded the `full` threshold and is preventing +the cluster from servicing writes. + +Utilization by pool can be checked with:: + + ceph df +The currently defined `full` ratio can be seen with:: + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount:: + + ceph osd set-full-ratio + +New storage should be added to the cluster by deploying more OSDs or +existing data should be deleted in order to free up space. + OSD_BACKFILLFULL ________________ +One or more OSDs has exceeded the `backfillfull` threshold, which will +prevent data from being allowed to rebalance to this device. This is +an early warning that rebalancing may not be able to complete and that +the cluster is approaching full. + +Utilization by pool can be checked with:: + + ceph df OSD_NEARFULL ____________ +One or more OSDs has exceeded the `nearfull` threshold. This is an early +warning that the cluster is approaching full. + +Utilization by pool can be checked with:: + + ceph df OSDMAP_FLAGS ____________ - +One or more cluster flags of interest has been set. These flags include: + +* *full* - the cluster is flagged as full and cannot service writes +* *pauserd*, *pausewr* - paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, such that the + monitors will not mark OSDs `down` +* *noin* - OSDs that were previously marked `out` will not be marked + back `in` when they start +* *noout* - down OSDs will not automatically be marked out after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared with:: + + ceph osd set + ceph osd unset + OSD_FLAGS _________ +One or more OSDs has a per-OSD flag of interest set. These flags include: + +* *noup*: OSD is not allowed to start +* *nodown*: failure reports for this OSD will be ignored +* *noin*: if this OSD was previously marked `out` automatically + after a failure, it will not be marked in when it stats +* *noout*: if this OSD is down it will not automatically be marked + `out` after the configured interval + +Per-OSD flags can be set and cleared with:: + + ceph osd add- + ceph osd rm- + +For example, :: + + ceph osd rm-nodown osd.123 OLD_CRUSH_TUNABLES __________________ +The CRUSH map is using very old settings and should be updated. The +oldest tunables that can be used (i.e., the oldest client version that +can connect to the cluster) without triggering this health warning is +determined by the ``mon_crush_min_required_version`` config option. +See :doc:`/rados/operations/crush-map/#tunables` for more information. OLD_CRUSH_STRAW_CALC_VERSION ____________________________ +The CRUSH map is using an older, non-optimal method for calculating +intermediate weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method +(``straw_calc_version=1``). See +:doc:`/rados/operations/crush-map/#tunables` for more information. CACHE_POOL_NO_HIT_SET _____________________ +One or more cache pools is not configured with a *hit set* to track +utilization, which will prevent the tiering agent from identifying +cold objects to flush and evict from the cache. + +Hit sets can be configured on the cache pool with:: + + ceph osd pool set hit_set_type + ceph osd pool set hit_set_period + ceph osd pool set hit_set_count + ceph osd pool set hit_set_fpp OSD_NO_SORTBITWISE __________________ +No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set before luminous v12.y.z or newer +OSDs can start. You can safely set the flag with:: + + ceph osd set sortbitwise POOL_FULL _________ +One or more pools has reached its quota and is no longer allowing writes. + +Pool quotas and utilization can be seen with:: + + ceph df detail + +You can either raise the pool quota with:: + + ceph osd pool set-quota max_objects + ceph osd pool set-quota max_bytes + +or delete some existing data to reduce utilization. Data health (pools & placement groups) ------------------------------ -- 2.39.5