From d96ebad94b09a9489bc0f44a157de62f78be3a7d Mon Sep 17 00:00:00 2001 From: Zac Dover Date: Sun, 4 Dec 2022 02:49:02 +1000 Subject: [PATCH] doc/rados: add prompts to health-checks (3 of 5) Add unselectable prompts to doc/rados/operations/health-checks.rst, third 300 lines. https://tracker.ceph.com/issues/57108 Signed-off-by: Zac Dover (cherry picked from commit 73e1a295258ebc52dff0ac306a153b1adc1a84ec) --- doc/rados/operations/health-checks.rst | 136 ++++++++++++++++--------- 1 file changed, 87 insertions(+), 49 deletions(-) diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst index fa9e55f14ccc0..61e02f1f3527a 100644 --- a/doc/rados/operations/health-checks.rst +++ b/doc/rados/operations/health-checks.rst @@ -522,19 +522,25 @@ Score is given in [0-1] range. [0.7 .. 0.9] considerable, but safe fragmentation [0.9 .. 1.0] severe fragmentation, may impact BlueFS ability to get space from BlueStore -If detailed report of free fragments is required do:: +If detailed report of free fragments is required do: - ceph daemon osd.123 bluestore allocator dump block +.. prompt:: bash $ + + ceph daemon osd.123 bluestore allocator dump block In case when handling OSD process that is not running fragmentation can be inspected with `ceph-bluestore-tool`. -Get fragmentation score:: +Get fragmentation score: + +.. prompt:: bash $ - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score -And dump detailed free chunks:: +And dump detailed free chunks: - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump +.. prompt:: bash $ + + ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump BLUESTORE_LEGACY_STATFS _______________________ @@ -547,15 +553,19 @@ not available. However, if there is a mix of pre-Nautilus and post-Nautilus OSDs, the cluster usage statistics reported by ``ceph df`` will not be accurate. -The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if ``osd.123`` needed to be updated,:: +The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if ``osd.123`` needed to be updated,: + +.. prompt:: bash $ - systemctl stop ceph-osd@123 - ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 - systemctl start ceph-osd@123 + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 -This warning can be disabled with:: +This warning can be disabled with: - ceph config set global bluestore_warn_on_legacy_statfs false +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_legacy_statfs false BLUESTORE_NO_PER_POOL_OMAP __________________________ @@ -568,15 +578,19 @@ based on the most recent deep-scrub. The old OSDs can be updated to track by pool by stopping each OSD, running a repair operation, and the restarting it. For example, if -``osd.123`` needed to be updated,:: +``osd.123`` needed to be updated,: + +.. prompt:: bash $ + + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 - systemctl stop ceph-osd@123 - ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 - systemctl start ceph-osd@123 +This warning can be disabled with: -This warning can be disabled with:: +.. prompt:: bash $ - ceph config set global bluestore_warn_on_no_per_pool_omap false + ceph config set global bluestore_warn_on_no_per_pool_omap false BLUESTORE_NO_PER_PG_OMAP __________________________ @@ -587,15 +601,19 @@ Pacific. Per-PG omap enables faster PG removal when PGs migrate. The older OSDs can be updated to track by PG by stopping each OSD, running a repair operation, and the restarting it. For example, if -``osd.123`` needed to be updated,:: +``osd.123`` needed to be updated,: - systemctl stop ceph-osd@123 - ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 - systemctl start ceph-osd@123 +.. prompt:: bash $ -This warning can be disabled with:: + systemctl stop ceph-osd@123 + ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 + systemctl start ceph-osd@123 - ceph config set global bluestore_warn_on_no_per_pg_omap false +This warning can be disabled with: + +.. prompt:: bash $ + + ceph config set global bluestore_warn_on_no_per_pg_omap false BLUESTORE_DISK_SIZE_MISMATCH @@ -607,13 +625,15 @@ the OSD crashing in the future. The OSDs in question should be destroyed and reprovisioned. Care should be taken to do this one OSD at a time, and in a way that doesn't put any data at -risk. For example, if osd ``$N`` has the error,:: +risk. For example, if osd ``$N`` has the error: + +.. prompt:: bash $ - ceph osd out osd.$N - while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done - ceph osd destroy osd.$N - ceph-volume lvm zap /path/to/device - ceph-volume lvm create --osd-id $N --data /path/to/device + ceph osd out osd.$N + while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done + ceph osd destroy osd.$N + ceph-volume lvm zap /path/to/device + ceph-volume lvm create --osd-id $N --data /path/to/device BLUESTORE_NO_COMPRESSION ________________________ @@ -643,13 +663,17 @@ This alert doesn't require immediate response but corresponding host might need additional attention, e.g. upgrading to the latest OS/kernel versions and H/W resource utilization monitoring. -This warning can be disabled on all OSDs with:: +This warning can be disabled on all OSDs with: - ceph config set osd bluestore_warn_on_spurious_read_errors false +.. prompt:: bash $ -Alternatively, it can be disabled on a specific OSD with:: + ceph config set osd bluestore_warn_on_spurious_read_errors false + +Alternatively, it can be disabled on a specific OSD with: - ceph config set osd.123 bluestore_warn_on_spurious_read_errors false +.. prompt:: bash $ + + ceph config set osd.123 bluestore_warn_on_spurious_read_errors false Device health @@ -669,14 +693,18 @@ hardware from the system. Note that the marking out is normally done automatically if ``mgr/devicehealth/self_heal`` is enabled based on the ``mgr/devicehealth/mark_out_threshold``. -Device health can be checked with:: +Device health can be checked with: + +.. prompt:: bash $ - ceph device info + ceph device info Device life expectancy is set by a prediction model run by -the mgr or an by external tool via the command:: +the mgr or an by external tool via the command: - ceph device set-life-expectancy +.. prompt:: bash $ + + ceph device set-life-expectancy You can change the stored life expectancy manually, but that usually doesn't accomplish anything as whatever tool originally set it will @@ -731,16 +759,20 @@ requests to be serviced. Problematic PG states include *peering*, *stale*, *incomplete*, and the lack of *active* (if those conditions do not clear quickly). -Detailed information about which PGs are affected is available from:: +Detailed information about which PGs are affected is available from: - ceph health detail +.. prompt:: bash $ + + ceph health detail In most cases the root cause is that one or more OSDs is currently down; see the discussion for ``OSD_DOWN`` above. -The state of specific problematic PGs can be queried with:: +The state of specific problematic PGs can be queried with: - ceph tell query +.. prompt:: bash $ + + ceph tell query PG_DEGRADED ___________ @@ -754,16 +786,20 @@ Specifically, one or more PGs: enough instances of that placement group in the cluster; * has not had the *clean* flag set for some time. -Detailed information about which PGs are affected is available from:: +Detailed information about which PGs are affected is available from: - ceph health detail +.. prompt:: bash $ + + ceph health detail In most cases the root cause is that one or more OSDs is currently down; see the dicussion for ``OSD_DOWN`` above. -The state of specific problematic PGs can be queried with:: +The state of specific problematic PGs can be queried with: - ceph tell query +.. prompt:: bash $ + + ceph tell query PG_RECOVERY_FULL @@ -832,10 +868,12 @@ can be caused by RGW bucket index objects that do not have automatic resharding enabled. Please see :ref:`RGW Dynamic Bucket Index Resharding ` for more information on resharding. -The thresholds can be adjusted with:: +The thresholds can be adjusted with: + +.. prompt:: bash $ - ceph config set osd osd_deep_scrub_large_omap_object_key_threshold - ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold + ceph config set osd osd_deep_scrub_large_omap_object_key_threshold + ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold CACHE_POOL_NEAR_FULL ____________________ -- 2.39.5