From 3dfdf939a83ee7737b67cc2ec0c23a13fbecd9e2 Mon Sep 17 00:00:00 2001 From: Ville Ojamo <14869000+bluikko@users.noreply.github.com> Date: Fri, 20 Mar 2026 21:04:20 +0700 Subject: [PATCH] doc/monitoring: Improve index.rst Improve consistency and formatting. Fix typos, markup errors, missing words. Fix wrong metric explanation. Change a broken MD link to an RST link. Signed-off-by: Ville Ojamo --- doc/monitoring/index.rst | 237 +++++++++++++++++---------------------- 1 file changed, 105 insertions(+), 132 deletions(-) diff --git a/doc/monitoring/index.rst b/doc/monitoring/index.rst index 38c628ee804b..c6b44b71cd21 100644 --- a/doc/monitoring/index.rst +++ b/doc/monitoring/index.rst @@ -1,7 +1,7 @@ .. _monitoring: =================== -Monitoring overview +Monitoring Overview =================== This document explains the Ceph monitoring @@ -12,36 +12,36 @@ can leverage Prometheus, Alertmanager, Grafana, and scripting to create customiz monitoring tools. -Ceph Monitoring stack +Ceph Monitoring Stack ===================== Ceph deploys an integrated monitoring stack as described in the :ref:`Monitoring Services ` section of -the ``cephadm`` documentation. Deployments with external fleetwide monitoring +the ``cephadm`` documentation. Deployments with external fleet-wide monitoring and observability systems using these or other tools may choose to disable the stack that Ceph deploys by default. -Ceph metrics +Ceph Metrics ============ Many Ceph metrics are gathered from the performance counters exposed by each -Ceph daemon. These :doc:`../dev/perf_counters` are native Ceph metrics. +Ceph daemon. These :ref:`Perf Counters` are native Ceph metrics. Performance counters are rendered into standard Prometheus metrics by the ``ceph_exporter`` daemon. This daemon runs on every Ceph cluster host and exposes an endpoint where performance counters exposed by Ceph daemons running on that host are presented in the form of Prometheus metrics. -In addition to the ``ceph_exporter`` the Ceph Manager ``prometheus`` module +In addition to the ``ceph_exporter`` service, the Ceph Manager ``prometheus`` module exposes metrics relating to the Ceph cluster as a whole. Ceph provides a Prometheus endpoint from which one can obtain the complete list -of available metrics, or against which admins, Grafana, and Alertmanager can exeute queries. +of available metrics, or against which admins, Grafana, and Alertmanager can execute queries. Prometheus (and related systems) accept data queries formatted as PromQL expressions. Expansive documentation of PromQL can be -viewed [here](https://prometheus.io/docs/prometheus/latest/querying/basics/) and +viewed in the `PromQL documentation`_ and several excellent books can be found at the usual sources of digital and print books. We will explore a number of PromQL queries below. Use the following command @@ -49,7 +49,7 @@ to obtain the Prometheus endpoint for your cluster: Example: -.. code-block:: bash +.. prompt:: bash # auto # ceph orch ps --service_name prometheus NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID @@ -61,19 +61,21 @@ interface, which includes a list of targets, an expression browser, and metrics related to the Prometheus service itself. The complete list of metrics (with descriptions) is available at the URL of the below form: -in: -``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata`` +:: + + http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata The Ceph Dashboard provides a rich set of graphs and other panels that display the most important cluster and service metrics. Many of the examples in this document are taken from Dashboard graphics or extrapolated from metrics exposed by the Ceph Dashboard. -Ceph daemon health metrics + +Ceph Daemon Health Metrics ========================== -The ``ceph_exporter`` provides a metric named ``ceph_daemon_socket_up`` that +The ``ceph_exporter`` service provides a metric named ``ceph_daemon_socket_up`` that indicates the health status of a Ceph daemon based on its ability to respond via the admin socket, where a value of ``1`` means healthy, and ``0`` means unhealthy. Although a Ceph daemon might still be "alive" when it @@ -82,6 +84,7 @@ in its functionality. As such, this metric serves as an excellent means of detecting problems in any of the main Ceph daemons. The ``ceph_daemon_socket_up`` Prometheus metrics also have labels as described below: + * ``ceph_daemon``: Identifier of the Ceph daemon exposing an admin socket on the host. * ``hostname``: Name of the host where the Ceph daemon is running. @@ -100,14 +103,15 @@ To identify any Ceph daemons that were not responsive at any point in the last ceph_daemon_socket_up == 0 or min_over_time(ceph_daemon_socket_up[12h]) == 0 -Performance metrics +Performance Metrics =================== -Below we explore a a number of metrics that indicate Ceph cluster performance. +Below we explore a number of metrics that indicate Ceph cluster performance. All of these metrics have the following labels: + * ``ceph_daemon``: Identifier of the Ceph daemon from which the metric was harvested -* ``instance``: The IP address of the exporter instance exposing the metric. +* ``instance``: The IP address of the exporter instance exposing the metric * ``job``: Prometheus scrape job name Below is an example Prometheus query result showing these labels: @@ -123,10 +127,10 @@ Example: .. code-block:: bash - # Writes (B/s): + # Writes (B/s) sum(irate(ceph_osd_op_w_in_bytes[1m])) - # Reads (B/s): + # Reads (B/s) sum(irate(ceph_osd_op_r_out_bytes[1m])) @@ -137,15 +141,15 @@ Example: .. code-block:: bash - # Writes (ops/s): + # Writes (ops/s) sum(irate(ceph_osd_op_w[1m])) - # Reads (ops/s): + # Reads (ops/s) sum(irate(ceph_osd_op_r[1m])) *Latency:* Query ``ceph_osd_op_latency_sum`` to measure the delay before OSD transfers of data -begins in respose to client requests: +begins in response to client requests: Example: @@ -154,7 +158,7 @@ Example: sum(irate(ceph_osd_op_latency_sum[1m])) -OSD performance +OSD Performance =============== The cluster performance metrics described above are gathered from OSD metrics. @@ -178,8 +182,8 @@ Examples: ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481 -Physical storage drive performance: -=================================== +Physical Storage Drive Performance +================================== By combining Prometheus ``node_exporter`` metrics with Ceph cluster metrics we can derive performance information for physical storage media backing Ceph OSDs. @@ -211,7 +215,8 @@ Example: # Physical drive utilization (%) for osd.0 in the last 5 minutes. Note that this value has limited mean for SSDs label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*") -Pool metrics + +Pool Metrics ============ Ceph pool metrics have the following labels: @@ -223,70 +228,65 @@ Ceph pool metrics have the following labels: Pool-specific metrics include: - * ``ceph_pool_metadata``: Information about the pool that can be used - together with other metrics to provide more information in query resultss - and graphs. In addition to the above three common labels this metric - provides the following: - - * ``compression_mode``: Compression type enabled for the pool. Values are - ``lz4``, ``snappy``, ``zlib``, ``zstd``, and ``none`). Example: - ``compression_mode="none"`` - - * ``description``: Brief description of the pool data protection strategy - including replica number or EC profile. Example: - ``description="replica:3"`` - - * ``name``: Name of the pool. Example: ``name=".mgr"`` - - * ``type``: Data protection strategy, replicated or EC. ``Example: - type="replicated"`` - - * ``ceph_pool_bytes_used``: Total raw capacity (after replication or EC) - consumed by user data and metadata - - * ``ceph_pool_stored``: Total client data stored in the pool (before data - protection) - - * ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in - the pool - - * ``ceph_pool_compress_bytes_used``: Data compressed in the pool - - * ``ceph_pool_rd``: Client read operations per pool (reads per second) - - * ``ceph_pool_rd_bytes``: Client read operations in bytes per pool - - * ``ceph_pool_wr``: Client write operations per pool (writes per second) - - * ``ceph_pool_wr_bytes``: Client write operation in bytes per pool - - -**Useful queries**: +* ``ceph_pool_metadata``: Information about the pool that can be used + together with other metrics to provide more information in query results + and graphs. In addition to the above three common labels, this metric + provides the following: + + * ``compression_mode``: Compression type enabled for the pool. Values are + ``lz4``, ``snappy``, ``zlib``, ``zstd``, and ``none`). Example: + ``compression_mode="none"`` + * ``description``: Brief description of the pool data protection strategy + including replica number or EC profile. Example: + ``description="replica:3"`` + * ``name``: Name of the pool. Example: ``name=".mgr"`` + * ``type``: Data protection strategy, replicated or EC. ``Example: + type="replicated"`` + * ``ceph_pool_bytes_used``: Total raw capacity (after replication or EC) + consumed by user data and metadata + * ``ceph_pool_stored``: Total client data stored in the pool (before data + protection) + * ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in + the pool + * ``ceph_pool_compress_bytes_used``: Data compressed in the pool + * ``ceph_pool_rd``: Client read operations per pool (reads per second) + * ``ceph_pool_rd_bytes``: Client read operations in bytes per pool + * ``ceph_pool_wr``: Client write operations per pool (writes per second) + * ``ceph_pool_wr_bytes``: Client write operation in bytes per pool + + +Useful Queries +-------------- .. code-block:: bash - # Total raw capacity available in the cluster: + # Total raw capacity available in the cluster sum(ceph_osd_stat_bytes) - # Total raw capacity consumed in the cluster (including metadata + redundancy): + # Total raw capacity consumed in the cluster (including metadata + redundancy) sum(ceph_pool_bytes_used) - # Total client data stored in the cluster: + # Total client data stored in the cluster sum(ceph_pool_stored) - # Compression savings: + # Compression savings sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used) # Client IOPS for a specific pool - reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} - writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} + # reads: + irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} + # writes: + irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} # Client throughput for a specific pool - reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} - writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} + # reads: + irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} + # writes: + irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"} + -RGW metrics -================== +RGW Metrics +=========== These metrics have the following labels: @@ -301,58 +301,49 @@ Example: ceph_rgw_req{instance="192.168.122.7:9283", instance_id="154247", job="ceph"} = 12345 -Generic metrics +Generic Metrics --------------- * ``ceph_rgw_metadata``: Provides generic information about an RGW daemon. This can be used together with other metrics to provide contextual - information in queries and graphs. In addtion to the three common labels, + information in query results and graphs. In addition to the three common labels, this metric provides the following: * ``ceph_daemon``: Name of the RGW daemon instance. Example: ``ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq"`` - * ``ceph_version``: Version of the RGW daemon. Example: ``ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)"`` - * ``hostname``: Name of the host where the daemon runs. Example: - ``hostname:"cephtest-node-00.cephlab.com"`` - + ``hostname="cephtest-node-00.cephlab.com"`` * ``ceph_rgw_req``: Number of requests processed by the daemon - (``GET``+``PUT``+``DELETE``). Useful for detecting bottlenecks and + (``GET`` + ``PUT`` + ``DELETE``). Useful for detecting bottlenecks and optimizing load distribution. - * ``ceph_rgw_qlen``: Operations queue length for the daemon. Useful for detecting bottlenecks and optimizing load distribution. - * ``ceph_rgw_failed_req``: Aborted requests. Useful for detecting daemon errors. -GET operation metrics +GET Operation Metrics --------------------- -* ``ceph_rgw_op_global_get_obj_lat_count``: Number of ``GET`` requests +* ``ceph_rgw_op_global_get_obj_lat_count``: Number of ``GET`` requests * ``ceph_rgw_op_global_get_obj_lat_sum``: Total latency for ``GET`` requests - * ``ceph_rgw_op_global_get_obj_ops``: Total number of ``GET`` requests - * ``ceph_rgw_op_global_get_obj_bytes``: Total bytes transferred for ``GET`` requests -PUT operation metrics -------------------------------- -* ``ceph_rgw_op_global_put_obj_lat_count``: Number of get operations +PUT Operation Metrics +--------------------- +* ``ceph_rgw_op_global_put_obj_lat_count``: Number of ``PUT`` operations * ``ceph_rgw_op_global_put_obj_lat_sum``: Total latency time for ``PUT`` operations - * ``ceph_rgw_op_global_put_obj_ops``: Total number of ``PUT`` operations - * ``ceph_rgw_op_global_get_obj_bytes``: Total bytes transferred in ``PUT`` operations -Additional Useful queries +Additional Useful Queries ------------------------- .. code-block:: bash @@ -404,18 +395,18 @@ Example: ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452 -Important metrics +Important Metrics ----------------- * ``ceph_mds_metadata``: Provides general information about the MDS daemon. It can be used together with other metrics to provide contextual - information in queries and graphs. The following extra labels are populated: + information in query results and graphs. The following extra labels are populated: - * ``ceph_version``: MDS daemon version - * ``fs_id``: CephFS filesystem ID - * ``hostname``: Name of the host where the MDS daemon runs - * ``public_addr``: Public address of the host where the MDS daemon runs - * ``rank``: Rank of the MDS daemon + * ``ceph_version``: MDS daemon version + * ``fs_id``: CephFS filesystem ID + * ``hostname``: Name of the host where the MDS daemon runs + * ``public_addr``: Public address of the host where the MDS daemon runs + * ``rank``: Rank of the MDS daemon Example: @@ -423,41 +414,29 @@ Example: ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"} - * ``ceph_mds_request``: Total number of requests for the MDS - * ``ceph_mds_reply_latency_sum``: Reply latency total - * ``ceph_mds_reply_latency_count``: Reply latency count - * ``ceph_mds_server_handle_client_request``: Number of client requests - * ``ceph_mds_sessions_session_count``: Session count - * ``ceph_mds_sessions_total_load``: Total load - * ``ceph_mds_sessions_sessions_open``: Sessions currently open - * ``ceph_mds_sessions_sessions_stale``: Sessions currently stale - * ``ceph_objecter_op_r``: Number of read operations - * ``ceph_objecter_op_w``: Number of write operations - * ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon - * ``ceph_mds_root_rfiles``: Total number of files managed by the daemon -Useful queries: ---------------- +Useful Queries +-------------- .. code-block:: bash - # Total MDS read workload: + # Total MDS read workload sum(rate(ceph_objecter_op_r[1m])) - # Total MDS daemons workload: + # Total MDS daemons workload sum(rate(ceph_objecter_op_w[1m])) # Read workload for a specific MDS @@ -473,15 +452,15 @@ Useful queries: rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata -Block metrics +Block Metrics ============= By default RBD metrics for images are not gathered, as their cardinality may be high. This helps ensure the performance of the Manager's ``prometheus`` module. To produce metrics for RBD images, configure the -Manager option ``mgr/prometheus/rbd_stats_pools``. For more information -see :ref:`prometheus-rbd-io-statistics` +Manager option :confval:`mgr/prometheus/rbd_stats_pools`. For more information, +see :ref:`prometheus-rbd-io-statistics`. These metrics have the following labels: @@ -497,28 +476,20 @@ Example: ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"} -Important metrics ------------------- +Important Metrics +----------------- * ``ceph_rbd_read_bytes``: RBD bytes read - * ``ceph_rbd_write_bytes``: RBD image bytes written - * ``ceph_rbd_read_latency_count``: RBD read operation latency count - * ``ceph_rbd_read_latency_sum``: RBD read operation latency total time - * ``ceph_rbd_read_ops``: RBD read operation count - * ``ceph_rbd_write_ops``: RBD write operation count - * ``ceph_rbd_write_latency_count``: RBD write operation latency count - * ``ceph_rbd_write_latency_sum``: RBD write operation latency total - -Useful queries +Useful Queries -------------- .. code-block:: bash @@ -527,8 +498,10 @@ Useful queries rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata -Hardware monitoring +Hardware Monitoring =================== -See :ref:`hardware-monitoring` +See :ref:`hardware-monitoring`. + +.. _PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/basics/ -- 2.47.3