From 9665b2072a61f1c3cc5de7c3b0ed29c1a8760949 Mon Sep 17 00:00:00 2001
From: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
Date: Thu, 8 May 2025 15:53:25 +0530
Subject: [PATCH] doc/rados/operations.rst: add docs for availability score

This commit adds docs for how to use the availability score
feature. It also details when we consider a pool available
and when not and how we calculate the availability score.

Fixes: https://tracker.ceph.com/issues/67777

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
---
 doc/rados/operations/monitoring.rst | 36 +++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/doc/rados/operations/monitoring.rst b/doc/rados/operations/monitoring.rst
index 5e35ed27032..6e879defdaa 100644
--- a/doc/rados/operations/monitoring.rst
+++ b/doc/rados/operations/monitoring.rst
@@ -738,3 +738,39 @@ Print active connections and their TCP round trip time and retransmission counte
 
 	248     89      1       mgr.0   863     1677    0
 	3       86      2       mon.0   230     278     0
+
+Tracking Data Availability Score of a Cluster
+=============================================
+
+Ceph internally tracks the data availability of each pool in a cluster.
+To check the data availability score of each pool in a cluster, 
+the following command can be invoked: 
+
+
+.. prompt:: bash $
+
+   ceph osd pool availability-status
+
+If the cluster has 4 pools, this is what the ``availability-status`` 
+will report:  
+
+.. prompt:: bash $
+
+   POOL       	UPTIME  DOWNTIME  NUMFAILURES  MTBF  MTTR  SCORE 	AVAILABLE
+   rbd            	2m   	21s        	1	2m   21s  0.888889      	1
+   .mgr          	86s    	0s        	0	0s	0s     	1      	1
+   cephfs.a.meta 	77s    	0s        	0	0s	0s     	1      	1
+   cephfs.a.data 	76s    	0s        	0	0s	0s     	1      	1
+
+We consider a pool unavailable if there is potentially any data loss. 
+This means, if there are any PG in the pool not in 
+active state or if there are unfound objects, some data might be
+either unreachable or lost. In such cases, we mark the pool as 
+unavailable. Otherwise the pool is considered available. 
+For example: A pool will be marked available even if an OSD is down 
+as long as PG replication ensures there is no data loss. 
+
+We first calculate the Mean Time Between Failures (MTBF) and 
+Mean Time To Recover (MTTR) and arrive at the availability score 
+by finding ratio of MTBF to total time (ie MTTR + MTBF).  The score
+is updated every 5 seconds. 
\ No newline at end of file
-- 
2.39.5