doc: update stretch-mode.rst

author Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>

Wed, 22 Apr 2026 17:55:13 +0000 (17:55 +0000)

committer Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>

Mon, 11 May 2026 15:53:37 +0000 (15:53 +0000)
author Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
Wed, 22 Apr 2026 17:55:13 +0000 (17:55 +0000)
committer Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
Mon, 11 May 2026 15:53:37 +0000 (15:53 +0000)
diff --git a/doc/man/8/ceph.rst b/doc/man/8/ceph.rst

index 230f33f0601785d6e8cf070d8cfef70ba53797ba..fc03abcbe99e80fd50b35f74c189c1ea2dc7986a 100644 (file)
--- a/doc/man/8/ceph.rst
+++ b/doc/man/8/ceph.rst
@@ -632,7 +632,11 @@ rules and failure handling on all pools. For a given PG to successfully peer
  and be marked active, ``min_size`` replicas will now need to be active under all
  (currently two) CRUSH buckets of type <dividing_bucket>.
  
-<tiebreaker_mon> is the tiebreaker Monitor to use if a network split happens.
+<tiebreaker_mon> is the tiebreaker mon to use if a network split happens.
+This parameter is optional. If not supplied, the system will automatically
+select a monitor that is not in either data zone. If there are multiple
+monitors outside the data zones, automatic selection will fail and you must
+explicitly specify the tiebreaker monitor.
  
  <dividing_bucket> is the bucket type across which to stretch.
  This will typically be ``datacenter`` or other CRUSH hierarchy bucket type that
@@ -642,7 +646,7 @@ denotes physically or logically distant subdivisions.
  
  Usage::
  
-    ceph mon enable_stretch_mode <tiebreaker_mon> <new_crush_rule> <dividing_bucket>
+       ceph mon enable_stretch_mode {<tiebreaker_mon>} <new_crush_rule> <dividing_bucket>
  
  Subcommand ``remove`` removes Monitor named <name>.
  
diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst

index 1ec6cc370e2726da963c5a1cfdca949df0bc4cd5..f07b46d1e8ef4d068f9116adc01e6cf513e3a21f 100644 (file)
--- a/doc/rados/operations/stretch-mode.rst
+++ b/doc/rados/operations/stretch-mode.rst
@@ -30,13 +30,13 @@ We will here consider two standard configurations: a configuration with two
  data centers (or, in clouds, two availability zones), and a configuration with
  three data centers.
  
-In the two-site configuration, Ceph arranges for each site to hold a copy of
-the data. A third site houses a tiebreaker (arbiter, witness)
+In the two-zone configuration, Ceph arranges for each zone to hold a copy of
+the data. A third zone houses a tiebreaker (arbiter, witness)
  Monitor. This tiebreaker Monitor picks a winner when a network connection
-between sites fails and both data centers remain alive.
+between zones fails and both data centers remain alive.
  
  The tiebreaker monitor can be a VM. It can also have higher network latency
-to the OSD site(s) than OSD site(s) can have to each other.
+to the OSD zone(s) than OSD zone(s) can have to each other.
  
  The standard Ceph configuration is able to survive many network failures or
  data-center failures without compromising data availability. When enough
@@ -133,19 +133,23 @@ converge according to configured replication policy and return to normal operati
  
  Connectivity Monitor Election Strategy
  ---------------------------------------
-When using stretch mode, the Monitor election strategy must be set to ``connectivity``.
+Stretch mode requires the Monitor election strategy to be set to ``connectivity``.
  This strategy tracks network connectivity between Monitors and is
  used to determine which data center should be favored when the cluster
  experiences netsplit.
  
-See `Changing Monitor Elections`_
+**Note:** When you enable stretch mode with ``ceph mon enable_stretch_mode``,
+the cluster will automatically switch the election strategy to ``connectivity``
+if it is not already set. Manual configuration is not required.
+
+See `Changing Monitor Elections`_ for more details on election strategies.
  
  Stretch Peering Rule
  --------------------
  One critical behavior of stretch mode is its ability to prevent a PG from going ``active`` if the acting set
  contains only replicas from a single data center. This safeguard is crucial for mitigating the risk of data
-loss during site failures because if a PG were allowed to go ``active`` with replicas only at a single site,
-writes could be acknowledged despite a lack of redundancy. In the event of a site failure, all data in the
+loss during zone failures because if a PG were allowed to go ``active`` with replicas only at a single zone,
+writes could be acknowledged despite a lack of redundancy. In the event of a zone failure, all data in the
  affected PG would be lost.
  
  Entering Stretch Mode
@@ -158,7 +162,7 @@ with the CRUSH topology.
  
     .. prompt:: bash $
  
-      ceph mon set_location a datacenter=site1
+      ceph mon set_location a datacenter=zone1
  
  #. Generate a CRUSH rule that places two copies in each data center.
     This requires editing the CRUSH map directly:
@@ -170,17 +174,15 @@ with the CRUSH topology.
  
  #. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
     other rule (``id 1``), but you will likely need to use a different, unique rule ID. We
-   have two ``datacenter`` buckets named ``site1`` and ``site2``:
+   have two ``datacenter`` buckets named ``zone1`` and ``zone2``:
  
     ::
  
-      rule stretch_rule {
+      rule stretch_replicated_rule {
               id 1
               type replicated
-             step take site1
-             step chooseleaf firstn 2 type host
-             step emit
-             step take site2
+             step take default
+             step choose firstn 0 type datacenter
               step chooseleaf firstn 2 type host
               step emit
       }
@@ -191,47 +193,39 @@ with the CRUSH topology.
        of the available space from the datacenter, not the available space for
        the pools associated with the CRUSH rule.
     
-      For example, consider a cluster with two CRUSH rules, ``stretch_rule`` and
-      ``stretch_replicated_rule``::
+      For example, consider a cluster with two CRUSH rules, ``stretch_replicated_rule`` and
+      ``stretch_replicated_rule_alt``::
  
-         rule stretch_rule {
+         rule stretch_replicated_rule {
                id 1
                type replicated
-              step take DC1
-              step chooseleaf firstn 2 type host
-              step emit
-              step take DC2
+              step take default
+              step choose firstn 0 type datacenter
                step chooseleaf firstn 2 type host
                step emit
           }
           
-         rule stretch_replicated_rule {
+         rule stretch_replicated_rule_alt {
                   id 2
                   type replicated
-                 step take default
-                 step choose firstn 0 type datacenter
+                 step take zone1
+                 step chooseleaf firstn 2 type host
+                 step emit
+                 step take zone2
                   step chooseleaf firstn 2 type host
                   step emit
           }
  
-      In the above example, ``stretch_rule`` will report an incorrect value for
+      In the above example, ``stretch_replicated_rule_alt`` will report an incorrect value for
        ``MAX AVAIL``. ``stretch_replicated_rule`` will report the correct value.
-      This is because ``stretch_rule`` is defined in such a way that
+      This is because ``stretch_replicated_rule_alt`` is defined in such a way that
        ``PGMap::get_rule_avail`` considers only the available capacity of a single
        ``datacenter``, and not (as would be correct) the total available capacity from
        both ``datacenters``.
        
-      Here is a workaround. Instead of defining the stretch rule as defined in
-      the ``stretch_rule`` above, define it as follows::
-
-         rule stretch_rule {
-           id 2
-           type replicated
-           step take default
-           step choose firstn 0 type datacenter
-           step chooseleaf firstn 2 type host
-           step emit
-         }
+      The recommended approach is to use the ``stretch_replicated_rule`` definition shown
+      above (with ``take default`` and ``choose firstn 0 type datacenter``), which correctly
+      reports ``MAX AVAIL``.
  
        See https://tracker.ceph.com/issues/56650 for more detail on this workaround.
  
@@ -244,27 +238,39 @@ with the CRUSH topology.
        crushtool -c crush.map.txt -o crush2.map.bin
        ceph osd setcrushmap -i crush2.map.bin
  
-#. Run the Monitors in ``connectivity`` mode. See `Changing Monitor Elections`_.
+#. Direct the cluster to enter stretch mode. The cluster will automatically
+   switch to the `connectivity` election strategy if not already configured.
+   
+   When a tiebreaker Monitor is provisioned, it must be assigned to a CRUSH 
+   `datacenter` location that is neither `zone1` nor `zone2`. This data center 
+   should not be predefined in your CRUSH map.
+   
+   An explicit tiebreaker Monitor is optional. If not specified, the cluster will
+   automatically select a Monitor that has been assigned to a `datacenter` (or the
+   specified bucket type) that differs from the main data zones.
+   
+   **Option 1: Automatic tiebreaker selection** (recommended):
+   
+   Let the cluster automatically select the tiebreaker:
  
     .. prompt:: bash $
  
-      ceph mon set election_strategy connectivity
+      ceph mon set_location e datacenter=zone3
+      ceph mon enable_stretch_mode stretch_replicated_rule datacenter
  
-#. Direct the cluster to enter stretch mode. In this example, ``mon.e`` is the
-   tiebreaker Monitor and we are splitting across CRUSH ``datacenters``. The tiebreaker
-   monitor must be assigned a CRUSH ``datacenter`` that is neither ``site1`` nor
-   ``site2``. This data center **should not** be predefined in your CRUSH map. Here 
-   we are placing ``mon.e`` in a virtual data center named ``site3``:
+   **Option 2: Explicit tiebreaker monitor**:
+   
+   Alternatively, you can explicitly specify ``mon.e`` as the tiebreaker Monitor:
  
     .. prompt:: bash $
  
-      ceph mon set_location e datacenter=site3
-      ceph mon enable_stretch_mode e stretch_rule datacenter
+      ceph mon set_location e datacenter=zone3
+      ceph mon enable_stretch_mode e stretch_replicated_rule datacenter
  
  When stretch mode is enabled, PGs will become active only when they peer
  across CRUSH ``datacenter`` (or across whichever CRUSH bucket type was specified),
  assuming both are available. Pools will increase in size from the default ``3`` to
-``4``, and two replicas will be placed at each site. OSDs will be allowed to
+``4``, and two replicas will be placed at each zone. OSDs will be allowed to
  connect to Monitors only if they are in the same data center as the Monitors.
  New Monitors will not be allowed to join the cluster if they do not specify a
  CRUSH location.
@@ -273,7 +279,7 @@ If all OSDs and Monitors in one of the ``datacenter`` become inaccessible at onc
  the cluster in the surviving ``datacenter`` enters  *degraded stretch mode*.
  A health state warning will be
  raised, pools' ``min_size`` will be reduced to ``1``, and the cluster will be
-allowed to go active with the components and data at the single remaining site. Pool ``size``
+allowed to go active with the components and data at the single remaining zone. Pool ``size``
  does not change, so warnings will be raised that the PGs are undersized,
  but a special stretch mode flag will prevent the OSDs from
  creating extra copies in the remaining data center. This means that the data
@@ -285,8 +291,8 @@ only from the ``datacenter`` that was ``up`` throughout the duration of the
  downtime. When all PGs are in a known state, and are neither degraded nor
  undersized / incomplete, the cluster transitions back to regular stretch mode, ends the
  warning, restores pools' ``min_size`` to its original value of ``2``, requires
-PGs at both sites to peer, and no longer requires the site that was up throughout the
-duration of the downtime when peering. This makes failover to the other site
+PGs at both zones to peer, and no longer requires the zone that was up throughout the
+duration of the downtime when peering. This makes failover to the other zone
  possible, if needed.
  
  .. _Changing Monitor elections: ../change-mon-elections
@@ -324,7 +330,7 @@ is in degraded stretch mode or healthy stretch mode.
  
  Limitations of Stretch Mode 
  ===========================
-When using stretch mode, OSDs must be located at exactly two sites. 
+When using stretch mode, OSDs must be located at exactly two zones. 
  
  Two Monitors must be run in each data center, plus a tiebreaker in a third
  (possibly in the cloud) for a total of five Monitors. While in stretch mode, OSDs
@@ -351,7 +357,7 @@ data centers has been restored. This reduces the potential for data loss.
     For example, the following rule specifying the ``ssd`` device class will not work::
  
        rule stretch_replicated_rule {
-                 id 2
+                 id 1
                   type replicated class ssd
                   step take default
                   step choose firstn 0 type datacenter
author	Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
	Wed, 22 Apr 2026 17:55:13 +0000 (17:55 +0000)
committer	Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
	Mon, 11 May 2026 15:53:37 +0000 (15:53 +0000)
doc/man/8/ceph.rst		patch \| blob \| history
doc/rados/operations/stretch-mode.rst		patch \| blob \| history