--- /dev/null
+=====================================================
+Design Document: Ceph Erasure Coded (EC) Stretch Cluster
+=====================================================
+
+.. note::
+
+ **Phased Delivery (R1, R2, …)**
+
+ This document distinguishes between **R1** and **later release** work.
+ Sections and sub-sections are annotated accordingly. R1 delivers:
+
+ - **Zone-local direct reads** (good path only; any failure redirects to the
+ Primary)
+ - **Writes via Primary** (all write IO traverses the inter-zone link; no
+ Zone Primary encoding)
+ - **Recovery from Primary** (Primary reads all data and writes all remote-zone
+ shards; no inter-zone bandwidth optimization)
+
+ The labels R1, R2, etc. refer to **incremental PR delivery milestones** —
+ they do *not* map one-to-one to Ceph named releases. Multiple delivery
+ phases may land within a single Ceph release, or a single phase may span
+ releases depending on development progress.
+
+ The implementation order is detailed in Section 14.
+
+.. important::
+
+ **Scope and Applicability**
+
+ This design is an **extension to Fast EC** (Fast Erasure Coding) only. We make
+ **no attempt to support legacy EC** implementations with this feature. All
+ functionality described in this document requires Fast EC to be enabled.
+
+
+1. Intent & High-Level Summary
+-------------------------------
+
+The primary goal of this feature is to support a Replicated Erasure Coded (EC)
+configuration within a multi-zone Ceph cluster (Stretch Cluster), building upon
+the Fast EC infrastructure.
+
+.. note::
+
+ Throughout this document, a *zone* refers to a group of OSDs within a single
+ CRUSH failure domain — typically a data center, but it could be any CRUSH
+ bucket type (e.g., rack, room). The key characteristic of a zone boundary is
+ that traffic crossing it incurs a significant performance penalty: higher
+ latency and/or lower bandwidth compared to intra-zone communication.
+
+ The term *inter-zone link* refers to the network connection between these
+ failure domains (e.g., the dedicated link between data centers). This is
+ typically the most bandwidth-constrained and latency-sensitive path in the
+ cluster. There is a strong desire to minimize inter-zone link traffic, even
+ to the extent of issuing additional zone-local reads to avoid sending data across
+ this link.
+
+Currently, Ceph Stretch clusters typically rely on Replica to ensure data
+availability across zones (e.g., data centers). While Erasure Coding provides
+storage efficiency, applying a standard EC profile naively across a stretch
+cluster introduces significant operational problems.
+
+Consider a 2-zone stretch cluster using a standard EC profile of ``k=4, m=6``.
+While the high parity count provides sufficient fault tolerance to survive a
+full zone loss, this configuration suffers from three fundamental
+inefficiencies:
+
+1. **Write Bandwidth Waste**: Every write must distribute all ``k+m`` chunks
+ across both zones. The coding (parity) shards are unnecessarily duplicated
+ over the inter-zone link, consuming bandwidth that scales linearly with the
+ number of shards.
+2. **Reads Must Cross Zones**: Serving a read requires retrieving at least ``k``
+ chunks, which in general will span both zones. Every read operation incurs
+ inter-zone link latency.
+3. **Recovery is Inter-Zone Link Bound**: After a zone failure, recovering the
+ lost chunks requires reading surviving chunks across the inter-zone link,
+ placing the entire recovery burden on the most constrained network path.
+
+This feature introduces a hybrid approach to eliminate these problems: creating
+Erasure Coded stripes (defined by ``k + m``) and replicating those stripes
+``zones`` times across a specified topology level (e.g., Data Center). Each zone
+holds a complete, independent copy of the EC stripe, meaning reads and recovery
+can be performed entirely within a single zone. This combines the zone-local storage
+efficiency of EC with the zone-level redundancy of a stretch cluster, while
+minimizing inter-zone link usage to write replication only.
+
+2. Proposed Configuration (CLI) Changes
+---------------------------------------
+
+To support this topology, the EC Pool configuration interface will be expanded
+directly within the pool creation command. We intend to address a number of
+issues with the current CLI design:
+
+* Create-then-set flow. A typical setup procedure of a pool requires multiple CLI
+ commands (e.g. create replica, then set num copies, then set stretched, etc... )
+* Remove the need for an EC profile. Indeed, stretch mode EC pools will be mutually
+ exclusive with the use of a profile.
+
+The current CLI behaviour of accepting either positional or non-positional arguments
+will be maintained, however no positional arguments will be added for the new
+functionality.
+
+Supporting stretch EC pools will require changes to several CLI commands. The design
+document will focus on just the changes that will be made to the pool create CLI
+to give an idea of how the new CLI will work. Similar changes will be made to the
+CLIs that allow modification of a pool. It also expected that changes will be made
+to the CLIs that control stretch mode.
+
+
+
+2.1 ceph osd pool create
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ceph osd pool create command will be extended to become a parameterized command.
+
+For backward compatibility, the positional arguments will be maintained and can be
+specified along side the new paramaters.
+
+2.2.1 Full command syntax
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The full command syntax is listed here. Refer to the later sections for pool-type specific details.
+
+.. code-block:: text
+
+ ceph osd pool create {pool_name}
+ [{pg_num} | --pg_num <pg_num>]
+ [{pgp_num} | --pgp_num <pgp_num>]
+ [{replicated | erasure} | --pool_type <replicated|erasure>]
+ [{expected_num_objects} | --expected_num_objects <expected_num_objects>]
+
+ # CRUSH Placement Options (Mutually Exclusive)
+ [ {crush_rule_name} | --crush_rule_name <crush_rule_name> |
+ [--crush_root <crush_root>] [--zone_failure_domain <zone_failure_domain>] [--osd_failure_domain <osd_failure_domain>] [--crush_device_class <class>] ]
+
+ # Replica-Specific Options
+ [--size <size>]
+
+ # Erasure-Specific Options
+ [--data_shards <num_data_shards>]
+ [--coding_shards <num_coding_shards>]
+ [--stripe_unit <stripe_unit>]
+ [ {erasure_code_profile} | --profile <profile_name> ]
+
+ # Topology and Redundancy
+ [--zones <num_zones>]
+ [--min_size <min_size>]
+
+ # Autoscaling and General Config
+ [--autoscale_mode=<on,off,warn>]
+ [--pg_num_min <pg_num_min>]
+ [--pg_num_max <pg_num_max>]
+ [--bulk]
+ [--target_size_bytes <target_size_bytes>]
+ [--target_size_ratio <target_size_ratio>]
+ [--crimson]
+ [--yes_i_really_mean_it]
+
+ # Legacy Options
+ [--stretch_mode]
+
+2.2.2 Legacy positional syntax
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following syntax, taken from the current docs, will be maintained for backward compatibility:
+
+.. code-block:: text
+
+ ceph osd pool create {pool_name} [{pg_num} [{pgp_num}]] [replicated] \
+ [crush_rule_name] [expected_num_objects]
+
+.. code-block:: text
+
+ ceph osd pool create {pool_name} [{pg_num} [{pgp_num}]] erasure \
+ [erasure_code_profile] [crush_rule_name] [expected_num_objects] [--autoscale_mode=<on,off,warn>]
+
+This syntax can be used with the new parameters which do not directly conflict. For example --pg_num cannot
+be used with its positional counterpart.
+
+2.2.1 Basic Parameters
+^^^^^^^^^^^^^^^^^^^^^^
+
+These are the primary parameters required for standard deployments.
+
+**--pool_type** (or positional equivalent)
+ - *Definition*: The type of pool to create.
+ - *Values*: ``replicated`` or ``erasure``.
+ - *Default Value*: Derived from ``osd_pool_default_type`` if omitted, but highly recommended to specify.
+
+**--size**
+ - *Definition*: The number of OSDs the data is striped over for each object. (Replica only)
+ - *Note*: Parameter will be ignored, rather than rejected for EC pools, for backward compatibility.
+
+**--data_shards** (or **--k**)
+ - *Definition*: Within a zone, the number of OSDs the data is striped over for each object. (EC only)
+
+**--coding_shards** (or **--m**)
+ - *Definition*: Within a zone, the number of OSDs the coding shards are striped over. (EC only)
+
+**--zones**
+ - *Definition*: For a stretched cluster configuration defines the number of zones, each which store a full replica of the pool. (EC or Replica)
+ - *Default Value*: ``1``
+ - *Behavior*: Setting this to >1 creates a stretched pool.
+ A non-stretched pool achieves redundancy accross OSDs. A stretched pool creates redundancy
+ accross ``zones``.
+ - *Pool Size*: The resulting pool ``size`` is ``zones × (k + m)``.
+
+
+2.2.2 Advanced
+^^^^^^^^^^^^^^
+
+These parameters are intended for advanced users and offer finer control over the cluster layout.
+
+**--stripe_unit**
+ - *Definition*: The amount of data in a shard, per stripe. Sometimes referred to as "chunk size". (EC only)
+ - *Default Value*: 16k for Fast EC, 4k for classic EC.
+ - *Purpose*: Where beneficial for a well-known access pattern, the stripe unit can
+ be tuned to any 4k-divisible value.
+
+**--zone_failure_domain**
+ - *Definition*: The CRUSH bucket type over which zone-redundancy is achieved.
+ - *Default Value*: ``datacenter``
+ - *Purpose*: This is the CRUSH bucket type that defines a zone.
+ - *Note*: Mutually exclusive with ``--crush_rule``
+
+**--osd_failure_domain**
+ - *Definition*: The CRUSH bucket type over which OSD-redundancy is achieved.
+ - *Default Value*: ``host``
+ - *Note*: Mutually exclusive with ``--crush_rule``
+
+**--crush_device_class**
+ - *Definition*: Restrict placement to devices of a specific class (e.g., ``ssd`` or ``hdd``), using the CRUSH device class names in the CRUSH map.
+ - *Purpose*: Only required if a cluster has a mixture of different classes of OSD.
+ - *Note*: Mutually exclusive with ``--crush_rule``
+
+**--crush_root**
+ - *Definition*: The root of the CRUSH tree to use. (R2 only)
+ - *Default Value*: The cluster root (``default``).
+ - *Topology and Validation*: Specifying the CRUSH root to use (defaults to ``default``), the CRUSH level for a zone (defaults to ``datacenter``), and the number of zones (defaults to ``1``) is sufficient to define the pool's placement:
+
+ - Example 1: In a cluster with 2 datacenters, specifying ``--zones 2`` will create a stretch pool across the 2 datacenters.
+ - Example 2: In a cluster with 2 datacenters, specifying ``--crush_root DC1`` will create an pool completely contained in DC1.
+ - Validation: If there are N datacenters with the same root and you specify a number of zones M != N, the command will fail because the specified number of zones is different from the number of zones in the CRUSH hierarchy.
+ - Custom Rules: If users want to use a subset of zones (e.g., a special 3-datacenter configuration), they must specify a custom CRUSH rule. A custom CRUSH rule is mutually exclusive with specifying the CRUSH root and/or CRUSH level.
+
+ - *Operational Restrictions*: When there are stretch pools, adding or moving a CRUSH bucket that impacts the number of zones for the pool will require a ``yes-i-really-mean-it`` flag, as this is liable to break things.
+ - *Note*: Mutually exclusive with ``--crush_rule``
+
+**--min_size**
+ - *Definition*: The minimum number of shards required to serve I/O within a zone.
+ - *Format*: Now always specified as a target range: ``1-<num replicas>`` for replica pools,
+ or ``<num data shards>-<num data shards>+<num coding shards>`` for EC pools.
+
+**--crush_rule**
+ - *Definition*: Use this CRUSH rule, instead of an auto-generated rule.
+ - *Purpose*: Create a bespoke CRUSH rule for advanced use cases not covered by the auto rule generation above.
+ - *Note*: Mutually exclusive with ``--crush_root``, ``--osd_failure_domain`` and ``--zone_failure_domain``
+
+**--profile**
+ - *Definition*: The legacy EC Profile to use.
+ - *Note*: Mutually exclusive with ``--zones``: Cannot be used with multi-zone configurations.
+
+.. note::
+
+ The following parameters are existing, standard pool configuration options included here for completeness.
+
+**--pg_num**
+ - *Definition*: The total number of placement groups for the pool.
+
+**--pgp_num**
+ - *Definition*: The total number of placement groups for placement purposes.
+ - *Note*: This should never be modified. Purely here for backward compaitbility.
+
+**--expected_num_objects**
+ - *Definition*: The expected number of objects for this pool, used to pre-split placement groups at pool creation.
+
+**--autoscale_mode**
+ - *Definition*: The auto-scaling mode for placement groups.
+ - *Values*: ``on``, ``off``, or ``warn``.
+
+**--pg_num_min**
+ - *Definition*: The minimum number of placement groups when auto-scaling is active.
+
+**--pg_num_max**
+ - *Definition*: The maximum number of placement groups when auto-scaling is active.
+
+**--bulk**
+ - *Definition*: Flags the pool as a "bulk" pool to pre-allocate more placement groups automatically.
+
+**--target_size_bytes**
+ - *Definition*: The expected total size of the pool in bytes, used to guide PG autoscaling.
+
+**--target_size_ratio**
+ - *Definition*: The expected ratio of the cluster's total capacity this pool will consume, used for PG autoscaling.
+
+**--crimson**
+ - *Definition*: Flags the pool to run on Crimson OSD.
+ - *Note*: Crimson OSD is experimental.
+
+**--yes_i_really_mean_it**
+ - *Definition*: Internal safety override flag. In the context of pool creation, it is specifically used to allow the creation of hidden or system-reserved pools whose names begin with a dot (e.g., ``.rgw.root``).
+
+
+2.2.3 Deprecated
+^^^^^^^^^^^^^^^^
+
+These are used for backward compatibility only.
+
+**--stretch_mode**
+ - *Definition*: Deprecated parameter for Replica pools which configures two zones with half the specified number of replica copies in each zone. Not supported for EC pools.
+ - *Note*: Mutually exclusive with ``--zones`` and erasure pools.
+
+
+2.3 Examples
+^^^^^^^^^^^^
+
+The following are examples of how the new parameterized ``ceph osd pool create`` command simplifies pool creation across different topologies.
+
+**Example 1: Basic Replicated Pool**
+Create a standard 3-way replicated pool containing 128 placement groups:
+
+.. code-block:: bash
+
+ ceph osd pool create my_rep_pool --pool_type replicated --size 3 --pg_num 128
+
+**Example 2: Standard Erasure Coded Pool**
+Create an erasure-coded pool using a ``k=4, m=2`` configuration (yielding a size of 6 shards) with 64 placement groups:
+
+.. code-block:: bash
+
+ ceph osd pool create my_ec_pool --pool_type erasure --data_shards 4 --coding_shards 2 --pg_num 64
+
+**Example 3: Stretched Replicated Pool**
+Create a replicated pool that spans across two datacenters, achieving a total size of 4 (2 replicas in each datacenter):
+
+.. code-block:: bash
+
+ ceph osd pool create stretch_rep --pool_type replicated --size 4 --zones 2 --zone_failure_domain datacenter
+
+**Example 4: Stretched Erasure Coded Pool**
+Create an erasure-coded pool stretched across two racks. Using a ``k=4, m=2`` configuration per zone across 2 zones creates a total pool size of 12 shards (4 data and 2 coding per rack):
+
+.. code-block:: bash
+
+ ceph osd pool create stretch_ec --pool_type erasure --data_shards 4 --coding_shards 2 --zones 2 --zone_failure_domain rack
+
+**Example 5: Single Datacenter Erasure Coded Pool**
+Create an EC pool confined entirely to a specific datacenter using the ``--crush_root`` parameter:
+
+.. code-block:: bash
+
+ ceph osd pool create dc1_ec --pool_type erasure --data_shards 4 --coding_shards 2 --crush_root DC1
+
+**Example 6: Bulk Erasure Coded Pool with Autoscaling Limits**
+Create an EC pool where the system automatically scales the PG count but enforces a minimum boundary, marking it as a bulk pool:
+
+.. code-block:: bash
+
+ ceph osd pool create bulk_ec --pool_type erasure --data_shards 6 --coding_shards 3 --autoscale_mode on --pg_num_min 128 --bulk
+
+
+1. Approaches Considered but Rejected
+-------------------------------------
+
+We evaluated and rejected the following two alternative designs:
+
+3.1 Multi-layered Backends
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This approach involved re-using the existing ``ReplicationBackend`` to manage
+the top-level replication, with EC acting as a secondary layer.
+
+- *Reason for Rejection*: This would require complex, multi-layered peering
+ logic where the replication layer interacts with the EC layer. Ensuring the
+ correctness of these interactions is difficult. Additionally, the front-end
+ and back-end interfaces of the two back ends differ significantly (e.g.,
+ support for synchronous reads), which would require substantial refactoring.
+- *Advantage of Proposed Solution*: Our chosen bespoke approach minimizes
+ changes to the peering state machine and offers better potential for recovery
+ during complex failure scenarios.
+
+3.2 LRC (Locally Repairable Codes) Plugin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This approach involved configuring the existing LRC plugin to provide multi-zone
+EC semantics. While LRC is designed for locality-aware erasure coding, it does
+not address the core problems this design aims to solve:
+
+- *Reason for Rejection*:
+
+ 1. **Writes still cross zones**: LRC distributes all chunks (data, local
+ parity, and global parity) across the full CRUSH topology. Every write
+ operation sends chunks over the inter-zone link, offering no bandwidth
+ savings over standard EC.
+ 2. **Reads still cross zones**: Reading the original data requires ``k`` data
+ chunks, which are spread across all zones. LRC provides no mechanism for a
+ single zone to independently serve reads.
+ 3. **Recovery remains primary-centralized**: In the current Ceph EC
+ infrastructure, all recovery operations are coordinated by the Primary OSD.
+ While LRC defines local repair groups at the coding level, the EC back end
+ would still need to be extended to delegate recovery execution to remote
+ zones — the same complexity required by this proposal.
+
+- *Advantage of Proposed Solution*: The replicated EC stripe approach places a
+ complete ``(k+m)`` set at each zone, which inherently enables zone-local
+ reads, zone-local single-OSD recovery (without any special coding scheme), and
+ limits inter-zone link usage to write replication. It achieves better locality
+ than LRC with less infrastructure complexity.
+
+
+4. Configuration Logic (Preliminary)
+--------------------------------------
+
+The interactions between these parameters imply a hierarchy in the CRUSH
+rule generation:
+
+- **Top Level**: The CRUSH rule selects ``zones`` buckets of type
+ ``zone`` (e.g., select 2 data centers).
+- **Lower Level**: Inside each selected ``zone``, the CRUSH rule
+ selects ``k+m`` OSDs to store the chunks.
+
+This leverages standard CRUSH mechanisms — no changes to CRUSH
+itself are required.
+
+
+5. Multi-Zone & Topology Considerations
+-----------------------------------------
+
+While this architecture is capable of supporting N-zone configurations,
+specific attention is given to the common 2-zone High Availability (HA) use
+case.
+
+- **Reference Diagrams**: For clarity, architectural diagrams and examples
+ within this design documentation will primarily depict a 2-zone HA
+ configuration (``zones=2``, with 2 data centers).
+- **Logical Scalability**: The design is logically N-way capable. The parameter
+ ``zones`` is not limited to 2; the system supports any valid CRUSH topology where
+ ``zones`` failure domains exist.
+- **Testing Strategy**: Testing will follow a phased approach:
+
+ 1. **Unit Tests (Initial)**: The unit test framework will include tests for
+ both 2-zone (``zones=2``) and 3-zone (``zones=3``) configurations, validating the
+ core peering and recovery logic for N-way topologies.
+ 2. **Full 2-Zone Testing (Initial Release)**: 2-zone configurations will be
+ fully tested end-to-end for the initial release, covering all read, write,
+ recovery, and failure scenarios.
+ 3. **Full 3-Zone Testing (Later Release)**: Full integration and real-world
+ testing of 3-zone configurations will be deferred to a subsequent release.
+
+
+5.1 Single-Zone Pools
+~~~~~~~~~~~~~~~~~~~~~
+
+The use case for "single-zone" pools is a Ceph cluster which is split across multiple data centers, but the redundancy is provided by the application. Here, the user can specify a pool which is restricted to a single data center. It should be noted that this means that loss of an inter-zone link will lead to an entire zone being lost (something that would not be the case for two independent clusters).
+
+
+6. Topologies and Terminology
+-------------------------------
+
+To illustrate the relationship between the replication layer, the EC layer, and
+the physical topology, we define the following terms and visualization.
+
+6.1 Logical Diagram (2-Zone HA)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following diagram illustrates a cluster configured with ``r=2``, ``k=2``,
+``m=1`` spanning two Data Centers.
+
+.. ditaa::
+ +-------------------+ +-------------------+
+ | Client A | | Client B |
+ +---------+---------+ +---------+---------+
+ | |
+ v v
+ +---------------------------------------------------+
+ | Logical Replication Layer |
+ | (zones=2, zone_failure_domain=DC) |
+ +--------------------------+------------------------+
+ |
+ +--------------+--------------+
+ | |
+ v v
+ +-----------+---------+ +-----------+-----------+
+ | | | |
+ | | | |
+ | Data Center A | | Data Center B |
+ | (Replica/Zone 1) | | (Replica/Zone 2) |
+ | | | |
+ | | | |
+ +--+-------+-------+--+ +---+-------+-------+---+
+ | | | | | |
+ v v v v v v
+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
+ | OSD | | OSD | | OSD | | OSD | | OSD | | OSD |
+ |Shard| |Shard| |Shard| |Shard| |Shard| |Shard|
+ | 0 | | 1 | | 2 | | 3 | | 4 | | 5 |
+ +-^---+ +-----+ +-----+ +-^---+ +-----+ +-----+
+ | |
+ Primary Zone
+ Primary
+
+6.2 Key Definitions
+~~~~~~~~~~~~~~~~~~~~
+
+**Primary**
+ The primary OSD in the acting set for a Placement Group (PG). This OSD is
+ responsible for coordinating writes and reads for the PG.
+ (Standard Ceph terminology.)
+
+**Shard**
+ The globally unique position of a chunk within the pool's acting set. Shards
+ are numbered ``0`` through ``zones × (k + m) - 1``. In the diagram above, Zone A
+ holds Shards 0, 1, 2 and Zone B holds Shards 3, 4, 5.
+
+**Zone-local**
+ An OSD or shard is zone-local if it shares the same ``zone`` as another
+ OSD or shard.
+
+**Zone Primary**
+ An internal EC convention. The Zone Primary is the first primary-capable
+ shard in the acting set that resides in a given zone.
+
+ These zone primaries will be used in (post-R1) stretch clusters to perform
+ local-to-zone erasure coding. The intent is to minimize inter-zone bandwidth
+ requirements.
+
+**Remote-zone Shard**
+ A relative term referring to a shard in a different zone. For example: "When
+ recovering data in Zone A, a remote-zone shard may be used to read data"
+
+
+7. Read Path & Recovery Strategies
+----------------------------------
+
+This architecture supports multiple read strategies to optimize for locality
+and handle failure scenarios. The strategies are presented in order of
+increasing complexity: starting with the baseline read-from-Primary path,
+then layering direct-read optimizations on top.
+
+7.1 Read from Primary — *R1*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The standard read path involves the client contacting the Primary OSD.
+
+- **Local Priority**: The Primary will prioritize reading from local-to-zone OSDs
+ (within its own ``zone``) to serve the request or recover data
+ (reconstruct the stripe). This minimizes inter-zone link traffic.
+
+7.1.1 Zone-local Recovery
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When the Primary cannot serve a read from a single zone-local shard (e.g., because
+a shard is missing or degraded), it reconstructs the data from the remaining
+zone-local shards within its ``zone``.
+
+7.1.2 Zone-Local Recovery
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the Primary cannot reconstruct data using only zone-local shards, it may issue
+direct reads to remote-zone OSDs. This cross-zone read path serves two purposes:
+
+- **Backfill / Recovery**: When a zone-local OSD is down or being backfilled, the
+ Primary can read the corresponding shard from the remote zone to recover the
+ missing data. It is likely that a zone with insufficient EC chunks to
+ reconstruct locally would be taken offline by the stretch cluster monitor
+ logic, but this fallback ensures availability during transient states.
+- **Medium Error Recovery**: If a zone-local OSD returns a medium error (e.g.,
+ unreadable sector), the Primary can recover the affected data by reading from
+ remote-zone shards and reconstructing locally.
+
+7.2 Direct Reads — Primary Zone — *R1*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This feature extends the existing "EC Direct Reads" capability (documented
+separately) to the stretch cluster topology. A client co-located with the
+Primary can read data directly from a zone-local shard, bypassing the Primary
+entirely on the good path. **No new code is required** — this is the standard
+EC Direct Read path operating over stretch-cluster shard numbering. Any failure
+falls back to the Primary read path (Section 7.1).
+
+.. mermaid::
+
+ sequenceDiagram
+ title Direct Read — Primary Zone
+
+ participant C as Client
+ participant P as Primary
+ participant ZS as Zone-local Shard
+
+ C->>ZS: Direct Read
+ activate ZS
+ ZS->>C: Complete
+ deactivate ZS
+
+ C->>ZS: Direct Read
+ activate ZS
+ ZS->>C: -EAGAIN
+ deactivate ZS
+ C->>P: Full Read
+ activate P
+ P->>ZS: Read
+ activate ZS
+ ZS->>P: Read Done
+ deactivate ZS
+ P->>C: Done
+ deactivate P
+
+- **Failure Handling & Redirection (R1)**:
+
+ Any failure encountered during a direct read — whether a missing shard,
+ ``-EAGAIN`` rejection, or medium error — results in the client redirecting
+ the operation to the **Primary** (regardless of the Primary's location).
+
+7.3 Direct Reads — Zone-Local — *R1*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The client will read from a zone-local shard.
+
+- **Zone-Local Shard Identification**: The client determines which OSDs in the
+ acting set reside within its own zone by comparing OSD CRUSH
+ locations against its own. This identifies the ``k+m`` shards of the local
+ zone. The client then selects a suitable set of data shards for the
+ direct read.
+
+- **Unavailable Zone-local OSDs**: If the zone-local shards are not available (e.g., the
+ zone-local OSDs are down or the client cannot identify any zone-local replica), the
+ client does not attempt a direct read and instead directs the operation to
+ the Primary. In later releases this will be improved to redirect to the
+ local Zone Primary instead.
+
+- **Direct Read Execution**: If a zone-local shard is available, the client sends
+ the read directly to that OSD. As detailed in the ``ec_direct_reads.rst``
+ design, it is generally acceptable for reads to overtake in-flight writes.
+ However, the OSD is aware if it has an "uncommitted" write (a write that
+ could potentially be rolled back) for the requested object. If such a write
+ exists, or if the client's operation requires strict ordering (e.g. the
+ ``rwordered`` flag is set), the OSD rejects the operation with ``-EAGAIN``.
+ Data access errors (e.g., a media error on the underlying storage) also cause
+ the OSD to reject the operation.
+
+- **Failure Handling & Redirection (R1)**:
+
+ An -EAGAIN will cause the client to retry the op. For R1, the op will be
+ redriven to the primary. R2 will redrive the op to the zone primary.
+
+7.4 Read from Zone Primary — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In the R1 release, the primary will handle all failures. In later releases,
+an op which cannot be processed as a direct read will be directed at the
+zone primary instead. The zone primary will attempt to reconstruct the
+read data from the zone-local shards directly.
+
+A conflict due to an uncommitted write will be continue to be handled by
+the primary, as the zone primary must also rejected an op if an
+uncommitted write exists for that object.
+
+- **Zone-local Recovery**: A read directed to a Zone Primary will attempt to serve
+ the request by recovering data using only OSDs within the same data center
+ (the remote zone).
+- **Zone Degradation**: If a zone has insufficient redundancy to reconstruct
+ data locally, that zone should be taken offline to clients rather than serving
+ reads that would require inter-zone link access. (How? - Needs to be covered elsewhere)
+- **``-EAGAIN`` Behavior**: The Zone Primary will return ``-EAGAIN`` only in
+ short-lived transient conditions:
+
+
+7.4.1 Safe Shard Identification & Synchronous Recovery — *Later Release*
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, the Zone Primary does not inherently know which shards are
+consistent and safe to read. To manage this, the Peering process will be
+enhanced.
+
+- **Synchronous Recovery Set**: The existing peering Activate message will be
+ extended to include a "Synchronous Recovery Set". This set identifies shards
+ that are currently undergoing recovery and may not yet be consistent.
+
+ .. note::
+
+ The peering state machine has a transition from Active+Clean to Recovery,
+ which implies that recovery can start without a new peering interval.
+ This needs further investigation to ensure the Synchronous Recovery Set
+ is correctly maintained across such transitions.
+
+- **Consistency Guarantee**: Within any single epoch, the Synchronous Recovery
+ Set can only reduce over time (shards are removed as they complete recovery).
+ The set is treated as binary — either the full recovery set is active, or it
+ is empty. This ensures the Zone Primary is *conservatively stale*: it may
+ reject reads it could safely serve, but will never read from an inconsistent
+ shard.
+
+- **Initial Behavior**:
+
+ - Shards listed in the Synchronous Recovery Set will not be used for reads
+ by the Zone Primary.
+ - If the remaining available shards are insufficient to reconstruct the data,
+ the Zone Primary will return ``-EAGAIN``.
+ - Recovery messages will signal the Zone Primary when recovery completes,
+ clearing the set.
+
+- **Subsequent Enhancement**:
+
+ - A mechanism will be added to allow the Zone Primary to request permission
+ to read from a shard within the Synchronous Recovery Set.
+ - This request will trigger the necessary recovery for that specific object
+ (if not already complete) before the read is permitted.
+
+
+8. Write Path & Transaction Handling
+------------------------------------
+
+This section details how write operations are managed, specifically focusing on
+Read-Modify-Write (RMW) sequences and the replication of write transactions to
+remote zones.
+
+8.1 Direct-to-OSD Writes (Primary Encodes All Shards) — *R1*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In R1, all writes are coordinated exclusively by the Primary.
+
+- **Mechanism**: The Primary generates all ``zones × (k + m)`` coded shard writes
+ and sends individual write operations directly to every OSD in the acting
+ set, including those in remote zones.
+- **RMW Reads**: The "read" portion of any Read-Modify-Write cycle is performed
+ locally by the Primary, strictly adhering to the logic defined in Section 7
+ (Read Path & Recovery Strategies).
+- **Inter-zone traffic**: This sends all coded shard data (including parity)
+ over the inter-zone link. While this is not bandwidth-optimal, it requires
+ no Zone Primary involvement in write processing and therefore no new
+ write-path code beyond extending the existing shard fan-out to the larger
+ acting set.
+
+8.2 Replicate Transaction (Preferred Path) — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ This section is deferred to a later release. R1 sends all coded
+ shards directly from the Primary.
+
+.. mermaid::
+
+ sequenceDiagram
+ title Write to Primary
+
+ participant C as Client
+ participant P as Primary
+ participant ZS as Zone-local Shard
+ participant ZP as Zone Primary
+ participant RS as Remote-zone Shard
+
+ C->>P: Submit Op
+ activate P
+
+ P->>RP: Replicate
+ activate ZP
+
+ note over P: Replicate message must contain<br/>data and which shards are to be updated.<br/>When OSDs are recovering, the primary<br/>may need to update remote-zone shards directly.
+
+ P->>ZS: SubRead (cache)
+ activate ZS
+ ZS->>P: SubReadReply
+ deactivate ZS
+
+ P->>ZS: SubWrite
+ activate ZS
+ ZS->>P: SubWriteReply
+ deactivate ZS
+
+ ZP->>RS: SubRead (cache)
+ activate RS
+ RS->>RP: SubReadReply
+ deactivate RS
+
+ ZP->>RS: SubWrite
+ activate RS
+ RS->>RP: SubWriteReply
+ deactivate RS
+
+ ZP->>P: ReplicateDone
+ deactivate ZP
+
+ P->>C: Complete
+ deactivate P
+
+This mechanism is designed to minimize inter-zone link bandwidth usage, which
+is often the most constrained resource in stretch clusters.
+
+- **Mechanism**: The replicate message is an extension of the existing
+ sub-write message. Instead of sending individually coded shard data to every
+ OSD in the remote zone, the Primary sends a copy of the *PGBackend
+ transaction* (i.e., the raw write data and metadata, prior to EC encoding) to
+ the Zone Primary.
+- **Role of Zone Primary**: Upon receipt, the Zone Primary processes the
+ transaction through its local ECTransaction pipeline — largely unmodified
+ code — to generate the coded shards for its zone. For writes smaller than a
+ full stripe, this includes the Zone Primary issuing reads to its zone-local
+ shards so that the parity update can be calculated. It then fans out the shard
+ writes to the zone-local OSDs within its ``zone``. This means coding
+ (parity) data is never sent over the inter-zone link; it is computed
+ independently at each zone.
+- **Completion**: The Zone Primary responds using the existing sub-write-reply
+ message once all zone-local shard writes are durable.
+- **Crash Handling**: If the Zone Primary crashes mid-fan-out, any partial
+ writes on remote-zone shards are rolled back by the existing peering process. No
+ new crash-recovery mechanisms are required.
+- **Benefit**: This reduces inter-zone link traffic to a single transaction
+ message per remote zone, carrying only the raw data — not the ``k+m`` coded
+ chunks.
+
+8.3 Client Writes Direct to Zone Primary — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ This section is deferred to a later release. It requires Replicate
+ Transaction (Section 8.2) as a prerequisite, since the Zone Primary must
+ be able to process transactions and fan out shard writes locally.
+
+.. mermaid::
+
+ sequenceDiagram
+ title Write to Zone Primary
+
+ participant C as Client
+ participant P as Primary
+ participant ZS as Zone-local Shard
+ participant RC as Remote Client
+ participant ZP as Zone Primary
+ participant RS as Remote-zone Shard
+
+ RC->>RP: Write op
+ activate ZP
+
+ ZP->>P: Replicate
+ activate P
+ P->>RP: Permission to write
+
+ note over ZP: There is potential for pre-emptive caching here,<br/>but it adds complexity and mostly does not actually<br/>help performance, as the limiting factor is actually<br/>the Remote-zone write.
+
+ ZP->>RS: SubRead (cache)
+ activate RS
+ RS->>RP: SubReadReply
+ deactivate RS
+
+ ZP->>RS: SubWrite
+ activate RS
+ RS->>RP: SubWriteReply
+ deactivate RS
+
+ P->>ZS: SubRead (cache)
+ activate ZS
+ ZS->>P: SubReadReply
+ deactivate ZS
+
+ P->>ZS: SubWrite
+ activate ZS
+ ZS->>P: SubWriteReply
+ deactivate ZS
+
+ P->>RP: write done
+ ZP->>RC: Complete
+
+ note over ZP: The write complete message is permitted<br/>as soon as all SubWriteReply messages have<br/>been received by the remote. It is shown<br/>here to demonstrate that it is required to<br/>complete processing on the Primary.
+
+ ZP->>P: Remote-zone write complete
+ deactivate P
+ deactivate ZP
+
+ P->>P: PG Idle.
+
+ P->>RP: DummyOp
+ activate P
+ activate ZP
+
+ P->>ZS: DummyOp
+ activate ZS
+ ZS->>P: Done
+ deactivate ZS
+
+ ZP->>P: Done
+ deactivate P
+
+ note over ZP: Message ordering means that remote<br/>primary does not need to wait for<br/>dummy ops to complete
+
+ ZP->>RS: DummyOp
+ activate RS
+ RS->>RP: Done
+ deactivate RS
+ deactivate ZP
+
+This mechanism allows a client to write to its local Zone Primary rather than
+sending data over the inter-zone link to the Primary. The key benefit is that
+the write data crosses the inter-zone link only once (Zone Primary → Primary)
+rather than twice (client → Primary → Zone Primary). Unlike Section 8.2 where
+the Primary initiates the transaction and replicates it outward, here the
+Zone Primary receives the client write directly and coordinates with the
+Primary to maintain global ordering while avoiding redundant data transfer.
+
+- **Mechanism**:
+
+ 1. The client sends the write to its local **Zone Primary**. The Zone
+ Primary stashes a local copy of the write data but does not yet fan out
+ to zone-local shards.
+ 2. The Zone Primary replicates the *PGBackend transaction* (raw write data,
+ prior to EC encoding) to the **Primary**.
+ 3. The Primary processes the transaction as normal through its local
+ ECTransaction pipeline, performing cache reads, encoding, and fanning out
+ shard writes to its zone-local OSDs. When the Primary reaches the step where it
+ would normally replicate data to the originating Zone Primary (as in
+ Section 8.2), it instead sends a **write-permission message** to that
+ Zone Primary, since the Zone Primary already holds the data. Writes to
+ any *other* Zone Primaries (in an ``zones > 2`` configuration) proceed via
+ the normal Replicate Transaction path (Section 8.2).
+ 4. Upon receiving write-permission, the Zone Primary processes the
+ transaction through its local ECTransaction pipeline, generating and
+ fanning out the coded shard writes to the zone-local OSDs within its
+ ``zone``.
+ 5. Once the Primary's own local writes are durable and all other zones have
+ confirmed durability, the Primary sends a **completion message** to the
+ originating Zone Primary. The Zone Primary then responds to the client
+ once the completion message is received and its own local writes are also
+ durable.
+
+- **Write Ordering**: Strict write ordering must be maintained across all zones.
+ Since the Primary remains the single authority for PG log sequencing:
+
+ - The Primary sequences the write as part of its normal processing in step 3.
+ The write-permission message carries the assigned sequence number, ensuring
+ that writes arriving at the Primary directly and writes arriving via a
+ Zone Primary are globally ordered.
+ - Writes from multiple Zone Primaries (in an ``zones > 2`` configuration) are
+ serialized through the Primary's normal sequencing mechanism — no special
+ handling is required beyond the existing PG log ordering.
+ - The Zone Primary does not fan out to zone-local shards until it receives the
+ write-permission message, guaranteeing that zone-local shard writes occur in
+ the correct global order.
+
+- **Benefit**: For workloads where the client is co-located with a remote zone,
+ write data traverses the inter-zone link exactly once (Zone Primary →
+ Primary transaction replication). This halves the inter-zone bandwidth
+ compared to R1 (client → Primary → Zone Primary), and is equivalent to
+ Section 8.2 but with the added advantage that the client experiences local
+ write latency for the initial acknowledgement. Crucially, the data is never
+ sent back to the originating Zone Primary — the Primary only sends
+ lightweight permission and completion messages.
+
+- **Failure Handling**: If the Zone Primary is unavailable, the client falls
+ back to writing directly to the Primary (Section 8.1 or 8.2, depending on
+ what is available). If the Primary is unreachable from the Zone Primary
+ mid-transaction, the Zone Primary returns an error to the client and any
+ stashed data is discarded (no zone-local shard writes have been issued, since
+ write-permission was never received).
+
+- **R3 Enhancement: Forward to Other Zone Primaries (``zones > 2``)**:
+
+ .. note::
+
+ This enhancement is a potential R3 feature and may not be implemented.
+
+ In topologies with more than two zones, the originating Zone Primary could
+ forward the transaction to the other Zone Primaries in parallel with
+ sending it to the Primary. This would improve write latency for those zones
+ by allowing them to receive the data directly from a peer Zone Primary
+ rather than waiting for the Primary to replicate outward. The Primary would
+ still issue write-permission messages to all Zone Primaries to maintain
+ global ordering, but the data transfer to additional zones would already be
+ complete, reducing the critical path.
+
+ *Complexity / Review Note*: There is a known race condition here. The
+ write-permission message originating from the Primary might overtake the
+ data transfer sent by the originating Zone Primary. To handle this, a
+ unique transaction ID will be required to definitively tie these two
+ independent messages together at the receiving Zone Primary.
+
+8.4 Hybrid Approach — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ Requires Replicate Transaction (Section 8.2). Deferred to a later release.
+
+It is acknowledged that complex failure scenarios may require a combination of
+Replicate Transaction, Client-via-Remote-Primary, and Direct-to-OSD approaches.
+
+- **Example**: In a partial failure where one remote zone is healthy and another
+ is degraded, the Primary may use Replicate Transaction for the healthy zone
+ while simultaneously using Direct-to-OSD for the degraded zone within the
+ same transaction lifecycle.
+
+
+9. Recovery Logic
+------------------
+
+All recovery operations are centralized and coordinated by the Primary OSD.
+
+9.1 Primary-Centric Recovery — *R1*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In R1, the Primary performs all recovery without delegating to Zone
+Primaries and without attempting to minimize inter-zone bandwidth.
+
+- **Zone-local Recovery**: The Primary recovers missing shards in its own
+ ``zone`` using available zone-local chunks, exactly as standard EC
+ recovery.
+- **Remote-zone Recovery**: For every remote ``zone``, the Primary reads
+ all shards required to reconstruct any missing data, encodes the missing
+ shards locally, and pushes them directly to the target remote-zone OSDs.
+- **No Zone Primary Coordination**: The Zone Primary plays no role in
+ recovery in R1. All cross-zone data flows from or to the Primary.
+- **Trade-off**: This approach may send more data over the inter-zone link than
+ necessary, but it avoids the complexity of remote-delegated recovery and uses
+ existing recovery infrastructure with minimal modification.
+
+9.2 Cost-Based Recovery Planning — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ This section is deferred to a later release. R1 uses Primary-centric
+ recovery (Section 9.1) for all scenarios.
+
+When the system identifies an individual object (including its clones) requiring
+recovery, the Primary executes a per-object assessment and planning phase. This
+leverages the existing, reactive, object-by-object recovery mechanism rather
+than introducing a broad distributed orchestration layer. The overriding goal
+is to **minimize inter-zone link bandwidth** — every cross-zone transfer is
+expensive, so the Primary must dynamically choose the recovery strategy for
+each object that results in the fewest bytes crossing zone boundaries.
+
+- **Assess Capabilities**: For each ``zone``, the Primary
+ determines how many consistent chunks are available locally and how many are
+ missing.
+- **Cost-Based Plan Selection**: The Primary evaluates the inter-zone bandwidth
+ cost of each viable recovery strategy and selects the cheapest option. The
+ key trade-off is between:
+
+ 1. *Zone Primary performs zone-local recovery with remote-zone reads* — the Zone
+ Primary reads a small number of missing shards from the Primary's zone
+ and reconstructs locally.
+ 2. *Primary reconstructs and pushes* — the Primary reconstructs the full
+ object and pushes the missing shards to the remote zone.
+
+ The optimal choice depends on how many shards each zone is missing.
+
+- **Example**: Consider a ``6+2`` (k=6, m=2) configuration where Zone A has 8
+ good shards and Zone B has only 5 good shards (3 missing). Two options:
+
+ - *Option A (Zone Primary reads remotely)*: The Zone Primary at Zone B
+ needs only **1 remote-zone read** (it has 5 of the 6 required chunks locally
+ and fetches 1 from Zone A) to reconstruct the 3 missing shards locally.
+ **Cost: 1 shard across the inter-zone link.**
+ - *Option B (Primary pushes)*: The Primary reconstructs the 3 missing shards
+ at Zone A and pushes them to Zone B. **Cost: 3 shards across the
+ inter-zone link.**
+
+ Option A is clearly cheaper. A general algorithm should compare the
+ cross-zone transfer cost for each strategy and choose the minimum.
+ Where inter-zone bandwidth consumption is equal, the algorithm should tie-break
+ by optimizing for zone-local read bandwidth or the total number of operations.
+
+ (In the future, the system may adapt these priorities — e.g. explicitly
+ optimizing for read count rather than inter-zone bandwidth on high-bandwidth
+ links to slower rotational media — whether through auto-tuning or operator
+ configuration. However, exploring alternative optimization modes is beyond the
+ scope of this design.)
+
+ .. note::
+
+ The detailed algorithm for optimal recovery planning requires further
+ design work. The general principle is: count the number of cross-zone
+ shard transfers each strategy would require and choose the minimum.
+
+- **Plan Construction**: For each domain, the Primary constructs a "Recovery
+ Plan" message containing specific instructions detailing which OSDs must be
+ used to read the available chunks and which OSDs are the targets to write the
+ recovered chunks. Where cross-zone reads are part of the plan, the message
+ specifies which remote-zone shards to fetch.
+
+9.3 Delegated Remote-zone Recovery — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ Requires Cost-Based Recovery Planning (Section 9.2). Deferred to a later
+ release.
+
+- **Remote-zone Recovery (Plan-Based)**: The Primary sends a per-object "Recovery Plan"
+ to the Zone Primary (or appropriate peers), instructing them to execute the
+ recovery for that object (and clones) locally within their domain using the
+ provided read/write set. If the plan includes cross-zone reads, the Zone Primary
+ fetches the specified shards before reconstructing. Because this ties directly
+ into the existing object recovery lifecycle, any failure of a remote-zone peer
+ during the plan simply drops the recovery op. The Primary detects the failure
+ via standard peering or timeout mechanisms and will naturally re-assess and
+ generate a new plan for the object.
+
+9.4 Fallback Strategies — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ These strategies are part of the Cost-Based Recovery system (Section 9.2)
+ and are deferred to a later release.
+
+9.4.1 Remote-zone Fallback (Push Recovery)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If a remote zone cannot perform zone-local recovery, and the cost-based analysis
+determines that Primary-side reconstruction and push is the cheapest option:
+
+1. The Primary reconstructs the full object locally (using whatever valid shards
+ are available across the cluster).
+2. The Primary pushes the full object data to the Zone Primary.
+3. This push is accompanied by a set of instructions specifying which shards in
+ the remote zone must be written.
+
+9.4.2 Primary Domain Fallback
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the Primary's own domain lacks sufficient chunks to recover locally:
+
+1. The Primary is permitted to read required shards from remote zones to perform
+ the reconstruction.
+2. While this crosses zone boundaries, the cost-based planning ensures it is
+ only chosen when it results in fewer cross-zone transfers than any
+ alternative.
+
+
+10. PG Log Handling — *R1*
+---------------------------
+
+This section describes how PG log entries are managed across replicated EC
+stripes, building on the FastEC design for primary-capable and non-primary
+shards.
+
+10.1 Primary-Capable vs. Non-Primary Shards
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The existing FastEC design distinguishes between two classes of shard:
+
+- **Primary-capable shards**: Maintain a full copy of the PG log, enabling
+ them to take over as Primary if needed.
+- **Non-primary shards**: Maintain a reduced PG log, omitting entries where no
+ write was performed to that specific shard.
+
+For replicated EC stretch clusters, the non-primary shard selection algorithm
+will be extended. The following shards will be classified as **primary-capable**
+(i.e., they maintain a full PG log):
+
+- The first data shard in each replica (per ``zone``).
+- All coding (parity) shards in each replica.
+
+All remaining data shards will be non-primary shards with reduced logs.
+
+.. note::
+
+ Implementation detail: it needs to be determined whether ``pg_temp`` should
+ be used to arrange all primary-capable shards (local, then remote) ahead of
+ non-primary-capable shards in the acting set ordering.
+
+10.2 Independent Log Generation at Zone Primary — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+ Log generation at the Zone Primary is only needed when Replicate
+ Transaction writes (Section 8.2) are implemented. For R1, where the
+ Primary sends pre-encoded shards directly, PG log entries are generated
+ solely by the Primary and distributed with the sub-write messages as per
+ existing behavior.
+
+For latency optimization, the replicate transaction message (see Section 8.2)
+will be sent to the Zone Primary *before* the cache read has completed on the
+Primary. This means the Zone Primary cannot receive a pre-built log entry from
+the Primary — it must generate its own PG log entry independently from the
+transaction data.
+
+Both the Primary and Zone Primary will produce equivalent log entries from the
+same transaction, but they are generated independently at each zone.
+
+10.3 Log Entry Compatibility & Upgrade Safety — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``osdmap.requires_osd_release`` field dictates the minimum code level that
+OSDs can run, and therefore determines the log entry format that must be used.
+No new versioning mechanism is required — the existing release gating already
+constrains log entry compatibility across mixed-version clusters.
+
+Because the Zone Primary generates its own log entry from the transaction
+data, it must populate the Object-Based Context (OBC) to obtain the old
+version of attributes (including ``object_info_t`` for old size). This is an
+additional reason the Zone Primary must be a **primary-capable shard** — only
+primary-capable shards maintain the full PG log and OBC state needed for
+correct log entry construction.
+
+**Debug Mode for Log Entry Equivalence**
+
+A debug mode will be implemented that compares the log entries generated by the
+Primary and Zone Primary for each transaction. When enabled, the Primary will
+include its generated log entry in the replicate transaction message, and the
+Zone Primary will assert that its independently generated entry is identical.
+This mode will be **enabled by default in teuthology testing** to catch any
+divergence early. It will be available as a runtime configuration option for
+production debugging but disabled by default in production due to the
+additional message overhead.
+
+.. warning::
+
+ If a future release changes how PG log entries are derived from
+ transactions, the debug equivalence mode provides an automated safety net.
+ Any divergence between Primary and Zone Primary log entries will be caught
+ immediately in CI.
+
+
+11. Peering & Stretch Mode — *R1*
+-----------------------------------
+
+This section covers the peering, ``min_size``, and stretch-mode integration
+required for replicated EC pools. The design follows the existing replica
+stretch cluster pattern as closely as possible, with EC-specific adaptations.
+
+.. important::
+
+ **R1 scope is simple two-zone (zones=2).** Three-zone (``zones=3``) details are
+ described for design completeness but will be implemented in a later
+ release.
+
+11.1 Pool Lifecycle
+~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+ **CLI Under Review**: We are reviewing the CLIs for enabling stretch mode and
+ setting stretch mode on a pool. We aim to either get rid of it completely
+ (i.e., infer stretch mode automatically when you create a pool with ``zones > 1``)
+ or just have an enable/disable stretch mode CLI with no arguments. We also
+ will ensure the new CLI works with both stretch-EC and stretch-replica pools for R1.
+
+A replicated EC pool implicitly operates in stretch mode. Creation and configuration will be
+simplified based on the final CLI iteration as noted above.
+
+- **CRUSH rule**: Set to the stretch CRUSH rule when stretch mode is active.
+- **min_size**: Managed by the monitor. Automatically adjusted during stretch
+ mode state transitions (Section 11.3).
+- **Peering**: Uses stretch-aware acting set calculation with parameters
+ adjusted by the monitor during state transitions.
+- **Failure**: Automatic ``min_size`` reduction, degraded/recovery/healthy
+ stretch mode transitions — all managed by OSDMonitor.
+
+**Prerequisites for ``zones > 1`` Pool Creation**
+
+.. list-table::
+ :header-rows: 1
+
+ * - Requirement
+ - Reason
+ * - All OSDs have the ``zones > 1`` feature bit
+ - Ensures all OSDs understand extended acting-set semantics (Section 15)
+ * - Stretch mode must be enabled on the cluster
+ - ``zones > 1`` pools require the stretch mode state machine for
+ ``min_size`` management and zone failover
+ * - ``zone_failure_domain`` must match the stretch mode failure domain
+ - The CRUSH rule must align with the stretch cluster topology
+
+11.2 Minimum PG Size Semantics
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The user defines a ``min_size`` for a single zone, which implicitly specifies
+the number of failures the pool will tolerate. For an EC pool, this is in the
+range ``K`` to ``K+M``, defining a tolerance of ``0`` to ``M`` failures. For a
+replica pool, this is in the range ``1`` to ``size``, defining a tolerance of
+``0`` to ``size - min_size`` failures.
+
+Let the number of tolerated failures derived from this setup be denoted as **F**.
+
+If the number of zones > 1, then this setting is dynamically *interpreted*
+according to the cluster's stretch mode (Healthy, Degraded, Recovery). Both
+EC and Replica pools with multiple zones will interpret ``min_size`` this way,
+rather than actively modifying the ``min_size`` setting whenever a stretch
+mode transition occurs.
+
+11.2.1 Per-Pool Min-Size Interpretations — *R1*
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The pool statically tolerates up to ``F`` OSD failures, but that tolerance
+applies to different scopes based on the active stretch mode:
+
+- **Healthy Mode**: There may be up to ``F`` OSD failures across the OSDs in
+ *all zones combined*.
+- **Degraded Mode**: There may be up to ``F`` OSD failures across the OSDs in
+ the *surviving zone(s)*. The failed zone is entirely ignored.
+- **Recovery Mode**: There may be up to ``F`` OSD failures across the OSDs in
+ the *surviving zone(s)*. The recovering zone is entirely ignored.
+
+.. note::
+ A partial failure within a zone may drop a pool below its ``min_size``
+ requirement and cause I/O to stop. Manually removing the rest of the failed
+ zone will cause a transition to **Degraded Stretch Mode**, which might be
+ sufficient to bring the pool back online because the ``min_size`` requirement
+ is now met by the surviving zone. There will be no automation to promote a
+ partial zone failure to a whole zone failure.
+
+11.2.2 Per-Zone Min-Size Interpretations — *Later Release*
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A more sophisticated implementation will reinterpret this tolerance strictly on
+a per-zone basis:
+
+- There may be up to ``F`` OSD failures *in each individual zone*.
+- If any single zone experiences more than ``F`` OSD failures, it will
+ automatically trigger a transition into **Degraded Stretch Mode**, effectively
+ treating all OSDs in the zone as failed.
+
+11.3 Stretch Mode State Machine
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When stretch mode is enabled, the state machine behaves identically to the
+replica stretch mode state machine — leveraging the existing OSDMonitor
+infrastructure. As noted above, the explicit ``min_size`` setting is simply
+*interpreted* against this state machine, rather than actively mutated by the
+OSDMonitor upon transitions.
+
+.. mermaid::
+
+ stateDiagram-v2
+ [*] --> Healthy: enable_stretch_mode
+ Healthy --> Degraded: zone failure detected
+ Degraded --> Recovery: failed zone returns
+ Recovery --> Healthy: all PGs clean
+ Healthy --> [*]: disable_stretch_mode
+ Degraded --> Recovery: force_recovery_stretch_mode CLI
+ Recovery --> Healthy: force_healthy_stretch_mode CLI
+
+**Two-Zone Transitions (zones=2) — R1**
+
+.. list-table::
+ :header-rows: 1
+
+ * - Stretch State
+ - min_size
+ - Reasoning
+ * - **Healthy**
+ - ``zones × (K+M) − M``
+ - Full redundancy; tolerate up to M failures
+ * - **Degraded** (one zone down)
+ - ``K``
+ - One zone lost (K+M shards gone). Surviving zone has K+M shards;
+ can tolerate M more losses. ``K+M − M = K``.
+ * - **Recovery** (zone returning)
+ - ``K`` (same as degraded)
+ - Keep reduced min_size until resync is complete
+ * - **Healthy** (resync complete)
+ - ``zones × (K+M) − M``
+ - Full min_size restored
+
+**Concrete example — K=2, M=1, zones=2 (size=6):**
+
+.. list-table::
+ :header-rows: 1
+
+ * - Stretch State
+ - min_size
+ * - Healthy
+ - 5
+ * - Degraded
+ - 2
+ * - Recovery
+ - 2
+ * - Healthy (restored)
+ - 5
+
+**The degraded-mode formula:**
+
+- Healthy min_size: ``zones × (K+M) − M``
+- Degraded min_size: ``(num_zones−1) × (K+M) − M`` = healthy min_size minus
+ ``(K+M)``
+- **Rule: if a zone fails, reduce min_size by K+M. If a zone becomes
+ healthy again, increase min_size by K+M.**
+
+**Three-Zone Transitions (zones=3) — Later Release**
+
+.. list-table::
+ :header-rows: 1
+
+ * - Stretch State
+ - min_size
+ - Example (K=2, M=1)
+ * - Healthy (3 zones)
+ - ``3(K+M) − M``
+ - 8
+ * - One zone down
+ - ``2(K+M) − M``
+ - 5
+ * - Two zones down
+ - ``K``
+ - 2
+
+11.4 OSDMonitor Changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The existing OSDMonitor stretch mode code contains
+``if (is_replicated()) { ... } else { /* not supported */ }`` patterns in
+several places. These gaps must be filled for EC pools with ``zones > 1``.
+
+**11.4.1 Pool Stretch Set / Unset** (``prepare_command_pool_stretch_set``,
+``prepare_command_pool_stretch_unset``)
+
+``stretch_set`` currently works for any pool type, setting
+``peering_crush_bucket_*``, ``crush_rule``, ``size``, ``min_size``.
+
+For EC pools with ``zones > 1``, add validation:
+
+- Validate ``min_size ∈ [num_zones×(K+M)−M, num_zones×(K+M)]``
+- If ``size`` is provided, validate it matches ``zones × (K+M)``
+
+``stretch_unset`` clears all ``peering_crush_*`` fields. No EC-specific
+changes required.
+
+**11.4.2 Enable/Disable Stretch Mode** (``try_enable_stretch_mode_pools``)
+
+*Currently rejects EC pools.* Change to accept EC pools with ``zones > 1``:
+
+- Set ``peering_crush_bucket_count``, ``peering_crush_bucket_target``,
+ ``peering_crush_bucket_barrier`` (same values as replica)
+- Set ``crush_rule`` to the stretch CRUSH rule
+- Set ``size = r × (k + m)`` (should already be correct from pool creation)
+- Set ``min_size = r × (k + m) − m``
+
+**Pool Creation Gate**: Pool creation with ``zones > 1`` must be rejected if
+stretch mode is not already enabled on the cluster. This is validated in
+``OSDMonitor::prepare_new_pool``.
+
+**11.4.3 Degraded Stretch Mode** (``trigger_degraded_stretch_mode``)
+
+*Currently sets* ``newp.min_size = pgi.second.min_size / 2`` *for replica
+pools.* For EC pools, compute::
+
+ newp.min_size = p.min_size - (k + m)
+
+For a K=2, M=1, zones=2 pool: ``min_size = 5 − 3 = 2``.
+
+Also set ``peering_crush_bucket_count`` and
+``peering_crush_mandatory_member`` as for replicated pools.
+
+**11.4.4 Healthy Stretch Mode** (``trigger_healthy_stretch_mode``)
+
+*Currently reads* ``mon_stretch_pool_min_size`` *config for replica pools.*
+For EC pools, compute from the EC profile::
+
+ newp.min_size = r × (k + m) − m
+
+**11.4.5 Recovery Stretch Mode** (``trigger_recovery_stretch_mode``)
+
+No changes required — does not modify ``min_size``.
+
+11.5 Peering State Changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**11.5.1 Acting Set Calculation**
+
+``calc_replicated_acting_stretch`` is pool-type agnostic at the CRUSH bucket
+level, but for EC pools the acting set has **shard semantics** — position
+``i`` must serve shard ``i``. A new ``calc_ec_acting_stretch`` function (or
+an extension of the existing ``calc_ec_acting``) is required.
+
+Algorithm for each position ``i`` (0 to ``num_zones×(K+M)−1``):
+
+1. Prefer ``up[i]`` if usable
+2. Otherwise prefer ``acting[i]`` if usable
+3. Otherwise search strays for an OSD with shard ``i``
+4. If no usable OSD found, leave as ``CRUSH_ITEM_NONE``
+
+CRUSH ``bucket_max`` constraints apply: no zone may contribute more than
+``size / peering_crush_bucket_target`` OSDs.
+
+.. note::
+
+ This is essentially ``calc_ec_acting`` with the additional CRUSH bucket
+ awareness from the stretch path.
+
+**11.5.2 Async Recovery**
+
+``choose_async_recovery_replicated`` checks ``stretch_set_can_peer()`` before
+removing an OSD from async recovery. An EC stretch variant must respect shard
+identity: removing an OSD must not cause any zone to drop below K shards.
+
+**11.5.3 Stretch Set Validation** (``stretch_set_can_peer``)
+
+For EC pools, additionally verify:
+
+- The acting set includes at least K shards in at least one surviving zone
+- In healthy mode, all zones have ``K+M`` shards
+
+11.6 Relationship to ``peering_crush_bucket_*`` Fields
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The existing ``pg_pool_t`` fields remain unchanged:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Field
+ - Purpose
+ - EC Stretch Usage
+ * - ``peering_crush_bucket_count``
+ - Min distinct CRUSH buckets in acting set
+ - zones (zones) in healthy; reduced during degraded
+ * - ``peering_crush_bucket_target``
+ - Target CRUSH buckets for ``bucket_max`` calc
+ - zones (zones) in healthy; reduced during degraded
+ * - ``peering_crush_bucket_barrier``
+ - CRUSH type level (e.g., datacenter)
+ - Same as replica — the failure domain
+ * - ``peering_crush_mandatory_member``
+ - CRUSH bucket that must be represented
+ - Set to surviving zone during degraded mode
+
+**Example values for zones=2, K=2, M=1 (size=6):**
+
+.. list-table::
+ :header-rows: 1
+
+ * - Stretch State
+ - bucket_count
+ - bucket_target
+ - bucket_max
+ - mandatory_member
+ * - Healthy
+ - 2
+ - 2
+ - 3
+ - NONE
+ * - Degraded
+ - 1
+ - 1
+ - 6
+ - surviving_site
+ * - Recovery
+ - 1
+ - 1
+ - 6
+ - surviving_site
+ * - Healthy (restored)
+ - 2
+ - 2
+ - 3
+ - NONE
+
+``bucket_max = ceil(size / bucket_target)`` — in healthy mode, ``6 / 2 = 3 =
+K+M``. This naturally prevents more than one copy of each shard per zone.
+
+11.7 Network Partition Handling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The existing Ceph stretch cluster mechanism handles network partitions as
+follows:
+
+- **MON-Based Zone Election**: The MONs are distributed between zones (plus a
+ tie-break zone). When the inter-zone link is severed, the MONs elect which
+ zone continues to operate.
+- **OSD Heartbeat Discovery**: OSDs heartbeat the MONs and will discover if
+ they are attached to the losing zone. OSDs on the losing zone stop
+ processing IO.
+- **Lease-Based Read Gating**: A lease determines whether OSDs are permitted to
+ process read IOs. The surviving zone must wait for the lease to expire before
+ new writes can be processed, ensuring no stale reads are served by the losing
+ zone during the transition.
+
+This mechanism does not preclude more complex failure scenarios that may also
+need to be treated as zone failures:
+
+- **Asymmetric Network Splits**: OSDs at a remote zone may be able to
+ communicate with a zone-local MON even though the MONs have determined a zone
+ failover has occurred.
+- **Partial Zone Failures**: Loss of a rack or subset of OSDs within a zone may
+ warrant treating the zone as failed (e.g., if the remaining shards fall below
+ the per-zone ``min_size`` threshold defined in Section 11.2).
+
+11.8 Online OSDs in Offline Zones
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a zone is marked as offline but individual OSDs within that zone remain
+reachable, they must be removed from the up set that is presented to the
+Objecter and peering state machine. The mechanism ties to the ``min_size``
+logic described in Section 11.2:
+
+- **Removal**: If fewer than ``k`` OSDs from a zone are present in the up set,
+ the remaining OSDs for that zone are also removed. This forces a zone
+ failover and prevents partial-zone IO.
+- **Reintegration**: When enough OSDs return to meet the per-zone threshold,
+ they are added back into the up set. Standard peering will handle resyncing
+ data — no new recovery code is required for reintegration.
+
+
+12. Scrub Behavior — *R1*
+-----------------------------
+
+The only modification required to scrub is how it verifies the CRCs received
+from each shard during deep scrub. The deep scrub process must:
+
+1. **Verify the coding CRC**, where the EC plugin supports it. This is current
+ behavior but must be extended to cover the replicated shards (i.e.,
+ verifying the coding relationship within each zone's stripe independently).
+2. **Verify cross-replica CRC equivalence**: corresponding shards in each
+ replica must have the same CRC. For example, SiteShard 0 at Zone A must
+ match SiteShard 0 at Zone B.
+
+Scrub never reads data to the Primary — it only collects and compares CRCs
+reported by each OSD. This means inter-zone bandwidth is not a significant
+concern, and a primary-centric scrub process continues to make sense for
+replicated EC stretch clusters.
+
+
+13. Migration Path
+-------------------
+
+13.1 Pool Migration — *R1*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Migration from an existing EC pool to a replicated EC stretch cluster pool will
+leverage the **Pool Migration** design, which is being implemented for the
+Umbrella release. This document does not define a separate migration mechanism.
+
+13.2 In-Place Replica Count Modification — *Later Release*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In a later release, users will be able to modify ``zones`` for an existing pool by
+swapping in a new EC profile that is otherwise identical (same ``k``, ``m``,
+plugin, etc.) but specifies a different ``zones`` value. Upon profile change, the
+CRUSH rule will be updated to reflect the new replica count, and the standard
+recovery process will automatically perform all necessary expansion (or
+contraction) to match the new configuration — no manual data migration is
+required.
+
+
+14. Implementation Order
+-------------------------
+
+This section defines the phased implementation plan, broken into single-sprint
+stories.
+
+14.1 R1 — Read-Optimized Stretch Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+R1 delivers a functional EC stretch cluster with zone-local read optimization
+(comparable to what Replica stretch clusters provide today). All writes and
+recovery traverse the inter-zone link via the Primary.
+
+1. **CRUSH Rule Generation for Replicated EC**
+ Implement automatic CRUSH rule creation from the ``zones`` and
+ ``zone_failure_domain`` pool parameters (Section 4). Validate that the acting
+ set places ``k+m`` shards per ``zone``.
+
+2. **EC Profile Extension: ``zones`` and ``zone_failure_domain`` Parameters**
+ Extend the EC profile and pool configuration to accept and validate ``zones``
+ and ``zone_failure_domain``. Wire these into pool creation and ``ceph osd pool``
+ commands (Section 2).
+
+3. **Primary-Capable Shard Selection for Replicated EC**
+ Extend the non-primary shard selection algorithm to designate the first data
+ shard and all parity shards per replica as primary-capable (Section 10.1).
+
+4. **Stretch Mode and Peering for Replicated EC**
+ Broken into the following sub-stories (see Sections 11.1–11.6):
+
+ a. **Pool Stretch Set/Unset for EC** (OSDMonitor): Allow ``osd pool
+ stretch set/unset`` on EC pools with ``zones > 1``. Add EC-specific
+ ``min_size`` range validation (Section 11.4.1).
+ b. **Enable/Disable Stretch Mode for EC** (OSDMonitor): Allow
+ ``mon enable_stretch_mode`` when the cluster has EC pools with
+ ``zones > 1``. Set ``min_size = r × (k+m) − m`` (Section 11.4.2).
+ c. **Stretch Mode Transitions for EC** (OSDMonitor): Implement
+ degraded/recovery/healthy transitions. On zone failure, reduce
+ ``min_size`` by ``k+m``; on recovery, restore it (Sections 11.4.3–5).
+ d. **EC Peering with Stretch Constraints** (PeeringState): Create
+ ``calc_ec_acting_stretch`` to respect both shard identity and CRUSH
+ ``bucket_max`` constraints. Extend async recovery checks
+ (Section 11.5).
+ e. **is_recoverable / is_readable for Stretch EC**: Verify and extend
+ ``ECRecPred`` and ``ECReadPred`` to account for the larger shard set
+ and per-zone constraints (Section 11.5.3).
+
+5. **Primary Write Fan-Out to All Shards (Direct-to-OSD Writes)**
+ Extend the Primary write path to encode and distribute ``zones × (k + m)``
+ shard writes directly to all OSDs in the acting set (Section 8.1).
+
+6. **Direct-to-OSD Reads with Primary Fallback**
+ Extend EC Direct Reads to the stretch topology: clients read locally and
+ redirect all failures to the Primary (Sections 7.2, 7.3).
+
+7a. **Single-OSD Recovery (Within-Zone)**
+ Extend recovery so the Primary can recover a single missing OSD within any
+ replica. The Primary reads shards from the affected replica, reconstructs
+ the missing shard, and pushes it to the replacement OSD (Section 9.1).
+
+7b. **Full-Zone Recovery**
+ Extend recovery to handle full-zone loss: the Primary reads from its local
+ (surviving) replica, encodes the full stripe, and pushes all ``k+m`` shards
+ to the recovering zone's OSDs. Also covers multi-OSD failures within a
+ single zone (Section 9.1).
+
+8. **Scrub CRC Verification for Replicated Shards**
+ Extend deep scrub to verify cross-replica CRC equivalence for corresponding
+ shards (Section 12).
+
+9. **Network Partition and Zone Failover Integration**
+ Validate the existing MON-based zone election and OSD heartbeat mechanisms
+ work correctly with the replicated EC acting set (Section 11.7).
+
+10. **End-to-End 2-Zone Integration Testing**
+ Full integration test suite covering read, write, recovery, scrub, and
+ zone-failover scenarios for a 2-zone (``zones=2``) configuration.
+
+11. **Inter-Zone Statistics**
+ Add perf counters tracking cross-zone bytes, operation counts, latency
+ histograms, and zone-local vs remote-zone-zone read ratios to give operators visibility
+ into inter-zone link utilization (Section 17).
+
+14.2 Later Release — Recovery Bandwidth Optimization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These stories reduce inter-zone link bandwidth consumed during recovery, in
+priority order.
+
+1. **Cost-Based Recovery Planning**
+ Implement the assessment and plan-selection algorithm that compares
+ cross-zone transfer costs for each recovery strategy (Section 9.2).
+
+2. **Recovery Plan Message and Zone Primary Execution**
+ Define the "Recovery Plan" message format and implement Zone Primary
+ execution of delegated recovery within its ``zone``
+ (Section 9.3).
+
+3. **Push Recovery Fallback**
+ Implement the path where the Primary reconstructs the full object and
+ pushes to the Zone Primary with shard-write instructions (Section 9.4.1).
+
+4. **Primary Domain Fallback (Cross-Zone Read for Zone-local Recovery)**
+ Allow the Primary to read shards from remote zones when its own domain
+ lacks sufficient chunks, only when cost-based planning selects it
+ (Section 9.4.2).
+
+14.3 Later Release — Write Bandwidth Optimization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These stories reduce inter-zone link bandwidth consumed during writes, in
+priority order.
+
+1. **Replicate Transaction Message**
+ Implement the replicate transaction message that sends raw write data to
+ the Zone Primary instead of pre-encoded shards. The Zone Primary
+ encodes locally (Section 8.2).
+
+2. **Independent PG Log Generation at Zone Primary**
+ Enable the Zone Primary to generate its own PG log entries from the
+ replicate transaction, including OBC population and the debug equivalence
+ mode (Sections 10.2, 10.3).
+
+3. **Client Writes Direct to Zone Primary**
+ Enable clients to write to their local Zone Primary, which stashes the
+ data, replicates to the Primary, and fans out locally only after receiving
+ write-permission. Includes the write-permission/completion message
+ mechanism to maintain global consistency (Section 8.3).
+
+4. **Hybrid Write Path (Replicate + Direct-to-OSD)**
+ Support mixed write strategies within a single transaction when zones are
+ in different health states (Section 8.4).
+
+14.4 Later Release — Additional Enhancements
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These stories improve resilience and flexibility but are lower priority than
+bandwidth optimizations.
+
+1. **Read from Zone Primary**
+ Enable clients to send reads to their local Zone Primary for local
+ reconstruction, including the Synchronous Recovery Set mechanism
+ (Sections 7.4, 7.4.1).
+
+2. **Direct-to-OSD Failure Redirection to Zone Primary**
+ Upgrade direct-read failure handling to redirect to the local Zone
+ Primary instead of the global Primary (Section 7.5).
+
+3. **Per-Zone min_size with Zone Failover**
+ Implement per-zone minimum shard thresholds that automatically trigger
+ zone failover (Section 11.2, later release).
+
+4. **Online OSDs in Offline Zones Handling**
+ Implement removal of residual online OSDs from the up set when their zone
+ is offline (Section 11.8).
+
+5. **In-Place Replica Count Modification**
+ Allow ``zones`` to be changed on an existing pool via EC profile swap
+ (Section 13.2).
+
+6. **3-Zone (``zones=3``) Full Integration Testing**
+ Full integration and real-world testing of 3-zone configurations
+ (Section 5).
+
+
+15. Upgrade & Backward Compatibility
+--------------------------------------
+
+Replicated EC pools with ``zones=1`` are fully backward compatible. An ``zones=1``
+profile produces a standard ``(k+m)`` acting set with no additional replication,
+no new on-disk format, and no new wire messages. Existing OSDs and clients will
+handle these pools without modification, because the resulting behavior is
+identical to a conventional EC pool.
+
+Pools with ``zones > 1`` introduce a larger acting set (``zones × (k + m)`` shards),
+new CRUSH rules, and — in later releases — new inter-OSD messages
+(Replicate Transaction, Read Permissions, etc.). To prevent mixed-version
+clusters from misinterpreting these pools:
+
+- A new **OSD feature bit** will gate the creation of ``zones > 1`` profiles. The
+ monitor will reject pool creation or EC profile changes that set ``zones > 1``
+ unless all OSDs in the cluster advertise this feature bit.
+- This ensures that every OSD in the cluster understands the extended acting
+ set semantics, shard numbering, and any new message types before an ``zones > 1``
+ pool can be instantiated.
+- No data migration is required when upgrading: the feature bit is purely an
+ admission control mechanism. Once all OSDs are upgraded and the bit is
+ present, ``zones > 1`` pools can be created normally.
+
+.. note::
+ **Upgrade Scenarios**: We are still thinking through upgrade scenarios from older
+ releases (e.g. can we infer ``zones = 2`` pools automatically at upgrade time?).
+
+ Open design decisions that will be resolved at implementation:
+
+ What to do with stretched replica pools during upgrade. One option is to leave
+ these as is (with num_zones = 1 and the current interpretation of min-size)
+ and continue to support all the arguments on enable/disable stretch mode. An
+ alternative option would be to try and automatically convert these pools to
+ num_zones = 2 and the new interpretation of min-size). These pools should
+ currently have a custom CRUSH rule which should be maintained.
+
+16. Kernel Changes
+-------------------
+
+The kernel RBD client (``krbd``) implements its own EC direct read path,
+independent of the userspace ``librados`` client. For replicated EC pools, the
+``krbd`` direct read logic must be updated to understand the ``zones × (k + m)``
+shard layout so that it can identify and target zone-local shards within the
+client's ``zone``. Without this change, ``krbd`` may attempt direct
+reads to shards in a remote zone, negating the locality benefit.
+
+The required modification is small — the shard selection logic needs to account
+for the replica stride when choosing which shard to read from — but it is a
+kernel-side change and therefore follows the kernel release cycle independently
+of the Ceph userspace releases.
+
+
+17. Inter-Zone Statistics — *R1*
+----------------------------------
+
+Operators need visibility into inter-zone link utilization to validate that
+the stretch EC design is delivering its locality benefits and to plan capacity.
+The following statistics will be exposed per-pool and per-OSD:
+
+- **Cross-zone bytes sent / received**: Total bytes transferred to/from OSDs in
+ remote zones, broken down by operation type:
+
+ - Write fan-out (shard data sent to remote zone)
+ - Recovery push (shard data sent during recovery)
+ - remote-zone read (shard data read from remote zone for reconstruction or
+ medium-error recovery)
+
+- **Cross-zone operation counts**: Number of cross-zone sub-operations, broken
+ down by type (sub-write, sub-read, recovery push).
+
+- **Cross-zone latency**: Histogram (p50 / p95 / p99) of round-trip latency for
+ cross-zone sub-operations, useful for detecting link degradation.
+
+- **local vs remote zone read ratio**: For direct-read-enabled pools, the fraction of
+ reads served locally versus those that fell back to the Primary across a zone
+ boundary. A high remote-zone ratio indicates a locality problem.
+
+These counters will be exposed through the existing ``ceph perf`` counter
+infrastructure and will be queryable via ``ceph tell osd.N perf dump`` and
+the Ceph Manager dashboard. They should also be available at the pool level
+via ``ceph osd pool stats``.
+
+.. note::
+
+ The inter-zone classification relies on comparing the CRUSH location of the
+ source OSD with the destination OSD's ``zone``. This comparison is
+ already performed during shard fan-out; the statistics layer adds only
+ counter increments on the existing code path.
+
+
+18. Non-Redundant Pool Zone Affinity
+------------------------------------
+
+A stretch cluster topology may additionally host workloads that do not require
+multi-zone redundancy, or where redundancy is handled at a higher application
+layer. To accommodate this, a non-redundant pool can be configured with zone
+affinity.
+
+This is achieved by setting up the pool to use a specific CRUSH root. When creating the pool, you set the ``zones`` to ``1`` (the default) and pass a ``crush_root`` parameter targeting a specific datacenter bucket or zone bucket within your CRUSH hierarchy.
+
+Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~
+
+When configuring a pool with affinity to a specific zone, the system generates a
+CRUSH rule that performs a standard ``take <crush_root>`` operation on the
+designated zone bucket.
+
+For instance, if a cluster has two datacenters defined in CRUSH as ``DC1`` and
+``DC2``, configuring a pool with ``crush_root=DC1`` and ``zones=1`` will prompt the
+system to generate a CRUSH rule that starts with ``take DC1``, ensuring all
+data for that pool resides completely within that datacenter. This allows non-redundant
+applications to leverage zone-local storage without incurring the latency or bandwidth
+costs of crossing the inter-zone link.