--- /dev/null
+ceph_test_rados — Model-Based RADOS Stress Test
+================================================
+
+``ceph_test_rados`` is a model-based integration test that verifies the
+data correctness of the RADOS layer under stress. It maintains an in-memory
+model of expected object data and metadata, and compares it against the
+actual object data returned by RADOS after every read, detecting data
+corruption, snapshot inconsistencies, and attribute mismatches.
+
+.. note::
+
+ This is **not** a performance benchmark. For throughput and latency
+ measurement, use ``rados bench``. ``ceph_test_rados`` is a
+ *correctness verifier*.
+
+How It Works
+------------
+
+1. **Initialization**: Creates ``--objects`` initial objects via write
+ (or append for EC pools).
+2. **Stress loop**: Generates a randomized stream of up to ``--max-ops``
+ operations, each selected by weighted probability from the
+ ``--op`` arguments.
+3. **Verification**: Every read dispatches 3 pipelined reads and
+ compares data, xattrs, and omap entries against the in-memory model.
+4. **Completion**: Prints the error count and per-operation-type
+ statistics to stderr.
+
+Architecture
+~~~~~~~~~~~~
+
+The tool is built from several components:
+
+- ``TestRados.cc`` — CLI parsing, ``main()``, and the
+ ``WeightedTestGenerator`` which selects operations by weight.
+- ``RadosModel.h`` — The ``RadosTestContext`` (in-memory model) and all
+ 26 ``TestOp`` subclasses (``ReadOp``, ``WriteOp``, ``SnapCreateOp``,
+ etc.).
+- ``Object.h`` / ``Object.cc`` — Content generators
+ (``VarLenGenerator``, ``AppendGenerator``) and the ``ObjectDesc``
+ model that tracks layered object contents across snapshots.
+- ``TestOpStat.h`` — Per-operation-type latency statistics collector.
+
+Synopsis
+--------
+
+::
+
+ ceph_test_rados
+ --op <read|write|write_excl|writesame|delete|snap_create|snap_remove|
+ rollback|setattr|rmattr|watch|copy_from|hit_set_list|is_dirty|
+ undirty|cache_flush|cache_try_flush|cache_evict|append|append_excl|
+ set_redirect|unset_redirect|chunk_read|tier_promote|tier_flush|
+ set_chunk|tier_evict> <weight>
+ [--op <operation_type> <weight> ...]
+ [--pool <pool_name>]
+ [--max-ops <op_count>]
+ [--objects <object_count>]
+ [--max-in-flight <max_concurrent>]
+ [--size <max_size_bytes>]
+ [--min-stride-size <bytes>]
+ [--max-stride-size <bytes>]
+ [--max-seconds <seconds>]
+ [--ec-pool]
+ [--no-omap]
+ [--no-sparse]
+ [--pool-snaps]
+ [--balance-reads]
+ [--localize-reads]
+ [--offlen_randomization_ratio <0-100>]
+ [--write-fadvise-dontneed]
+ [--max-attr-len <bytes>]
+ [--set_redirect]
+ [--set_chunk]
+ [--low_tier_pool <pool_name>]
+ [--enable_dedup]
+ [--dedup_chunk_algo <fastcdc|fixcdc>]
+ [--dedup_chunk_size <bytes>]
+ [--timestamps]
+
+At least one ``--op`` with a positive weight is required.
+
+Core Parameters
+---------------
+
+``--pool <name>``
+ Target RADOS pool (must already exist). Default: ``rbd``.
+
+``--max-ops <n>``
+ Maximum number of operations to execute (including initial object
+ writes). Default: ``1000``.
+
+``--objects <n>``
+ Number of distinct objects to create and test against. Must satisfy
+ ``max_in_flight * 2 <= objects``. Default: ``50``.
+
+``--max-in-flight <n>``
+ Maximum concurrent asynchronous operations. Default: ``16``.
+
+``--max-seconds <n>``
+ Wall-clock time limit in seconds. ``0`` means unlimited (run until
+ ``--max-ops`` is exhausted). Default: ``0``.
+
+Object Geometry
+---------------
+
+``--size <n>``
+ Maximum object size in bytes. Actual sizes are randomized within
+ approximately ``[size/2, size]``. Default: ``4000000`` (~3.8 MiB).
+
+``--min-stride-size <n>``
+ Minimum write stride in bytes. Must be < ``--max-stride-size``
+ and <= ``--size``. Default: ``size / 10``.
+
+``--max-stride-size <n>``
+ Maximum write stride in bytes. Must be > ``--min-stride-size``
+ and <= ``--size``. Default: ``size / 5``.
+
+Pool Type and Behavior
+----------------------
+
+``--ec-pool``
+ Indicates that the target is an erasure-coded pool **that does not support overwrites**.
+ **Must appear before any** ``--op`` **arguments.**
+
+ .. note::
+
+ This is largely a legacy parameter. When Ceph originally introduced
+ EC pools, they did not support partial overwrites or sparse reads. Today,
+ if an EC pool supports overwrites (e.g., via BlueStore), you should *not*
+ use this flag, so that ``ceph_test_rados`` can test partial overwrites.
+ In the Teuthology QA suite, setting ``erasure_code_use_overwrites: true``
+ prevents the test runner from passing this flag.
+
+ Using this flag has the following effects:
+
+ 1. Implicitly sets ``--no-sparse``.
+ 2. Initial object creation writes use ``append`` mode instead of ``write``.
+ 3. Overwrite operations (``write``, ``write_excl``, ``writesame``) are
+ disallowed and will cause startup validation to fail.
+
+``--no-omap``
+ Disable omap operations. Automatically set if the pool does not
+ support omap (auto-detected at startup).
+
+``--no-sparse``
+ Disable sparse reads (use full reads only). Automatically set when
+ ``--ec-pool`` is used.
+
+``--pool-snaps``
+ Use pool-level snapshots instead of self-managed snapshots.
+
+Read Routing
+------------
+
+``--balance-reads``
+ Set ``LIBRADOS_OPERATION_BALANCE_READS`` on read operations,
+ allowing reads from any replica.
+
+``--localize-reads``
+ Set ``LIBRADOS_OPERATION_LOCALIZE_READS`` on read operations,
+ preferring the closest replica.
+
+``--offlen_randomization_ratio <n>``
+ Percentage chance (0–100) that a read uses a randomized offset
+ instead of reading from offset 0. Default: ``50``.
+
+Write Behavior
+--------------
+
+``--write-fadvise-dontneed``
+ Set the ``write_fadvise_dontneed`` flag on the pool, advising the
+ OSD backend not to cache written data.
+
+``--max-attr-len <n>``
+ Maximum generated xattr length in bytes. Default: ``20000``.
+
+Manifest and Tiering
+--------------------
+
+``--set_redirect``
+ Enable redirect manifest testing. Requires ``--low_tier_pool``.
+
+``--set_chunk``
+ Enable chunk-based manifest testing. Requires ``--low_tier_pool``.
+
+``--low_tier_pool <name>``
+ Low-tier pool for redirect/chunk/dedup operations. Must be a
+ different pool from ``--pool`` to avoid a known race condition.
+ Required when ``--set_redirect`` or ``--set_chunk`` is set.
+
+Deduplication
+-------------
+
+``--enable_dedup``
+ Enable deduplication testing. Requires ``--dedup_chunk_algo`` and
+ ``--dedup_chunk_size``. Configures the pool with SHA-256
+ fingerprinting and the specified chunking algorithm.
+
+``--dedup_chunk_algo <algorithm>``
+ Chunking algorithm: ``fastcdc`` or ``fixcdc``.
+
+``--dedup_chunk_size <size>``
+ Chunk size for content-defined chunking (e.g., ``131072``).
+
+Output
+------
+
+``--timestamps``
+ Prefix each output line with a coarse timestamp.
+
+Operation Types
+---------------
+
+Operations are specified via ``--op <name> <weight>``. Weights are
+relative: an operation with weight 100 is twice as likely as one with
+weight 50.
+
+.. list-table::
+ :header-rows: 1
+ :widths: 20 10 70
+
+ * - Name
+ - Valid with --ec-pool
+ - Description
+ * - ``read``
+ - Yes
+ - Read and verify object data, xattrs, and omap against the model.
+ * - ``write``
+ - No
+ - Random-offset partial write.
+ * - ``write_excl``
+ - No
+ - Random-offset partial write that asserts the object already exists
+ (``assert_exists()``) as part of the transaction.
+ * - ``writesame``
+ - No
+ - Write same data pattern across an extent.
+ * - ``delete``
+ - Yes
+ - Delete an object.
+ * - ``snap_create``
+ - Yes
+ - Create a snapshot (quiesces in-flight ops first).
+ * - ``snap_remove``
+ - Yes
+ - Remove a snapshot.
+ * - ``rollback``
+ - Yes
+ - Roll back an object to a previous snapshot.
+ * - ``setattr``
+ - Yes
+ - Set random xattrs (and omap if supported).
+ * - ``rmattr``
+ - Yes
+ - Remove random xattrs (and omap if supported).
+ * - ``watch``
+ - Yes
+ - Establish a watch, self-notify, wait for callback.
+ * - ``copy_from``
+ - Yes
+ - Server-side copy between objects in the pool.
+ * - ``hit_set_list``
+ - Yes
+ - List HitSet entries.
+ * - ``is_dirty``
+ - Yes
+ - Check object dirty state (cache tier).
+ * - ``undirty``
+ - Yes
+ - Mark object clean (cache tier).
+ * - ``cache_flush``
+ - Yes
+ - Flush object from cache tier (blocking).
+ * - ``cache_try_flush``
+ - Yes
+ - Try to flush object from cache tier (non-blocking).
+ * - ``cache_evict``
+ - Yes
+ - Evict object from cache tier.
+ * - ``append``
+ - Yes
+ - Append data to an object.
+ * - ``append_excl``
+ - Yes
+ - Append data that asserts the object already exists.
+ * - ``set_redirect``
+ - Yes
+ - Set redirect manifest to low-tier pool.
+ * - ``unset_redirect``
+ - Yes
+ - Remove redirect manifest.
+ * - ``chunk_read``
+ - Yes
+ - Read and verify a chunk from a manifest object.
+ * - ``tier_promote``
+ - Yes
+ - Promote object from lower tier.
+ * - ``tier_flush``
+ - Yes
+ - Flush object to backing tier.
+ * - ``set_chunk``
+ - Yes
+ - Set chunk manifest (requires ``--enable_dedup``).
+ * - ``tier_evict``
+ - Yes
+ - Evict object to backing tier.
+
+Environment Variables
+---------------------
+
+``CEPH_CLIENT_ID``
+ Client ID for the librados connection. If unset, connects as the
+ default client.
+
+Standard Ceph environment variables (``CEPH_CONF``, ``CEPH_KEYRING``,
+etc.) are respected.
+
+Teuthology Integration
+----------------------
+
+The tool is typically invoked via the ``rados`` Teuthology task defined
+in ``qa/tasks/rados.py``. The task creates pools, translates YAML
+configuration into CLI arguments, and manages the process lifecycle.
+
+Example YAML configuration::
+
+ tasks:
+ - rados:
+ clients: [client.0]
+ ops: 400000
+ max_seconds: 600
+ objects: 1024
+ size: 16384
+ op_weights:
+ read: 100
+ write: 100
+ delete: 50
+ snap_create: 50
+ snap_remove: 50
+ rollback: 50
+
+Workload examples are in ``qa/suites/rados/thrash*/workloads/``.
+
+.. note::
+
+ The Teuthology wrapper automatically splits ``write`` and ``append``
+ weights into regular and ``_excl`` halves. This does not happen at
+ the CLI level: specify both variants explicitly when invoking the
+ binary directly.
+
+Examples
+--------
+
+Basic replicated pool test::
+
+ ceph_test_rados \
+ --pool testpool \
+ --max-ops 10000 \
+ --objects 500 \
+ --max-in-flight 16 \
+ --size 4000000 \
+ --op read 100 \
+ --op write 100 \
+ --op delete 10
+
+EC pool (without allow_ec_overwrites) with snapshots::
+
+ ceph_test_rados \
+ --ec-pool \
+ --pool my-ec-pool \
+ --max-ops 4000 \
+ --objects 50 \
+ --pool-snaps \
+ --op read 100 \
+ --op append 100 \
+ --op delete 50 \
+ --op snap_create 50 \
+ --op snap_remove 50 \
+ --op rollback 50
+
+Deduplication test::
+
+ ceph_test_rados \
+ --pool testpool \
+ --low_tier_pool low_tier \
+ --set_chunk \
+ --enable_dedup \
+ --dedup_chunk_algo fastcdc \
+ --dedup_chunk_size 131072 \
+ --max-ops 1500 \
+ --objects 50 \
+ --op read 100 \
+ --op write 50 \
+ --op set_chunk 30 \
+ --op tier_promote 10
+
+Exit Status
+-----------
+
+The tool will immediately panic (via ``ceph_abort()``) and dump core
+if any data verification errors (e.g., mismatching object content,
+corrupt metadata) are detected during reads.
+
+If no bugs are hit and the execution time/op count is exhausted, the
+tool will exit cleanly with status **0**.
+
+Exit status **1** indicates a startup validation failure (such as
+incompatible arguments).
+
+Source Files
+------------
+
+- ``src/test/osd/TestRados.cc`` — CLI parsing and main loop
+- ``src/test/osd/RadosModel.h`` — Test context and operation classes
+- ``src/test/osd/Object.h`` — Content generation and verification model
+- ``src/test/osd/TestOpStat.h`` — Operation statistics
+- ``qa/tasks/rados.py`` — Teuthology task wrapper