From 8267213930d88f7614a9b13a362265fccd9634ba Mon Sep 17 00:00:00 2001
From: Adam Kupczyk <akupczyk@ibm.com>
Date: Mon, 18 Aug 2025 06:28:01 +0000
Subject: [PATCH] doc/dev : cputrace documentation addition

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
(cherry picked from commit 4c1e70e06e16dc9fca19ba6bef4897f061ec7b37)
---
 doc/dev/cputrace.rst | 304 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 304 insertions(+)
 create mode 100644 doc/dev/cputrace.rst

diff --git a/doc/dev/cputrace.rst b/doc/dev/cputrace.rst
new file mode 100644
index 0000000000000..2ca57bf0913b2
--- /dev/null
+++ b/doc/dev/cputrace.rst
@@ -0,0 +1,304 @@
+========
+CpuTrace
+========
+
+CpuTrace is a developer tool that measures the CPU cost of execution.
+It is useful when deciding between algorithms for new code and for
+validating performance enhancements.
+CpuTrace measures CPU instructions, clock cycles, branch mispredictions,
+cache misses and thread reschedules.
+
+Integration into Ceph
+---------------------
+
+To enable CpuTrace, build with the ``WITH_CPUTRACE`` flag:
+
+.. code-block:: bash
+
+  ./do_cmake.sh -DWITH_CPUTRACE=1
+
+Once built with CpuTrace support, you can annotate specific functions
+or code regions using the provided macros and helper classes.
+
+To enable profiling in your code, include the CpuTrace header:
+
+.. code-block:: cpp
+
+   #include "common/cputrace.h"
+
+Then you can mark functions for profiling using the provided helpers.
+
+Raw counter mode
+----------------
+
+CpuTrace is using the Linux ``perf_event_open`` syscall. You can use the tool
+as a simple helper to get access to hardware perf counters.
+
+.. code-block:: cpp
+
+  // I am profiling my code and want to know
+  // how many clock cycles and how many thread switches it takes
+  HW_ctx hw = HW_ctx_empty;
+  HW_init(&hw, HW_PROFILE_SWI|HW_PROFILE_CYC);
+  sample_t start, end;
+  HW_read(&hw, &start);
+  // my code starts
+  // .....
+  // my code ends
+  HW_read(&hw, &end);
+  // task_switches = end.swi - start.swi;
+  // clock_cycles  = end.cyc - start.cyc;
+  HW_clean(&hw);
+
+By inspecting ``task_switches`` and ``clock_cycles`` the developer can learn that
+real clock execution time of 10ms has only 1M clock cycles, but had 2 task switches.
+
+Aggregating samples
+-------------------
+
+A single readout of execution time is usually not enough. We need more samples
+to get a more realistic measurement of actual execution cost.
+
+.. code-block:: cpp
+
+  // a variable to hold my measurement
+  static measurement_t my_code_time;
+  sample_t start, end, elapsed;
+  // hw initialized somewhere else
+  HW_read(&hw, &start);
+  // my code starts
+  // .....
+  // my code ends
+  HW_read(&hw, &end);
+  elapsed = end - start;
+  // add new sample to the whole measurement
+  my_code_time.sample(elapsed);
+
+``measurement_t``
+-----------------
+
+The ``measurement_t`` type aggregates collected samples and counts the number
+of measurements performed.
+
+It produces summary statistics that include:
+
+- **count** : total number of measurements
+- **average** : mean value across all samples
+- **zero / non-zero split** : how many measurements were exactly zero
+  versus greater than zero (only for context switch metrics)
+
+These statistics provide a compact and clear view of performance measurements.
+
+``measurement_t`` can also export results in two formats:
+
+- **Ceph Formatter** (for structured JSON/YAML/XML output):
+
+  .. code-block:: cpp
+
+     ceph::Formatter* jf;
+     m->dump(jf, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
+
+- **String stream** (for plain-text logging):
+
+  .. code-block:: cpp
+
+     std::stringstream ss;
+     m->dump_to_stringstream(ss, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
+     std::cout << ss.str();
+
+This makes it easy to either integrate measurements into Cephâs
+structured output pipeline or dump them as human-readable text for debugging.
+
+RAII samples
+------------
+
+It is usually most convenient to use RAII to collect samples.
+With RAII, measurement begins automatically when the guard object is created
+and ends when it goes out of scope, so no explicit start/stop calls are required.
+
+The hardware context (``HW_ctx``) must be initialized once before creating
+guards. After initialization, the same context can be reused across multiple
+measurements.
+
+``HW_guard`` takes two arguments:
+
+- ``HW_ctx* ctx``
+  Pointer to the initialized hardware context.
+
+- ``measurement_t* m``
+  Pointer to the measurement object where results will be stored.
+
+
+Example:
+
+.. code-block:: cpp
+
+  // variable to hold measurement results
+  static measurement_t my_code_time;
+  {
+    HW_guard guard(&hw, &my_code_time);
+    // code to be measured
+    // ...
+  }
+
+Named measurements
+------------------
+
+Code regions can be measured using a `named guard`.
+Each ``HW_named_guard`` automatically starts measurement at construction and stops when leaving scope.
+
+.. code-block:: cpp
+
+  {
+    HW_named_guard("function", &hw);
+    // my code starts
+    // ...
+    // my code ends
+  }
+
+This example records the execution time of ``function``.
+
+The guard requires a pointer to a previously initialized ``HW_ctx``.
+This context must be created and set up (e.g., during program initialization)
+before guards can be used.
+
+Named guards provide a simple and consistent way to track performance metrics.
+
+To later access the collected measurements for a given name, use:
+
+.. code-block:: cpp
+
+  measurement_t* m = get_named_measurement("function");
+  if (m) {
+    // inspect m->sum_cyc, m->sum_ins.
+    // m->dump_to_stringstream(ss, HW_PROFILE_INS|HW_PROFILE_CYC);
+  }
+
+Admin socket integration
+------------------------
+
+In addition to direct instrumentation in code, CpuTrace can also be controlled
+at runtime via the admin socket interface. This allows developers to start,
+stop, and inspect profiling in running Ceph daemons without rebuilding or
+restarting them.
+
+To profile a function, annotate it with the provided macros:
+
+.. code-block:: cpp
+
+  HWProfileFunctionF(profile, __func__,
+                     HW_PROFILE_CYC  | HW_PROFILE_CMISS |
+                     HW_PROFILE_INS  | HW_PROFILE_BMISS |
+                     HW_PROFILE_SWI);
+
+- ``profile`` is a local variable name for the profiler object and only needs to be unique within the profiling scope.
+- ``__func__`` (or any string you pass as the name) is the unique anchor name for this profiling scope.
+
+Each unique name creates a separate anchor. Reusing the same name in multiple places will trigger an assertion failure.
+
+This macro automatically attaches a profiler to the function scope and
+collects the specified hardware counters each time the function executes.
+
+You can combine any of the available flags:
+
+* ``HW_PROFILE_CYC``   â CPU cycles
+* ``HW_PROFILE_CMISS`` â Cache misses
+* ``HW_PROFILE_BMISS`` â Branch mispredictions
+* ``HW_PROFILE_INS``   â Instructions retired
+* ``HW_PROFILE_SWI``   â Context switches
+
+Available commands:
+
+* ``cputrace start`` â Start profiling with the configured groups/counters
+* ``cputrace stop`` â Stop profiling and freeze results
+* ``cputrace dump`` â Dump all collected metrics (as JSON or plain text)
+* ``cputrace reset`` â Reset all captured data
+
+Profiling counters are cumulative. `cputrace stop` pauses profiling without
+resetting values. `cputrace start` resumes accumulation. Use `cputrace reset`
+to clear all collected metrics.
+
+Example usage from the command line:
+
+.. code-block:: bash
+
+  # Start profiling on OSD.0
+  ceph tell osd.0 cputrace start
+
+  # Stop profiling
+  ceph tell osd.0 cputrace stop
+
+  # Dump results
+  ceph tell osd.0 cputrace dump
+
+  # Reset counters
+  ceph tell osd.0 cputrace reset
+
+These commands can be repeated multiple times: developers typically
+``start`` before a workload, ``stop`` afterwards, and then ``dump`` the results
+to analyze them.
+
+``cputrace dump`` supports optional arguments to filter by logger or counter,
+so only a subset of metrics can be reported when needed.
+
+``cputrace reset`` clears all data, preparing for a fresh round of profiling.
+
+API Reference
+-------------
+
+Enums
+~~~~~
+
+.. code-block:: cpp
+
+  enum cputrace_flags {
+      HW_PROFILE_SWI   = (1ULL << 0), // Context switches
+      HW_PROFILE_CYC   = (1ULL << 1), // CPU cycles
+      HW_PROFILE_CMISS = (1ULL << 2), // Cache misses
+      HW_PROFILE_BMISS = (1ULL << 3), // Branch mispredictions
+      HW_PROFILE_INS   = (1ULL << 4), // Instructions retired
+  };
+
+The bitwise ``|`` operator may be used to combine these flags.
+
+Data structures
+~~~~~~~~~~~~~~~
+
+``sample_t`` â holds a single hardware counter snapshot.
+
+.. code-block:: cpp
+
+  struct sample_t {
+    uint64_t swi;   //context switches
+    uint64_t cyc;   //clock cycles
+    uint64_t cmiss; //cache misses
+    uint64_t bmiss; //branch misses
+    uint64_t ins;   //instructions
+  };
+
+``measurement_t`` â accumulates multiple samples and computes totals/averages and other 
+useful metrics.
+
+.. code-block:: cpp
+
+  struct measurement_t {
+    uint64_t call_count = 0;
+    uint64_t sample_count = 0;
+    uint64_t sum_swi = 0, sum_cyc = 0, sum_cmiss = 0, sum_bmiss = 0, sum_ins = 0;
+    uint64_t non_zero_swi_count = 0;
+    uint64_t zero_swi_count = 0;
+  };
+
+
+``HW_ctx`` â encapsulates perf-event file descriptors for one measurement context.
+
+.. code-block:: cpp
+
+  extern HW_ctx HW_ctx_empty;
+
+Low-level API
+~~~~~~~~~~~~~
+
+- ``void HW_init(HW_ctx* ctx, cputrace_flags flags)`` â initialize perf counters.
+- ``void HW_read(HW_ctx* ctx, sample_t* out)`` â read current counter values.
+- ``void HW_clean(HW_ctx* ctx)`` â release perf counters.
-- 
2.39.5