From 4e582a581606fc7012e8de245fefea423a607ab4 Mon Sep 17 00:00:00 2001
From: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Date: Wed, 11 Mar 2026 08:30:19 +0100
Subject: [PATCH] Temp docs on TIDL+MMA

---
 TIDL_MMA_REFERENCE.md | 718 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 718 insertions(+)
 create mode 100644 TIDL_MMA_REFERENCE.md

diff --git a/TIDL_MMA_REFERENCE.md b/TIDL_MMA_REFERENCE.md
new file mode 100644
index 00000000000..270da12d20e
--- /dev/null
+++ b/TIDL_MMA_REFERENCE.md
@@ -0,0 +1,718 @@
+# How TIDL Uses the C7x MMA for Convolution
+
+## Overview
+
+TI's TIDL (TI Deep Learning) framework runs convolutions on the C7x DSP by
+programming the MMA (Matrix Multiply Accelerator) hardware directly, bypassing
+MMALib. This document describes the hardware, the TIDL data flow, and how it
+differs from our current MMALib-based approach. The goal is to provide a
+reference for implementing direct MMA programming in Thames.
+
+## 1. C7x MMA Hardware Architecture
+
+### 1.1 Variant
+
+The J722S contains a C7524 DSP core with an **MMA2-256F** coprocessor
+(512-bit vector width, 64-byte vectors).
+
+### 1.2 Matrix Storage Areas
+
+The MMA has three internal storage areas:
+
+| Storage | Total Size | Row Width | Rows (int8) | Purpose |
+|---------|-----------|-----------|-------------|---------|
+| **A** (activation) | 64 B | 64 B | 1 | Input feature vector (one row at a time) |
+| **B** (weights) | 4096 B | 64 B | 64 | Weight matrix (double-buffered: 2 banks × 64 rows) |
+| **C** (accumulator) | 256 B | 256 B | 1 | Accumulation results (64 × int32 = 256 B, double-buffered) |
+
+For int8 operation:
+- A holds 64 elements per row (one input vector)
+- B holds a 64×64 weight matrix (double-buffered, so B0 and B1 each hold 64×64)
+- C holds 64 int32 accumulators (double-buffered, C0 and C1)
+- One MMA operation computes: **C += A × B** (64 outputs from 64 inputs × 64×64 weights)
+
+### 1.3 Data Types
+
+The MMA supports configurable types for each storage area:
+
+**A matrix (input activations):**
+
+| Enum | Value | Description |
+|------|-------|-------------|
+| `ATYPE_UINT8` | 0 | Unsigned 8-bit |
+| `ATYPE_INT8` | 4 | Signed 8-bit |
+| `ATYPE_UINT16` | 1 | Unsigned 16-bit |
+| `ATYPE_INT16` | 5 | Signed 16-bit |
+| `ATYPE_UINT32` | 2 | Unsigned 32-bit |
+| `ATYPE_INT32` | 6 | Signed 32-bit |
+| `ATYPE_F32` | 3 | 32-bit float (MMA2-256F only) |
+
+**B matrix (weights):** `SIZE8`, `SIZE16`, `SIZE32` — element size only,
+signedness is implicit (the multiply uses A's signedness for the product
+interpretation).
+
+**C accumulator (output):**
+- `C_ATYPE`: `SA` (signed) or `UA` (unsigned) — accumulator signedness
+- `C_BTYPE`: configures element size/type for bias loading — `UINT8`, `INT8`,
+  `UINT16`, `INT16`, `UINT32`, `INT32`, `UINT64`, `INT64`
+- `C_OPERATION0/1`: `MUL` (C = A×B), `MULNEGATE` (C = -A×B), `MULMINUS`
+  (C = C - A×B), `MULPLUS` (C = C + A×B)
+
+**X transfer (output conversion):**
+- `X_XTYPE`: output element type — `UINT8`, `INT8`, `UINT16`, `INT16`,
+  `UINT32`, `INT32`, etc.
+- `X_CTYPE`: accumulator element type read by X FSM — `UINT32`, `INT32`,
+  `UINT64`, `INT64`, `UINT128`, `INT128`, `F32`
+
+### 1.4 Post-Processing Pipeline (X Transfer)
+
+When transferring results out of the C accumulator, the MMA can apply a
+hardware post-processing pipeline:
+
+1. **Scale + Shift** (optional, `X_SCALE_SHIFT_CTRL`): Per-row or per-column
+   unsigned/signed scale and shift. Loaded via `__HWA_LOAD_2REG` into
+   `HWA_SCALE0/1` and `HWA_SHIFT0/1` registers.
+
+2. **Right Shift** (`X_SHIFT`): Fixed right-shift amount (7 bits). When
+   `X_SCALE_SHIFT_CTRL` is enabled, this field selects per-row/per-col
+   mode instead.
+
+3. **Rounding** (`X_RE`): Half-LSB rounding after shift (adds 0.5 before
+   truncation).
+
+4. **Saturation** (`X_SAT`): Clamp to output type range.
+
+5. **Parameterized Saturation** (`X_PSAT`): Clamp to custom
+   `[SAT_MIN, SAT_MAX]` range (16-bit values, split across multiple
+   bitfields in the config register). Used for activation clipping.
+
+6. **ReLU** (`X_ReLU`): Rectified linear unit (clamp negatives to 0).
+   Applied after PSAT.
+
+7. **Type Conversion**: Convert from accumulator type (`X_CTYPE`) to output
+   type (`X_XTYPE`).
+
+The hardware pipeline order is:
+**accumulate → (scale ×) → shift → round → saturate → PSAT → ReLU → output**
+
+### 1.5 Finite State Machines
+
+The MMA has three FSMs that sequence operations automatically:
+
+- **B FSM**: Controls writing to B matrix banks (double-buffering), row
+  offsets, bank switching periods
+- **C FSM**: Controls accumulator read/write banks, offsets, operation
+  selection (MUL vs MULPLUS), bias loading destinations
+- **X FSM**: Controls transfer from C to output, bank switching for reads,
+  offset sequencing
+
+The FSM periods (e.g., `B_BSWPER`, `C_CWSWPER`, `X_CSWPER`) are expressed
+in units of MMA operations and control when banks switch, offsets reset, and
+operations alternate between OP0 and OP1.
+
+### 1.6 Streaming Engines (SE) and Streaming Address Generators (SA)
+
+The C7x has hardware streaming engines for efficient memory access:
+
+- **SE0, SE1**: Read-only streaming engines — fetch data from memory in
+  configurable multi-dimensional patterns (up to 6D: `ICNT0`..`ICNT5`,
+  `DIM1`..`DIM5`). Opened with `__SE_OPEN` using a `__STRM_TEMPLATE`
+  configuration struct.
+
+- **SA0, SA1, SA2, SA3**: Streaming Address generators — produce address
+  sequences for MMA transfer masking. Opened with `__SA_OPEN`. Can
+  generate multi-dimensional address patterns used with `__HWAXFER_MASK`
+  to control which C accumulator rows are transferred.
+
+The SE templates configure:
+- `DIMFMT`: Number of dimensions (up to 6D)
+- `ELETYPE`: Element type (`__SE_ELETYPE_8BIT`, `__SE_ELETYPE_16BIT`, etc.)
+- `VECLEN`: Vector length (`__SE_VECLEN_64BYTES` for 512-bit)
+- `ICNT0..ICNT5`: Iteration counts per dimension
+- `DIM1..DIM5`: Stride (in bytes) for each dimension
+- `CBK0/CBK1`: Circular buffer configuration
+- `DECDIM1/2`: Decrementing dimension support
+
+These are used to stream input activations through SE0 and weight rows
+through SE1 (or to load them into the MMA A and B storage directly).
+
+## 2. MMA Intrinsics API
+
+The C7x compiler provides intrinsics that map to MMA hardware instructions:
+
+### 2.1 Configuration
+
+```c
+void __HWAOPEN(__HWA_CONFIG_REG_v1 config,
+               __HWA_OFFSET_REG offsets,
+               __MMA_OPEN_FSM fsm_select);
+```
+Opens and configures the MMA. `fsm_select` is an `__MMA_OPEN_FSM` enum
+value (`__MMA_OPEN_FSM_RESET = 0` resets all FSMs).
+
+### 2.2 Data Loading
+
+```c
+void __HWALDA(__mma_vec src);      // Load one row into A storage
+void __HWALDB(__mma_vec src);      // Load one row into B storage
+void __HWALDC(__mma_vec src);      // Load one row into C storage (bias)
+void __HWALDAB(__mma_vec a, __mma_vec b); // Load A and B simultaneously
+void __HWALDBC(__mma_vec b, __mma_vec c); // Load B and C simultaneously
+```
+
+For efficient pipelining, `__HWALDAB` loads both an input activation row
+(into A) and a weight row (into B) in a single instruction.
+
+### 2.3 Scale/Shift Loading (MMA2+)
+
+```c
+void __HWA_LOAD_2REG(__mma_vec src1, __mma_vec src2, __MMA_LOAD_2REG dest);
+```
+
+Loads per-channel scale and shift vectors into MMA registers:
+- `__MMA_LOAD_2REG_SCALE_SHIFT_0`: src1 → HWA_SCALE0, src2 → HWA_SHIFT0
+- `__MMA_LOAD_2REG_SCALE_SHIFT_1`: src1 → HWA_SCALE1, src2 → HWA_SHIFT1
+
+These are used for per-channel quantization: scale[ch] and shift[ch] values
+are packed into vectors and loaded before the compute loop.
+
+### 2.4 Compute
+
+```c
+void __HWAOP(__MMA_A_SOURCE_SELECT src);     // Perform one MMA operation
+void __HWAOPXFER(__MMA_A_SOURCE_SELECT src); // Operate + transfer in parallel
+```
+
+`__MMA_A_LDA` (= 0): Use A vector from most recent `__HWALDA`.
+When using the A Register File (ARF), can specify `__MMA_A_ARF_ROW_SA0`
+through `__MMA_A_ARF_ROW_SA3` (with optional `ADV` variants that advance
+the SA).
+
+`__HWAOPXFER` performs an MMA operation and a C→output transfer
+simultaneously, enabling pipelined execution where one accumulator bank
+computes while the other bank's results are being transferred out.
+
+### 2.5 Transfer (Result Readout)
+
+```c
+void __HWAXFER(__MMA_XFER_SRC src);     // Load transfer buffer
+__mma_vec __HWARCV(int32_t index);       // Read from transfer buffer
+```
+
+`__HWAXFER(__MMA_XFER_SRC_C)`: Transfer C accumulator contents through the
+post-processing pipeline into the transfer buffer.
+
+With masking (MMA2+):
+```c
+__HWAXFERB(src, mask);  // Transfer with byte-granularity masking
+__HWAXFERH(src, mask);  // Transfer with halfword-granularity masking
+```
+
+The mask comes from a Streaming Address generator (SA0–SA3):
+- `__MMA_XFER_MASK_PSA0` through `PSA3`: Use SA for write masking
+- `__MMA_XFER_MASK_PSA0ADV` through `PSA3ADV`: Use SA + advance
+
+### 2.6 Status and Cleanup
+
+```c
+void __HWACLOSE(0);          // Close MMA
+void __HWARESET();           // Reopen with previous config (MMA2+)
+void __HWAXFER_XSTATUS_DELAYED(); // Read min/max range statistics
+```
+
+## 3. TIDL's Convolution Architecture
+
+### 3.1 High-Level Flow
+
+TIDL processes convolutions in a tiled, block-based manner:
+
+```
+for each output-channel group (64 channels at a time):
+    for each spatial tile:
+        1. DMA input tile into L2 scratch buffer
+        2. Open MMA with config registers
+        3. Load bias into C accumulator (via HWALDC)
+        4. Load per-channel scale/shift (via __HWA_LOAD_2REG)
+        5. for each input-channel group (64 at a time):
+               Load 64 weight rows into B (via HWALDB × 64)
+               for each spatial position in tile:
+                   Load input vector into A (via HWALDA)
+                   Trigger MMA operation (HWAOP or HWAOPXFER)
+        6. Transfer C results out (HWAXFER/HWARCV)
+        7. Store output tile via DMA
+```
+
+### 3.2 MMA Configuration for uint8 Convolution
+
+TIDL's config register struct (`configRegisterStruct_i8s_i8s_o8s` in
+`tidl_conv2d_mma.h`) uses these critical settings:
+
+```c
+// A config (input activations)
+.A_ATYPE = A_CONFIG_ATYPE_UINT8    // *** UNSIGNED uint8 inputs ***
+
+// B config (weights)
+.B_BTYPE = B_CONFIG_SIZE8          // 8-bit weight elements
+.B_ORDER = B_CONFIG_ROW            // Row-major weight layout
+.B_BSWPER = 64                     // B bank switch every 64 ops
+
+// C config (accumulator)
+.C_ATYPE = C_CONFIG_ATYPE_SA       // SIGNED accumulator (int32)
+.C_BTYPE = C_CONFIG_BTYPE_UINT8    // Bias loading type
+.C_OPERATION0 = C_CONFIG_MUL       // First op: C = A × B
+.C_OPERATION1 = C_CONFIG_MULPLUS   // Subsequent ops: C += A × B
+.C_HWLDDST = C_CONFIG_HWLDDST_X1  // Bias goes to C accumulator ×1
+.C_HWLDTYPE = C_CONFIG_HWLDTYPE_INT8
+.C_OP0PER = 64                     // OP0 for first 64 ops (init)
+.C_OP1PER = (K-1)*64              // OP1 for remaining ops (accumulate)
+
+// X config (output transfer)
+.X_XTYPE = X_CONFIG_XTYPE_UINT8   // Output type: uint8
+.X_CTYPE = X_CONFIG_CTYPE_UINT32  // Accumulator type: uint32 (read as unsigned!)
+.X_SHIFT = OUT_SHIFT               // Right-shift amount
+.X_ReLU = 0                        // ReLU disabled in config (done via PSAT)
+.X_SAT = 0                         // Standard saturation disabled
+.X_RE = 0                          // Rounding disabled in this config
+.X_CSTART = 1                      // X reads from opposite bank to C writes
+```
+
+**The critical insight: `A_ATYPE = UINT8`.** TIDL keeps input activations
+as unsigned uint8, eliminating the need to subtract the input zero point
+before the MMA multiply. The zero point correction is absorbed into the bias
+instead (see Section 4).
+
+### 3.3 Operation Sequencing
+
+The MMA uses a two-operation scheme with FSM-controlled alternation:
+
+1. **Operation 0 (MUL)**: `C = A × B` — Initializes the accumulator with
+   the first partial product. Runs for `C_OP0PER = 64` operations (one
+   full B matrix multiply).
+
+2. **Operation 1 (MULPLUS)**: `C = C + A × B` — Accumulates subsequent
+   partial products. Runs for `C_OP1PER = (K-1)*64` operations (remaining
+   K-1 input channel groups).
+
+After K groups of 64 input channels, the C accumulator contains the full
+convolution result for 64 output channels, plus the pre-loaded bias. The
+X FSM then transfers the result out with shift + saturation + type conversion.
+
+### 3.4 Double Buffering
+
+Both B and C storage are double-buffered:
+
+- **B double-buffering**: While the MMA computes using B bank 0, new weight
+  rows are loaded into B bank 1 via `__HWALDB`. `B_BSWPER` controls when
+  banks alternate.
+
+- **C double-buffering**: While results from C bank 0 are being transferred
+  out via `__HWAXFER`, new computations write to C bank 1. `C_CWSWPER`
+  and `X_CSWPER` must be configured so the X FSM reads from the bank that
+  the C FSM just finished writing.
+
+### 3.5 Streaming Engine Configuration
+
+TIDL configures the streaming engines for efficient memory access:
+
+- **SE0** (or SE1): Streams input activation vectors from L2 SRAM
+  - `ELETYPE = __SE_ELETYPE_8BIT`
+  - `VECLEN = __SE_VECLEN_64BYTES`
+  - `ICNT0 = 64` (one vector = 64 bytes)
+  - Higher dimensions iterate over spatial positions and channel groups
+  - May use circular buffering (`CBK0`) for L2 scratch memory
+
+- **SA0/SA1/SA2**: Generate address sequences for:
+  - Write-masking partial output tiles (when output channels aren't a
+    multiple of 64)
+  - Controlling which C accumulator rows to transfer
+
+### 3.6 DSP Kernel Functions (Pre-compiled)
+
+The actual DSP kernel code is in `tidl_priv_algo.lib` (pre-compiled, no
+source available). The key functions are:
+
+| Function | Purpose |
+|----------|---------|
+| `TIDL_conv2dDspInitNew()` | Allocate scratch buffers, configure SE/SA templates, set up MMA config |
+| `TIDL_conv2dDspProcess()` | Execute the tiled convolution loop on the DSP |
+| `hwaInit()` | Program MMA config register via `__HWAOPEN` |
+| `blockConvS08_ci()` | Inner convolution block: B-panel fill + MMA compute loop |
+| `calcMMAConv()` | Outer loop orchestrating tiles, DMA, and MMA blocks |
+| `prefillBpanel_ci()` | Pre-load first B panel for double-buffering startup |
+
+### 3.7 L2 Memory Layout
+
+TIDL allocates a fixed L2 scratch buffer:
+
+```c
+#define L2_MEM_SIZE       (256*1024)  // Total: 256 KB
+#define INFEAT_L2_MEM_SIZE (128*1024)  // Input features: 128 KB (power of 2 for SE circ buf)
+#define LEFT_L2_MEM_SIZE  (L2_MEM_SIZE - INFEAT_L2_MEM_SIZE) // Weights + bias: 128 KB
+```
+
+Input features are stored in a circular buffer in L2 SRAM, accessed via the
+streaming engine with circular buffer addressing. This allows the SE to wrap
+around when reading sliding-window convolution inputs.
+
+### 3.8 MMA Modeling Mode
+
+`tidl_conv2d_mma.h` includes an `#ifdef MMA_MODELING` path that implements
+the MMA computation in software using C arrays:
+
+```c
+int32_t cPanel[2][64*64];     // C accumulator (2 banks × 64×64 int32)
+char64 bPanel[2][64];         // B weight matrix (2 banks × 64 rows)
+char64 bPanelT[64];           // Transposed B panel
+```
+
+Functions like `mmaOP()`, `prefillBpanel()`, `transposeBPanel()`, and
+`updateState()` simulate the hardware FSM behavior. This provides a
+bit-accurate software model of the MMA for validation.
+
+## 4. Quantization: TIDL vs Thames/MMALib
+
+### 4.1 The Core Difference
+
+Both MMALib and TIDL map **input activations → MMA A** (with `A_ATYPE =
+UINT8`) and **weights → MMA B** (with `B_BTYPE = SIZE8`). The MMA hardware
+A/B mapping is the same in both cases.
+
+**MMALib** (our current approach):
+- MMALib's `convolveBias_row` API requires weights as **INT8** (signed,
+  zero point = 0) — this is an API constraint, not an MMA hardware
+  constraint
+- For TFLite models using the **older uint8 quantization** (weights stored
+  as uint8 with non-zero zero point, e.g., w_zp=133), we must re-quantize:
+  `w_int8 = clamp(w_uint8 - w_zp, -128, 127)`
+- When `|w_uint8 - w_zp| > 127`, the weight overflows and must be rescaled
+  by 0.5×, introducing rounding error
+- The rescaling factor is compensated by doubling the per-channel scale,
+  but rounding errors from halving the weights accumulate
+- For TFLite **full-integer quantization** (weights as int8, zp=0), this
+  is not an issue — weights are already in the required format
+
+**TIDL** (direct MMA):
+- TIDL configures `C_BTYPE = C_CONFIG_BTYPE_UINT8` — the B matrix (weights)
+  is interpreted as unsigned uint8 in the C FSM / bias table context
+- The A_ATYPE = UINT8 allows raw unsigned input activations without
+  subtracting the input zero point
+- All zero point corrections (both input and output) are absorbed into
+  the derivedBias term during model import
+- For TFLite full-integer quantization models (int8 symmetric weights),
+  the weights are used as-is
+- The MMA hardware handles the mixed-signedness product correctly:
+  unsigned_A × signed/unsigned_B → signed int32 accumulator
+- No weight re-quantization, no rescaling, no rounding error
+
+### 4.2 TIDL's Quantization Math
+
+For TFLite asymmetric quantization (`TIDL_QuantStyleAsymNP2_TFL`):
+
+**Scale and Shift (per output channel):**
+
+$$\text{scaleRatio}[c] = \frac{S_x \cdot S_w[c]}{S_y}$$
+
+where $S_x$, $S_w[c]$, $S_y$ are the input, per-channel weight, and output
+scales from TFLite.
+
+`TIDL_getMMAv2_ScaleAndShift()` converts this floating-point ratio into a
+uint8 scale and uint8 shift:
+
+```c
+void TIDL_getMMAv2_ScaleAndShift(float scaleRatio, uint8_t *scale, uint8_t *shift)
+{
+    int32_t shiftBits = 0;
+    float newScaleRatio = scaleRatio;
+    while (1) {
+        newScaleRatio *= 2;
+        if (shiftBits >= 40) break;
+        if (newScaleRatio > 255.0) { newScaleRatio /= 2; break; }
+        shiftBits++;
+    }
+    *shift = shiftBits;
+    *scale = (uint8_t)(newScaleRatio + 0.5);
+}
+```
+
+This is the same "doubling" algorithm we use in Thames (`compute_quant()`).
+The result approximates: $\text{scaleRatio} \approx \text{scale} / 2^{\text{shift}}$
+
+**Bias (per output channel) — direct TFLite path:**
+
+$$\text{derivedBias}[c] = \text{bias}_{\text{TFLite}}[c] + z_y \cdot \frac{S_y}{S_x \cdot S_w[c]} - z_x \cdot \sum_i W_q[c][i]$$
+
+where:
+- $\text{bias}_{\text{TFLite}}[c]$ is the int32 bias from the TFLite model
+- $z_y$ is the output zero point
+- $z_x$ is the input zero point
+- $W_q[c][i]$ are the quantized int8 weights for output channel $c$
+- $\frac{S_y}{S_x \cdot S_w[c]}$ is `nScale` = $1 / \text{scaleRatio}[c]$
+
+The $-z_x \cdot \sum W$ term pre-computes the input zero point correction,
+so the MMA can operate on raw unsigned activation values without subtracting
+$z_x$ first. The $z_y \cdot \text{nScale}$ term accounts for the output zero
+point.
+
+**Post-accumulation (hardware):**
+
+The MMA hardware computes:
+
+$$\text{acc}[c] = \text{derivedBias}[c] + \sum_i A_{\text{uint8}}[i] \cdot W_{\text{int8}}[c][i]$$
+
+Then the X transfer pipeline applies:
+
+$$\text{out}[c] = \text{clamp}\left(\left\lfloor \frac{\text{acc}[c] \cdot \text{scale}[c]}{2^{\text{shift}[c]}} + 0.5 \right\rfloor, \text{minPSAT}, \text{maxPSAT}\right)$$
+
+**PSAT bounds:** For uint8 output with activation clipping:
+
+```c
+minPSAT = round(clipMin / outScale) + outZeroPoint;  // typically 0
+maxPSAT = round(clipMax / outScale) + outZeroPoint;  // typically 255
+```
+
+### 4.3 Weight Treatment
+
+TFLite full-integer quantized models store weights as **int8** (symmetric,
+zero point = 0). TIDL keeps them as-is — no re-quantization. The weights
+are loaded into the B matrix storage, and the MMA treats them as 8-bit
+values (signedness is determined by how the multiply interprets the product,
+controlled by `A_ATYPE` and `C_ATYPE`).
+
+Since `A_ATYPE = UINT8` (unsigned inputs) and `C_ATYPE = SA` (signed
+accumulator), the multiply computes a **mixed-signedness** product: unsigned
+activation × signed/unsigned weight → signed 32-bit accumulator. This
+matches the TFLite quantization semantics exactly.
+
+### 4.4 Comparison Table
+
+| Aspect | MMALib (Thames current) | TIDL (direct MMA) |
+|--------|------------------------|-------------------|
+| Input type (MMA A) | UINT8 | UINT8 |
+| Weight type (MMA B) | INT8 (API constraint) | INT8 or UINT8 (flexible) |
+| A_ATYPE | UINT8 | UINT8 |
+| C_BTYPE | N/A (MMALib internal) | UINT8 |
+| Weight re-quantization | Required for uint8 models | Not needed |
+| Weight rescaling | 0.5× for overflows, compensated in scale | Not needed |
+| Input zero point | Subtracted from bias term | Subtracted from bias term |
+| Output zero point | Added to bias term | Added to bias term |
+| Scale/shift | Per-channel uint8/uint8 | Per-channel uint8/uint8 |
+| Post-processing | MMALib software (scaleShiftRoundAndReLU) | MMA hardware pipeline |
+| Rounding error source | Weight halving + int8 clamping | None from weights |
+| Per-channel scale load | MMALib internal | `__HWA_LOAD_2REG` intrinsic |
+| Activation | Software ReLU in MMALib | HW PSAT + ReLU in X pipeline |
+
+### 4.5 Why TIDL Has No Accuracy Problem
+
+With direct MMA programming, TIDL avoids the accuracy issue we hit on
+Model 2 (input_zp=128, weight_zp=133):
+
+1. TIDL's model import tool handles weight re-quantization offline (not at
+   runtime), and for TFLite full-integer quantization, weights are already
+   int8 symmetric (zp=0) → no re-quantization needed
+2. If the original model has uint8 weights, TIDL's import can dequantize
+   to float and re-quantize with full precision during model compilation
+3. The full zero point correction (for both input and output) is absorbed
+   into the derivedBias at import time → exact computation
+4. The hardware pipeline handles scale/shift/round/saturate atomically
+
+For Model 2, where MMALib requires runtime rescaling of 19/32 channels
+(causing cumulative rounding errors summing to diff=56), TIDL's offline
+import would compute derivedBias with full float precision, and the MMA
+hardware would produce results matching the TFLite CPU reference (within
+the hardware rounding tolerance of shift+round).
+
+**Note on the MMA hardware mapping**: Both MMALib and TIDL use the same
+hardware mapping: **A = input activations** (streamed through SE0 via
+`HWALDAB`), **B = weight matrix** (loaded from SE1 via `HWALDAB`). The
+`A_ATYPE = UINT8` setting allows raw unsigned input activations.  The
+difference is that MMALib's API forces weights to be INT8 symmetric at
+the API level, while direct MMA programming gives full control over the
+`C_BTYPE` field and weight format.
+
+## 5. What Source Code Is Available
+
+### 5.1 Available (in `c7x-mma-tidl/ti_dl/algo/`)
+
+| File | Content |
+|------|---------|
+| `inc/tidl_conv2d_mma.h` | MMA config register struct, offset register, SE/SA template declarations, MMA modeling code |
+| `inc/tidl_conv2d_mma_i.h` | API declarations for DSP init/process functions |
+| `src/tidl_conv2d_base.c` | Reference conv2d implementation, bias/scale/shift setup, PSAT computation, dispatch logic |
+| `src/tidl_alg_utils.c` | `TIDL_getMMAv2_ScaleAndShift()` — the scale/shift quantization algorithm |
+| `inc/tidl_alg_int.h` | `TIDL_roundSatMMA()` — round+shift+saturate reference implementation |
+
+### 5.2 Available (TI compiler headers)
+
+| File | Content |
+|------|---------|
+| `C7524-MMA2_256F/c7x_mma.h` | Complete MMA hardware definition: all enum types, `__HWA_CONFIG_REG_v1` struct, intrinsic declarations, matrix dimension macros |
+
+### 5.3 NOT Available (pre-compiled in `tidl_priv_algo.lib`)
+
+| Function | What it does |
+|----------|-------------|
+| `TIDL_conv2dDspInitNew()` | Full initialization: allocate buffers, configure SE/SA templates, set MMA config register fields dynamically based on layer parameters |
+| `TIDL_conv2dDspProcess()` | Full execution: tiled loop with DMA, MMA compute, output writeback |
+| `hwaInit()` | Program `__HWAOPEN` with runtime-computed config |
+| `blockConvS08_ci()` | Inner block: prefill B panels, run MMA ops, transfer results |
+| `calcMMAConv()` | Outer tile loop coordinating DMA and MMA |
+| `prefillBpanel_ci()` | First B panel load for double-buffer pipeline startup |
+
+## 6. Implementation Roadmap for Thames
+
+### 6.1 Phase 1: Basic MMA Convolution (No Post-Processing)
+
+1. **Emit `__HWAOPEN`**: Encode the config register struct as an immediate
+   in the kernel binary. Use the TIDL config as a template but compute
+   FSM periods from the actual layer dimensions:
+   - `B_BSWPER = MMA_SIZE` (64 for int8)
+   - `C_OP0PER = MMA_SIZE`
+   - `C_OP1PER = (K-1) * MMA_SIZE` where K = numInChannels / 64
+   - `C_CWSWPER = K * MMA_SIZE`
+   - etc.
+
+2. **Weight loading**: Use SE1 to stream weight rows, emit `__HWALDB` ×64
+   for each input channel group.
+
+3. **Input loading**: Use SE0 to stream input activation vectors, emit
+   `__HWALDA` for each spatial position.
+
+4. **Compute**: `__HWAOP` for first A×B, `__HWAOPXFER` for pipelined
+   compute+transfer.
+
+5. **Result readout**: `__HWARCV` to get output vectors, `VST32B` to store.
+
+### 6.2 Phase 2: Add Quantization
+
+1. **Bias loading**: Pre-load derivedBias into C via `__HWALDC`. Bias
+   layout must match the MMA's C storage format.
+
+2. **Scale/shift loading**: `__HWA_LOAD_2REG(scale_vec, shift_vec,
+   __MMA_LOAD_2REG_SCALE_SHIFT_0)` before the compute loop.
+
+3. **Enable PSAT**: Set `X_PSAT = 1`, encode `SAT_MIN`/`SAT_MAX` in the
+   config register (16-bit values split across non-contiguous bitfields).
+
+4. **Enable rounding**: `X_RE = 1` for half-LSB rounding.
+
+5. **Enable per-channel shift**: `X_SCALE_SHIFT_CTRL = 1`,
+   `X_SHIFT = __MMA_X_CONFIG_SHIFT_ROW_UNSIGNED` (per-row unsigned shift).
+
+### 6.3 Phase 3: Tiling and DMA
+
+1. Implement block scheduling (`thames_sched_operation()`) to find optimal
+   tile sizes that fit in L2 SRAM.
+
+2. Add DMA support for copying input tiles to L2 and output tiles to DDR.
+
+3. Implement double-buffering for both B panels and input tiles.
+
+### 6.4 Key Assembly Instructions
+
+The C7x assembler instructions corresponding to the MMA intrinsics:
+
+| Intrinsic | Assembly | Unit |
+|-----------|----------|------|
+| `__HWAOPEN` | `HWAOPEN .C2 VBL, VBL, imm` | C2 |
+| `__HWALDA` | `HWALDAB .L2 VB, VB` (A-only form) | L2 |
+| `__HWALDB` | `HWALDB .L2 VB` | L2 |
+| `__HWALDC` | via `HWALDAB/HWALDBC` | L2 |
+| `__HWALDAB` | `HWALDAB .L2 VB, VB` | L2 |
+| `__HWAOP` | `HWAOPXFER .S1 imm` (op-only form) | S1 |
+| `__HWAOPXFER` | `HWAOPXFER .S1 imm` | S1 |
+| `__HWAXFER` | `HWAXFER .S2 imm` | S2 |
+| `__HWARCV` | `HWARCVS .S2 VB` | S2 |
+| `__HWACLOSE` | via control instruction | C2 |
+| `__HWA_LOAD_2REG` | via HWAXFER variants | S2 |
+
+Note: In the reference MMALib disassembly, `HWAOPXFER` on S1 and `HWALDAB`
+on L2 always execute **in parallel** (same execute packet). `HWARCVS` on S2
+follows with appropriate pipeline delay.
+
+### 6.5 Required Changes to Thames
+
+1. **`thames_coefs.c`**: Remove weight re-quantization (int8→int8 rescaling).
+   Compute derivedBias using the TIDL formula with zero point absorption.
+   Weights remain as TFLite int8 (symmetric, zp=0).
+
+2. **`thames_compiler.c`**: New MMA convolution emitter using `asm_HWAOPEN`,
+   `asm_HWALDAB`, `asm_HWAOPXFER`, `asm_HWARCVS` from the C7x assembler.
+
+3. **`thames_ml.c`**: Input data stays uint8 — no need to convert to int8.
+   Zero point folded into bias.
+
+4. **New**: MMA config register encoding helper — compute `__HWA_CONFIG_REG`
+   fields from layer parameters and encode as 512-bit immediate for HWAOPEN.
+
+## 7. Reference: TIDL `TIDL_roundSatMMA()`
+
+This function models the MMA hardware's post-processing:
+
+```c
+static int64_t TIDL_roundSatMMA(int64_t val, int32_t bits, int32_t min, int32_t max)
+{
+    int64_t temp;
+    if (bits > 0) {
+        temp = (val >> (bits - 1)) + 1;  // add rounding bias
+        val = temp >> 1;                  // complete the shift
+    }
+    val = val < min ? min : val;
+    val = val > max ? max : val;
+    return val;
+}
+```
+
+This is: `round(val / 2^bits)` clamped to `[min, max]`. The rounding is
+"round half up" (add 0.5 × 2^bits before truncation).
+
+## 8. Reference: MMA Config Register Layout
+
+The `__HWA_CONFIG_REG_v1` struct is 512 bits (64 bytes = one vector).
+Key sections and their bit positions (little-endian):
+
+### A Config (bits 0–31 of first 64-bit word)
+- `A_ATYPE` [3:0] — 4 bits: element type (UINT8=0, INT8=4, etc.)
+- `A_ALUTEN` [8] — 1 bit: lookup table enable
+- `A_ARF_CTRL` [16] — 1 bit: A register file enable
+- `A_ARF_BASE` [22:17] — 6 bits: circular buffer base
+- `A_ARF_SIZE` [30:24] — 7 bits: ARF array size
+
+### B Config (bits 32–63 of first word + bits 0–63 of second word)
+- `B_BSWPER` [63:32] — 32 bits: B bank switch period
+- `B_BRSTPER` [7:0] — 8 bits: B offset reset period
+- `B_BTYPE` [9:8] — 2 bits: element size (SIZE8=0)
+- `B_ORDER` [16] — 1 bit: row/column major
+- `B_BSTART` [24] — 1 bit: initial B bank
+- `B_BOFFSET` [39:32] — 8 bits: global offset
+
+### C Config (third + fourth + fifth 64-bit words)
+- `C_ATYPE` [0] — 1 bit: signed/unsigned accumulator
+- `C_BTYPE` [11:8] — 4 bits: bias element type
+- `C_OPERATION0` [17:16] — 2 bits: first operation (MUL/MULPLUS/etc.)
+- `C_OPERATION1` [25:24] — 2 bits: second operation
+- `C_HWLDDST` [34:32] — 3 bits: HWALD destination
+- `C_HWLDTYPE` [43:40] — 4 bits: HWALD element type
+- `C_OPSTART` [48] — 1 bit: initial operation
+- Various periods: `C_OP0PER`, `C_OP1PER`, `C_BSWPER`, `C_CRSWPER`, `C_CWSWPER` (32 bits each)
+- Various offsets: `C_CROFFSET`, `C_CWOFFSET`, `C_CLOFFSET` (6-7 bits each)
+
+### X Config (sixth + seventh 64-bit words)
+- `X_ReLU` [0] — 1 bit: ReLU enable
+- `X_PSAT` [1] — 1 bit: parameterized saturation enable
+- `X_SAT_MIN` — 16 bits split across three fields (5:0, 12:6, 15:13)
+- `X_SAT` [8+16] — 1 bit: standard saturation enable
+- `X_RE` [16+16] — 1 bit: rounding enable
+- `X_SHIFT` [38:32] — 7 bits: right shift amount (or mode selector)
+- `X_SCALE_SHIFT_CTRL` [30] — 1 bit: per-row/col scale/shift enable
+- `X_XTYPE` [43:40] — 4 bits: output element type
+- `X_SAT_MAX` — 16 bits split across three fields
+- `X_CTYPE` [50:48] — 3 bits: accumulator element type
+- `X_CSWPER` — 32 bits: C read bank switch period
+- `X_COFFSET` — 8 bits: C read offset
+- `X_CSTART` — 1 bit: initial C bank for X reads
+
+### Padding / Misc
+- `PARITYCTRL` [63:62] — 2 bits: parity control
+
+The config register is loaded as a single 512-bit vector via the `HWAOPEN`
+instruction.