From 4e582a581606fc7012e8de245fefea423a607ab4 Mon Sep 17 00:00:00 2001 From: Tomeu Vizoso Date: Wed, 11 Mar 2026 08:30:19 +0100 Subject: [PATCH] Temp docs on TIDL+MMA --- TIDL_MMA_REFERENCE.md | 718 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 718 insertions(+) create mode 100644 TIDL_MMA_REFERENCE.md diff --git a/TIDL_MMA_REFERENCE.md b/TIDL_MMA_REFERENCE.md new file mode 100644 index 00000000000..270da12d20e --- /dev/null +++ b/TIDL_MMA_REFERENCE.md @@ -0,0 +1,718 @@ +# How TIDL Uses the C7x MMA for Convolution + +## Overview + +TI's TIDL (TI Deep Learning) framework runs convolutions on the C7x DSP by +programming the MMA (Matrix Multiply Accelerator) hardware directly, bypassing +MMALib. This document describes the hardware, the TIDL data flow, and how it +differs from our current MMALib-based approach. The goal is to provide a +reference for implementing direct MMA programming in Thames. + +## 1. C7x MMA Hardware Architecture + +### 1.1 Variant + +The J722S contains a C7524 DSP core with an **MMA2-256F** coprocessor +(512-bit vector width, 64-byte vectors). + +### 1.2 Matrix Storage Areas + +The MMA has three internal storage areas: + +| Storage | Total Size | Row Width | Rows (int8) | Purpose | +|---------|-----------|-----------|-------------|---------| +| **A** (activation) | 64 B | 64 B | 1 | Input feature vector (one row at a time) | +| **B** (weights) | 4096 B | 64 B | 64 | Weight matrix (double-buffered: 2 banks × 64 rows) | +| **C** (accumulator) | 256 B | 256 B | 1 | Accumulation results (64 × int32 = 256 B, double-buffered) | + +For int8 operation: +- A holds 64 elements per row (one input vector) +- B holds a 64×64 weight matrix (double-buffered, so B0 and B1 each hold 64×64) +- C holds 64 int32 accumulators (double-buffered, C0 and C1) +- One MMA operation computes: **C += A × B** (64 outputs from 64 inputs × 64×64 weights) + +### 1.3 Data Types + +The MMA supports configurable types for each storage area: + +**A matrix (input activations):** + +| Enum | Value | Description | +|------|-------|-------------| +| `ATYPE_UINT8` | 0 | Unsigned 8-bit | +| `ATYPE_INT8` | 4 | Signed 8-bit | +| `ATYPE_UINT16` | 1 | Unsigned 16-bit | +| `ATYPE_INT16` | 5 | Signed 16-bit | +| `ATYPE_UINT32` | 2 | Unsigned 32-bit | +| `ATYPE_INT32` | 6 | Signed 32-bit | +| `ATYPE_F32` | 3 | 32-bit float (MMA2-256F only) | + +**B matrix (weights):** `SIZE8`, `SIZE16`, `SIZE32` — element size only, +signedness is implicit (the multiply uses A's signedness for the product +interpretation). + +**C accumulator (output):** +- `C_ATYPE`: `SA` (signed) or `UA` (unsigned) — accumulator signedness +- `C_BTYPE`: configures element size/type for bias loading — `UINT8`, `INT8`, + `UINT16`, `INT16`, `UINT32`, `INT32`, `UINT64`, `INT64` +- `C_OPERATION0/1`: `MUL` (C = A×B), `MULNEGATE` (C = -A×B), `MULMINUS` + (C = C - A×B), `MULPLUS` (C = C + A×B) + +**X transfer (output conversion):** +- `X_XTYPE`: output element type — `UINT8`, `INT8`, `UINT16`, `INT16`, + `UINT32`, `INT32`, etc. +- `X_CTYPE`: accumulator element type read by X FSM — `UINT32`, `INT32`, + `UINT64`, `INT64`, `UINT128`, `INT128`, `F32` + +### 1.4 Post-Processing Pipeline (X Transfer) + +When transferring results out of the C accumulator, the MMA can apply a +hardware post-processing pipeline: + +1. **Scale + Shift** (optional, `X_SCALE_SHIFT_CTRL`): Per-row or per-column + unsigned/signed scale and shift. Loaded via `__HWA_LOAD_2REG` into + `HWA_SCALE0/1` and `HWA_SHIFT0/1` registers. + +2. **Right Shift** (`X_SHIFT`): Fixed right-shift amount (7 bits). When + `X_SCALE_SHIFT_CTRL` is enabled, this field selects per-row/per-col + mode instead. + +3. **Rounding** (`X_RE`): Half-LSB rounding after shift (adds 0.5 before + truncation). + +4. **Saturation** (`X_SAT`): Clamp to output type range. + +5. **Parameterized Saturation** (`X_PSAT`): Clamp to custom + `[SAT_MIN, SAT_MAX]` range (16-bit values, split across multiple + bitfields in the config register). Used for activation clipping. + +6. **ReLU** (`X_ReLU`): Rectified linear unit (clamp negatives to 0). + Applied after PSAT. + +7. **Type Conversion**: Convert from accumulator type (`X_CTYPE`) to output + type (`X_XTYPE`). + +The hardware pipeline order is: +**accumulate → (scale ×) → shift → round → saturate → PSAT → ReLU → output** + +### 1.5 Finite State Machines + +The MMA has three FSMs that sequence operations automatically: + +- **B FSM**: Controls writing to B matrix banks (double-buffering), row + offsets, bank switching periods +- **C FSM**: Controls accumulator read/write banks, offsets, operation + selection (MUL vs MULPLUS), bias loading destinations +- **X FSM**: Controls transfer from C to output, bank switching for reads, + offset sequencing + +The FSM periods (e.g., `B_BSWPER`, `C_CWSWPER`, `X_CSWPER`) are expressed +in units of MMA operations and control when banks switch, offsets reset, and +operations alternate between OP0 and OP1. + +### 1.6 Streaming Engines (SE) and Streaming Address Generators (SA) + +The C7x has hardware streaming engines for efficient memory access: + +- **SE0, SE1**: Read-only streaming engines — fetch data from memory in + configurable multi-dimensional patterns (up to 6D: `ICNT0`..`ICNT5`, + `DIM1`..`DIM5`). Opened with `__SE_OPEN` using a `__STRM_TEMPLATE` + configuration struct. + +- **SA0, SA1, SA2, SA3**: Streaming Address generators — produce address + sequences for MMA transfer masking. Opened with `__SA_OPEN`. Can + generate multi-dimensional address patterns used with `__HWAXFER_MASK` + to control which C accumulator rows are transferred. + +The SE templates configure: +- `DIMFMT`: Number of dimensions (up to 6D) +- `ELETYPE`: Element type (`__SE_ELETYPE_8BIT`, `__SE_ELETYPE_16BIT`, etc.) +- `VECLEN`: Vector length (`__SE_VECLEN_64BYTES` for 512-bit) +- `ICNT0..ICNT5`: Iteration counts per dimension +- `DIM1..DIM5`: Stride (in bytes) for each dimension +- `CBK0/CBK1`: Circular buffer configuration +- `DECDIM1/2`: Decrementing dimension support + +These are used to stream input activations through SE0 and weight rows +through SE1 (or to load them into the MMA A and B storage directly). + +## 2. MMA Intrinsics API + +The C7x compiler provides intrinsics that map to MMA hardware instructions: + +### 2.1 Configuration + +```c +void __HWAOPEN(__HWA_CONFIG_REG_v1 config, + __HWA_OFFSET_REG offsets, + __MMA_OPEN_FSM fsm_select); +``` +Opens and configures the MMA. `fsm_select` is an `__MMA_OPEN_FSM` enum +value (`__MMA_OPEN_FSM_RESET = 0` resets all FSMs). + +### 2.2 Data Loading + +```c +void __HWALDA(__mma_vec src); // Load one row into A storage +void __HWALDB(__mma_vec src); // Load one row into B storage +void __HWALDC(__mma_vec src); // Load one row into C storage (bias) +void __HWALDAB(__mma_vec a, __mma_vec b); // Load A and B simultaneously +void __HWALDBC(__mma_vec b, __mma_vec c); // Load B and C simultaneously +``` + +For efficient pipelining, `__HWALDAB` loads both an input activation row +(into A) and a weight row (into B) in a single instruction. + +### 2.3 Scale/Shift Loading (MMA2+) + +```c +void __HWA_LOAD_2REG(__mma_vec src1, __mma_vec src2, __MMA_LOAD_2REG dest); +``` + +Loads per-channel scale and shift vectors into MMA registers: +- `__MMA_LOAD_2REG_SCALE_SHIFT_0`: src1 → HWA_SCALE0, src2 → HWA_SHIFT0 +- `__MMA_LOAD_2REG_SCALE_SHIFT_1`: src1 → HWA_SCALE1, src2 → HWA_SHIFT1 + +These are used for per-channel quantization: scale[ch] and shift[ch] values +are packed into vectors and loaded before the compute loop. + +### 2.4 Compute + +```c +void __HWAOP(__MMA_A_SOURCE_SELECT src); // Perform one MMA operation +void __HWAOPXFER(__MMA_A_SOURCE_SELECT src); // Operate + transfer in parallel +``` + +`__MMA_A_LDA` (= 0): Use A vector from most recent `__HWALDA`. +When using the A Register File (ARF), can specify `__MMA_A_ARF_ROW_SA0` +through `__MMA_A_ARF_ROW_SA3` (with optional `ADV` variants that advance +the SA). + +`__HWAOPXFER` performs an MMA operation and a C→output transfer +simultaneously, enabling pipelined execution where one accumulator bank +computes while the other bank's results are being transferred out. + +### 2.5 Transfer (Result Readout) + +```c +void __HWAXFER(__MMA_XFER_SRC src); // Load transfer buffer +__mma_vec __HWARCV(int32_t index); // Read from transfer buffer +``` + +`__HWAXFER(__MMA_XFER_SRC_C)`: Transfer C accumulator contents through the +post-processing pipeline into the transfer buffer. + +With masking (MMA2+): +```c +__HWAXFERB(src, mask); // Transfer with byte-granularity masking +__HWAXFERH(src, mask); // Transfer with halfword-granularity masking +``` + +The mask comes from a Streaming Address generator (SA0–SA3): +- `__MMA_XFER_MASK_PSA0` through `PSA3`: Use SA for write masking +- `__MMA_XFER_MASK_PSA0ADV` through `PSA3ADV`: Use SA + advance + +### 2.6 Status and Cleanup + +```c +void __HWACLOSE(0); // Close MMA +void __HWARESET(); // Reopen with previous config (MMA2+) +void __HWAXFER_XSTATUS_DELAYED(); // Read min/max range statistics +``` + +## 3. TIDL's Convolution Architecture + +### 3.1 High-Level Flow + +TIDL processes convolutions in a tiled, block-based manner: + +``` +for each output-channel group (64 channels at a time): + for each spatial tile: + 1. DMA input tile into L2 scratch buffer + 2. Open MMA with config registers + 3. Load bias into C accumulator (via HWALDC) + 4. Load per-channel scale/shift (via __HWA_LOAD_2REG) + 5. for each input-channel group (64 at a time): + Load 64 weight rows into B (via HWALDB × 64) + for each spatial position in tile: + Load input vector into A (via HWALDA) + Trigger MMA operation (HWAOP or HWAOPXFER) + 6. Transfer C results out (HWAXFER/HWARCV) + 7. Store output tile via DMA +``` + +### 3.2 MMA Configuration for uint8 Convolution + +TIDL's config register struct (`configRegisterStruct_i8s_i8s_o8s` in +`tidl_conv2d_mma.h`) uses these critical settings: + +```c +// A config (input activations) +.A_ATYPE = A_CONFIG_ATYPE_UINT8 // *** UNSIGNED uint8 inputs *** + +// B config (weights) +.B_BTYPE = B_CONFIG_SIZE8 // 8-bit weight elements +.B_ORDER = B_CONFIG_ROW // Row-major weight layout +.B_BSWPER = 64 // B bank switch every 64 ops + +// C config (accumulator) +.C_ATYPE = C_CONFIG_ATYPE_SA // SIGNED accumulator (int32) +.C_BTYPE = C_CONFIG_BTYPE_UINT8 // Bias loading type +.C_OPERATION0 = C_CONFIG_MUL // First op: C = A × B +.C_OPERATION1 = C_CONFIG_MULPLUS // Subsequent ops: C += A × B +.C_HWLDDST = C_CONFIG_HWLDDST_X1 // Bias goes to C accumulator ×1 +.C_HWLDTYPE = C_CONFIG_HWLDTYPE_INT8 +.C_OP0PER = 64 // OP0 for first 64 ops (init) +.C_OP1PER = (K-1)*64 // OP1 for remaining ops (accumulate) + +// X config (output transfer) +.X_XTYPE = X_CONFIG_XTYPE_UINT8 // Output type: uint8 +.X_CTYPE = X_CONFIG_CTYPE_UINT32 // Accumulator type: uint32 (read as unsigned!) +.X_SHIFT = OUT_SHIFT // Right-shift amount +.X_ReLU = 0 // ReLU disabled in config (done via PSAT) +.X_SAT = 0 // Standard saturation disabled +.X_RE = 0 // Rounding disabled in this config +.X_CSTART = 1 // X reads from opposite bank to C writes +``` + +**The critical insight: `A_ATYPE = UINT8`.** TIDL keeps input activations +as unsigned uint8, eliminating the need to subtract the input zero point +before the MMA multiply. The zero point correction is absorbed into the bias +instead (see Section 4). + +### 3.3 Operation Sequencing + +The MMA uses a two-operation scheme with FSM-controlled alternation: + +1. **Operation 0 (MUL)**: `C = A × B` — Initializes the accumulator with + the first partial product. Runs for `C_OP0PER = 64` operations (one + full B matrix multiply). + +2. **Operation 1 (MULPLUS)**: `C = C + A × B` — Accumulates subsequent + partial products. Runs for `C_OP1PER = (K-1)*64` operations (remaining + K-1 input channel groups). + +After K groups of 64 input channels, the C accumulator contains the full +convolution result for 64 output channels, plus the pre-loaded bias. The +X FSM then transfers the result out with shift + saturation + type conversion. + +### 3.4 Double Buffering + +Both B and C storage are double-buffered: + +- **B double-buffering**: While the MMA computes using B bank 0, new weight + rows are loaded into B bank 1 via `__HWALDB`. `B_BSWPER` controls when + banks alternate. + +- **C double-buffering**: While results from C bank 0 are being transferred + out via `__HWAXFER`, new computations write to C bank 1. `C_CWSWPER` + and `X_CSWPER` must be configured so the X FSM reads from the bank that + the C FSM just finished writing. + +### 3.5 Streaming Engine Configuration + +TIDL configures the streaming engines for efficient memory access: + +- **SE0** (or SE1): Streams input activation vectors from L2 SRAM + - `ELETYPE = __SE_ELETYPE_8BIT` + - `VECLEN = __SE_VECLEN_64BYTES` + - `ICNT0 = 64` (one vector = 64 bytes) + - Higher dimensions iterate over spatial positions and channel groups + - May use circular buffering (`CBK0`) for L2 scratch memory + +- **SA0/SA1/SA2**: Generate address sequences for: + - Write-masking partial output tiles (when output channels aren't a + multiple of 64) + - Controlling which C accumulator rows to transfer + +### 3.6 DSP Kernel Functions (Pre-compiled) + +The actual DSP kernel code is in `tidl_priv_algo.lib` (pre-compiled, no +source available). The key functions are: + +| Function | Purpose | +|----------|---------| +| `TIDL_conv2dDspInitNew()` | Allocate scratch buffers, configure SE/SA templates, set up MMA config | +| `TIDL_conv2dDspProcess()` | Execute the tiled convolution loop on the DSP | +| `hwaInit()` | Program MMA config register via `__HWAOPEN` | +| `blockConvS08_ci()` | Inner convolution block: B-panel fill + MMA compute loop | +| `calcMMAConv()` | Outer loop orchestrating tiles, DMA, and MMA blocks | +| `prefillBpanel_ci()` | Pre-load first B panel for double-buffering startup | + +### 3.7 L2 Memory Layout + +TIDL allocates a fixed L2 scratch buffer: + +```c +#define L2_MEM_SIZE (256*1024) // Total: 256 KB +#define INFEAT_L2_MEM_SIZE (128*1024) // Input features: 128 KB (power of 2 for SE circ buf) +#define LEFT_L2_MEM_SIZE (L2_MEM_SIZE - INFEAT_L2_MEM_SIZE) // Weights + bias: 128 KB +``` + +Input features are stored in a circular buffer in L2 SRAM, accessed via the +streaming engine with circular buffer addressing. This allows the SE to wrap +around when reading sliding-window convolution inputs. + +### 3.8 MMA Modeling Mode + +`tidl_conv2d_mma.h` includes an `#ifdef MMA_MODELING` path that implements +the MMA computation in software using C arrays: + +```c +int32_t cPanel[2][64*64]; // C accumulator (2 banks × 64×64 int32) +char64 bPanel[2][64]; // B weight matrix (2 banks × 64 rows) +char64 bPanelT[64]; // Transposed B panel +``` + +Functions like `mmaOP()`, `prefillBpanel()`, `transposeBPanel()`, and +`updateState()` simulate the hardware FSM behavior. This provides a +bit-accurate software model of the MMA for validation. + +## 4. Quantization: TIDL vs Thames/MMALib + +### 4.1 The Core Difference + +Both MMALib and TIDL map **input activations → MMA A** (with `A_ATYPE = +UINT8`) and **weights → MMA B** (with `B_BTYPE = SIZE8`). The MMA hardware +A/B mapping is the same in both cases. + +**MMALib** (our current approach): +- MMALib's `convolveBias_row` API requires weights as **INT8** (signed, + zero point = 0) — this is an API constraint, not an MMA hardware + constraint +- For TFLite models using the **older uint8 quantization** (weights stored + as uint8 with non-zero zero point, e.g., w_zp=133), we must re-quantize: + `w_int8 = clamp(w_uint8 - w_zp, -128, 127)` +- When `|w_uint8 - w_zp| > 127`, the weight overflows and must be rescaled + by 0.5×, introducing rounding error +- The rescaling factor is compensated by doubling the per-channel scale, + but rounding errors from halving the weights accumulate +- For TFLite **full-integer quantization** (weights as int8, zp=0), this + is not an issue — weights are already in the required format + +**TIDL** (direct MMA): +- TIDL configures `C_BTYPE = C_CONFIG_BTYPE_UINT8` — the B matrix (weights) + is interpreted as unsigned uint8 in the C FSM / bias table context +- The A_ATYPE = UINT8 allows raw unsigned input activations without + subtracting the input zero point +- All zero point corrections (both input and output) are absorbed into + the derivedBias term during model import +- For TFLite full-integer quantization models (int8 symmetric weights), + the weights are used as-is +- The MMA hardware handles the mixed-signedness product correctly: + unsigned_A × signed/unsigned_B → signed int32 accumulator +- No weight re-quantization, no rescaling, no rounding error + +### 4.2 TIDL's Quantization Math + +For TFLite asymmetric quantization (`TIDL_QuantStyleAsymNP2_TFL`): + +**Scale and Shift (per output channel):** + +$$\text{scaleRatio}[c] = \frac{S_x \cdot S_w[c]}{S_y}$$ + +where $S_x$, $S_w[c]$, $S_y$ are the input, per-channel weight, and output +scales from TFLite. + +`TIDL_getMMAv2_ScaleAndShift()` converts this floating-point ratio into a +uint8 scale and uint8 shift: + +```c +void TIDL_getMMAv2_ScaleAndShift(float scaleRatio, uint8_t *scale, uint8_t *shift) +{ + int32_t shiftBits = 0; + float newScaleRatio = scaleRatio; + while (1) { + newScaleRatio *= 2; + if (shiftBits >= 40) break; + if (newScaleRatio > 255.0) { newScaleRatio /= 2; break; } + shiftBits++; + } + *shift = shiftBits; + *scale = (uint8_t)(newScaleRatio + 0.5); +} +``` + +This is the same "doubling" algorithm we use in Thames (`compute_quant()`). +The result approximates: $\text{scaleRatio} \approx \text{scale} / 2^{\text{shift}}$ + +**Bias (per output channel) — direct TFLite path:** + +$$\text{derivedBias}[c] = \text{bias}_{\text{TFLite}}[c] + z_y \cdot \frac{S_y}{S_x \cdot S_w[c]} - z_x \cdot \sum_i W_q[c][i]$$ + +where: +- $\text{bias}_{\text{TFLite}}[c]$ is the int32 bias from the TFLite model +- $z_y$ is the output zero point +- $z_x$ is the input zero point +- $W_q[c][i]$ are the quantized int8 weights for output channel $c$ +- $\frac{S_y}{S_x \cdot S_w[c]}$ is `nScale` = $1 / \text{scaleRatio}[c]$ + +The $-z_x \cdot \sum W$ term pre-computes the input zero point correction, +so the MMA can operate on raw unsigned activation values without subtracting +$z_x$ first. The $z_y \cdot \text{nScale}$ term accounts for the output zero +point. + +**Post-accumulation (hardware):** + +The MMA hardware computes: + +$$\text{acc}[c] = \text{derivedBias}[c] + \sum_i A_{\text{uint8}}[i] \cdot W_{\text{int8}}[c][i]$$ + +Then the X transfer pipeline applies: + +$$\text{out}[c] = \text{clamp}\left(\left\lfloor \frac{\text{acc}[c] \cdot \text{scale}[c]}{2^{\text{shift}[c]}} + 0.5 \right\rfloor, \text{minPSAT}, \text{maxPSAT}\right)$$ + +**PSAT bounds:** For uint8 output with activation clipping: + +```c +minPSAT = round(clipMin / outScale) + outZeroPoint; // typically 0 +maxPSAT = round(clipMax / outScale) + outZeroPoint; // typically 255 +``` + +### 4.3 Weight Treatment + +TFLite full-integer quantized models store weights as **int8** (symmetric, +zero point = 0). TIDL keeps them as-is — no re-quantization. The weights +are loaded into the B matrix storage, and the MMA treats them as 8-bit +values (signedness is determined by how the multiply interprets the product, +controlled by `A_ATYPE` and `C_ATYPE`). + +Since `A_ATYPE = UINT8` (unsigned inputs) and `C_ATYPE = SA` (signed +accumulator), the multiply computes a **mixed-signedness** product: unsigned +activation × signed/unsigned weight → signed 32-bit accumulator. This +matches the TFLite quantization semantics exactly. + +### 4.4 Comparison Table + +| Aspect | MMALib (Thames current) | TIDL (direct MMA) | +|--------|------------------------|-------------------| +| Input type (MMA A) | UINT8 | UINT8 | +| Weight type (MMA B) | INT8 (API constraint) | INT8 or UINT8 (flexible) | +| A_ATYPE | UINT8 | UINT8 | +| C_BTYPE | N/A (MMALib internal) | UINT8 | +| Weight re-quantization | Required for uint8 models | Not needed | +| Weight rescaling | 0.5× for overflows, compensated in scale | Not needed | +| Input zero point | Subtracted from bias term | Subtracted from bias term | +| Output zero point | Added to bias term | Added to bias term | +| Scale/shift | Per-channel uint8/uint8 | Per-channel uint8/uint8 | +| Post-processing | MMALib software (scaleShiftRoundAndReLU) | MMA hardware pipeline | +| Rounding error source | Weight halving + int8 clamping | None from weights | +| Per-channel scale load | MMALib internal | `__HWA_LOAD_2REG` intrinsic | +| Activation | Software ReLU in MMALib | HW PSAT + ReLU in X pipeline | + +### 4.5 Why TIDL Has No Accuracy Problem + +With direct MMA programming, TIDL avoids the accuracy issue we hit on +Model 2 (input_zp=128, weight_zp=133): + +1. TIDL's model import tool handles weight re-quantization offline (not at + runtime), and for TFLite full-integer quantization, weights are already + int8 symmetric (zp=0) → no re-quantization needed +2. If the original model has uint8 weights, TIDL's import can dequantize + to float and re-quantize with full precision during model compilation +3. The full zero point correction (for both input and output) is absorbed + into the derivedBias at import time → exact computation +4. The hardware pipeline handles scale/shift/round/saturate atomically + +For Model 2, where MMALib requires runtime rescaling of 19/32 channels +(causing cumulative rounding errors summing to diff=56), TIDL's offline +import would compute derivedBias with full float precision, and the MMA +hardware would produce results matching the TFLite CPU reference (within +the hardware rounding tolerance of shift+round). + +**Note on the MMA hardware mapping**: Both MMALib and TIDL use the same +hardware mapping: **A = input activations** (streamed through SE0 via +`HWALDAB`), **B = weight matrix** (loaded from SE1 via `HWALDAB`). The +`A_ATYPE = UINT8` setting allows raw unsigned input activations. The +difference is that MMALib's API forces weights to be INT8 symmetric at +the API level, while direct MMA programming gives full control over the +`C_BTYPE` field and weight format. + +## 5. What Source Code Is Available + +### 5.1 Available (in `c7x-mma-tidl/ti_dl/algo/`) + +| File | Content | +|------|---------| +| `inc/tidl_conv2d_mma.h` | MMA config register struct, offset register, SE/SA template declarations, MMA modeling code | +| `inc/tidl_conv2d_mma_i.h` | API declarations for DSP init/process functions | +| `src/tidl_conv2d_base.c` | Reference conv2d implementation, bias/scale/shift setup, PSAT computation, dispatch logic | +| `src/tidl_alg_utils.c` | `TIDL_getMMAv2_ScaleAndShift()` — the scale/shift quantization algorithm | +| `inc/tidl_alg_int.h` | `TIDL_roundSatMMA()` — round+shift+saturate reference implementation | + +### 5.2 Available (TI compiler headers) + +| File | Content | +|------|---------| +| `C7524-MMA2_256F/c7x_mma.h` | Complete MMA hardware definition: all enum types, `__HWA_CONFIG_REG_v1` struct, intrinsic declarations, matrix dimension macros | + +### 5.3 NOT Available (pre-compiled in `tidl_priv_algo.lib`) + +| Function | What it does | +|----------|-------------| +| `TIDL_conv2dDspInitNew()` | Full initialization: allocate buffers, configure SE/SA templates, set MMA config register fields dynamically based on layer parameters | +| `TIDL_conv2dDspProcess()` | Full execution: tiled loop with DMA, MMA compute, output writeback | +| `hwaInit()` | Program `__HWAOPEN` with runtime-computed config | +| `blockConvS08_ci()` | Inner block: prefill B panels, run MMA ops, transfer results | +| `calcMMAConv()` | Outer tile loop coordinating DMA and MMA | +| `prefillBpanel_ci()` | First B panel load for double-buffer pipeline startup | + +## 6. Implementation Roadmap for Thames + +### 6.1 Phase 1: Basic MMA Convolution (No Post-Processing) + +1. **Emit `__HWAOPEN`**: Encode the config register struct as an immediate + in the kernel binary. Use the TIDL config as a template but compute + FSM periods from the actual layer dimensions: + - `B_BSWPER = MMA_SIZE` (64 for int8) + - `C_OP0PER = MMA_SIZE` + - `C_OP1PER = (K-1) * MMA_SIZE` where K = numInChannels / 64 + - `C_CWSWPER = K * MMA_SIZE` + - etc. + +2. **Weight loading**: Use SE1 to stream weight rows, emit `__HWALDB` ×64 + for each input channel group. + +3. **Input loading**: Use SE0 to stream input activation vectors, emit + `__HWALDA` for each spatial position. + +4. **Compute**: `__HWAOP` for first A×B, `__HWAOPXFER` for pipelined + compute+transfer. + +5. **Result readout**: `__HWARCV` to get output vectors, `VST32B` to store. + +### 6.2 Phase 2: Add Quantization + +1. **Bias loading**: Pre-load derivedBias into C via `__HWALDC`. Bias + layout must match the MMA's C storage format. + +2. **Scale/shift loading**: `__HWA_LOAD_2REG(scale_vec, shift_vec, + __MMA_LOAD_2REG_SCALE_SHIFT_0)` before the compute loop. + +3. **Enable PSAT**: Set `X_PSAT = 1`, encode `SAT_MIN`/`SAT_MAX` in the + config register (16-bit values split across non-contiguous bitfields). + +4. **Enable rounding**: `X_RE = 1` for half-LSB rounding. + +5. **Enable per-channel shift**: `X_SCALE_SHIFT_CTRL = 1`, + `X_SHIFT = __MMA_X_CONFIG_SHIFT_ROW_UNSIGNED` (per-row unsigned shift). + +### 6.3 Phase 3: Tiling and DMA + +1. Implement block scheduling (`thames_sched_operation()`) to find optimal + tile sizes that fit in L2 SRAM. + +2. Add DMA support for copying input tiles to L2 and output tiles to DDR. + +3. Implement double-buffering for both B panels and input tiles. + +### 6.4 Key Assembly Instructions + +The C7x assembler instructions corresponding to the MMA intrinsics: + +| Intrinsic | Assembly | Unit | +|-----------|----------|------| +| `__HWAOPEN` | `HWAOPEN .C2 VBL, VBL, imm` | C2 | +| `__HWALDA` | `HWALDAB .L2 VB, VB` (A-only form) | L2 | +| `__HWALDB` | `HWALDB .L2 VB` | L2 | +| `__HWALDC` | via `HWALDAB/HWALDBC` | L2 | +| `__HWALDAB` | `HWALDAB .L2 VB, VB` | L2 | +| `__HWAOP` | `HWAOPXFER .S1 imm` (op-only form) | S1 | +| `__HWAOPXFER` | `HWAOPXFER .S1 imm` | S1 | +| `__HWAXFER` | `HWAXFER .S2 imm` | S2 | +| `__HWARCV` | `HWARCVS .S2 VB` | S2 | +| `__HWACLOSE` | via control instruction | C2 | +| `__HWA_LOAD_2REG` | via HWAXFER variants | S2 | + +Note: In the reference MMALib disassembly, `HWAOPXFER` on S1 and `HWALDAB` +on L2 always execute **in parallel** (same execute packet). `HWARCVS` on S2 +follows with appropriate pipeline delay. + +### 6.5 Required Changes to Thames + +1. **`thames_coefs.c`**: Remove weight re-quantization (int8→int8 rescaling). + Compute derivedBias using the TIDL formula with zero point absorption. + Weights remain as TFLite int8 (symmetric, zp=0). + +2. **`thames_compiler.c`**: New MMA convolution emitter using `asm_HWAOPEN`, + `asm_HWALDAB`, `asm_HWAOPXFER`, `asm_HWARCVS` from the C7x assembler. + +3. **`thames_ml.c`**: Input data stays uint8 — no need to convert to int8. + Zero point folded into bias. + +4. **New**: MMA config register encoding helper — compute `__HWA_CONFIG_REG` + fields from layer parameters and encode as 512-bit immediate for HWAOPEN. + +## 7. Reference: TIDL `TIDL_roundSatMMA()` + +This function models the MMA hardware's post-processing: + +```c +static int64_t TIDL_roundSatMMA(int64_t val, int32_t bits, int32_t min, int32_t max) +{ + int64_t temp; + if (bits > 0) { + temp = (val >> (bits - 1)) + 1; // add rounding bias + val = temp >> 1; // complete the shift + } + val = val < min ? min : val; + val = val > max ? max : val; + return val; +} +``` + +This is: `round(val / 2^bits)` clamped to `[min, max]`. The rounding is +"round half up" (add 0.5 × 2^bits before truncation). + +## 8. Reference: MMA Config Register Layout + +The `__HWA_CONFIG_REG_v1` struct is 512 bits (64 bytes = one vector). +Key sections and their bit positions (little-endian): + +### A Config (bits 0–31 of first 64-bit word) +- `A_ATYPE` [3:0] — 4 bits: element type (UINT8=0, INT8=4, etc.) +- `A_ALUTEN` [8] — 1 bit: lookup table enable +- `A_ARF_CTRL` [16] — 1 bit: A register file enable +- `A_ARF_BASE` [22:17] — 6 bits: circular buffer base +- `A_ARF_SIZE` [30:24] — 7 bits: ARF array size + +### B Config (bits 32–63 of first word + bits 0–63 of second word) +- `B_BSWPER` [63:32] — 32 bits: B bank switch period +- `B_BRSTPER` [7:0] — 8 bits: B offset reset period +- `B_BTYPE` [9:8] — 2 bits: element size (SIZE8=0) +- `B_ORDER` [16] — 1 bit: row/column major +- `B_BSTART` [24] — 1 bit: initial B bank +- `B_BOFFSET` [39:32] — 8 bits: global offset + +### C Config (third + fourth + fifth 64-bit words) +- `C_ATYPE` [0] — 1 bit: signed/unsigned accumulator +- `C_BTYPE` [11:8] — 4 bits: bias element type +- `C_OPERATION0` [17:16] — 2 bits: first operation (MUL/MULPLUS/etc.) +- `C_OPERATION1` [25:24] — 2 bits: second operation +- `C_HWLDDST` [34:32] — 3 bits: HWALD destination +- `C_HWLDTYPE` [43:40] — 4 bits: HWALD element type +- `C_OPSTART` [48] — 1 bit: initial operation +- Various periods: `C_OP0PER`, `C_OP1PER`, `C_BSWPER`, `C_CRSWPER`, `C_CWSWPER` (32 bits each) +- Various offsets: `C_CROFFSET`, `C_CWOFFSET`, `C_CLOFFSET` (6-7 bits each) + +### X Config (sixth + seventh 64-bit words) +- `X_ReLU` [0] — 1 bit: ReLU enable +- `X_PSAT` [1] — 1 bit: parameterized saturation enable +- `X_SAT_MIN` — 16 bits split across three fields (5:0, 12:6, 15:13) +- `X_SAT` [8+16] — 1 bit: standard saturation enable +- `X_RE` [16+16] — 1 bit: rounding enable +- `X_SHIFT` [38:32] — 7 bits: right shift amount (or mode selector) +- `X_SCALE_SHIFT_CTRL` [30] — 1 bit: per-row/col scale/shift enable +- `X_XTYPE` [43:40] — 4 bits: output element type +- `X_SAT_MAX` — 16 bits split across three fields +- `X_CTYPE` [50:48] — 3 bits: accumulator element type +- `X_CSWPER` — 32 bits: C read bank switch period +- `X_COFFSET` — 8 bits: C read offset +- `X_CSTART` — 1 bit: initial C bank for X reads + +### Padding / Misc +- `PARITYCTRL` [63:62] — 2 bits: parity control + +The config register is loaded as a single 512-bit vector via the `HWAOPEN` +instruction.