Temp docs on TIDL+MMA

This commit is contained in:
Tomeu Vizoso 2026-03-11 08:30:19 +01:00
parent f24c783170
commit 4e582a5816

718
TIDL_MMA_REFERENCE.md Normal file
View file

@ -0,0 +1,718 @@
# How TIDL Uses the C7x MMA for Convolution
## Overview
TI's TIDL (TI Deep Learning) framework runs convolutions on the C7x DSP by
programming the MMA (Matrix Multiply Accelerator) hardware directly, bypassing
MMALib. This document describes the hardware, the TIDL data flow, and how it
differs from our current MMALib-based approach. The goal is to provide a
reference for implementing direct MMA programming in Thames.
## 1. C7x MMA Hardware Architecture
### 1.1 Variant
The J722S contains a C7524 DSP core with an **MMA2-256F** coprocessor
(512-bit vector width, 64-byte vectors).
### 1.2 Matrix Storage Areas
The MMA has three internal storage areas:
| Storage | Total Size | Row Width | Rows (int8) | Purpose |
|---------|-----------|-----------|-------------|---------|
| **A** (activation) | 64 B | 64 B | 1 | Input feature vector (one row at a time) |
| **B** (weights) | 4096 B | 64 B | 64 | Weight matrix (double-buffered: 2 banks × 64 rows) |
| **C** (accumulator) | 256 B | 256 B | 1 | Accumulation results (64 × int32 = 256 B, double-buffered) |
For int8 operation:
- A holds 64 elements per row (one input vector)
- B holds a 64×64 weight matrix (double-buffered, so B0 and B1 each hold 64×64)
- C holds 64 int32 accumulators (double-buffered, C0 and C1)
- One MMA operation computes: **C += A × B** (64 outputs from 64 inputs × 64×64 weights)
### 1.3 Data Types
The MMA supports configurable types for each storage area:
**A matrix (input activations):**
| Enum | Value | Description |
|------|-------|-------------|
| `ATYPE_UINT8` | 0 | Unsigned 8-bit |
| `ATYPE_INT8` | 4 | Signed 8-bit |
| `ATYPE_UINT16` | 1 | Unsigned 16-bit |
| `ATYPE_INT16` | 5 | Signed 16-bit |
| `ATYPE_UINT32` | 2 | Unsigned 32-bit |
| `ATYPE_INT32` | 6 | Signed 32-bit |
| `ATYPE_F32` | 3 | 32-bit float (MMA2-256F only) |
**B matrix (weights):** `SIZE8`, `SIZE16`, `SIZE32` — element size only,
signedness is implicit (the multiply uses A's signedness for the product
interpretation).
**C accumulator (output):**
- `C_ATYPE`: `SA` (signed) or `UA` (unsigned) — accumulator signedness
- `C_BTYPE`: configures element size/type for bias loading — `UINT8`, `INT8`,
`UINT16`, `INT16`, `UINT32`, `INT32`, `UINT64`, `INT64`
- `C_OPERATION0/1`: `MUL` (C = A×B), `MULNEGATE` (C = -A×B), `MULMINUS`
(C = C - A×B), `MULPLUS` (C = C + A×B)
**X transfer (output conversion):**
- `X_XTYPE`: output element type — `UINT8`, `INT8`, `UINT16`, `INT16`,
`UINT32`, `INT32`, etc.
- `X_CTYPE`: accumulator element type read by X FSM — `UINT32`, `INT32`,
`UINT64`, `INT64`, `UINT128`, `INT128`, `F32`
### 1.4 Post-Processing Pipeline (X Transfer)
When transferring results out of the C accumulator, the MMA can apply a
hardware post-processing pipeline:
1. **Scale + Shift** (optional, `X_SCALE_SHIFT_CTRL`): Per-row or per-column
unsigned/signed scale and shift. Loaded via `__HWA_LOAD_2REG` into
`HWA_SCALE0/1` and `HWA_SHIFT0/1` registers.
2. **Right Shift** (`X_SHIFT`): Fixed right-shift amount (7 bits). When
`X_SCALE_SHIFT_CTRL` is enabled, this field selects per-row/per-col
mode instead.
3. **Rounding** (`X_RE`): Half-LSB rounding after shift (adds 0.5 before
truncation).
4. **Saturation** (`X_SAT`): Clamp to output type range.
5. **Parameterized Saturation** (`X_PSAT`): Clamp to custom
`[SAT_MIN, SAT_MAX]` range (16-bit values, split across multiple
bitfields in the config register). Used for activation clipping.
6. **ReLU** (`X_ReLU`): Rectified linear unit (clamp negatives to 0).
Applied after PSAT.
7. **Type Conversion**: Convert from accumulator type (`X_CTYPE`) to output
type (`X_XTYPE`).
The hardware pipeline order is:
**accumulate → (scale ×) → shift → round → saturate → PSAT → ReLU → output**
### 1.5 Finite State Machines
The MMA has three FSMs that sequence operations automatically:
- **B FSM**: Controls writing to B matrix banks (double-buffering), row
offsets, bank switching periods
- **C FSM**: Controls accumulator read/write banks, offsets, operation
selection (MUL vs MULPLUS), bias loading destinations
- **X FSM**: Controls transfer from C to output, bank switching for reads,
offset sequencing
The FSM periods (e.g., `B_BSWPER`, `C_CWSWPER`, `X_CSWPER`) are expressed
in units of MMA operations and control when banks switch, offsets reset, and
operations alternate between OP0 and OP1.
### 1.6 Streaming Engines (SE) and Streaming Address Generators (SA)
The C7x has hardware streaming engines for efficient memory access:
- **SE0, SE1**: Read-only streaming engines — fetch data from memory in
configurable multi-dimensional patterns (up to 6D: `ICNT0`..`ICNT5`,
`DIM1`..`DIM5`). Opened with `__SE_OPEN` using a `__STRM_TEMPLATE`
configuration struct.
- **SA0, SA1, SA2, SA3**: Streaming Address generators — produce address
sequences for MMA transfer masking. Opened with `__SA_OPEN`. Can
generate multi-dimensional address patterns used with `__HWAXFER_MASK`
to control which C accumulator rows are transferred.
The SE templates configure:
- `DIMFMT`: Number of dimensions (up to 6D)
- `ELETYPE`: Element type (`__SE_ELETYPE_8BIT`, `__SE_ELETYPE_16BIT`, etc.)
- `VECLEN`: Vector length (`__SE_VECLEN_64BYTES` for 512-bit)
- `ICNT0..ICNT5`: Iteration counts per dimension
- `DIM1..DIM5`: Stride (in bytes) for each dimension
- `CBK0/CBK1`: Circular buffer configuration
- `DECDIM1/2`: Decrementing dimension support
These are used to stream input activations through SE0 and weight rows
through SE1 (or to load them into the MMA A and B storage directly).
## 2. MMA Intrinsics API
The C7x compiler provides intrinsics that map to MMA hardware instructions:
### 2.1 Configuration
```c
void __HWAOPEN(__HWA_CONFIG_REG_v1 config,
__HWA_OFFSET_REG offsets,
__MMA_OPEN_FSM fsm_select);
```
Opens and configures the MMA. `fsm_select` is an `__MMA_OPEN_FSM` enum
value (`__MMA_OPEN_FSM_RESET = 0` resets all FSMs).
### 2.2 Data Loading
```c
void __HWALDA(__mma_vec src); // Load one row into A storage
void __HWALDB(__mma_vec src); // Load one row into B storage
void __HWALDC(__mma_vec src); // Load one row into C storage (bias)
void __HWALDAB(__mma_vec a, __mma_vec b); // Load A and B simultaneously
void __HWALDBC(__mma_vec b, __mma_vec c); // Load B and C simultaneously
```
For efficient pipelining, `__HWALDAB` loads both an input activation row
(into A) and a weight row (into B) in a single instruction.
### 2.3 Scale/Shift Loading (MMA2+)
```c
void __HWA_LOAD_2REG(__mma_vec src1, __mma_vec src2, __MMA_LOAD_2REG dest);
```
Loads per-channel scale and shift vectors into MMA registers:
- `__MMA_LOAD_2REG_SCALE_SHIFT_0`: src1 → HWA_SCALE0, src2 → HWA_SHIFT0
- `__MMA_LOAD_2REG_SCALE_SHIFT_1`: src1 → HWA_SCALE1, src2 → HWA_SHIFT1
These are used for per-channel quantization: scale[ch] and shift[ch] values
are packed into vectors and loaded before the compute loop.
### 2.4 Compute
```c
void __HWAOP(__MMA_A_SOURCE_SELECT src); // Perform one MMA operation
void __HWAOPXFER(__MMA_A_SOURCE_SELECT src); // Operate + transfer in parallel
```
`__MMA_A_LDA` (= 0): Use A vector from most recent `__HWALDA`.
When using the A Register File (ARF), can specify `__MMA_A_ARF_ROW_SA0`
through `__MMA_A_ARF_ROW_SA3` (with optional `ADV` variants that advance
the SA).
`__HWAOPXFER` performs an MMA operation and a C→output transfer
simultaneously, enabling pipelined execution where one accumulator bank
computes while the other bank's results are being transferred out.
### 2.5 Transfer (Result Readout)
```c
void __HWAXFER(__MMA_XFER_SRC src); // Load transfer buffer
__mma_vec __HWARCV(int32_t index); // Read from transfer buffer
```
`__HWAXFER(__MMA_XFER_SRC_C)`: Transfer C accumulator contents through the
post-processing pipeline into the transfer buffer.
With masking (MMA2+):
```c
__HWAXFERB(src, mask); // Transfer with byte-granularity masking
__HWAXFERH(src, mask); // Transfer with halfword-granularity masking
```
The mask comes from a Streaming Address generator (SA0SA3):
- `__MMA_XFER_MASK_PSA0` through `PSA3`: Use SA for write masking
- `__MMA_XFER_MASK_PSA0ADV` through `PSA3ADV`: Use SA + advance
### 2.6 Status and Cleanup
```c
void __HWACLOSE(0); // Close MMA
void __HWARESET(); // Reopen with previous config (MMA2+)
void __HWAXFER_XSTATUS_DELAYED(); // Read min/max range statistics
```
## 3. TIDL's Convolution Architecture
### 3.1 High-Level Flow
TIDL processes convolutions in a tiled, block-based manner:
```
for each output-channel group (64 channels at a time):
for each spatial tile:
1. DMA input tile into L2 scratch buffer
2. Open MMA with config registers
3. Load bias into C accumulator (via HWALDC)
4. Load per-channel scale/shift (via __HWA_LOAD_2REG)
5. for each input-channel group (64 at a time):
Load 64 weight rows into B (via HWALDB × 64)
for each spatial position in tile:
Load input vector into A (via HWALDA)
Trigger MMA operation (HWAOP or HWAOPXFER)
6. Transfer C results out (HWAXFER/HWARCV)
7. Store output tile via DMA
```
### 3.2 MMA Configuration for uint8 Convolution
TIDL's config register struct (`configRegisterStruct_i8s_i8s_o8s` in
`tidl_conv2d_mma.h`) uses these critical settings:
```c
// A config (input activations)
.A_ATYPE = A_CONFIG_ATYPE_UINT8 // *** UNSIGNED uint8 inputs ***
// B config (weights)
.B_BTYPE = B_CONFIG_SIZE8 // 8-bit weight elements
.B_ORDER = B_CONFIG_ROW // Row-major weight layout
.B_BSWPER = 64 // B bank switch every 64 ops
// C config (accumulator)
.C_ATYPE = C_CONFIG_ATYPE_SA // SIGNED accumulator (int32)
.C_BTYPE = C_CONFIG_BTYPE_UINT8 // Bias loading type
.C_OPERATION0 = C_CONFIG_MUL // First op: C = A × B
.C_OPERATION1 = C_CONFIG_MULPLUS // Subsequent ops: C += A × B
.C_HWLDDST = C_CONFIG_HWLDDST_X1 // Bias goes to C accumulator ×1
.C_HWLDTYPE = C_CONFIG_HWLDTYPE_INT8
.C_OP0PER = 64 // OP0 for first 64 ops (init)
.C_OP1PER = (K-1)*64 // OP1 for remaining ops (accumulate)
// X config (output transfer)
.X_XTYPE = X_CONFIG_XTYPE_UINT8 // Output type: uint8
.X_CTYPE = X_CONFIG_CTYPE_UINT32 // Accumulator type: uint32 (read as unsigned!)
.X_SHIFT = OUT_SHIFT // Right-shift amount
.X_ReLU = 0 // ReLU disabled in config (done via PSAT)
.X_SAT = 0 // Standard saturation disabled
.X_RE = 0 // Rounding disabled in this config
.X_CSTART = 1 // X reads from opposite bank to C writes
```
**The critical insight: `A_ATYPE = UINT8`.** TIDL keeps input activations
as unsigned uint8, eliminating the need to subtract the input zero point
before the MMA multiply. The zero point correction is absorbed into the bias
instead (see Section 4).
### 3.3 Operation Sequencing
The MMA uses a two-operation scheme with FSM-controlled alternation:
1. **Operation 0 (MUL)**: `C = A × B` — Initializes the accumulator with
the first partial product. Runs for `C_OP0PER = 64` operations (one
full B matrix multiply).
2. **Operation 1 (MULPLUS)**: `C = C + A × B` — Accumulates subsequent
partial products. Runs for `C_OP1PER = (K-1)*64` operations (remaining
K-1 input channel groups).
After K groups of 64 input channels, the C accumulator contains the full
convolution result for 64 output channels, plus the pre-loaded bias. The
X FSM then transfers the result out with shift + saturation + type conversion.
### 3.4 Double Buffering
Both B and C storage are double-buffered:
- **B double-buffering**: While the MMA computes using B bank 0, new weight
rows are loaded into B bank 1 via `__HWALDB`. `B_BSWPER` controls when
banks alternate.
- **C double-buffering**: While results from C bank 0 are being transferred
out via `__HWAXFER`, new computations write to C bank 1. `C_CWSWPER`
and `X_CSWPER` must be configured so the X FSM reads from the bank that
the C FSM just finished writing.
### 3.5 Streaming Engine Configuration
TIDL configures the streaming engines for efficient memory access:
- **SE0** (or SE1): Streams input activation vectors from L2 SRAM
- `ELETYPE = __SE_ELETYPE_8BIT`
- `VECLEN = __SE_VECLEN_64BYTES`
- `ICNT0 = 64` (one vector = 64 bytes)
- Higher dimensions iterate over spatial positions and channel groups
- May use circular buffering (`CBK0`) for L2 scratch memory
- **SA0/SA1/SA2**: Generate address sequences for:
- Write-masking partial output tiles (when output channels aren't a
multiple of 64)
- Controlling which C accumulator rows to transfer
### 3.6 DSP Kernel Functions (Pre-compiled)
The actual DSP kernel code is in `tidl_priv_algo.lib` (pre-compiled, no
source available). The key functions are:
| Function | Purpose |
|----------|---------|
| `TIDL_conv2dDspInitNew()` | Allocate scratch buffers, configure SE/SA templates, set up MMA config |
| `TIDL_conv2dDspProcess()` | Execute the tiled convolution loop on the DSP |
| `hwaInit()` | Program MMA config register via `__HWAOPEN` |
| `blockConvS08_ci()` | Inner convolution block: B-panel fill + MMA compute loop |
| `calcMMAConv()` | Outer loop orchestrating tiles, DMA, and MMA blocks |
| `prefillBpanel_ci()` | Pre-load first B panel for double-buffering startup |
### 3.7 L2 Memory Layout
TIDL allocates a fixed L2 scratch buffer:
```c
#define L2_MEM_SIZE (256*1024) // Total: 256 KB
#define INFEAT_L2_MEM_SIZE (128*1024) // Input features: 128 KB (power of 2 for SE circ buf)
#define LEFT_L2_MEM_SIZE (L2_MEM_SIZE - INFEAT_L2_MEM_SIZE) // Weights + bias: 128 KB
```
Input features are stored in a circular buffer in L2 SRAM, accessed via the
streaming engine with circular buffer addressing. This allows the SE to wrap
around when reading sliding-window convolution inputs.
### 3.8 MMA Modeling Mode
`tidl_conv2d_mma.h` includes an `#ifdef MMA_MODELING` path that implements
the MMA computation in software using C arrays:
```c
int32_t cPanel[2][64*64]; // C accumulator (2 banks × 64×64 int32)
char64 bPanel[2][64]; // B weight matrix (2 banks × 64 rows)
char64 bPanelT[64]; // Transposed B panel
```
Functions like `mmaOP()`, `prefillBpanel()`, `transposeBPanel()`, and
`updateState()` simulate the hardware FSM behavior. This provides a
bit-accurate software model of the MMA for validation.
## 4. Quantization: TIDL vs Thames/MMALib
### 4.1 The Core Difference
Both MMALib and TIDL map **input activations → MMA A** (with `A_ATYPE =
UINT8`) and **weights → MMA B** (with `B_BTYPE = SIZE8`). The MMA hardware
A/B mapping is the same in both cases.
**MMALib** (our current approach):
- MMALib's `convolveBias_row` API requires weights as **INT8** (signed,
zero point = 0) — this is an API constraint, not an MMA hardware
constraint
- For TFLite models using the **older uint8 quantization** (weights stored
as uint8 with non-zero zero point, e.g., w_zp=133), we must re-quantize:
`w_int8 = clamp(w_uint8 - w_zp, -128, 127)`
- When `|w_uint8 - w_zp| > 127`, the weight overflows and must be rescaled
by 0.5×, introducing rounding error
- The rescaling factor is compensated by doubling the per-channel scale,
but rounding errors from halving the weights accumulate
- For TFLite **full-integer quantization** (weights as int8, zp=0), this
is not an issue — weights are already in the required format
**TIDL** (direct MMA):
- TIDL configures `C_BTYPE = C_CONFIG_BTYPE_UINT8` — the B matrix (weights)
is interpreted as unsigned uint8 in the C FSM / bias table context
- The A_ATYPE = UINT8 allows raw unsigned input activations without
subtracting the input zero point
- All zero point corrections (both input and output) are absorbed into
the derivedBias term during model import
- For TFLite full-integer quantization models (int8 symmetric weights),
the weights are used as-is
- The MMA hardware handles the mixed-signedness product correctly:
unsigned_A × signed/unsigned_B → signed int32 accumulator
- No weight re-quantization, no rescaling, no rounding error
### 4.2 TIDL's Quantization Math
For TFLite asymmetric quantization (`TIDL_QuantStyleAsymNP2_TFL`):
**Scale and Shift (per output channel):**
$$\text{scaleRatio}[c] = \frac{S_x \cdot S_w[c]}{S_y}$$
where $S_x$, $S_w[c]$, $S_y$ are the input, per-channel weight, and output
scales from TFLite.
`TIDL_getMMAv2_ScaleAndShift()` converts this floating-point ratio into a
uint8 scale and uint8 shift:
```c
void TIDL_getMMAv2_ScaleAndShift(float scaleRatio, uint8_t *scale, uint8_t *shift)
{
int32_t shiftBits = 0;
float newScaleRatio = scaleRatio;
while (1) {
newScaleRatio *= 2;
if (shiftBits >= 40) break;
if (newScaleRatio > 255.0) { newScaleRatio /= 2; break; }
shiftBits++;
}
*shift = shiftBits;
*scale = (uint8_t)(newScaleRatio + 0.5);
}
```
This is the same "doubling" algorithm we use in Thames (`compute_quant()`).
The result approximates: $\text{scaleRatio} \approx \text{scale} / 2^{\text{shift}}$
**Bias (per output channel) — direct TFLite path:**
$$\text{derivedBias}[c] = \text{bias}_{\text{TFLite}}[c] + z_y \cdot \frac{S_y}{S_x \cdot S_w[c]} - z_x \cdot \sum_i W_q[c][i]$$
where:
- $\text{bias}_{\text{TFLite}}[c]$ is the int32 bias from the TFLite model
- $z_y$ is the output zero point
- $z_x$ is the input zero point
- $W_q[c][i]$ are the quantized int8 weights for output channel $c$
- $\frac{S_y}{S_x \cdot S_w[c]}$ is `nScale` = $1 / \text{scaleRatio}[c]$
The $-z_x \cdot \sum W$ term pre-computes the input zero point correction,
so the MMA can operate on raw unsigned activation values without subtracting
$z_x$ first. The $z_y \cdot \text{nScale}$ term accounts for the output zero
point.
**Post-accumulation (hardware):**
The MMA hardware computes:
$$\text{acc}[c] = \text{derivedBias}[c] + \sum_i A_{\text{uint8}}[i] \cdot W_{\text{int8}}[c][i]$$
Then the X transfer pipeline applies:
$$\text{out}[c] = \text{clamp}\left(\left\lfloor \frac{\text{acc}[c] \cdot \text{scale}[c]}{2^{\text{shift}[c]}} + 0.5 \right\rfloor, \text{minPSAT}, \text{maxPSAT}\right)$$
**PSAT bounds:** For uint8 output with activation clipping:
```c
minPSAT = round(clipMin / outScale) + outZeroPoint; // typically 0
maxPSAT = round(clipMax / outScale) + outZeroPoint; // typically 255
```
### 4.3 Weight Treatment
TFLite full-integer quantized models store weights as **int8** (symmetric,
zero point = 0). TIDL keeps them as-is — no re-quantization. The weights
are loaded into the B matrix storage, and the MMA treats them as 8-bit
values (signedness is determined by how the multiply interprets the product,
controlled by `A_ATYPE` and `C_ATYPE`).
Since `A_ATYPE = UINT8` (unsigned inputs) and `C_ATYPE = SA` (signed
accumulator), the multiply computes a **mixed-signedness** product: unsigned
activation × signed/unsigned weight → signed 32-bit accumulator. This
matches the TFLite quantization semantics exactly.
### 4.4 Comparison Table
| Aspect | MMALib (Thames current) | TIDL (direct MMA) |
|--------|------------------------|-------------------|
| Input type (MMA A) | UINT8 | UINT8 |
| Weight type (MMA B) | INT8 (API constraint) | INT8 or UINT8 (flexible) |
| A_ATYPE | UINT8 | UINT8 |
| C_BTYPE | N/A (MMALib internal) | UINT8 |
| Weight re-quantization | Required for uint8 models | Not needed |
| Weight rescaling | 0.5× for overflows, compensated in scale | Not needed |
| Input zero point | Subtracted from bias term | Subtracted from bias term |
| Output zero point | Added to bias term | Added to bias term |
| Scale/shift | Per-channel uint8/uint8 | Per-channel uint8/uint8 |
| Post-processing | MMALib software (scaleShiftRoundAndReLU) | MMA hardware pipeline |
| Rounding error source | Weight halving + int8 clamping | None from weights |
| Per-channel scale load | MMALib internal | `__HWA_LOAD_2REG` intrinsic |
| Activation | Software ReLU in MMALib | HW PSAT + ReLU in X pipeline |
### 4.5 Why TIDL Has No Accuracy Problem
With direct MMA programming, TIDL avoids the accuracy issue we hit on
Model 2 (input_zp=128, weight_zp=133):
1. TIDL's model import tool handles weight re-quantization offline (not at
runtime), and for TFLite full-integer quantization, weights are already
int8 symmetric (zp=0) → no re-quantization needed
2. If the original model has uint8 weights, TIDL's import can dequantize
to float and re-quantize with full precision during model compilation
3. The full zero point correction (for both input and output) is absorbed
into the derivedBias at import time → exact computation
4. The hardware pipeline handles scale/shift/round/saturate atomically
For Model 2, where MMALib requires runtime rescaling of 19/32 channels
(causing cumulative rounding errors summing to diff=56), TIDL's offline
import would compute derivedBias with full float precision, and the MMA
hardware would produce results matching the TFLite CPU reference (within
the hardware rounding tolerance of shift+round).
**Note on the MMA hardware mapping**: Both MMALib and TIDL use the same
hardware mapping: **A = input activations** (streamed through SE0 via
`HWALDAB`), **B = weight matrix** (loaded from SE1 via `HWALDAB`). The
`A_ATYPE = UINT8` setting allows raw unsigned input activations. The
difference is that MMALib's API forces weights to be INT8 symmetric at
the API level, while direct MMA programming gives full control over the
`C_BTYPE` field and weight format.
## 5. What Source Code Is Available
### 5.1 Available (in `c7x-mma-tidl/ti_dl/algo/`)
| File | Content |
|------|---------|
| `inc/tidl_conv2d_mma.h` | MMA config register struct, offset register, SE/SA template declarations, MMA modeling code |
| `inc/tidl_conv2d_mma_i.h` | API declarations for DSP init/process functions |
| `src/tidl_conv2d_base.c` | Reference conv2d implementation, bias/scale/shift setup, PSAT computation, dispatch logic |
| `src/tidl_alg_utils.c` | `TIDL_getMMAv2_ScaleAndShift()` — the scale/shift quantization algorithm |
| `inc/tidl_alg_int.h` | `TIDL_roundSatMMA()` — round+shift+saturate reference implementation |
### 5.2 Available (TI compiler headers)
| File | Content |
|------|---------|
| `C7524-MMA2_256F/c7x_mma.h` | Complete MMA hardware definition: all enum types, `__HWA_CONFIG_REG_v1` struct, intrinsic declarations, matrix dimension macros |
### 5.3 NOT Available (pre-compiled in `tidl_priv_algo.lib`)
| Function | What it does |
|----------|-------------|
| `TIDL_conv2dDspInitNew()` | Full initialization: allocate buffers, configure SE/SA templates, set MMA config register fields dynamically based on layer parameters |
| `TIDL_conv2dDspProcess()` | Full execution: tiled loop with DMA, MMA compute, output writeback |
| `hwaInit()` | Program `__HWAOPEN` with runtime-computed config |
| `blockConvS08_ci()` | Inner block: prefill B panels, run MMA ops, transfer results |
| `calcMMAConv()` | Outer tile loop coordinating DMA and MMA |
| `prefillBpanel_ci()` | First B panel load for double-buffer pipeline startup |
## 6. Implementation Roadmap for Thames
### 6.1 Phase 1: Basic MMA Convolution (No Post-Processing)
1. **Emit `__HWAOPEN`**: Encode the config register struct as an immediate
in the kernel binary. Use the TIDL config as a template but compute
FSM periods from the actual layer dimensions:
- `B_BSWPER = MMA_SIZE` (64 for int8)
- `C_OP0PER = MMA_SIZE`
- `C_OP1PER = (K-1) * MMA_SIZE` where K = numInChannels / 64
- `C_CWSWPER = K * MMA_SIZE`
- etc.
2. **Weight loading**: Use SE1 to stream weight rows, emit `__HWALDB` ×64
for each input channel group.
3. **Input loading**: Use SE0 to stream input activation vectors, emit
`__HWALDA` for each spatial position.
4. **Compute**: `__HWAOP` for first A×B, `__HWAOPXFER` for pipelined
compute+transfer.
5. **Result readout**: `__HWARCV` to get output vectors, `VST32B` to store.
### 6.2 Phase 2: Add Quantization
1. **Bias loading**: Pre-load derivedBias into C via `__HWALDC`. Bias
layout must match the MMA's C storage format.
2. **Scale/shift loading**: `__HWA_LOAD_2REG(scale_vec, shift_vec,
__MMA_LOAD_2REG_SCALE_SHIFT_0)` before the compute loop.
3. **Enable PSAT**: Set `X_PSAT = 1`, encode `SAT_MIN`/`SAT_MAX` in the
config register (16-bit values split across non-contiguous bitfields).
4. **Enable rounding**: `X_RE = 1` for half-LSB rounding.
5. **Enable per-channel shift**: `X_SCALE_SHIFT_CTRL = 1`,
`X_SHIFT = __MMA_X_CONFIG_SHIFT_ROW_UNSIGNED` (per-row unsigned shift).
### 6.3 Phase 3: Tiling and DMA
1. Implement block scheduling (`thames_sched_operation()`) to find optimal
tile sizes that fit in L2 SRAM.
2. Add DMA support for copying input tiles to L2 and output tiles to DDR.
3. Implement double-buffering for both B panels and input tiles.
### 6.4 Key Assembly Instructions
The C7x assembler instructions corresponding to the MMA intrinsics:
| Intrinsic | Assembly | Unit |
|-----------|----------|------|
| `__HWAOPEN` | `HWAOPEN .C2 VBL, VBL, imm` | C2 |
| `__HWALDA` | `HWALDAB .L2 VB, VB` (A-only form) | L2 |
| `__HWALDB` | `HWALDB .L2 VB` | L2 |
| `__HWALDC` | via `HWALDAB/HWALDBC` | L2 |
| `__HWALDAB` | `HWALDAB .L2 VB, VB` | L2 |
| `__HWAOP` | `HWAOPXFER .S1 imm` (op-only form) | S1 |
| `__HWAOPXFER` | `HWAOPXFER .S1 imm` | S1 |
| `__HWAXFER` | `HWAXFER .S2 imm` | S2 |
| `__HWARCV` | `HWARCVS .S2 VB` | S2 |
| `__HWACLOSE` | via control instruction | C2 |
| `__HWA_LOAD_2REG` | via HWAXFER variants | S2 |
Note: In the reference MMALib disassembly, `HWAOPXFER` on S1 and `HWALDAB`
on L2 always execute **in parallel** (same execute packet). `HWARCVS` on S2
follows with appropriate pipeline delay.
### 6.5 Required Changes to Thames
1. **`thames_coefs.c`**: Remove weight re-quantization (int8→int8 rescaling).
Compute derivedBias using the TIDL formula with zero point absorption.
Weights remain as TFLite int8 (symmetric, zp=0).
2. **`thames_compiler.c`**: New MMA convolution emitter using `asm_HWAOPEN`,
`asm_HWALDAB`, `asm_HWAOPXFER`, `asm_HWARCVS` from the C7x assembler.
3. **`thames_ml.c`**: Input data stays uint8 — no need to convert to int8.
Zero point folded into bias.
4. **New**: MMA config register encoding helper — compute `__HWA_CONFIG_REG`
fields from layer parameters and encode as 512-bit immediate for HWAOPEN.
## 7. Reference: TIDL `TIDL_roundSatMMA()`
This function models the MMA hardware's post-processing:
```c
static int64_t TIDL_roundSatMMA(int64_t val, int32_t bits, int32_t min, int32_t max)
{
int64_t temp;
if (bits > 0) {
temp = (val >> (bits - 1)) + 1; // add rounding bias
val = temp >> 1; // complete the shift
}
val = val < min ? min : val;
val = val > max ? max : val;
return val;
}
```
This is: `round(val / 2^bits)` clamped to `[min, max]`. The rounding is
"round half up" (add 0.5 × 2^bits before truncation).
## 8. Reference: MMA Config Register Layout
The `__HWA_CONFIG_REG_v1` struct is 512 bits (64 bytes = one vector).
Key sections and their bit positions (little-endian):
### A Config (bits 031 of first 64-bit word)
- `A_ATYPE` [3:0] — 4 bits: element type (UINT8=0, INT8=4, etc.)
- `A_ALUTEN` [8] — 1 bit: lookup table enable
- `A_ARF_CTRL` [16] — 1 bit: A register file enable
- `A_ARF_BASE` [22:17] — 6 bits: circular buffer base
- `A_ARF_SIZE` [30:24] — 7 bits: ARF array size
### B Config (bits 3263 of first word + bits 063 of second word)
- `B_BSWPER` [63:32] — 32 bits: B bank switch period
- `B_BRSTPER` [7:0] — 8 bits: B offset reset period
- `B_BTYPE` [9:8] — 2 bits: element size (SIZE8=0)
- `B_ORDER` [16] — 1 bit: row/column major
- `B_BSTART` [24] — 1 bit: initial B bank
- `B_BOFFSET` [39:32] — 8 bits: global offset
### C Config (third + fourth + fifth 64-bit words)
- `C_ATYPE` [0] — 1 bit: signed/unsigned accumulator
- `C_BTYPE` [11:8] — 4 bits: bias element type
- `C_OPERATION0` [17:16] — 2 bits: first operation (MUL/MULPLUS/etc.)
- `C_OPERATION1` [25:24] — 2 bits: second operation
- `C_HWLDDST` [34:32] — 3 bits: HWALD destination
- `C_HWLDTYPE` [43:40] — 4 bits: HWALD element type
- `C_OPSTART` [48] — 1 bit: initial operation
- Various periods: `C_OP0PER`, `C_OP1PER`, `C_BSWPER`, `C_CRSWPER`, `C_CWSWPER` (32 bits each)
- Various offsets: `C_CROFFSET`, `C_CWOFFSET`, `C_CLOFFSET` (6-7 bits each)
### X Config (sixth + seventh 64-bit words)
- `X_ReLU` [0] — 1 bit: ReLU enable
- `X_PSAT` [1] — 1 bit: parameterized saturation enable
- `X_SAT_MIN` — 16 bits split across three fields (5:0, 12:6, 15:13)
- `X_SAT` [8+16] — 1 bit: standard saturation enable
- `X_RE` [16+16] — 1 bit: rounding enable
- `X_SHIFT` [38:32] — 7 bits: right shift amount (or mode selector)
- `X_SCALE_SHIFT_CTRL` [30] — 1 bit: per-row/col scale/shift enable
- `X_XTYPE` [43:40] — 4 bits: output element type
- `X_SAT_MAX` — 16 bits split across three fields
- `X_CTYPE` [50:48] — 3 bits: accumulator element type
- `X_CSWPER` — 32 bits: C read bank switch period
- `X_COFFSET` — 8 bits: C read offset
- `X_CSTART` — 1 bit: initial C bank for X reads
### Padding / Misc
- `PARITYCTRL` [63:62] — 2 bits: parity control
The config register is loaded as a single 512-bit vector via the `HWAOPEN`
instruction.