mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2026-03-11 11:20:41 +01:00
Temp docs on TIDL+MMA
This commit is contained in:
parent
f24c783170
commit
4e582a5816
1 changed files with 718 additions and 0 deletions
718
TIDL_MMA_REFERENCE.md
Normal file
718
TIDL_MMA_REFERENCE.md
Normal file
|
|
@ -0,0 +1,718 @@
|
|||
# How TIDL Uses the C7x MMA for Convolution
|
||||
|
||||
## Overview
|
||||
|
||||
TI's TIDL (TI Deep Learning) framework runs convolutions on the C7x DSP by
|
||||
programming the MMA (Matrix Multiply Accelerator) hardware directly, bypassing
|
||||
MMALib. This document describes the hardware, the TIDL data flow, and how it
|
||||
differs from our current MMALib-based approach. The goal is to provide a
|
||||
reference for implementing direct MMA programming in Thames.
|
||||
|
||||
## 1. C7x MMA Hardware Architecture
|
||||
|
||||
### 1.1 Variant
|
||||
|
||||
The J722S contains a C7524 DSP core with an **MMA2-256F** coprocessor
|
||||
(512-bit vector width, 64-byte vectors).
|
||||
|
||||
### 1.2 Matrix Storage Areas
|
||||
|
||||
The MMA has three internal storage areas:
|
||||
|
||||
| Storage | Total Size | Row Width | Rows (int8) | Purpose |
|
||||
|---------|-----------|-----------|-------------|---------|
|
||||
| **A** (activation) | 64 B | 64 B | 1 | Input feature vector (one row at a time) |
|
||||
| **B** (weights) | 4096 B | 64 B | 64 | Weight matrix (double-buffered: 2 banks × 64 rows) |
|
||||
| **C** (accumulator) | 256 B | 256 B | 1 | Accumulation results (64 × int32 = 256 B, double-buffered) |
|
||||
|
||||
For int8 operation:
|
||||
- A holds 64 elements per row (one input vector)
|
||||
- B holds a 64×64 weight matrix (double-buffered, so B0 and B1 each hold 64×64)
|
||||
- C holds 64 int32 accumulators (double-buffered, C0 and C1)
|
||||
- One MMA operation computes: **C += A × B** (64 outputs from 64 inputs × 64×64 weights)
|
||||
|
||||
### 1.3 Data Types
|
||||
|
||||
The MMA supports configurable types for each storage area:
|
||||
|
||||
**A matrix (input activations):**
|
||||
|
||||
| Enum | Value | Description |
|
||||
|------|-------|-------------|
|
||||
| `ATYPE_UINT8` | 0 | Unsigned 8-bit |
|
||||
| `ATYPE_INT8` | 4 | Signed 8-bit |
|
||||
| `ATYPE_UINT16` | 1 | Unsigned 16-bit |
|
||||
| `ATYPE_INT16` | 5 | Signed 16-bit |
|
||||
| `ATYPE_UINT32` | 2 | Unsigned 32-bit |
|
||||
| `ATYPE_INT32` | 6 | Signed 32-bit |
|
||||
| `ATYPE_F32` | 3 | 32-bit float (MMA2-256F only) |
|
||||
|
||||
**B matrix (weights):** `SIZE8`, `SIZE16`, `SIZE32` — element size only,
|
||||
signedness is implicit (the multiply uses A's signedness for the product
|
||||
interpretation).
|
||||
|
||||
**C accumulator (output):**
|
||||
- `C_ATYPE`: `SA` (signed) or `UA` (unsigned) — accumulator signedness
|
||||
- `C_BTYPE`: configures element size/type for bias loading — `UINT8`, `INT8`,
|
||||
`UINT16`, `INT16`, `UINT32`, `INT32`, `UINT64`, `INT64`
|
||||
- `C_OPERATION0/1`: `MUL` (C = A×B), `MULNEGATE` (C = -A×B), `MULMINUS`
|
||||
(C = C - A×B), `MULPLUS` (C = C + A×B)
|
||||
|
||||
**X transfer (output conversion):**
|
||||
- `X_XTYPE`: output element type — `UINT8`, `INT8`, `UINT16`, `INT16`,
|
||||
`UINT32`, `INT32`, etc.
|
||||
- `X_CTYPE`: accumulator element type read by X FSM — `UINT32`, `INT32`,
|
||||
`UINT64`, `INT64`, `UINT128`, `INT128`, `F32`
|
||||
|
||||
### 1.4 Post-Processing Pipeline (X Transfer)
|
||||
|
||||
When transferring results out of the C accumulator, the MMA can apply a
|
||||
hardware post-processing pipeline:
|
||||
|
||||
1. **Scale + Shift** (optional, `X_SCALE_SHIFT_CTRL`): Per-row or per-column
|
||||
unsigned/signed scale and shift. Loaded via `__HWA_LOAD_2REG` into
|
||||
`HWA_SCALE0/1` and `HWA_SHIFT0/1` registers.
|
||||
|
||||
2. **Right Shift** (`X_SHIFT`): Fixed right-shift amount (7 bits). When
|
||||
`X_SCALE_SHIFT_CTRL` is enabled, this field selects per-row/per-col
|
||||
mode instead.
|
||||
|
||||
3. **Rounding** (`X_RE`): Half-LSB rounding after shift (adds 0.5 before
|
||||
truncation).
|
||||
|
||||
4. **Saturation** (`X_SAT`): Clamp to output type range.
|
||||
|
||||
5. **Parameterized Saturation** (`X_PSAT`): Clamp to custom
|
||||
`[SAT_MIN, SAT_MAX]` range (16-bit values, split across multiple
|
||||
bitfields in the config register). Used for activation clipping.
|
||||
|
||||
6. **ReLU** (`X_ReLU`): Rectified linear unit (clamp negatives to 0).
|
||||
Applied after PSAT.
|
||||
|
||||
7. **Type Conversion**: Convert from accumulator type (`X_CTYPE`) to output
|
||||
type (`X_XTYPE`).
|
||||
|
||||
The hardware pipeline order is:
|
||||
**accumulate → (scale ×) → shift → round → saturate → PSAT → ReLU → output**
|
||||
|
||||
### 1.5 Finite State Machines
|
||||
|
||||
The MMA has three FSMs that sequence operations automatically:
|
||||
|
||||
- **B FSM**: Controls writing to B matrix banks (double-buffering), row
|
||||
offsets, bank switching periods
|
||||
- **C FSM**: Controls accumulator read/write banks, offsets, operation
|
||||
selection (MUL vs MULPLUS), bias loading destinations
|
||||
- **X FSM**: Controls transfer from C to output, bank switching for reads,
|
||||
offset sequencing
|
||||
|
||||
The FSM periods (e.g., `B_BSWPER`, `C_CWSWPER`, `X_CSWPER`) are expressed
|
||||
in units of MMA operations and control when banks switch, offsets reset, and
|
||||
operations alternate between OP0 and OP1.
|
||||
|
||||
### 1.6 Streaming Engines (SE) and Streaming Address Generators (SA)
|
||||
|
||||
The C7x has hardware streaming engines for efficient memory access:
|
||||
|
||||
- **SE0, SE1**: Read-only streaming engines — fetch data from memory in
|
||||
configurable multi-dimensional patterns (up to 6D: `ICNT0`..`ICNT5`,
|
||||
`DIM1`..`DIM5`). Opened with `__SE_OPEN` using a `__STRM_TEMPLATE`
|
||||
configuration struct.
|
||||
|
||||
- **SA0, SA1, SA2, SA3**: Streaming Address generators — produce address
|
||||
sequences for MMA transfer masking. Opened with `__SA_OPEN`. Can
|
||||
generate multi-dimensional address patterns used with `__HWAXFER_MASK`
|
||||
to control which C accumulator rows are transferred.
|
||||
|
||||
The SE templates configure:
|
||||
- `DIMFMT`: Number of dimensions (up to 6D)
|
||||
- `ELETYPE`: Element type (`__SE_ELETYPE_8BIT`, `__SE_ELETYPE_16BIT`, etc.)
|
||||
- `VECLEN`: Vector length (`__SE_VECLEN_64BYTES` for 512-bit)
|
||||
- `ICNT0..ICNT5`: Iteration counts per dimension
|
||||
- `DIM1..DIM5`: Stride (in bytes) for each dimension
|
||||
- `CBK0/CBK1`: Circular buffer configuration
|
||||
- `DECDIM1/2`: Decrementing dimension support
|
||||
|
||||
These are used to stream input activations through SE0 and weight rows
|
||||
through SE1 (or to load them into the MMA A and B storage directly).
|
||||
|
||||
## 2. MMA Intrinsics API
|
||||
|
||||
The C7x compiler provides intrinsics that map to MMA hardware instructions:
|
||||
|
||||
### 2.1 Configuration
|
||||
|
||||
```c
|
||||
void __HWAOPEN(__HWA_CONFIG_REG_v1 config,
|
||||
__HWA_OFFSET_REG offsets,
|
||||
__MMA_OPEN_FSM fsm_select);
|
||||
```
|
||||
Opens and configures the MMA. `fsm_select` is an `__MMA_OPEN_FSM` enum
|
||||
value (`__MMA_OPEN_FSM_RESET = 0` resets all FSMs).
|
||||
|
||||
### 2.2 Data Loading
|
||||
|
||||
```c
|
||||
void __HWALDA(__mma_vec src); // Load one row into A storage
|
||||
void __HWALDB(__mma_vec src); // Load one row into B storage
|
||||
void __HWALDC(__mma_vec src); // Load one row into C storage (bias)
|
||||
void __HWALDAB(__mma_vec a, __mma_vec b); // Load A and B simultaneously
|
||||
void __HWALDBC(__mma_vec b, __mma_vec c); // Load B and C simultaneously
|
||||
```
|
||||
|
||||
For efficient pipelining, `__HWALDAB` loads both an input activation row
|
||||
(into A) and a weight row (into B) in a single instruction.
|
||||
|
||||
### 2.3 Scale/Shift Loading (MMA2+)
|
||||
|
||||
```c
|
||||
void __HWA_LOAD_2REG(__mma_vec src1, __mma_vec src2, __MMA_LOAD_2REG dest);
|
||||
```
|
||||
|
||||
Loads per-channel scale and shift vectors into MMA registers:
|
||||
- `__MMA_LOAD_2REG_SCALE_SHIFT_0`: src1 → HWA_SCALE0, src2 → HWA_SHIFT0
|
||||
- `__MMA_LOAD_2REG_SCALE_SHIFT_1`: src1 → HWA_SCALE1, src2 → HWA_SHIFT1
|
||||
|
||||
These are used for per-channel quantization: scale[ch] and shift[ch] values
|
||||
are packed into vectors and loaded before the compute loop.
|
||||
|
||||
### 2.4 Compute
|
||||
|
||||
```c
|
||||
void __HWAOP(__MMA_A_SOURCE_SELECT src); // Perform one MMA operation
|
||||
void __HWAOPXFER(__MMA_A_SOURCE_SELECT src); // Operate + transfer in parallel
|
||||
```
|
||||
|
||||
`__MMA_A_LDA` (= 0): Use A vector from most recent `__HWALDA`.
|
||||
When using the A Register File (ARF), can specify `__MMA_A_ARF_ROW_SA0`
|
||||
through `__MMA_A_ARF_ROW_SA3` (with optional `ADV` variants that advance
|
||||
the SA).
|
||||
|
||||
`__HWAOPXFER` performs an MMA operation and a C→output transfer
|
||||
simultaneously, enabling pipelined execution where one accumulator bank
|
||||
computes while the other bank's results are being transferred out.
|
||||
|
||||
### 2.5 Transfer (Result Readout)
|
||||
|
||||
```c
|
||||
void __HWAXFER(__MMA_XFER_SRC src); // Load transfer buffer
|
||||
__mma_vec __HWARCV(int32_t index); // Read from transfer buffer
|
||||
```
|
||||
|
||||
`__HWAXFER(__MMA_XFER_SRC_C)`: Transfer C accumulator contents through the
|
||||
post-processing pipeline into the transfer buffer.
|
||||
|
||||
With masking (MMA2+):
|
||||
```c
|
||||
__HWAXFERB(src, mask); // Transfer with byte-granularity masking
|
||||
__HWAXFERH(src, mask); // Transfer with halfword-granularity masking
|
||||
```
|
||||
|
||||
The mask comes from a Streaming Address generator (SA0–SA3):
|
||||
- `__MMA_XFER_MASK_PSA0` through `PSA3`: Use SA for write masking
|
||||
- `__MMA_XFER_MASK_PSA0ADV` through `PSA3ADV`: Use SA + advance
|
||||
|
||||
### 2.6 Status and Cleanup
|
||||
|
||||
```c
|
||||
void __HWACLOSE(0); // Close MMA
|
||||
void __HWARESET(); // Reopen with previous config (MMA2+)
|
||||
void __HWAXFER_XSTATUS_DELAYED(); // Read min/max range statistics
|
||||
```
|
||||
|
||||
## 3. TIDL's Convolution Architecture
|
||||
|
||||
### 3.1 High-Level Flow
|
||||
|
||||
TIDL processes convolutions in a tiled, block-based manner:
|
||||
|
||||
```
|
||||
for each output-channel group (64 channels at a time):
|
||||
for each spatial tile:
|
||||
1. DMA input tile into L2 scratch buffer
|
||||
2. Open MMA with config registers
|
||||
3. Load bias into C accumulator (via HWALDC)
|
||||
4. Load per-channel scale/shift (via __HWA_LOAD_2REG)
|
||||
5. for each input-channel group (64 at a time):
|
||||
Load 64 weight rows into B (via HWALDB × 64)
|
||||
for each spatial position in tile:
|
||||
Load input vector into A (via HWALDA)
|
||||
Trigger MMA operation (HWAOP or HWAOPXFER)
|
||||
6. Transfer C results out (HWAXFER/HWARCV)
|
||||
7. Store output tile via DMA
|
||||
```
|
||||
|
||||
### 3.2 MMA Configuration for uint8 Convolution
|
||||
|
||||
TIDL's config register struct (`configRegisterStruct_i8s_i8s_o8s` in
|
||||
`tidl_conv2d_mma.h`) uses these critical settings:
|
||||
|
||||
```c
|
||||
// A config (input activations)
|
||||
.A_ATYPE = A_CONFIG_ATYPE_UINT8 // *** UNSIGNED uint8 inputs ***
|
||||
|
||||
// B config (weights)
|
||||
.B_BTYPE = B_CONFIG_SIZE8 // 8-bit weight elements
|
||||
.B_ORDER = B_CONFIG_ROW // Row-major weight layout
|
||||
.B_BSWPER = 64 // B bank switch every 64 ops
|
||||
|
||||
// C config (accumulator)
|
||||
.C_ATYPE = C_CONFIG_ATYPE_SA // SIGNED accumulator (int32)
|
||||
.C_BTYPE = C_CONFIG_BTYPE_UINT8 // Bias loading type
|
||||
.C_OPERATION0 = C_CONFIG_MUL // First op: C = A × B
|
||||
.C_OPERATION1 = C_CONFIG_MULPLUS // Subsequent ops: C += A × B
|
||||
.C_HWLDDST = C_CONFIG_HWLDDST_X1 // Bias goes to C accumulator ×1
|
||||
.C_HWLDTYPE = C_CONFIG_HWLDTYPE_INT8
|
||||
.C_OP0PER = 64 // OP0 for first 64 ops (init)
|
||||
.C_OP1PER = (K-1)*64 // OP1 for remaining ops (accumulate)
|
||||
|
||||
// X config (output transfer)
|
||||
.X_XTYPE = X_CONFIG_XTYPE_UINT8 // Output type: uint8
|
||||
.X_CTYPE = X_CONFIG_CTYPE_UINT32 // Accumulator type: uint32 (read as unsigned!)
|
||||
.X_SHIFT = OUT_SHIFT // Right-shift amount
|
||||
.X_ReLU = 0 // ReLU disabled in config (done via PSAT)
|
||||
.X_SAT = 0 // Standard saturation disabled
|
||||
.X_RE = 0 // Rounding disabled in this config
|
||||
.X_CSTART = 1 // X reads from opposite bank to C writes
|
||||
```
|
||||
|
||||
**The critical insight: `A_ATYPE = UINT8`.** TIDL keeps input activations
|
||||
as unsigned uint8, eliminating the need to subtract the input zero point
|
||||
before the MMA multiply. The zero point correction is absorbed into the bias
|
||||
instead (see Section 4).
|
||||
|
||||
### 3.3 Operation Sequencing
|
||||
|
||||
The MMA uses a two-operation scheme with FSM-controlled alternation:
|
||||
|
||||
1. **Operation 0 (MUL)**: `C = A × B` — Initializes the accumulator with
|
||||
the first partial product. Runs for `C_OP0PER = 64` operations (one
|
||||
full B matrix multiply).
|
||||
|
||||
2. **Operation 1 (MULPLUS)**: `C = C + A × B` — Accumulates subsequent
|
||||
partial products. Runs for `C_OP1PER = (K-1)*64` operations (remaining
|
||||
K-1 input channel groups).
|
||||
|
||||
After K groups of 64 input channels, the C accumulator contains the full
|
||||
convolution result for 64 output channels, plus the pre-loaded bias. The
|
||||
X FSM then transfers the result out with shift + saturation + type conversion.
|
||||
|
||||
### 3.4 Double Buffering
|
||||
|
||||
Both B and C storage are double-buffered:
|
||||
|
||||
- **B double-buffering**: While the MMA computes using B bank 0, new weight
|
||||
rows are loaded into B bank 1 via `__HWALDB`. `B_BSWPER` controls when
|
||||
banks alternate.
|
||||
|
||||
- **C double-buffering**: While results from C bank 0 are being transferred
|
||||
out via `__HWAXFER`, new computations write to C bank 1. `C_CWSWPER`
|
||||
and `X_CSWPER` must be configured so the X FSM reads from the bank that
|
||||
the C FSM just finished writing.
|
||||
|
||||
### 3.5 Streaming Engine Configuration
|
||||
|
||||
TIDL configures the streaming engines for efficient memory access:
|
||||
|
||||
- **SE0** (or SE1): Streams input activation vectors from L2 SRAM
|
||||
- `ELETYPE = __SE_ELETYPE_8BIT`
|
||||
- `VECLEN = __SE_VECLEN_64BYTES`
|
||||
- `ICNT0 = 64` (one vector = 64 bytes)
|
||||
- Higher dimensions iterate over spatial positions and channel groups
|
||||
- May use circular buffering (`CBK0`) for L2 scratch memory
|
||||
|
||||
- **SA0/SA1/SA2**: Generate address sequences for:
|
||||
- Write-masking partial output tiles (when output channels aren't a
|
||||
multiple of 64)
|
||||
- Controlling which C accumulator rows to transfer
|
||||
|
||||
### 3.6 DSP Kernel Functions (Pre-compiled)
|
||||
|
||||
The actual DSP kernel code is in `tidl_priv_algo.lib` (pre-compiled, no
|
||||
source available). The key functions are:
|
||||
|
||||
| Function | Purpose |
|
||||
|----------|---------|
|
||||
| `TIDL_conv2dDspInitNew()` | Allocate scratch buffers, configure SE/SA templates, set up MMA config |
|
||||
| `TIDL_conv2dDspProcess()` | Execute the tiled convolution loop on the DSP |
|
||||
| `hwaInit()` | Program MMA config register via `__HWAOPEN` |
|
||||
| `blockConvS08_ci()` | Inner convolution block: B-panel fill + MMA compute loop |
|
||||
| `calcMMAConv()` | Outer loop orchestrating tiles, DMA, and MMA blocks |
|
||||
| `prefillBpanel_ci()` | Pre-load first B panel for double-buffering startup |
|
||||
|
||||
### 3.7 L2 Memory Layout
|
||||
|
||||
TIDL allocates a fixed L2 scratch buffer:
|
||||
|
||||
```c
|
||||
#define L2_MEM_SIZE (256*1024) // Total: 256 KB
|
||||
#define INFEAT_L2_MEM_SIZE (128*1024) // Input features: 128 KB (power of 2 for SE circ buf)
|
||||
#define LEFT_L2_MEM_SIZE (L2_MEM_SIZE - INFEAT_L2_MEM_SIZE) // Weights + bias: 128 KB
|
||||
```
|
||||
|
||||
Input features are stored in a circular buffer in L2 SRAM, accessed via the
|
||||
streaming engine with circular buffer addressing. This allows the SE to wrap
|
||||
around when reading sliding-window convolution inputs.
|
||||
|
||||
### 3.8 MMA Modeling Mode
|
||||
|
||||
`tidl_conv2d_mma.h` includes an `#ifdef MMA_MODELING` path that implements
|
||||
the MMA computation in software using C arrays:
|
||||
|
||||
```c
|
||||
int32_t cPanel[2][64*64]; // C accumulator (2 banks × 64×64 int32)
|
||||
char64 bPanel[2][64]; // B weight matrix (2 banks × 64 rows)
|
||||
char64 bPanelT[64]; // Transposed B panel
|
||||
```
|
||||
|
||||
Functions like `mmaOP()`, `prefillBpanel()`, `transposeBPanel()`, and
|
||||
`updateState()` simulate the hardware FSM behavior. This provides a
|
||||
bit-accurate software model of the MMA for validation.
|
||||
|
||||
## 4. Quantization: TIDL vs Thames/MMALib
|
||||
|
||||
### 4.1 The Core Difference
|
||||
|
||||
Both MMALib and TIDL map **input activations → MMA A** (with `A_ATYPE =
|
||||
UINT8`) and **weights → MMA B** (with `B_BTYPE = SIZE8`). The MMA hardware
|
||||
A/B mapping is the same in both cases.
|
||||
|
||||
**MMALib** (our current approach):
|
||||
- MMALib's `convolveBias_row` API requires weights as **INT8** (signed,
|
||||
zero point = 0) — this is an API constraint, not an MMA hardware
|
||||
constraint
|
||||
- For TFLite models using the **older uint8 quantization** (weights stored
|
||||
as uint8 with non-zero zero point, e.g., w_zp=133), we must re-quantize:
|
||||
`w_int8 = clamp(w_uint8 - w_zp, -128, 127)`
|
||||
- When `|w_uint8 - w_zp| > 127`, the weight overflows and must be rescaled
|
||||
by 0.5×, introducing rounding error
|
||||
- The rescaling factor is compensated by doubling the per-channel scale,
|
||||
but rounding errors from halving the weights accumulate
|
||||
- For TFLite **full-integer quantization** (weights as int8, zp=0), this
|
||||
is not an issue — weights are already in the required format
|
||||
|
||||
**TIDL** (direct MMA):
|
||||
- TIDL configures `C_BTYPE = C_CONFIG_BTYPE_UINT8` — the B matrix (weights)
|
||||
is interpreted as unsigned uint8 in the C FSM / bias table context
|
||||
- The A_ATYPE = UINT8 allows raw unsigned input activations without
|
||||
subtracting the input zero point
|
||||
- All zero point corrections (both input and output) are absorbed into
|
||||
the derivedBias term during model import
|
||||
- For TFLite full-integer quantization models (int8 symmetric weights),
|
||||
the weights are used as-is
|
||||
- The MMA hardware handles the mixed-signedness product correctly:
|
||||
unsigned_A × signed/unsigned_B → signed int32 accumulator
|
||||
- No weight re-quantization, no rescaling, no rounding error
|
||||
|
||||
### 4.2 TIDL's Quantization Math
|
||||
|
||||
For TFLite asymmetric quantization (`TIDL_QuantStyleAsymNP2_TFL`):
|
||||
|
||||
**Scale and Shift (per output channel):**
|
||||
|
||||
$$\text{scaleRatio}[c] = \frac{S_x \cdot S_w[c]}{S_y}$$
|
||||
|
||||
where $S_x$, $S_w[c]$, $S_y$ are the input, per-channel weight, and output
|
||||
scales from TFLite.
|
||||
|
||||
`TIDL_getMMAv2_ScaleAndShift()` converts this floating-point ratio into a
|
||||
uint8 scale and uint8 shift:
|
||||
|
||||
```c
|
||||
void TIDL_getMMAv2_ScaleAndShift(float scaleRatio, uint8_t *scale, uint8_t *shift)
|
||||
{
|
||||
int32_t shiftBits = 0;
|
||||
float newScaleRatio = scaleRatio;
|
||||
while (1) {
|
||||
newScaleRatio *= 2;
|
||||
if (shiftBits >= 40) break;
|
||||
if (newScaleRatio > 255.0) { newScaleRatio /= 2; break; }
|
||||
shiftBits++;
|
||||
}
|
||||
*shift = shiftBits;
|
||||
*scale = (uint8_t)(newScaleRatio + 0.5);
|
||||
}
|
||||
```
|
||||
|
||||
This is the same "doubling" algorithm we use in Thames (`compute_quant()`).
|
||||
The result approximates: $\text{scaleRatio} \approx \text{scale} / 2^{\text{shift}}$
|
||||
|
||||
**Bias (per output channel) — direct TFLite path:**
|
||||
|
||||
$$\text{derivedBias}[c] = \text{bias}_{\text{TFLite}}[c] + z_y \cdot \frac{S_y}{S_x \cdot S_w[c]} - z_x \cdot \sum_i W_q[c][i]$$
|
||||
|
||||
where:
|
||||
- $\text{bias}_{\text{TFLite}}[c]$ is the int32 bias from the TFLite model
|
||||
- $z_y$ is the output zero point
|
||||
- $z_x$ is the input zero point
|
||||
- $W_q[c][i]$ are the quantized int8 weights for output channel $c$
|
||||
- $\frac{S_y}{S_x \cdot S_w[c]}$ is `nScale` = $1 / \text{scaleRatio}[c]$
|
||||
|
||||
The $-z_x \cdot \sum W$ term pre-computes the input zero point correction,
|
||||
so the MMA can operate on raw unsigned activation values without subtracting
|
||||
$z_x$ first. The $z_y \cdot \text{nScale}$ term accounts for the output zero
|
||||
point.
|
||||
|
||||
**Post-accumulation (hardware):**
|
||||
|
||||
The MMA hardware computes:
|
||||
|
||||
$$\text{acc}[c] = \text{derivedBias}[c] + \sum_i A_{\text{uint8}}[i] \cdot W_{\text{int8}}[c][i]$$
|
||||
|
||||
Then the X transfer pipeline applies:
|
||||
|
||||
$$\text{out}[c] = \text{clamp}\left(\left\lfloor \frac{\text{acc}[c] \cdot \text{scale}[c]}{2^{\text{shift}[c]}} + 0.5 \right\rfloor, \text{minPSAT}, \text{maxPSAT}\right)$$
|
||||
|
||||
**PSAT bounds:** For uint8 output with activation clipping:
|
||||
|
||||
```c
|
||||
minPSAT = round(clipMin / outScale) + outZeroPoint; // typically 0
|
||||
maxPSAT = round(clipMax / outScale) + outZeroPoint; // typically 255
|
||||
```
|
||||
|
||||
### 4.3 Weight Treatment
|
||||
|
||||
TFLite full-integer quantized models store weights as **int8** (symmetric,
|
||||
zero point = 0). TIDL keeps them as-is — no re-quantization. The weights
|
||||
are loaded into the B matrix storage, and the MMA treats them as 8-bit
|
||||
values (signedness is determined by how the multiply interprets the product,
|
||||
controlled by `A_ATYPE` and `C_ATYPE`).
|
||||
|
||||
Since `A_ATYPE = UINT8` (unsigned inputs) and `C_ATYPE = SA` (signed
|
||||
accumulator), the multiply computes a **mixed-signedness** product: unsigned
|
||||
activation × signed/unsigned weight → signed 32-bit accumulator. This
|
||||
matches the TFLite quantization semantics exactly.
|
||||
|
||||
### 4.4 Comparison Table
|
||||
|
||||
| Aspect | MMALib (Thames current) | TIDL (direct MMA) |
|
||||
|--------|------------------------|-------------------|
|
||||
| Input type (MMA A) | UINT8 | UINT8 |
|
||||
| Weight type (MMA B) | INT8 (API constraint) | INT8 or UINT8 (flexible) |
|
||||
| A_ATYPE | UINT8 | UINT8 |
|
||||
| C_BTYPE | N/A (MMALib internal) | UINT8 |
|
||||
| Weight re-quantization | Required for uint8 models | Not needed |
|
||||
| Weight rescaling | 0.5× for overflows, compensated in scale | Not needed |
|
||||
| Input zero point | Subtracted from bias term | Subtracted from bias term |
|
||||
| Output zero point | Added to bias term | Added to bias term |
|
||||
| Scale/shift | Per-channel uint8/uint8 | Per-channel uint8/uint8 |
|
||||
| Post-processing | MMALib software (scaleShiftRoundAndReLU) | MMA hardware pipeline |
|
||||
| Rounding error source | Weight halving + int8 clamping | None from weights |
|
||||
| Per-channel scale load | MMALib internal | `__HWA_LOAD_2REG` intrinsic |
|
||||
| Activation | Software ReLU in MMALib | HW PSAT + ReLU in X pipeline |
|
||||
|
||||
### 4.5 Why TIDL Has No Accuracy Problem
|
||||
|
||||
With direct MMA programming, TIDL avoids the accuracy issue we hit on
|
||||
Model 2 (input_zp=128, weight_zp=133):
|
||||
|
||||
1. TIDL's model import tool handles weight re-quantization offline (not at
|
||||
runtime), and for TFLite full-integer quantization, weights are already
|
||||
int8 symmetric (zp=0) → no re-quantization needed
|
||||
2. If the original model has uint8 weights, TIDL's import can dequantize
|
||||
to float and re-quantize with full precision during model compilation
|
||||
3. The full zero point correction (for both input and output) is absorbed
|
||||
into the derivedBias at import time → exact computation
|
||||
4. The hardware pipeline handles scale/shift/round/saturate atomically
|
||||
|
||||
For Model 2, where MMALib requires runtime rescaling of 19/32 channels
|
||||
(causing cumulative rounding errors summing to diff=56), TIDL's offline
|
||||
import would compute derivedBias with full float precision, and the MMA
|
||||
hardware would produce results matching the TFLite CPU reference (within
|
||||
the hardware rounding tolerance of shift+round).
|
||||
|
||||
**Note on the MMA hardware mapping**: Both MMALib and TIDL use the same
|
||||
hardware mapping: **A = input activations** (streamed through SE0 via
|
||||
`HWALDAB`), **B = weight matrix** (loaded from SE1 via `HWALDAB`). The
|
||||
`A_ATYPE = UINT8` setting allows raw unsigned input activations. The
|
||||
difference is that MMALib's API forces weights to be INT8 symmetric at
|
||||
the API level, while direct MMA programming gives full control over the
|
||||
`C_BTYPE` field and weight format.
|
||||
|
||||
## 5. What Source Code Is Available
|
||||
|
||||
### 5.1 Available (in `c7x-mma-tidl/ti_dl/algo/`)
|
||||
|
||||
| File | Content |
|
||||
|------|---------|
|
||||
| `inc/tidl_conv2d_mma.h` | MMA config register struct, offset register, SE/SA template declarations, MMA modeling code |
|
||||
| `inc/tidl_conv2d_mma_i.h` | API declarations for DSP init/process functions |
|
||||
| `src/tidl_conv2d_base.c` | Reference conv2d implementation, bias/scale/shift setup, PSAT computation, dispatch logic |
|
||||
| `src/tidl_alg_utils.c` | `TIDL_getMMAv2_ScaleAndShift()` — the scale/shift quantization algorithm |
|
||||
| `inc/tidl_alg_int.h` | `TIDL_roundSatMMA()` — round+shift+saturate reference implementation |
|
||||
|
||||
### 5.2 Available (TI compiler headers)
|
||||
|
||||
| File | Content |
|
||||
|------|---------|
|
||||
| `C7524-MMA2_256F/c7x_mma.h` | Complete MMA hardware definition: all enum types, `__HWA_CONFIG_REG_v1` struct, intrinsic declarations, matrix dimension macros |
|
||||
|
||||
### 5.3 NOT Available (pre-compiled in `tidl_priv_algo.lib`)
|
||||
|
||||
| Function | What it does |
|
||||
|----------|-------------|
|
||||
| `TIDL_conv2dDspInitNew()` | Full initialization: allocate buffers, configure SE/SA templates, set MMA config register fields dynamically based on layer parameters |
|
||||
| `TIDL_conv2dDspProcess()` | Full execution: tiled loop with DMA, MMA compute, output writeback |
|
||||
| `hwaInit()` | Program `__HWAOPEN` with runtime-computed config |
|
||||
| `blockConvS08_ci()` | Inner block: prefill B panels, run MMA ops, transfer results |
|
||||
| `calcMMAConv()` | Outer tile loop coordinating DMA and MMA |
|
||||
| `prefillBpanel_ci()` | First B panel load for double-buffer pipeline startup |
|
||||
|
||||
## 6. Implementation Roadmap for Thames
|
||||
|
||||
### 6.1 Phase 1: Basic MMA Convolution (No Post-Processing)
|
||||
|
||||
1. **Emit `__HWAOPEN`**: Encode the config register struct as an immediate
|
||||
in the kernel binary. Use the TIDL config as a template but compute
|
||||
FSM periods from the actual layer dimensions:
|
||||
- `B_BSWPER = MMA_SIZE` (64 for int8)
|
||||
- `C_OP0PER = MMA_SIZE`
|
||||
- `C_OP1PER = (K-1) * MMA_SIZE` where K = numInChannels / 64
|
||||
- `C_CWSWPER = K * MMA_SIZE`
|
||||
- etc.
|
||||
|
||||
2. **Weight loading**: Use SE1 to stream weight rows, emit `__HWALDB` ×64
|
||||
for each input channel group.
|
||||
|
||||
3. **Input loading**: Use SE0 to stream input activation vectors, emit
|
||||
`__HWALDA` for each spatial position.
|
||||
|
||||
4. **Compute**: `__HWAOP` for first A×B, `__HWAOPXFER` for pipelined
|
||||
compute+transfer.
|
||||
|
||||
5. **Result readout**: `__HWARCV` to get output vectors, `VST32B` to store.
|
||||
|
||||
### 6.2 Phase 2: Add Quantization
|
||||
|
||||
1. **Bias loading**: Pre-load derivedBias into C via `__HWALDC`. Bias
|
||||
layout must match the MMA's C storage format.
|
||||
|
||||
2. **Scale/shift loading**: `__HWA_LOAD_2REG(scale_vec, shift_vec,
|
||||
__MMA_LOAD_2REG_SCALE_SHIFT_0)` before the compute loop.
|
||||
|
||||
3. **Enable PSAT**: Set `X_PSAT = 1`, encode `SAT_MIN`/`SAT_MAX` in the
|
||||
config register (16-bit values split across non-contiguous bitfields).
|
||||
|
||||
4. **Enable rounding**: `X_RE = 1` for half-LSB rounding.
|
||||
|
||||
5. **Enable per-channel shift**: `X_SCALE_SHIFT_CTRL = 1`,
|
||||
`X_SHIFT = __MMA_X_CONFIG_SHIFT_ROW_UNSIGNED` (per-row unsigned shift).
|
||||
|
||||
### 6.3 Phase 3: Tiling and DMA
|
||||
|
||||
1. Implement block scheduling (`thames_sched_operation()`) to find optimal
|
||||
tile sizes that fit in L2 SRAM.
|
||||
|
||||
2. Add DMA support for copying input tiles to L2 and output tiles to DDR.
|
||||
|
||||
3. Implement double-buffering for both B panels and input tiles.
|
||||
|
||||
### 6.4 Key Assembly Instructions
|
||||
|
||||
The C7x assembler instructions corresponding to the MMA intrinsics:
|
||||
|
||||
| Intrinsic | Assembly | Unit |
|
||||
|-----------|----------|------|
|
||||
| `__HWAOPEN` | `HWAOPEN .C2 VBL, VBL, imm` | C2 |
|
||||
| `__HWALDA` | `HWALDAB .L2 VB, VB` (A-only form) | L2 |
|
||||
| `__HWALDB` | `HWALDB .L2 VB` | L2 |
|
||||
| `__HWALDC` | via `HWALDAB/HWALDBC` | L2 |
|
||||
| `__HWALDAB` | `HWALDAB .L2 VB, VB` | L2 |
|
||||
| `__HWAOP` | `HWAOPXFER .S1 imm` (op-only form) | S1 |
|
||||
| `__HWAOPXFER` | `HWAOPXFER .S1 imm` | S1 |
|
||||
| `__HWAXFER` | `HWAXFER .S2 imm` | S2 |
|
||||
| `__HWARCV` | `HWARCVS .S2 VB` | S2 |
|
||||
| `__HWACLOSE` | via control instruction | C2 |
|
||||
| `__HWA_LOAD_2REG` | via HWAXFER variants | S2 |
|
||||
|
||||
Note: In the reference MMALib disassembly, `HWAOPXFER` on S1 and `HWALDAB`
|
||||
on L2 always execute **in parallel** (same execute packet). `HWARCVS` on S2
|
||||
follows with appropriate pipeline delay.
|
||||
|
||||
### 6.5 Required Changes to Thames
|
||||
|
||||
1. **`thames_coefs.c`**: Remove weight re-quantization (int8→int8 rescaling).
|
||||
Compute derivedBias using the TIDL formula with zero point absorption.
|
||||
Weights remain as TFLite int8 (symmetric, zp=0).
|
||||
|
||||
2. **`thames_compiler.c`**: New MMA convolution emitter using `asm_HWAOPEN`,
|
||||
`asm_HWALDAB`, `asm_HWAOPXFER`, `asm_HWARCVS` from the C7x assembler.
|
||||
|
||||
3. **`thames_ml.c`**: Input data stays uint8 — no need to convert to int8.
|
||||
Zero point folded into bias.
|
||||
|
||||
4. **New**: MMA config register encoding helper — compute `__HWA_CONFIG_REG`
|
||||
fields from layer parameters and encode as 512-bit immediate for HWAOPEN.
|
||||
|
||||
## 7. Reference: TIDL `TIDL_roundSatMMA()`
|
||||
|
||||
This function models the MMA hardware's post-processing:
|
||||
|
||||
```c
|
||||
static int64_t TIDL_roundSatMMA(int64_t val, int32_t bits, int32_t min, int32_t max)
|
||||
{
|
||||
int64_t temp;
|
||||
if (bits > 0) {
|
||||
temp = (val >> (bits - 1)) + 1; // add rounding bias
|
||||
val = temp >> 1; // complete the shift
|
||||
}
|
||||
val = val < min ? min : val;
|
||||
val = val > max ? max : val;
|
||||
return val;
|
||||
}
|
||||
```
|
||||
|
||||
This is: `round(val / 2^bits)` clamped to `[min, max]`. The rounding is
|
||||
"round half up" (add 0.5 × 2^bits before truncation).
|
||||
|
||||
## 8. Reference: MMA Config Register Layout
|
||||
|
||||
The `__HWA_CONFIG_REG_v1` struct is 512 bits (64 bytes = one vector).
|
||||
Key sections and their bit positions (little-endian):
|
||||
|
||||
### A Config (bits 0–31 of first 64-bit word)
|
||||
- `A_ATYPE` [3:0] — 4 bits: element type (UINT8=0, INT8=4, etc.)
|
||||
- `A_ALUTEN` [8] — 1 bit: lookup table enable
|
||||
- `A_ARF_CTRL` [16] — 1 bit: A register file enable
|
||||
- `A_ARF_BASE` [22:17] — 6 bits: circular buffer base
|
||||
- `A_ARF_SIZE` [30:24] — 7 bits: ARF array size
|
||||
|
||||
### B Config (bits 32–63 of first word + bits 0–63 of second word)
|
||||
- `B_BSWPER` [63:32] — 32 bits: B bank switch period
|
||||
- `B_BRSTPER` [7:0] — 8 bits: B offset reset period
|
||||
- `B_BTYPE` [9:8] — 2 bits: element size (SIZE8=0)
|
||||
- `B_ORDER` [16] — 1 bit: row/column major
|
||||
- `B_BSTART` [24] — 1 bit: initial B bank
|
||||
- `B_BOFFSET` [39:32] — 8 bits: global offset
|
||||
|
||||
### C Config (third + fourth + fifth 64-bit words)
|
||||
- `C_ATYPE` [0] — 1 bit: signed/unsigned accumulator
|
||||
- `C_BTYPE` [11:8] — 4 bits: bias element type
|
||||
- `C_OPERATION0` [17:16] — 2 bits: first operation (MUL/MULPLUS/etc.)
|
||||
- `C_OPERATION1` [25:24] — 2 bits: second operation
|
||||
- `C_HWLDDST` [34:32] — 3 bits: HWALD destination
|
||||
- `C_HWLDTYPE` [43:40] — 4 bits: HWALD element type
|
||||
- `C_OPSTART` [48] — 1 bit: initial operation
|
||||
- Various periods: `C_OP0PER`, `C_OP1PER`, `C_BSWPER`, `C_CRSWPER`, `C_CWSWPER` (32 bits each)
|
||||
- Various offsets: `C_CROFFSET`, `C_CWOFFSET`, `C_CLOFFSET` (6-7 bits each)
|
||||
|
||||
### X Config (sixth + seventh 64-bit words)
|
||||
- `X_ReLU` [0] — 1 bit: ReLU enable
|
||||
- `X_PSAT` [1] — 1 bit: parameterized saturation enable
|
||||
- `X_SAT_MIN` — 16 bits split across three fields (5:0, 12:6, 15:13)
|
||||
- `X_SAT` [8+16] — 1 bit: standard saturation enable
|
||||
- `X_RE` [16+16] — 1 bit: rounding enable
|
||||
- `X_SHIFT` [38:32] — 7 bits: right shift amount (or mode selector)
|
||||
- `X_SCALE_SHIFT_CTRL` [30] — 1 bit: per-row/col scale/shift enable
|
||||
- `X_XTYPE` [43:40] — 4 bits: output element type
|
||||
- `X_SAT_MAX` — 16 bits split across three fields
|
||||
- `X_CTYPE` [50:48] — 3 bits: accumulator element type
|
||||
- `X_CSWPER` — 32 bits: C read bank switch period
|
||||
- `X_COFFSET` — 8 bits: C read offset
|
||||
- `X_CSTART` — 1 bit: initial C bank for X reads
|
||||
|
||||
### Padding / Misc
|
||||
- `PARITYCTRL` [63:62] — 2 bits: parity control
|
||||
|
||||
The config register is loaded as a single 512-bit vector via the `HWAOPEN`
|
||||
instruction.
|
||||
Loading…
Add table
Reference in a new issue