diff --git a/src/panfrost/compiler/bifrost/valhall/ISA.xml b/src/panfrost/compiler/bifrost/valhall/ISA.xml
index 9c412ebf469..6fc6e0d12de 100644
--- a/src/panfrost/compiler/bifrost/valhall/ISA.xml
+++ b/src/panfrost/compiler/bifrost/valhall/ISA.xml
@@ -787,6 +787,13 @@
+
+
+
+
+
+
+
Do nothing. Useful at the start of a block for waiting on slots required
by the first actual instruction of the block, to reconcile dependencies
@@ -798,6 +805,12 @@
+
+
+
+
+
+
Branches to a specified relative offset if its source is nonzero (default)
or if its source is zero (if `.eq` is set). The offset is 27-bits and
@@ -823,6 +836,12 @@
+
+
+
+
+
+
Evaluates the given condition, and if it passes, discards the current
fragment and terminates the thread. Only valid in a **fragment** shader.
@@ -836,6 +855,12 @@
+
+
+
+
+
+
Jump to an indirectly specified (absolute or relative) address. Used to
jump to blend shaders at the end of a fragment shader.
@@ -851,6 +876,13 @@
+
+
+
+
+
+
+
General-purpose barrier. Must use slot #7. Must be paired with a
`.wait` flow on the instruction.
@@ -863,11 +895,21 @@
+
+
+
+
+
+
+
+
+
+
Evaluates the given condition and outputs either the true source or the
@@ -885,21 +927,41 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Evaluates the given condition and outputs either the true source or the
@@ -921,6 +983,13 @@
+
+
+
+
+
+
+
@@ -936,6 +1005,13 @@
+
+
+
+
+
+
+
Fetches a given flat varying from hardware buffer
@@ -949,6 +1025,13 @@
+
+
+
+
+
+
+
Fetches a given flat varying from hardware buffer
@@ -964,11 +1047,27 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
@@ -988,11 +1087,25 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
@@ -1010,6 +1123,12 @@
+
+
+
+
+
+
Interpolates a given varying from a software buffer
@@ -1026,6 +1145,13 @@
+
+
+
+
+
+
+
Interpolates a given varying from a software buffer
@@ -1043,6 +1169,13 @@
+
+
+
+
+
+
+
Fetches a given varying from a software buffer
@@ -1056,6 +1189,13 @@
+
+
+
+
+
+
+
Fetches a given varying from a software buffer
@@ -1071,6 +1211,12 @@
+
+
+
+
+
+
Load `vecsize` components from the attribute descriptor at entry `index`
of resource table `table` at index (vertex ID, instance ID), converting
@@ -1092,6 +1238,13 @@
+
+
+
+
+
+
+
Load `vecsize` components from the attribute descriptor at the specified
location at index (vertex ID, instance ID), converting
@@ -1113,6 +1266,13 @@
+
+
+
+
+
+
+
Load the 64-bit global clock, either a cycle counter or the system clock.
@@ -1124,6 +1284,12 @@
+
+
+
+
+
+
Load `vecsize` components from the texture descriptor at entry `index`
of resource table `table`, converting
@@ -1145,6 +1311,13 @@
+
+
+
+
+
+
+
Load `vecsize` components from the texture descriptor at the specified
location at index, converting
@@ -1165,6 +1338,12 @@
+
+
+
+
+
+
Load the effective address of an attribute specified with the
given immediate index. Returns three staging register: the low/high
@@ -1184,6 +1363,13 @@
+
+
+
+
+
+
+
Load the effective address of an attribute specified with the
given index. Returns three staging register: the low/high
@@ -1203,6 +1389,12 @@
+
+
+
+
+
+
Load the effective address of a texel from the image specified with the
given immediate index. Returns three staging registers: the low/high
@@ -1227,6 +1419,13 @@
+
+
+
+
+
+
+
Load the effective address of a texel from the image specified with the
given index. Returns three staging register: the low/high
@@ -1251,6 +1450,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1272,6 +1478,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1293,6 +1506,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1314,6 +1534,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1335,6 +1562,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1356,6 +1590,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1377,6 +1618,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1398,6 +1646,13 @@
+
+
+
+
+
+
+
Loads a buffer descriptor. If bits 25...31 of the mode descriptor are
all-ones, load from the buffer descriptors in the table indexed by the
@@ -1419,6 +1674,11 @@
+
+
+
+
+
Load effective address of a buffer with an offset added.
@@ -1433,6 +1693,12 @@
+
+
+
+
+
+
Load effective address of a buffer with an immediate offset added.
@@ -1449,6 +1715,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1465,6 +1740,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1481,6 +1765,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1497,6 +1790,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1513,6 +1815,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1529,6 +1840,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1545,6 +1865,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1561,6 +1890,15 @@
+
+
+
+
+
+
+
+
+
Loads from main memory
@@ -1580,48 +1918,120 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
@@ -1634,6 +2044,11 @@
+
+
+
+
+
Load effective address of a simple buffer with an offset added.
@@ -1648,6 +2063,13 @@
+
+
+
+
+
+
+
Load from memory with data conversion. The address to load from is given in
the first source, which must be a 64-bit register (a pair of 32-bit
@@ -1668,6 +2090,13 @@
+
+
+
+
+
+
+
Store to memory with data conversion. The address to store to is given in
the first source, which must be a 64-bit register (a pair of 32-bit
@@ -1690,6 +2119,12 @@
+
+
+
+
+
+
Loads a given render target, specified in the pixel indices descriptor, at
a given location and sample, and convert to the format specified in the
@@ -1710,6 +2145,13 @@
+
+
+
+
+
+
+
Store to given render target, specified in the pixel indices descriptor, at
a given location and sample, and convert to the format specified in the
@@ -1729,6 +2171,12 @@
+
+
+
+
+
+
Blends a given render target. This loads the API-specified blend state for
the render target from the first source. Blend descriptors are available
@@ -1768,6 +2216,13 @@
+
+
+
+
+
+
+
Does alpha-to-coverage testing, updating the sample coverage mask. ATEST
does not do an implicit discard. It should be executed before the first
@@ -1784,6 +2239,13 @@
+
+
+
+
+
+
+
Programatically writes out depth, stencil, or both, depending on which
modifiers are set. Used to implement gl_FragDepth and gl_FragStencil.
@@ -1818,6 +2280,11 @@
+
+
+
+
+
@@ -1833,6 +2300,11 @@
+
+
+
+
+
@@ -1849,6 +2321,11 @@
+
+
+
+
+
@@ -1863,6 +2340,11 @@
+
+
+
+
+
@@ -1883,12 +2365,22 @@
+
+
+
+
+
+
+
+
+
+
Value to convert
@@ -1939,6 +2431,11 @@
+
+
+
+
+
Converts up with the specified round mode.
Value to convert
@@ -1954,6 +2451,11 @@
+
+
+
+
+
@@ -1969,6 +2471,11 @@
+
+
+
+
+
@@ -1992,6 +2499,11 @@
+
+
+
+
+
@@ -2006,6 +2518,11 @@
+
+
+
+
+
@@ -2029,6 +2546,11 @@
+
+
+
+
+
@@ -2048,6 +2570,11 @@
+
+
+
+
+
Canonical register-to-register move.
@@ -2057,6 +2584,11 @@
+
+
+
+
+
Used as a primitive for various bitwise operations.
@@ -2068,6 +2600,11 @@
+
+
+
+
+
Used as a primitive for various bitwise operations.
@@ -2079,6 +2616,11 @@
+
+
+
+
+
Used as a primitive for various bitwise operations.
@@ -2090,6 +2632,11 @@
+
+
+
+
+
64-bit abs may be constructed in 4 instructions (5 clocks) by checking the
sign with `ICMP.s32.lt.m1 hi, 0` and negating based on the result with
@@ -2103,6 +2650,11 @@
+
+
+
+
+
@@ -2120,6 +2672,11 @@
+
+
+
+
+
Only available as 32-bit. Smaller bitsizes require explicit conversions.
64-bit popcount may be constructed in 3 clocks by separate 32-bit
@@ -2134,6 +2691,11 @@
+
+
+
+
+
Only available as 32-bit. Other bitsizes may be derived with swizzles.
@@ -2166,6 +2728,11 @@
+
+
+
+
+
Returns the mask of lanes ever active within the warp (subgroup), such
that the source is nonzero. The number of work-items in a subgroup is
@@ -2187,12 +2754,22 @@
+
+
+
+
+
+
+
+
+
+
Flush special float values. The ftz modifier flushes subnormal values to
@@ -2212,6 +2789,11 @@
+
+
+
+
+
@@ -2225,6 +2807,11 @@
+
+
+
+
+
@@ -2251,60 +2838,110 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Performs a given special function. The floating-point reciprocal (`FRCP`)
@@ -2323,24 +2960,44 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Performs a given special function. The trigonometric tables
@@ -2356,12 +3013,22 @@
+
+
+
+
+
+
+
+
+
+
$A + B$
@@ -2377,12 +3044,22 @@
+
+
+
+
+
+
+
+
+
+
$\min \{ A, B \}$
@@ -2396,12 +3073,22 @@
+
+
+
+
+
+
+
+
+
+
$\max \{ A, B \}$
@@ -2433,12 +3120,23 @@
+
+
+
+
+
+
+
+
+
+
+
Computes $A \cdot 2^B$ by adding B to the exponent of A. Used to calculate
@@ -2457,6 +3155,11 @@
+
+
+
+
+
Calculates the base-2 exponent of an argument specified as a 8:24
fixed-point. The original argument is passed as well for correct handling
@@ -2472,6 +3175,11 @@
+
+
+
+
+
Performs a floating-point addition specialized for logarithm computation.
@@ -2485,6 +3193,12 @@
+
+
+
+
+
+
Used for `atan2()` implementation. Destination is two 16-bit
values (int and float) for the first form, and a single 32-bit float when
@@ -2507,12 +3221,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -2526,12 +3250,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -2545,12 +3279,24 @@
+
+
+
+
+
+
+
+
+
+
+
+
A
B
@@ -2562,6 +3308,11 @@
+
+
+
+
+
Calculates $A | (B \ll 16)$. Used to implement `(ushort2)(A, B)`
A
B
@@ -2573,12 +3324,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -2592,12 +3353,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -2611,12 +3382,24 @@
+
+
+
+
+
+
+
+
+
+
+
+
$A - B$ with optional saturation
A
@@ -2637,6 +3420,12 @@
+
+
+
+
+
+
@@ -2655,12 +3444,24 @@
+
+
+
+
+
+
+
+
+
+
+
+
A
@@ -2673,42 +3474,78 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
$A \cdot B$ with optional saturation. Note the multipliers can only handle up to
@@ -2775,6 +3612,11 @@
+
+
+
+
+
Selects the value of A in the subgroup lane given by B. This implements
subgroup broadcasts. It may be used as a primitive for screen space
@@ -2792,11 +3634,21 @@
+
+
+
+
+
+
+
+
+
+
$A \cdot B + C$
@@ -2812,24 +3664,48 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Left shifts its first source by a specified amount and bitwise ANDs it with the
@@ -2847,24 +3723,48 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Right shifts its first source by a specified amount and bitwise ANDs it with the
@@ -2885,24 +3785,48 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Left shifts its first source by a specified amount and bitwise ORs it with the
@@ -2920,24 +3844,48 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Right shifts its first source by a specified amount and bitwise ORs it with the
@@ -2958,24 +3906,48 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Left shifts its first source by a specified amount and bitwise XORs it with the
@@ -2993,24 +3965,48 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Right shifts its first source by a specified amount and bitwise XORs it with the
@@ -3029,6 +4025,12 @@
+
+
+
+
+
+
Mux between A and B based on the provided mask. The condition specified
as the `mux` modifier is evaluated on the mask. If true, `A` is chosen,
@@ -3046,6 +4048,12 @@
+
+
+
+
+
+
Mux between A and B based on the provided mask. The condition specified
as the `mux` modifier is evaluated on the mask. If true, `A` is chosen,
@@ -3063,6 +4071,12 @@
+
+
+
+
+
+
Mux between A and B based on the provided mask. The condition specified
as the `mux` modifier is evaluated on the mask. If true, `A` is chosen,
@@ -3081,6 +4095,12 @@
+
+
+
+
+
+
During a cube map transform, select the S coordinate given a selected face.
Z coordinate as 32-bit floating point
X coordinate as 32-bit floating point
@@ -3092,6 +4112,12 @@
+
+
+
+
+
+
During a cube map transform, select the T coordinate given a selected face.
Y coordinate as 32-bit floating point
Z coordinate as 32-bit floating point
@@ -3102,6 +4128,11 @@
+
+
+
+
+
Calculates $A | (B \ll 8) | (CD \ll 16)$ for 8-bit A and B and 16-bit CD.
@@ -3120,6 +4151,11 @@
+
+
+
+
+
Select the maximum absolute value of its arguments.
X coordinate as 32-bit floating point
Y coordinate as 32-bit floating point
@@ -3130,6 +4166,11 @@
+
+
+
+
+
Select the cube face index corresponding to the arguments.
X coordinate as 32-bit floating point
Y coordinate as 32-bit floating point
@@ -3153,12 +4194,24 @@
+
+
+
+
+
+
+
+
+
+
+
+
A
B
@@ -3179,12 +4232,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3212,12 +4275,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3246,12 +4319,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3272,12 +4355,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3298,12 +4391,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3331,12 +4434,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3371,12 +4484,22 @@
+
+
+
+
+
+
+
+
+
+
@@ -3389,6 +4512,10 @@
+
+
+
+
Adds an arbitrary 32-bit immediate embedded within the instruction stream.
If no modifiers are required, this is preferred to `IADD.i32` with a
@@ -3405,6 +4532,10 @@
+
+
+
+
Adds an arbitrary pair of 16-bit immediates embedded within the
instruction stream. If no modifiers are required, this is preferred to
@@ -3436,6 +4567,10 @@
+
+
+
+
Adds an arbitrary 32-bit immediate embedded within the instruction stream.
If no modifiers are required, this is preferred to `FADD.f32` with a
@@ -3450,6 +4585,10 @@
+
+
+
+
Adds an arbitrary pair of 16-bit immediates embedded within the
instruction stream. If no modifiers are required, this is preferred to
@@ -3466,6 +4605,13 @@
+
+
+
+
+
+
+
@@ -3481,6 +4627,13 @@
+
+
+
+
+
+
+
@@ -3496,6 +4649,13 @@
+
+
+
+
+
+
+
@@ -3510,6 +4670,13 @@
+
+
+
+
+
+
+
@@ -3524,6 +4691,13 @@
+
+
+
+
+
+
+
@@ -3544,6 +4718,13 @@
+
+
+
+
+
+
+
@@ -3563,6 +4744,12 @@
+
+
+
+
+
+
Unfiltered textured instruction.
@@ -3589,6 +4776,11 @@
+
+
+
+
+
Ordinary texturing instruction using a sampler.
@@ -3617,6 +4809,11 @@
+
+
+
+
+
Texture gather instruction.
@@ -3646,6 +4843,12 @@
+
+
+
+
+
+
Texture sample with explicit gradient.
@@ -3672,6 +4875,11 @@
+
+
+
+
+
Pair of texture instructions.
@@ -3697,6 +4905,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX, using both V and T units.
@@ -3721,6 +4937,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX, using both V and T units.
@@ -3746,6 +4970,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX, using both V and T units.
@@ -3771,6 +5003,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX_DUAL, using both V and T units.
@@ -3795,6 +5035,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_IMM_F32.v2.f32 followed by TEX, using both V and T units.
@@ -3819,6 +5067,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_IMM_F32.v2.f32 followed by TEX, using both V and T units.
@@ -3844,6 +5100,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_IMM_F32.v2.f32 followed by TEX, using both V and T units.
@@ -3869,6 +5133,14 @@
+
+
+
+
+
+
+
+
Only works for FP32 varyings. Performance characteristics are similar
to LD_VAR_IMM_F32.v2.f32 followed by TEX_DUAL, using both V and T units.
@@ -3893,6 +5165,11 @@
+
+
+
+
+
First calculates $A \cdot B + C$ and then biases the exponent by D. Used in
special transcendental function sequences. It should not be used for
@@ -3911,6 +5188,11 @@
+
+
+
+
+
First calculates $A \cdot B + C$ and then biases the exponent by D. If $A
= 0$ or $B = 0$, the multiply $A \cdot B$ is treated as zero even if an
@@ -3930,6 +5212,11 @@
+
+
+
+
+
First calculates $A \cdot B + C$ and then biases the exponent by D. If $A
= 0$ or $B = 0$, the multiply is treated as $A$ even if an
@@ -3949,6 +5236,11 @@
+
+
+
+
+
First calculates $A \cdot B + C$ and then biases the exponent by D,
interpreted as a 16-bit value. Used in special transcendental function