Add a new optimization pass that identifies sequences of scalar dot
product operations and combines them into DPAS (Dot Product Accumulate
Systolic) matrix multiplication instructions for XeHP+ EUs that have a
systolic array pipeline (AKA XMX engine).
This is possible because a matrix multiplication as performed by DPAS
can be expressed like:
E^i_k = D^i_k + Sum_j A^i_j B^j_k
I.e. each scalar component of a matrix multiplication is just a
(possibly large) dot product. This pass identifies such chains of
sdot_4x8_iadd dot products in the program and bins them according to
the A and B arguments used. Sets of dot products with consecutive
components are transformed into a matrix product for each densely
occupied interval of indices within each bin, as long as there is an
efficient way to transpose one of the arguments in the register file.
This enables programs to opportunistically take advantage of the
systolic array pipeline for linear arithmetic, which has massively
greater throughput than the regular FPUs (roughly a factor of 4x the
throughput for the specific instructions replaced currently), without
the application having to be updated in order to take advantage of it
through a matrix multiplication API like KHR_cooperative_matrix.
The immediate motivation for this is getting the open source driver to
accelerate the matrix multiplications used for inference by the XeSS
ML-driven upscaling library, since the Mesa driver was currently
limited to the generic HLSL path that doesn't take advantage of the
XMX pipeline. Alternative AI-driven upscaling libraries can be
supported in theory though this hasn't been pursued yet, and there are
some assumptions in the optimization pass that might get in the way
currently:
- Currently only the sdot_4x8_iadd intrinsic is supported for no
particular reason other than it being the intrinsic generated by
the XeSS library in its multivendor path. It would be
straightforward to add support for additional types supported by
the systolic pipeline.
- Currently one of the arguments of the dot products is restricted to
be an SSBO load because that's what we encounter in the XeSS
library, but any other kind of memory load intrinsic could be
supported easily.
- Also accidental is the current limitation to run on Xe2+
hardware. Getting it to work on XeHP (e.g. DG2) is theoretically
possible beyond some minor differences so it will probably be a
future area for improvement.
- The limitation of the shader subgroup size to 16 done at the end of
the optimization pass is less accidental, because on all Intel Xe
platforms released so far the DPAS instruction is limited to run at
a fixed execution width (8 on XeHP and 16 on Xe2-3), so the backend
would need a way to expose variable-width DPAS intrinsics e.g. by
lowering them using SIMD splitting. I have some code to try to
achieve that, but the naïf SIMD splitting approach of DPAS
instructions appears to hurt more cases than it helps so I don't
have a ready solution to lift this restriction yet.
Evaluating the impact of this on the performance of XeSS kernels using
our internal microbenchmarks shows a performance improvement for XeSS
inference between 26% and 44% depending on the quality preset and
resolution, with a geomean improvement of 35% across the rendering
modes tested.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41814>
Multiple DPAS instructions executed on the same functional unit are
guaranteed to read their source operands in program order, so no
scoreboard synchronization is required between a DPAS read and another
DPAS read of the same register.
In order to achieve that track the pipeline (DPAS vs. other) of each
out-of-order dependency via a new field on the dependency struct along
with the token ID of the out-of-order dependency. When a read
dependency for a DPAS instruction is encountered whose producer is
also a DPAS unit, strip the SRC synchronization flag so that no
redundant wait is emitted. The DST synchronization flag is preserved
since write-after-read hazards still require ordering.
This reduces the number of scoreboard stalls emitted within chains of
DPAS instructions that have overlapping sources (common in matrix
multiplication kernels), improving occupancy of the systolic pipeline.
It avoids performance regressions in XeSS kernels in combination with
the following vectorization optimization, and could also be helpful in
theory with other workloads that utilize the systolic pipeline via
KHR_cooperative_matrix.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41814>
Previously when the first register allocation attempt failed,
brw_allocate_registers() would iterate over scheduling modes in the
fixed order specified by pre_modes[], assuming that the first
successful mode would be the most performant. However that wasn't
ever a very reliable guarantee, and it becomes less so on Xe3+ were a
lower-register-pressure schedule can have higher thread parallelism.
But actually that's a bit of a silly situation since the pre_modes[]
loop that runs before the first brw_assign_regs() attempt already
iterates over multiple scheduling heuristics in order to choose which
one to try first, so it has a static analysis model of the relative
performance of the different heuristics which we can use in order to
properly sort the pre_modes[] list and make a more informed decision
about the iteration order at little extra cost.
This seems to be helpful even before xe3 in cases where
BRW_SCHEDULE_PRE_(NON_)LIFO outperforms BRW_SCHEDULE_PRE(_LATENCY), in
particular when the critical path heuristic used by
BRW_SCHEDULE_PRE_(NON_)LIFO does a better job at minimizing the
latency of the program than the mostly backward-looking heuristic of
BRW_SCHEDULE_PRE(_LATENCY).
That is apparently the case in several shaders from the XeSS library,
where the BRW_SCHEDULE_PRE heuristic hoists most of the memory loads
of the shader aggressively to the top creating a bottleneck instead of
interleaving the messages more effectively with the arithmetic along
the critical path of the program. This patch avoids performance
regressions with the subsequent DPAS vectorization patch as a result
of this inversion of performance between the PRE and PRE_NON_LIFO
scheduling heuristics.
Note that this doesn't necessarily run the scheduler more times, it
just changes the order that the different scheduling modes are
attempted, no significant difference in the compile-time of shader-db
nor fossil-db has been observed.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41814>
Add a new nir_divergence_uniform_local_invocation_id_z divergence
option that allows the Z component of the local invocation ID to be
treated as uniform across the subgroup, for cases where the driver
knows that as a result of the hardware's subgroup walk order the Z
component is guaranteed to remain constant across a subgroup.
On Intel hardware for the walk order currently in use all invocations
within a single subgroup are guaranteed to share the same
local_invocation_id.z value when the product of the X and Y workgroup
dimensions is a multiple of the SIMD width (32 at most).
This allows the subsequent vectorization optimization to have an
effect for many dot products in XeSS kernels whose two arguments
currently appear divergent, however one of them only appears divergent
due to the dependency on local_invocation_id.z, which is actually
subgroup-uniform for these kernels.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41814>
The sysval is affected by VRS.
More subgroup sysvals might have to be added here.
Cc: mesa-stable
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42234>
chromium/skia (stupidly) hits this path when drawing transparent svgs,
and it's definitely a bug in the browser engine, but no human can possibly
comprehend how any of that works
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42222>
The LOAD and LD_PKA instructions have i8, i16, and i24 forms that can,
in theory, operate on partial registers. However, there are issues with
races between ALU and message instructions on partial registers. We
could probably come up with a complex model for this but for now it's
easiest to just force whole registers for message destinations.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42226>
Xe is unstable on 6.18+, so we need to revert to the previous stable
kernel if we want to have pre-merge jobs on ADL.
Cc: mesa-stable
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42041>
The Flip-hatch devices are getting retired in the Collabora lab.
We can also drop a few skips that were only needed for CML.
Cc: mesa-stable
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42041>
Swap the pre-merge CML Venus-on-ANV job with the nightly ADL one, making
the ADL job pre-merge. Also Move the nightly Android CTS job to ADL.
The ADL runners are available after disabling anv-adl-vk, and this keeps
Venus-on-ANV coverage in Cuttlefish Android VMs.
Using 4 parallel runners allows the pre-merge VK CTS test suite to run
with a lower fraction. Update the xfails to match, since the new fraction
covers a different subset of tests.
Cc: mesa-stable
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42041>
We already have VK CTS coverage on TGL and RPL, with the latter running
the full test suite pre-merge.
Having three gfx12 VK CTS jobs is redundant, so disable anv-adl-vk. The
ADL runners will be reused for a different job.
Cc: mesa-stable
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42041>
tu_pack_float32_for_color doesn't correctly handle that format, and
it actually correctly quantized without it.
Fixes: 38a10950e3 ("tu: Match SW color clear value packing with HW")
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42215>
The 64bit mask was truncated, and then when the low half is 0, the base was -1.
By accident, u_bit_consecutive64(-1, 65) is the original mask, so we uploaded a
single garbage value.
Fixes: 7f6262bb85 ("radv: allow holes in inline push constants")
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42182>
Newer kernels just print hex chip-id rather than unsigned "ipv4" style.
Update parsing to handle this. See kernel commit cc53487e01fc
("drm/msm/adreno: Change chip_id format").
Signed-off-by: Rob Clark <rob.clark@oss.qualcomm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42193>
Previously this helper function would not capture the xshm opcode from
the server's shm reply and drisw_glx requires the value to work
properly.
Fixes: 5f4eccf1 ("glx: Check that xshm can be attached")
Reviewed-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40926>
NEON always flushes subnormals to zero; previously lp_test_arit
special-cased vector paths to suppress the resulting failures.
The proper fix mirrors x86: set FPSCR/FPCR FZ so VFP also flushes,
keeping scalar and vector paths consistent with the C reference.
util_fpstate_{get,set,set_denorms_to_zero} now read/write FPSCR
(ARMv7) or FPCR (AArch64) via inline asm. flush_denorm_to_zero
in lp_test_arit flushes subnormal inputs on ARM/AArch64 to match.
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42178>
For the BIR-compiler, 64-bit values were not properly tracked in the
spill logic and PHIs were always assumed to be 32-bits. This could
create issues were only one half of the value was reloaded or spills
would overlap each other leading to garbage values. This patch fixes
these issues by keeping track of how many words each value needs. Also,
it adds a constraint for SHADD sources where it splits and collects them
right before the SHADD instruction itself to make it easier for RA to
handle the register pairs.
Fixes: 4542982062 ("pan/compiler: Use SHADDX instruction for i64 add")
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42080>
On v14+, multiview is not lowered to per-view output stores. Rename
"multiview" to "per_view_outputs" to make it clear that this logic only
applies when the shader uses nir_intrinsic_store_per_view_output.
Reviewed-by: Olivia Lee <olivia.lee@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42049>
Replace PAN_MAX_MULTIVIEW_VIEW_COUNT with a helper taking the GPU
architecture, so both the compiler and PanVK can query the right limit.
And rise maximum multiview view count to 16 on v14+. Up from 8 on older
generations.
Reviewed-by: Olivia Lee <olivia.lee@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42049>
On v14+, the view mask moved from PRIMITIVE_FLAGS to PRIMITIVE_FLAGS_2.
The multiview vertex shader unrolling no longer needs to be handled in
software. The GPU now runs one shader invocation per view, where each
writes a single view and the view index is passed through a preload.
Fixes: 4258888f4d ("pan/genxml: Add v14 definition")
Reviewed-by: Olivia Lee <olivia.lee@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42049>
On v14+, the GPU runs one vertex shader invocation per view, where each
writes a single view and the view index is passed through
BI_PRELOAD_VIEW_ID.
Reviewed-by: Olivia Lee <olivia.lee@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42049>
Move the ACORN random number generator from src/nouveau/compiler/acorn/
to src/compiler/rust/acorn/ so it can be shared between different
driver hardware test infrastructures.
Signed-off-by: Christian Gmeiner <cgmeiner@igalia.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42165>
The ISA.xml for Valhall did not match exactly ARSHIFT as it was based on
RSHIFT. We could generate ARSHIFT_OR however so in certain trace dumps
the output would be empty.
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42040>
There is no hardware restriction that limits the current size, it was
selected manually.
Increase it to 256 as this aligns more with other hardware, and this is
the minimum requirement for Vulkan 1.4.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42212>
Currently we shift the viewport as an implementation of FRONT_AND_BACK
culling mode.
However, as culling should only take effect on triangles, this shift
should only be applied when the active rasterizing primitive is
triangles.
Check the primitive topology before applying the viewport shift.
This fixes the new Vulkan CTS test `dEQP-VK.glsl.builtin_var.frontfacing.
add_ubo_load.{point,line}_list.front_and_back` introduced in CTS
1.4.6.0.
Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com>
Reviewed-by: Frank Binns <frank.binns@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42164>
Now that we have a unified layout for timestamp, we can implement
timestamp writes on DMA and Compute sub channels.
This also expose timestamp on non graphics queues.
Signed-off-by: Mary Guillemard <mary@mary.zone>
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42208>