This fixes the compute blitter with compression in the general case, and then
flips the switch since the compute blitter is faster / less buggy than the
traditional path.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30633>
I don't know what Apple calls these, so we're using the name "explicit
coordinates".
AGX has instructions for loading/stores register <---> tilebuffer ---> storage
images. Usually these are used in the fragment shader and end-of-tile shader to
implement colour attachments, with implicitly specified coordinates based on the
shader stage. However they can also be used in compute shaders with explicitly
specified coordinates ("imageblocks" in Apple parlance). Model this in NIR.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30633>
for preambles and for peephole selection.
total instructions in shared programs: 2159359 -> 2114124 (-2.09%)
instructions in affected programs: 359763 -> 314528 (-12.57%)
helped: 814
HURT: 6
Instructions are helped.
total alu in shared programs: 1685059 -> 1670200 (-0.88%)
alu in affected programs: 217210 -> 202351 (-6.84%)
helped: 589
HURT: 45
Alu are helped.
total fscib in shared programs: 1681202 -> 1666324 (-0.88%)
fscib in affected programs: 217477 -> 202599 (-6.84%)
helped: 590
HURT: 45
Fscib are helped.
total ic in shared programs: 460856 -> 455502 (-1.16%)
ic in affected programs: 41350 -> 35996 (-12.95%)
helped: 174
HURT: 8
Ic are helped.
total bytes in shared programs: 14302484 -> 14053982 (-1.74%)
bytes in affected programs: 2380614 -> 2132112 (-10.44%)
helped: 814
HURT: 7
Bytes are helped.
total regs in shared programs: 662302 -> 656517 (-0.87%)
regs in affected programs: 26979 -> 21194 (-21.44%)
helped: 432
HURT: 9
Regs are helped.
total uniforms in shared programs: 1651909 -> 1687077 (2.13%)
uniforms in affected programs: 95383 -> 130551 (36.87%)
helped: 17
HURT: 783
Uniforms are HURT.
total threads in shared programs: 20324608 -> 20326592 (<.01%)
threads in affected programs: 16192 -> 18176 (12.25%)
helped: 17
HURT: 3
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30633>
Incompatible changes:
- Make VM layout more flexible to allow for SVM with rusticl
(eventually, hopefully)
Compatible changes:
- Expose soft fault state to userspace as a flag
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30633>
honeykrisp wants to do this explicitly so we don't need prologs for TES. the gl
driver uses TES prologs implicitly for the same effect, but that's ...
suboptimal.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30382>
the calculation of workgroup reductions was wrong, giving nondeterministic
results when prefix summing >= 1024 items. fixes misrendering in
terraintessellation on honeykrisp.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30382>
Instead of having a hardcoded list of endian-independent format aliases
in the header, generate them from the format definitions.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29649>
This was an obnoxious bit of cheating we had in the gl4.6 driver that I added
literally the morning I passed gl4.6 cts, just to fix my last gl4.6 cts test.
It had an expiration date.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30051>
Add OpenCL kernels implementing the tessellation algorithm on device. This is an
OpenCL C port of the D3D11 reference tessellator, originally written by
Microsoft in C++. There are significant differences compared to the CPU based
reference implementation:
* significant simplifications and clean up. The reference code did a lot of
things in weird ways that would be inefficient on the GPU. I did a *lot* of
work here to get good AGX assembly generated for the tessellation kernels ...
the first attempts were quite bad! Notably, everything is carefully written to
ensure that all private memory access is optimized out in NIR; the resulting
kernels do not use scratch and do not spill on G13.
* prefix sum variants. To implement geom+tess efficiently, we need to first
calculate the count of indices generated by the tessellator, then prefix sum
that, then tessellate using the prefix sum results writing into 1 large index
buffer for a single indirect draw. This isn't too bad, we already have most of
the logic and the guts of the prefix sum kernel is shared with geometry
shaders.
* VDM generation variant. To implement tess alone, it's fastest to generate a
hardware Index List word for each patch, adding an appropriate 32-bit index
bias to the dynamically allocated U16 index buffers. Then from the CPU, we
have the illusion of a single draw to Stream Link with Return to. This
requires packing hardware control words from the tessellator kernel.
Fortunately, we have GenXML available so we just use agx_pack like we would in
the driver.
Along the way, we pick up indirect tess support (this follows on naturally),
which gets rid of the other bit of tessellation-related cheating. Implementing
this requires reworking our internal agx_launch data structures, but that has
the nice side effect of speeding up GS invocations too (by fixing the workgroup
size).
Don't get me wrong. tessellator.cl is the single most unhinged file of my
career, featuring GenXML-based pack macros fed by dynamic memory allocation fed
by the inscrutable tessellation algorithm.
But it works *really* well.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30051>