It's like the layer, it has to be exported via the pos and also
as a varying if the fragment shader reads it.
Fixes dEQP-VK.draw.shader_viewport_index.fragment_shader_*
Cc: <mesa-stable@lists.freedesktop.org>
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4564>
When NGG is used, vertex and tess eval shaders are executed on the
hardware NGG geometry stage. There is a series of steps they
must perform:
* Request GS space using GS_ALLOC_REQ
* Export the primitive
* Finally, export the normal VS outputs
In this commit, two modes are implemented:
* "late" which matches what the RADV LLVM backend currently does
* "early" which is an optimized version as seen in radeonsi
Vulkan doesn't allow the shader to write the edge flags, so we can
currently always use the "early" mode.
Exporting the primitive ID is also supported by having the GS threads
write that into LDS and reading them from LDS in the ES threads.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3576>
ngg_vertex_gs and ngg_tess_eval_gs work very similarly to
vertex_vs and tess_eval_vs, but they run on the HW NGG GS stage.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3576>
The implementations here just clone i2b32 and i2b1. This means that
b2b32 doesn't technically generate true NIR 0/-1 booleans but it should
be fine as it's only ever generated for shared variable writes which
will always be consumed by something which will then run it through an
i2b again.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4338>
Clear the workgroup size for all supported shader stages.
Also, unify the workgroup size calculation accross various places.
As a result, insert_waitcnt can use the proper workgroup size
which means that some waits can be dropped from tessellation
shaders. Also, in cases where the previous calculation was wrong,
we now insert s_barrier instructions.
Totals from affected shaders (GFX10):
Code Size: 340116 -> 338484 (-0.48 %) bytes
Fixes: a8d15ab6da
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4165>
When TCS has an equal number of input and output, it means that the
number of VS and TCS invocations (LS and HS) are the same; and that
the HS invocations operate on the same vertices as the LS.
When this is the case, this commit removes the else-if between
the merged VS and TCS halves, making it possible to schedule
and optimize the code accross the two halves.
Totals:
SGPRS: 5577367 -> 5581735 (0.08 %)
VGPRS: 3958592 -> 3960752 (0.05 %)
Code Size: 254867144 -> 254838244 (-0.01 %) bytes
Max Waves: 1053887 -> 1053747 (-0.01 %)
Totals from affected shaders:
SGPRS: 29032 -> 33400 (15.05 %)
VGPRS: 35664 -> 37824 (6.06 %)
Code Size: 1979028 -> 1950128 (-1.46 %) bytes
Max Waves: 7310 -> 7170 (-1.92 %)
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4165>
There is no need to check whether they are written using indirect
indices, because all tess factors should be written to VMEM only
at the end of the shader.
No pipeline db changes.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4165>
Apparently needed for Stoney Ridge, Kabini and Mullins APUs.
gfx702 also has 16-bank LDS and https://llvm.org/docs/AMDGPUUsage.html
lists some dGPUs under there. Those GPUs seem to be Hawaii actually
(gfx701) and we don't seem to have gotten any interpolation related bugs
reported with them so far.
The late kill flag was tested by running pipeline-db with
ACO_DEBUG=validatera while setting late kill for SMEM buffer loads,
emit_vop2_instruction() and texture instructions. I also tested with
just setting the flag for v_interp_p1_f32.
As far as I know, the only other thing we have to consider for 16-bank LDS
is something to do with 16-bit interpolation. We don't do that yet.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3914>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3914>
Tessellation control shaders can have per-vertex inputs,
and both per-vertex and per-patch outputs. TCS can not only store,
but also load their outputs.
The TCS outputs are stored in RING_HS_TESS_OFFCHIP in VMEM, which
is where the TES reads them from. Additionally, the are also stored
in LDS to make sure they can be loaded fast when read by the TCS.
Tessellation factors are always just stored in LDS.
At the end of the shader, the first shader invocation reads these
from LDS and writes them to RING_HS_TESS_FACTOR in VMEM, and
additionally to RING_HS_TESS_OFFCHIP when they are read by
the Tessellation Evaluation Shader.
This implementation matches the memory layouts used by radv_nir_to_llvm.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3964>
v5: rebase on float_controls changes
v7: rebase after shader args MR and load/store vectorizer MR
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2421>
GS is the same on GFX6, but GFX6 isn't fully supported yet.
v4: fix regclass
v7: rebase after shader args MR
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2421>
v2: implement GFX10
v3: rebase
v7: rebase after shader args MR
v8: fix gs_vtx_offset usage on GFX9/GFX10
v8: use unreachable() instead of printing intrinsic
v8: rename output_state to ge_output_state
v8: fix formatting around nir_foreach_variable()
v8: rename some helpers in the scheduler
v8: rename p_memory_barrier_all to p_memory_barrier_common
v8: fix assertion comparing ctx.stage against vertex_geometry_gs
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2421>
Previously, it would only work when the ballot size was set to the
lane mask. This patch makes is possible to set the ballot size
to either 32-bit or 64-bit for wave32 mode.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Currently all usages of exec and vcc are hardcoded to use s2 regclass.
This commit makes it possible to use s1 in wave32 mode and
s2 in wave64 mode.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
ACO considers discards jumps and creates edges in the CFG for them but NIR
does neither of these.
This can be fixed instead by keeping track of whether a side of an IF had
a break/discard, but this doesn't solve the issue with discards affecting
loop exit phis. So this reworks phi handling a bit.
Fixes these tests:
dEQP-VK.graphicsfuzz.disc-and-add-in-func-in-loop
dEQP-VK.graphicsfuzz.loop-call-discard
dEQP-VK.graphicsfuzz.complex-nested-loops-and-call
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Moved to RADV. No pipeline-db changes.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
The former was always true and hence dead code. We will want to
explicitly declare the ring offset register with ACO, but we also want
to declare the scratch offset too, and we can't try to disable it since
ACO also supports spilling and the determination of whether spilling has
to happen occurs well after setting up registers. So replace
supports_spill with something that will actually be used for ACO.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>