To be shared with radeonsi.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35931>
This hasn't been reproducible because RADV and GLSL always lower
non-constant slot and vertex indexing of GS inputs, but we'll stop
lowering it.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36018>
Similar to nir_lower_alu_width(), the callback can return the
desired number of components for a phi, or 0 for no lowering.
The previous behavior of nir_lower_phis_to_scalar() with lower_all=true
can be elicited via nir_lower_all_phis_to_scalar() while the previous
behavior with lower_all=false now corresponds to nir_lower_phis_to_scalar()
with NULL callback.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35783>
This changes legacy GS outputs to use the same logic as NGG GS.
It enables the same optimizations that NGG has such as forwarding
constant GS output components to the GS copy shader at compile time.
ac_nir_gs_output_info is removed.
GS output info is no longer passed to ac_nir_lower_legacy_gs and
ac_nir_create_gs_copy_shader separately.
ac_nir_lower_legacy_gs now gathers ac_nir_prerast_out, generates GSVS ring
stores, and also generates the GS copy shader with GSVS ring loads.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
This way we won't have to pass output info between the two functions.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
This is a cleanup.
Old gs LDS layout: [es outputs][gs outputs][scratch]
Old nogs LDS layout: [xfb/cull][scratch]
New gs LDS layout: [es outputs][scratch|gs outputs]
New nogs LDS layout: [scratch|xfb/cull]
The LDS scratch is moved to the beginning of the preceding buffer in LDS,
while the addresses in that LDS buffer are offset by the scratch size.
It effectively merges the LDS scratch with the preceding buffer in LDS.
Thanks to that, we no longer need the ngg_scratch ABI and the offset
in a user SGPR.
The lowering passes now return the LDS scratch size, which is used
by the drivers to determine the final LDS size.
The ngg_lds_layout SGPR is now unused without GS in RADV.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
We incorrectly used it to determine whether the shader should cull, which
luckily had no effect because it wasn't used everywhere.
cull_clipdist_mask should be used instead, which also reflects whether
clip planes are enabled in GL.
clip_cull_dist_mask is renamed to export_clipdist_mask to make it clear.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
This removes LDS space and loads/stores for constant GS & XFB output
components. Constant output components skip LDS stores, and LDS loads
are replaced with the gathered constants.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
This simplifies the code and scalarizes the loads/stores.
Scalar loads/stores will allow forwarding constant output components
from stores to loads easily.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
This simplifies the code and scalarizes the loads/stores.
Scalar loads/stores will allow forwarding constant output components
from stores to loads easily.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35352>
This switches the code to the new slot offsets from ac_nir_prerast_out
instead of using a prefix bitmask over outputs_written.
The LDS layout no longer includes these:
- GS: output components that are not written by GS
- VS/TES+XFB: output components that are not written by XFB
- VS/TES+XFB: slots that are not written by XFB (this could be significant)
This is also a cleanup because it unduplicates the bitcounts.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
instead of computing it separately. This is better because
ac_nir_lower_ngg_gs knows the final LDS size anyway, and it will be
easier to modify the size calculation this way.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
instead of computing it separately. This is better because
ac_nir_lower_ngg_nogs knows the final LDS size anyway, and it will be
easier to modify the size calculation this way.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
We need to gather outputs before lowering because lowering requires that we
know the LDS vertex stride, so that we can lower output stores to LDS
stores.
The pass will determine the LDS vertex stride, not drivers.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
This will be used to reduce the NGG LDS size for uncompacted GS and XFB
outputs.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
This enables emulating clip planes without ClipVertex via clip distances
(max 8) instead of the fixed-func hw (max 6 planes).
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
Streamout will require prerast info, which is gathered by
lower_ngg_gs_intrinsics.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
just code reordering (position exports should be at the end for perf)
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
It has no effect, but the extra export instructions is unnecessary and
we can't gather the effective number of position exports from NIR if we
insert incorrect exports.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35351>
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Fixes: b49eab68a8 ("ac/nir: use s_sendmsg(HS_TESSFACTOR) to optimize writing tess factors for gfx11")
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35489>
This moves per-patch output VMEM stores to the end of the shader where they
execute only once. They are skipped if the whole workgroup discards
all patches.
If tcs_vertices_out == 1, per-patch output VMEM stores use the same lanes
as per-vertex output VMEM stores, which are aligned to 4 or 8 lanes to get
cached bandwidth for the stores.
Previously, per-patch outputs were stored to memory for every store_output
intrinsic in TCS.
Additionally, LDS is no longer allocated for per-patch outputs that are only
written and read by invocation 0, or they are written by all invocations
but not read, and don't have indirect indexing. This reduces LDS usage and
LDS traffic.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34780>
This unifies the duplicated LDS output patch size computation between
hs_output_lds_offset and ac_nir_compute_tess_wg_info.
"+ 4" to the output patch stride minimizes LDS bank conflicts by making
the beginning of each patch start on a different LDS bank.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34780>
Checking whether every compoment is valid in tess_level_has_effect() when
prim_mode is unknown generated too many SALU. Do this instead:
if (triangles) ...
subgroup vote for triangles
else if (quads) ..
subgroup vote for quads
else // isoline
subgroup vote for isolines
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34780>
This rewrites tess level value tracking to use the 2-bit masks, which
means LDS allocation is determined separately for outer and inner levels.
LDS is not allocated for tess levels that are only written by invocation 0
and never read or only read by invocation 0. If the number of output
patch vertices is 1, LDS is also not allocated for tess levels.
Tess level outputs for TES are always written as whole vec4 to get cached
bandwidth.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34780>