This will be useful for pulling constants in device bound shaders. A64
allows us to put the constants anywhere.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32895>
This is similar to what we used to do on pre-SNB platforms, the number
of GRF registers used by the shader will be used on Xe3+ to adjust the
trade-off between thread-level parallelism and size of the GRF file.
Plumb the value through prog_data so the driver can set up the
hardware state accordingly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This is similar in principle to the previous commit "intel/brw/xe3+:
brw_compile_fs() implementation for Xe3+." but applied to compute-like
shader stages. It changes the implementation of brw_compile_cs/task/mesh()
to reduce compile time and take advantage of wider dispatch modes more
aggressively than the original logic, since as of Xe3 SIMD32 builds
succeed without spills in most cases thanks to VRT.
The new "optimistic" SIMD selection logic starts with the SIMD width
that is potentially highest performance and only compiles additional
narrower variants if that fails (typically due to spilling), while the
old "pessimistic" logic did the opposite: It started with the
narrowest SIMD width and compiled additional variants with increasing
register pressure until one of them failed to compile.
In typical non-spilling cases where we formerly compiled SIMD16 and
SIMD32 variants of the same compute shader, this change will halve the
number of backend compilations required to build it.
XXX - Possibly don't do this in cases with variable workgroup size
until effect on runtime performance can be measured directly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: Don't do this for now in cases with variable workgroup size, still
compile every possible variant in such cases.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
According to HSD 14016252163 if compute shader uses the sample
operation, morton walk order and set the thread group batch size to 4 is
expected to increase sampler cache hit rates by increasing sample
address locality within a subslice.
Rework:
* Caio: "||" => "&&" for type checking in instr_uses_sampler()
* Jordan: Use nir's foreach macros rather than
nir_shader_lower_instructions()
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32430>
This intrinsic was initially dedicated to mesh/task shaders, but the
mechanism it exposes also exists in the compute shaders on Gfx12.5+.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31508>