Here we speedup nir_find_inlinable_uniforms() by making sure we only
check a src is inlinable once.
If we have a bunch of nested if-statements where the conditions keep
building on the alu chains of previous conditions we can end up
with exponential processing times due to repeatedly processing the
same srcs over and over.
A big cause of the exponential grow seems to be instructions like
`ffma %594, %594, %599` or `fmul %600, %600` where each essentially
causes us to process the entire previous part of the chain
twice.
Shaders such as that in issue #14663 took multiple minutes to
compile previously, calling collect_src_uniforms billions of times
and now compile within a second with this change.
Closes: mesa/mesa#14663
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39664>
This eliminates expensive div, mod, rem opcodes with non-constant src1 being
constant src1 hiding behind bcsel.
gcc and LLVM are missing this.
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39560>
Removes the runtime code for this, and means we propergate the
signed zero/inf/nan checks to subexpessions too, not just exact.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39616>
The properly terminated regex automatically detects this case now.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39586>
For mesh/task shaders, the thread payload provides a local invocation
index, but it's always linear so it doesn't give the correct value when
quad derivatives are in use.
The lowering pass where all of this is done correctly for compute
shaders assumes load_local_invocation_index will be lowered in the
backend for mesh/task, calculates the values for the quads correctly but
then avoid replacing the original intrinsic and we remain with the wrong
results.
Add an intel specific intrinsic and always lower the generic one to that
(or whatever else was calculated) to avoid ambiguities and fix the value
for quad derivatives.
Fixes future CTS tests using mesh/task shaders under:
dEQP-VK.spirv_assembly.instruction.compute.compute_shader_derivatives.*
Fixes: d89bfb1ff7 ("intel/brw: Reorganize lowering of LocalID/Index to handle Mesh/Task")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39276>
By default, load/store vectorization uses nir_round_up_components()
to round up loads and possibly writemasked stores to the next valid
NIR vector width. However, some backends may not support load/stores
at all sizes. For example, older Intel supports only power-of-two
vector widths. Newer Intel also supports vec2 and vec3, but not
vec5/6/7. By providing a callback, backends can request promotion
to their next supported memory load/store vector width.
The existing "should we vectorize?" callback should continue to return
false for unsupported vector widths (i.e. beyond the maximum supported).
With this new callback, they do not need to say "no" to vectorization
that would normally produce an unsupported count (e.g. vec5/6/7) but
instead request that the component count be rounded up appropriately.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
This adds a new option, round_up_store_components, which rounds up the
number of components for stores that support writemasking to the next
valid vector size. For example, vec4+vec2 stores would round up from
6 components (which wouldn't be supported) to a full supportable vec8
store, relying on writemasking to ensure the correct pieces are written.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
URB intrinsics are simply memory load/stores to a special memory region,
so it's pretty reasonable to handle these in the memory vectorizer. We
treat emit_vertex_* intrinsics as a barrier for shader outputs.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Acked-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
This makes it easier for NIR passes to distinguish between inputs and
outputs without having to reason about which URB handle source was
passed to the intrinsic. It probably also makes it a bit easier for
humans to read the NIR too.
v2: Don't add memory mode to store intrinsics. It's always output.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
We were initializing to a nir_const_value of undefined (in practice on x86
builds, a pointer value), with .b set to 0. Those values would get dumped
in the annotated shader disassembly at the end of a test where all inputs
where unexpectedly skipped, producing very surprising output.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39369>