NIR is actually pretty good at optimizing UBO, SSBO, and shared memory
access but in order to do so, we actually have to run the optimizations
before we lower it all. Same for I/O. By doing all our lowering in
panvk before we ever run the optimization loop, we risk hampering it
significantly.
Ignoring loop changes (several get unrolled now), fossil-db on Sascha
Willems demos and a few others looks lik
Instrs: 189054 -> 187802 (-0.66%); split: -0.67%, +0.01%
CodeSize: 1756160 -> 1747072 (-0.52%); split: -0.52%, +0.01%
Estimated normalized CVT cycles: 771.367106999997 -> 766.0311719999971 (-0.69%); split: -1.05%, +0.36%
Estimated normalized SFU cycles: 1407.21875 -> 1406.9375 (-0.02%); split: -0.03%, +0.01%
Estimated normalized Load/Store cycles: 17477.0 -> 16917.0 (-3.20%)
Maximum number of threads: 1257 -> 1213 (-3.50%); split: +0.08%, -3.58%
Number of hardware loops: 283 -> 278 (-1.77%)
Totals from 186 (19.81% of 939) affected shaders:
Instrs: 102588 -> 101336 (-1.22%); split: -1.23%, +0.01%
CodeSize: 834432 -> 825344 (-1.09%); split: -1.10%, +0.02%
Estimated normalized CVT cycles: 463.226562 -> 457.890627 (-1.15%); split: -1.74%, +0.59%
Estimated normalized SFU cycles: 1021.84375 -> 1021.5625 (-0.03%); split: -0.05%, +0.02%
Estimated normalized Load/Store cycles: 8425.0 -> 7865.0 (-6.65%)
Maximum number of threads: 334 -> 290 (-13.17%); split: +0.30%, -13.47%
Number of hardware loops: 63 -> 58 (-7.94%)
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayern@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38334>
We need to lower outputs to get rid of output reads and so that we can
fix up layer writes on Bifrost. However, there's really no point in
lowering reads besides moving them to the top. Even then, NIR can
probably copy propagate the copies and we'll end up reading straight
from the input variable anyway.
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayern@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38334>
Neither nir_lower_io() nor nir_lower_indirect_derefs() know what to do
with copy_deref so we need to get rid of those first. Also, there are
some NIR passes which can insert more copy_deref or propagate an
indirect load to the I/O variable so we want to lower those away right
before lowering I/O.
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayern@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38334>
These two passes are a prerequisite for basically anything that
optimizes on variables.
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayern@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38334>
GTK is missing a semaphore between QueueSubmit() and QueuePresent()
causing the WSI submit to be "unordered" and to immediately signal the
semaphores (because it's missing a wait semaphore in QueuePresent()).
The workaround is to disable unordered WSI submits until GTK fixes it
properly.
Cc: "25.3"
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14087
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38351>
Xe2 uses byte offsets rather than OWord offsets. We've been storing the
per-slot offsets in bytes on Xe2 for a while, but kept the global offset
immediate in OWords for some reason, choosing to lower it during logical
send lowering.
This patch makes both offsets (global immediate, per-slot) in the same
units, so they could be added together if necessary without scaling.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38343>
Much clearer, especially since we're dealing with at least four
different kinds of intrinsics. These helpers were introduced years ago,
but probably didn't exist when we first wrote this code.
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38343>
I noticed that our backend was completely ignoring writemasks, despite
them appearing on many of the intrinsics we're implementing.
Rhys Perry pointed out that nir_lower_mem_access_bitsizes is removing
all non-trivial writemasking today, so ssbo/global/shared/scratch/etc.
stores should only ever see all components enabled. Which means what
we're doing is legitimate, if non-obvious. Add an assert to make it
obvious.
Thanks a lot to Rhys for helping me rediscover what made this work.
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38343>
The backend has been fully ignoring all writemasks for a long time,
so it really doesn't make sense to have them on our custom intrinsics.
I'm not sure they even make sense for some of the block intrinsics.
Also, the store_ssbo -> store_ssbo_intel pass was not setting writemask
at all, leaving it at the default value of 0 (aka write nothing, if it
had been respected...)
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38343>
Just compare against the size that was declared.
This is probably overkill. I couldn't figure out what vulkan says wrt
OOB access of shared memory. D3D however (which is very strict about
these things) says that for TGSM writes the entire contents of the TGSM
becomes undefined, for reads the result is undefined. Hence, rather
than masking out such accesses, to avoid the segfaults it would be
enough to just clamp the offsets to valid values.
nir doesn't seem easily able to tell us if an access is guaranteed
in-bound (unlike for ssbo access), so assume always potentially OOB.
v2: fix rusticl - for cl we don't know the shared size at compilation
time, this is only provided at launch_grid() time, the nir shader info
shared_size might be zero. Hence pass through the size via cs jit
context, there already actually was a member in there which looks
like it was intended for that (interestingly enough, the cs jit context
was actually unused, since resources are passed elsewhere nowadays).
Reviewed-by: Brian Paul <brian.paul@broadcom.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38307>
Use tristate for the aligned setting, otherwise it is always
first disabled which contributes to the condition if we set the
new stride active.
v2: set ByteStride in dword units and take secondary cmdbuf
in to account (Lionel)
Cc: mesa-stable
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Tested-by: Nataraj Deshpande <nataraj.deshpande@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38349>
More fallout from strict NIR validation but easy to fix. I hit this when
attempting to CTS changes for parent_instr.
Closes: #14245
Fixes: 2f6b4803ab ("nir/validate: expand IO intrinsic validation with nir_io_semantics")
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38356>
Release builds are noisy about flush_type and scope being used
uninitialized, even though they are always set.
Initialize them to the final else values to make GCC happy.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38357>
In ray tracing dispatch, we have dispatch.threads set to 0 since we
calculate the local_size_x/y/z based on the launch sizes.
This change takes 0 threads into an account and returh the TG size 8 in
such scenarios. Before this change, we were setting TG size to 2.
Fixes: 0c4e1c9efc ("intel/common: Add helper for compute thread group dispatch size")
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38229>
In some cases propagating through a bcsel may be harmful. If the bcsel
uses are unlikely to be eliminated in both branch of an if statement,
propagating through it may result in extra moves for phi instructions
and extended live ranges.
v2: Fix missing parameter in call. Noticed by Rhys. I fixed this on the
test machine, but I must have forgotten to propagate the change back to
my dev machine.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38321>
Rusticl reports `CL_DEVICE_VENDOR_ID` using the `vendor_id` property
defined in Panfrost. The value is not set so a `0` is reported
instead.
Initialise the value to `0x13B5`, which is Arm's PCI vendor ID.
Add the definition in `lib/pan_props.h` so it can be shared with
Gallium Lima, Panfrost and PanVK.
Signed-off-by: Ahmed Hesham <ahmed.hesham@arm.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38283>
With the current stack configuration the rv770 seems to be unable
to go beyond three with the "vs-output-array-float-index-wr-before-gs.shader_test"
test. Anyway, the value four seems to be sufficient for the other tests.
This issue was triggered on rv770, for instance, with:
"piglit/bin/shader_runner tests/spec/glsl-1.50/execution/variable-indexing/gs-output-array-float-index-wr.shader_test -auto -fbo"
"piglit/bin/shader_runner tests/spec/glsl-1.50/execution/variable-indexing/vs-output-array-float-index-wr-before-gs.shader_test -auto -fbo"
Fixes: 713edb5998 ("r600/sfn: handle the IF predicate in the scheduler")
Signed-off-by: Patrick Lerda <patrick9876@free.fr>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38213>