It's a stride of 1 output, which isn't 16. It's 16 * num_threads,
aligned to 256.
tcs_offchip_layout has 5 unused bits, so let's use them.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34780>
Not strictly scaling, but we upcast fo fp32, do the fdiv there and cast
back again.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34053>
no_varying cannot be used to eliminate stores on locations which may
be subsequently read
Fixes: 0058989357 ("nir/lower_io_to_scalar: don't create output stores that have no effect")
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35325>
I think this was unlikely to cause issues, even if the qsort()
implementation is unstable.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35255>
Fixes "left shift of negative value -128" with parallel_rdp/00f93a9497dfbb3b
and UBSan.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35255>
Without this cast, the left shift is promoted to 'int'.
Fixes "left shift of 50432 by 16 places cannot be represented in type 'int'"
with horizon_zero_dawn/001064f580f8e3be and UBSan.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35255>
Kepler cards do not support shared atomic operations directly, but they
have special ldslk and stsul that can implement mutex locks on
addresses. Shared atomics can be lowered into operations in mutexes.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35028>
We still allow mesh shader promote constant output to flat, but
mesh shader like geometry shader may store multi vertices'
varying in a single thread. So mesh shader may store different
constant values to different vertices in a single thread, we
should not promote this case to flat.
I'm not using shader_info.mesh.ms_cross_invocation_output_access
because OpenGL does not require IO to have explicit location, so
when nir_shader_gather_info is called in OpenGL GLSL compiler to
compute ms_cross_invocation_output_access, some implicit output
has -1 location which causes ms_cross_invocation_output_access
unset for it.
Cc: mesa-stable
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13134
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35081>
This functionality is useful with the software fp64
implementation. It allows running the remaining
tests.
Note: the same tests do not generate this indirect
access on cayman which has the hardware fp64
implementation enabled.
This change was tested on cypress, palm and barts.
Here are the tests fixed:
spec/arb_gpu_shader_fp64/execution/gs-isnan-dvec: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-array-copy: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-dmat4: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-dmat4-row-major: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-double-array-const-index: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-double-array-variable-index: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-double-bool-double: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-double-uniform-array-direct-indirect: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-doubles-float-mixed: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-dvec4-uniform-array-direct-indirect: fail pass
spec/arb_gpu_shader_fp64/uniform_buffers/gs-nested-struct: fail pass
Signed-off-by: Patrick Lerda <patrick9876@free.fr>
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34926>
These were initially observed in Hogwarts Legacy while working on
something else entirely. Two compute shaders in that app are helped
for spills and fills. On Skylake, one of the shaders benefits from
this change, and the other is hurt pretty significantly.
About 40 vertex shaders in Shadow of the Tomb Raider were helped for
instructions.
v2: Use ~0xff instead of 0xffffff00 to ensure the patterns will work
properly with all bit sizes. Noticed by Georg.
v3: No, really, fix the various errors to ensure the patterns will work
properly with all bit sizes. Noticed by Georg.
No shader-db changes on any Intel platform.
fossil-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake)
Totals:
Instrs: 210566294 -> 210561118 (-0.00%)
Cycle count: 31582309052 -> 31576352808 (-0.02%); split: -0.02%, +0.00%
Spill count: 519300 -> 519280 (-0.00%)
Fill count: 625181 -> 625161 (-0.00%)
Scratch Memory Size: 36289536 -> 36281344 (-0.02%)
Max live registers: 66068413 -> 66068161 (-0.00%)
Non SSA regs after NIR: 60230773 -> 60230775 (+0.00%)
Totals from 1662 (0.24% of 707082) affected shaders:
Instrs: 635064 -> 629888 (-0.82%)
Cycle count: 36549632 -> 30593388 (-16.30%); split: -16.43%, +0.14%
Spill count: 246 -> 226 (-8.13%)
Fill count: 280 -> 260 (-7.14%)
Scratch Memory Size: 16384 -> 8192 (-50.00%)
Max live registers: 178491 -> 178239 (-0.14%)
Non SSA regs after NIR: 169552 -> 169554 (+0.00%)
Tiger Lake
Totals:
Instrs: 238544730 -> 238539407 (-0.00%)
Cycle count: 23679446097 -> 23673238578 (-0.03%); split: -0.03%, +0.00%
Max live registers: 42494925 -> 42494799 (-0.00%)
Non SSA regs after NIR: 63639071 -> 63639074 (+0.00%)
Totals from 1662 (0.21% of 802704) affected shaders:
Instrs: 626604 -> 621281 (-0.85%)
Cycle count: 26444363 -> 20236844 (-23.47%); split: -23.50%, +0.02%
Max live registers: 95405 -> 95279 (-0.13%)
Non SSA regs after NIR: 181150 -> 181153 (+0.00%)
Ice Lake
Totals:
Instrs: 238855310 -> 238826534 (-0.01%)
Cycle count: 24952257277 -> 24944589398 (-0.03%); split: -0.03%, +0.00%
Spill count: 575510 -> 575117 (-0.07%)
Fill count: 713007 -> 708632 (-0.61%)
Max live registers: 42499556 -> 42499432 (-0.00%)
Non SSA regs after NIR: 64388747 -> 64388750 (+0.00%)
Totals from 1662 (0.21% of 805149) affected shaders:
Instrs: 926887 -> 898111 (-3.10%)
Cycle count: 67025583 -> 59357704 (-11.44%); split: -11.45%, +0.01%
Spill count: 5168 -> 4775 (-7.60%)
Fill count: 32883 -> 28508 (-13.30%)
Max live registers: 95614 -> 95490 (-0.13%)
Non SSA regs after NIR: 181150 -> 181153 (+0.00%)
Skylake
Totals:
Instrs: 161904416 -> 161895239 (-0.01%); split: -0.01%, +0.00%
Cycle count: 20098067714 -> 20090767583 (-0.04%); split: -0.04%, +0.00%
Spill count: 525546 -> 525789 (+0.05%); split: -0.04%, +0.09%
Fill count: 603369 -> 602276 (-0.18%); split: -0.28%, +0.10%
Max live registers: 33895714 -> 33895590 (-0.00%)
Non SSA regs after NIR: 57348729 -> 57348730 (+0.00%)
Totals from 1655 (0.25% of 653734) affected shaders:
Instrs: 769979 -> 760802 (-1.19%); split: -1.83%, +0.64%
Cycle count: 51365416 -> 44065285 (-14.21%); split: -14.22%, +0.01%
Spill count: 4186 -> 4429 (+5.81%); split: -4.90%, +10.70%
Fill count: 16356 -> 15263 (-6.68%); split: -10.50%, +3.82%
Max live registers: 95115 -> 94991 (-0.13%)
Non SSA regs after NIR: 180797 -> 180798 (+0.00%)
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34905>
Some shaders contain back-to-back atomic accesses in SPIR-V with
AcquireRelease semantics. In NIR, we translate these to a release
memory barrier, the atomic, then an acquire memory barrier.
This results in a lot of unnecessary memory barriers in the middle
of the sequence of atomics:
0. Release barrier
1. Atomic
2. Acquire barrier
3. Release barrier
4. Atomic
5. Acquire barrier
6. Release barrier
7. Atomic
8. Acquire barrier
In the absence of loads/stores, and when the atomic destinations are
unused, these barriers in-between atomics shouldn't be required.
This optimization pass would drop them (lines 2-3 and 5-6 above) while
leaving the first and last barriers (0 and 8), so the sequence remains
synchronized against other access elsewhere in the program.
One common example where this occurs is a sequence of min and max
atomics to clamp a certain memory location's value within a range.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33504>
for specifiers like %v4f, we need to store the whole vector. u_printf can
already handle this from OpenCL, we just need to match that here.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34909>