We could add a nir_def_bit_size() helper but we use ->bit_size about 3x
as often as nir_dest_bit_size() today so that's a major Coccinelle
refactor anyway and this doesn't make it much worse. Most of this
commit was generated byt the following semantic patch:
@@
expression D;
@@
<...
-nir_dest_bit_size(D)
+D.ssa.bit_size
...
Some manual fixup was needed, especially in cpp files where Coccinelle
tends to give up the moment it sees any interesting C++.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24674>
sed + ninja clang-format + fix up spacing for common code.
If you are unhappy that I did not manually change the whitespace of your driver,
you need to enable clang-format for it so the formatting would happen
automatically.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24428>
a variable with a component offset may span multiple slots, and this cannot
be inferred from its type alone (e.g., compacted clip+cull distances)
cc: mesa-stable
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24163>
Purely from the backend point of view it's just an additional
parameter to sampler messages.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23882>
Now that we're using load/store_reg intrinsics, the previous checks for
registers aren't what we want. Instead, we need to be looking for a mov
or vec where both the destination and a source are load/store_reg with
matching decl_reg.
Fixes: b8209d69ff ("intel/fs: Add support for new-style registers")
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24310>
Instead of using 4 dwords for each output slot, use only the amount
of memory actually needed by each variable.
There are some complications from this "obvious" idea:
- flat and non-flat variables can't be merged into the same vec4 slot,
because flat inputs mask has vec4 stride
- multi-slot variables can have different layout:
float[N] requires N 1-dword slots, but
i64vec3 requires 1 fully occupied 4-dword slot followed by 2-dword slot
- some output variables occur both in single-channel/component split
and combined variants
- crossing vec4 boundary requires generating more writes, so avoiding them
if possible is beneficial
This patch fixes some issues with arrays in per-vertex and per-primitive data
(func.mesh.ext.outputs.*.indirect_array.q0 in crucible)
and by reduction in single MUE size it allows spawning more threads at
the same time.
Note: this patch doesn't improve vk_meshlet_cadscene performance because
default layout is already optimal enough.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20407>
Only found 3 shaders affected in Red Dead Redemption :
Totals from 3 (0.05% of 5969) affected shaders:
Instrs: 2246 -> 2230 (-0.71%)
Cycles: 156506 -> 148402 (-5.18%); split: -5.23%, +0.05%
This will have a larger effect when we add the
load_ubo_uniform_block_intel intrinsic where we will have larger
blocks (vec8/vec16 vs vec4 only now).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23477>
This helps a lot with accessing surface handles in control flow. Our
resource_intel intrinsic has a non_uniform flag, in which case we
cannot apply this optimization. But in uniform cases, this is just a
massive win. We drop all kind of pipeline stalls due to
find_live_channel. We also reduce register pressure by doing the
surface handle computation in a single GRF (instead of 2 or 4).
There are some regressions in max dispatch width but those I think are
only on SIMD32 and due to the current heuristic disabling it after
throughput comparison with SIMD16. We know this heuristic is not
perfect, it should probably be updated in another change.
Here are some stats (all titles seem to have similar gains) :
PERCENTAGE DELTAS Shaders Instrs Cycles Subgroup size Send messages Spill count Fill count Scratch Memory Size Max live registers Max dispatch width
red_dead_redemption2 5860 -36.80% -5.67% +0.77% +0.06% -81.26% -79.16% -70.62% -8.63% -6.93%
---------------------------------------------------------------------------------------------------------------------------------------------------------------
All affected 4716 -37.29% -5.67% +0.95% +0.07% -81.26% -79.16% -70.62% -9.15% -8.47%
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 5860 -36.80% -5.67% +0.77% +0.06% -81.26% -79.16% -70.62% -8.63% -6.93%
PERCENTAGE DELTAS Shaders Instrs Cycles Subgroup size Send messages Spill count Fill count Scratch Memory Size Max live registers Max dispatch width
rise_of_the_tomb_raider_g2 12010 -37.19% -22.12% +0.01% +0.00% -99.01% -99.14% -98.65% -7.62% -4.96%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
All affected 11732 -37.27% -22.14% +0.01% +0.00% -99.01% -99.14% -98.65% -7.67% -5.11%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 12010 -37.19% -22.12% +0.01% +0.00% -99.01% -99.14% -98.65% -7.62% -4.96%
PERCENTAGE DELTAS Shaders Instrs Cycles Spill count Fill count Scratch Memory Size Max live registers Max dispatch width
total_war_warhammer2 462 -27.45% -12.42% -82.35% -88.46% -66.67% -5.52% -5.62%
-----------------------------------------------------------------------------------------------------------------------------------
All affected 335 -28.31% -12.77% -82.35% -88.46% -66.67% -6.25% -7.24%
-----------------------------------------------------------------------------------------------------------------------------------
Total 462 -27.45% -12.42% -82.35% -88.46% -66.67% -5.52% -5.62%
PERCENTAGE DELTAS Shaders Instrs Cycles Subgroup size Send messages Spill count Fill count Scratch Memory Size Max live registers Max dispatch width
witcher_3_dxvk_g2 1049 -36.94% -57.82% +0.06% +0.01% -98.52% -97.29% -98.10% -7.81% -1.00%
------------------------------------------------------------------------------------------------------------------------------------------------------------
All affected 693 -41.93% -58.45% +0.09% +0.01% -98.52% -97.29% -98.10% -10.25% -1.33%
------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 1049 -36.94% -57.82% +0.06% +0.01% -98.52% -97.29% -98.10% -7.81% -1.00%
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21645>
Using the information coming from surface_index_intel, we can tell
whether we should use the BTI or bindless heap for a particular SSBO
access.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21645>
We need to do 3 things to accomplish this :
1. make all the register access consider the maximal case when
unknown at compile time
2. move the clamping of load_per_vertex_input prior to lowering
nir_intrinsic_load_patch_vertices_in (in the dynamic cases, the
clamping will use the nir_intrinsic_load_patch_vertices_in to
clamp), meaning clamping using derefs rather than lowered
nir_intrinsic_load_per_vertex_input
3. in the known cases, lower nir_intrinsic_load_patch_vertices_in
in NIR (so that the clamped elements still be vectorized to the
smallest number of URB read messages)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22378>
Commit bb8e31b7ed ("anv: avoid hardcoding instruction VA constant in
shaders") had a slight negative impact on shaders (Red Dead Redemption
2 in particular). Dropping a few shaders from SIMD32 to SIMD16.
With this change, it brings back all the dropped SIMD32 shaders.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22872>
This value takes a few instructions to create, involving expanding
V-immediates, adding 8 for SIMD16, and so on. We can mark it UNDEF
so that it's clear that although these are partial writes, we are
actually defining the entire value.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22835>
Comparisons which produce 32-bit boolean results (0 or 0xFFFFFFFF)
but operate on 16-bit types would first generate a CMP instruction
with W or HF types, before expanding it out. This CMP is a partial
write, which leads us to think the register may contain some prior
contents still. When placed in a loop, this causes its live range
to extend beyond its real life time.
Mark the register with UNDEF first so that we know that no prior
contents exist and need to be preserved.
This affects:
flt32, fge32, feq32, fneu32, ilt32, ult32, ige32, uge32, ieq32, ine32
On one of Cyberpunk 2077's most complex compute shaders, this reduces
the maximum live registers from 696 to 537 (22.8%). Together with the
next patch, Cyberpunk's spills and fills are cut by 10.23% and 9.19%,
respectively.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22835>
The instructions manipulation cr0 use the default mask on lane0. So if
for some reason that lane is disabled in some of the dispatchs, we can
end up not executing the instructions.
Fixes flakyness in dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_float_32_to_16.uniform_matrix_float_rtz_frag
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22314>
There's no guarantee that this is a SSA value. Use the helper to handle
both SSA values and register correctly. Otherwise we read trash when we
encounter a register and make bad decisions on types, possibly leading
to our destination being UQ typed when the VGRF is only 32-bit.
Fixes compilation with -Dintel-clc=enabled since 7f6491b76d
(nir: Combine if_uses with instruction uses) but the bug is much older
than that, circa 2017. We were just getting lucky before.
Fixes: 069bf7c907 ("i965/fs: Match destination type to size for ballot")
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22374>
MTL support fp64, but not int64. The fsign(double(x))*FOO optimization
would try to use a 64-bit int xor operation to conditionally toggle
the sign bit off the result.
Since this only affects high bit of the result, we can do a 32-bit
move of the low dword, and a 32-bit xor on the high dword.
Fixes dEQP-VK.spirv_assembly.instruction.compute.float_controls.fp64.input_args.modf_denorm_flush_to_zero
on MTL.
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22259>
Using divergence analysis, figure out when SSBO & shared memory loads
are uniform and carry the data only once in register space.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21853>
Before SLM fence compiler needs to insert SYNC.ALLWR in order to avoid
the SLM data race.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22050>