nir_opt_load_store_vectorize checks for potential address wrapping
when vectorizing two loads ("low" and "high"). It looks for cases where
"low" might have a large address, and "high" has a positive offset
which, when added together, could trigger integer wraparound. The issue
here is that if the large address of "low" was considered out-of-bounds,
adding offset could wrap around to a small address, which might actually
be in-bounds. Thus, when loaded separately, "low" will fail and trigger
robustness out-of-bound-read behavior, but "high" would read correctly.
When vectorized, the entire load would fail. This is explicitly tested
for with 32-bit SSBO addresses in the Vulkan CTS.
However, anv's 64-bit global addresses and VMA handling effectively
prevent this case. Addresses 0-4095 are a reserved page so that if
people try to use 0 as a NULL pointer, it never maps to a valid BO.
That alone guarantees that the above case where "high" gets a small
address would never be in-bounds, so we don't need to check for it.
In fact, we allocate most user allocations out of high addresses,
and have specialized allocation heaps for certain types of GPU data
structures in the lower GB of memory. For a load to wrap around and
successfully land in the right heap, it would have to load gigabytes.
Disabling this allows load vectorization and overfetching in more cases.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
Moving it to intel_shader_enums.h
The plan is to make it visible to OpenCL shaders.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32329>
This will need to be set to true when the GLSL linker lowers IO, which
can later be unlowered by st/mesa, and then drivers can lower it again
without load_interpolated_input. Therefore, it can't be a global
immutable option.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32229>
At the point we were calling this, we hadn't necessarily cleaned up
derefs via nir_lower_vars_to_ssa, nor movs/vecs via copy propagation,
so it wasn't necessarily easy for this pass to see the actual usage of
the destination.
Moving this later allows us to detect f2f32(txf(...)) and avoid
converting it to a 16-bit txf (why convert with ALU instructions
when the sampler could do it for us?).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31750>
This intrinsic was initially dedicated to mesh/task shaders, but the
mechanism it exposes also exists in the compute shaders on Gfx12.5+.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31508>
Useful to insert debug traces a bit later in the lowering process (in
particular after load/store vectorization).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31508>
If you allow an unsupported component count in the callback for loads,
nir_opt_load_store_vectorize will align num_components to the next supported
vector size, essentially overfetching.
This changes all callbacks to reject it. AMD will enable it in a later commit.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29398>
It will be used to allow merging loads with a hole between them.
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29398>
This is going to handle more than atomics shortly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30828>
Without this, the next commit tiggers assertions.
v2: Unconditionally do the lowering after brw_nir_optimize. Suggested by
Caio.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> [v1]
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30251>
We were specifying align_offset = 64 and align_mul = 64, which is
invalid. nir_combined_align() asserts that align_offset < align_mul.
Our intention here is to perform cacheline-aligned (64B-aligned) block
loads, so we should set align_mul = 64 and can leave align_offset = 0.
Fixes: fbafa9cabd ("intel/nir: remove load_global_const_block_intel intrinsic")
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30755>
load_global_constant_uniform_block_intel is equivalent in terms of
loading, then for the predicate we just do a bcsel afterward in places
where that is required.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30659>
Intel GPUs have a saturate destination modifier, and
brw_fs_opt_saturate_propagation tries to replace explicit saturate
operations with this destination modifier. That pass is limited in
several ways. If the source of the explicit saturate is in a different
block or if the source of the explicit saturate is live after the
explicit saturate, brw_fs_opt_saturate_propagation will be unable to
make progress.
This optimization exists to help brw_fs_opt_saturate_propagation make
more progress. It tries to move NIR fsat instructions to the same block
that contains the definition of its source. It does this only in cases
where it will not create additional live values. It also attempts to do
this only in cases where the explicit saturate will ultimiately be
converted to a destination modifier.
v2: Fix metadata_preserve when theres no progress and use
nir_metadata_control_flow when there is progress. All suggested by
Alyssa.
v3: Fix a typo in the file header comment. Noticed by Ken. Don't
require nir_metadata_instr_index. Use nir_def_rewrite_uses_after instead
of open-coding something slightly more specific. Both suggested by Ken.
shader-db:
All Intel platforms had similar results. (Meteor Lake shown)
total instructions in shared programs: 19733645 -> 19733028 (<.01%)
instructions in affected programs: 193300 -> 192683 (-0.32%)
helped: 246
HURT: 1
helped stats (abs) min: 2 max: 48 x̄: 2.51 x̃: 2
helped stats (rel) min: 0.18% max: 0.39% x̄: 0.33% x̃: 0.34%
HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel) min: 0.31% max: 0.31% x̄: 0.31% x̃: 0.31%
95% mean confidence interval for instructions value: -2.87 -2.13
95% mean confidence interval for instructions %-change: -0.34% -0.32%
Instructions are helped.
total cycles in shared programs: 916180971 -> 916264656 (<.01%)
cycles in affected programs: 30197180 -> 30280865 (0.28%)
helped: 194
HURT: 142
helped stats (abs) min: 1 max: 21251 x̄: 872.75 x̃: 19
helped stats (rel) min: <.01% max: 23.17% x̄: 2.59% x̃: 0.23%
HURT stats (abs) min: 1 max: 28058 x̄: 1781.68 x̃: 399
HURT stats (rel) min: <.01% max: 37.21% x̄: 4.85% x̃: 1.63%
95% mean confidence interval for cycles value: -196.84 694.97
95% mean confidence interval for cycles %-change: -0.17% 1.27%
Inconclusive result (value mean confidence interval includes 0).
fossil-db:
Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown)
Totals:
Instrs: 151512021 -> 151511351 (-0.00%); split: -0.00%, +0.00%
Cycle count: 17209013596 -> 17209840995 (+0.00%); split: -0.02%, +0.02%
Max live registers: 32013312 -> 32013549 (+0.00%)
Max dispatch width: 5512304 -> 5512136 (-0.00%)
Totals from 774 (0.12% of 630172) affected shaders:
Instrs: 1559285 -> 1558615 (-0.04%); split: -0.05%, +0.01%
Cycle count: 1312656268 -> 1313483667 (+0.06%); split: -0.24%, +0.30%
Max live registers: 82195 -> 82432 (+0.29%)
Max dispatch width: 6664 -> 6496 (-2.52%)
Ice Lake
Totals:
Instrs: 151416791 -> 151416137 (-0.00%); split: -0.00%, +0.00%
Cycle count: 15162468885 -> 15163298824 (+0.01%); split: -0.00%, +0.01%
Max live registers: 32471367 -> 32471603 (+0.00%)
Max dispatch width: 5623752 -> 5623712 (-0.00%)
Totals from 733 (0.12% of 635598) affected shaders:
Instrs: 877965 -> 877311 (-0.07%); split: -0.09%, +0.01%
Cycle count: 190763628 -> 191593567 (+0.44%); split: -0.21%, +0.64%
Max live registers: 72067 -> 72303 (+0.33%)
Max dispatch width: 6216 -> 6176 (-0.64%)
Skylake
Totals:
Instrs: 140794845 -> 140794075 (-0.00%); split: -0.00%, +0.00%
Cycle count: 14665159301 -> 14665320514 (+0.00%); split: -0.00%, +0.01%
Max live registers: 31783341 -> 31783662 (+0.00%); split: -0.00%, +0.00%
Totals from 659 (0.11% of 625670) affected shaders:
Instrs: 829061 -> 828291 (-0.09%); split: -0.09%, +0.00%
Cycle count: 185478478 -> 185639691 (+0.09%); split: -0.33%, +0.41%
Max live registers: 67491 -> 67812 (+0.48%); split: -0.01%, +0.48%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29774>
Gfx8 required us to emulate image load store with untyped messages,
whereas Gfx9 just has typed message support for everything. brw no
longer supports Gfx8, so all of this code is effectively dead.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30576>
on agx (and mali), we predicate atomics on "if (!helper)", so doing so again in
this pass is redundant. and would cause a problem since we'd then have to lower
the "is helper inv?" flag late. so just skip the extra lowering code.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Acked-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30488>
UBO loads with a non-indirect buffer index should be safe to perform
speculatively. With a direct offset, we may sometimes turn them into
push constants, at which point it's just reading a register with no
cost at all. Otherwise, we access them via messages that use surface
state, and automatically perform bounds checking. So we shouldn't have
any issues with reading out of bounds and page faulting, for example.
This allows nir_opt_peephole_sel() to operate on load_ubo intrinsics,
so we can turn simple if's with loads on both sides to bcsels. In some
cases this can collapse a surprising amount of control flow, allowing
other optimizations to work better.
The i965 OpenGL driver used load_uniform intrinsics, which are allowed
in NIR's peephole select pass. But iris uses the Gallium NIR pass that
translates uniforms to loads from UBO 0, so we haven't been able to take
advantage of NIR's peephole select pass there. The backend pass was
still able to handle this to some extent, however.
fossil-db results on Alchemist:
Totals:
Instrs: 150656329 -> 150645307 (-0.01%); split: -0.01%, +0.00%
Cycles: 12635230179 -> 12633696811 (-0.01%); split: -0.02%, +0.00%
Send messages: 7416330 -> 7416261 (-0.00%)
Spill count: 52471 -> 52473 (+0.00%)
Fill count: 100818 -> 100803 (-0.01%); split: -0.02%, +0.00%
Scratch Memory Size: 3197952 -> 3198976 (+0.03%)
Totals from 1848 (0.29% of 630003) affected shaders:
Instrs: 1412300 -> 1401278 (-0.78%); split: -0.80%, +0.02%
Cycles: 1809789567 -> 1808256199 (-0.08%); split: -0.11%, +0.03%
Send messages: 59829 -> 59760 (-0.12%)
Spill count: 3870 -> 3872 (+0.05%)
Fill count: 9693 -> 9678 (-0.15%); split: -0.18%, +0.02%
Scratch Memory Size: 174080 -> 175104 (+0.59%)
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30498>
API requires samplers to return 32-bit even though hardware can handle
16-bit floating point, so we detect that case and make more efficient
use of memory BW. This is helping improve performance of encode and
decode tokens during LLM by at least 5% across multiple platforms.
Thank you Kenneth Graunke for suggesting and guiding me throughout
this implementation.
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30447>
Floating-point offsets work fine in combination with the
floating-point arithmetic we're about to lower these intrinsics into,
and they require less instructions than converting to fixed-point and
then back. No reason to take the precision/range hit nor the extra
instructions.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
v2: Add some comments explaining some of the nuance of the shift
optimizations. Fix a bug in the shift count calculation of the upper
32-bits. Move the @64 from the variable to the opcode. All suggested
by Jordan.
No shader-db changes on any Intel platform.
fossil-db:
Meteor Lake and DG2 had similar results. (Meteor Lake shown)
Totals:
Instrs: 154507026 -> 154506576 (-0.00%)
Cycle count: 17436298868 -> 17436295016 (-0.00%)
Max live registers: 32635309 -> 32635297 (-0.00%)
Totals from 42 (0.01% of 632575) affected shaders:
Instrs: 5616 -> 5166 (-8.01%)
Cycle count: 133680 -> 129828 (-2.88%)
Max live registers: 1158 -> 1146 (-1.04%)
No fossil-db changes on any other Intel platform.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29148>
This is similar to what RADV implements using the NIR_LOOP_PASS
helpers. I have not used those helpers for a couple of reasons:
1. They use the pointer to the optimization function, which doesn't
work if the same function is called multiple times in one invocation
of the loop (fixable)
2. After fixing them, due to Intel's use of sub-expressions, the amount
of code added to wrap the shared macro becomes more than simply
reimplementing them for the Intel compiler
On most workloads the results are a wash, but on compile heavy
workloads like Cyberpunk 2077 and Rise of the Tomb Raider, I saw
fossil-db runtimes fall by 1-2% on my ICL, with no changes to the
compiled shaders. Caio saw closer to 2.5% on TGL.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27510>
This adds optimizations for iadd, fadd, and ixor with reduce,
inclusive scan, and exclusive scan.
NOTE: The fadd and ixor optimizations had no shader-db or fossil-db
changes on any Intel platform.
NOTE 2: This change "fixes" arb_compute_variable_group_size-local-size
and base-local-size.shader_test on DG2 and MTL. This is just changing
the code path taken to not use whatever path was not working properly
before.
This is a subset of the things optimized by ACO. See also
https://gitlab.freedesktop.org/mesa/mesa/-/issues/3731#note_682802. The
min, max, iand, and ior exclusive_scan optimizations are not
implemented.
Broadwell on shader-db is not happy. I have not investigated.
v2: Silence some warnings about discarding const.
v3: Rename mbcnt to count_active_invocations. Add a big comment
explaining the differences between the two paths. Suggested by Rhys.
shader-db:
All Gfx9 and newer platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 20300384 -> 20299545 (<.01%)
instructions in affected programs: 19167 -> 18328 (-4.38%)
helped: 35 / HURT: 0
total cycles in shared programs: 842809750 -> 842766381 (<.01%)
cycles in affected programs: 2160249 -> 2116880 (-2.01%)
helped: 33 / HURT: 2
total spills in shared programs: 4632 -> 4626 (-0.13%)
spills in affected programs: 206 -> 200 (-2.91%)
helped: 3 / HURT: 0
total fills in shared programs: 5594 -> 5581 (-0.23%)
fills in affected programs: 664 -> 651 (-1.96%)
helped: 3 / HURT: 1
fossil-db results:
All Intel platforms had similar results. (Ice Lake shown)
Totals:
Instrs: 165551893 -> 165513303 (-0.02%)
Cycles: 15132539132 -> 15125314947 (-0.05%); split: -0.05%, +0.00%
Spill count: 45258 -> 45204 (-0.12%)
Fill count: 74286 -> 74157 (-0.17%)
Scratch Memory Size: 2467840 -> 2451456 (-0.66%)
Totals from 712 (0.11% of 656120) affected shaders:
Instrs: 598931 -> 560341 (-6.44%)
Cycles: 184650167 -> 177425982 (-3.91%); split: -3.95%, +0.04%
Spill count: 983 -> 929 (-5.49%)
Fill count: 2274 -> 2145 (-5.67%)
Scratch Memory Size: 52224 -> 35840 (-31.37%)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27044>