load_global_constant_uniform_block_intel is equivalent in terms of
loading, then for the predicate we just do a bcsel afterward in places
where that is required.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30659>
Bspec 64643: Structure_TraceRayPayload::Trace Ray Control
Bit field moved from 9-8 to 10-8 on Xe2.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30600>
Bspec 57508: Structure_SIMD16TraceRayMessage:: RayQuery Enable
"When this bit is set in the header, Trace Ray Message behaves like a
Ray Query. This message requires a write-back message indicating
RayQuery for all valid Rays (SIMD lanes) have completed."
If we don't pass the write-back register, somehow it was stepping on
over R0 register and can mess up the scratch space accesses which could
potentially lead to GPU hang. It can be noticed while running it under
simulator trace.
send.rta (16|M0) null r124 r126:1 0x0 0x02000100 {$15} // wr:1+1, rd:0; simd16 trace ray
R0 = 00000001 00000000 00000000 00000001 00000000 00000000 00000001 00000000 00000000 00000001 00000000 00000000 00000001 00000000 00000000 00000001
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30600>
When closest hit shader is called, the BVH object level
brw_nir_rt_load_mem_ray origin/direction is 0. What we should be using
is the ray origin/direction and apply the transform of the current
instance.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 9ba7d459a3 ("intel/rt: Implement the new ray-tracing system values")
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30578>
Intel GPUs have a saturate destination modifier, and
brw_fs_opt_saturate_propagation tries to replace explicit saturate
operations with this destination modifier. That pass is limited in
several ways. If the source of the explicit saturate is in a different
block or if the source of the explicit saturate is live after the
explicit saturate, brw_fs_opt_saturate_propagation will be unable to
make progress.
This optimization exists to help brw_fs_opt_saturate_propagation make
more progress. It tries to move NIR fsat instructions to the same block
that contains the definition of its source. It does this only in cases
where it will not create additional live values. It also attempts to do
this only in cases where the explicit saturate will ultimiately be
converted to a destination modifier.
v2: Fix metadata_preserve when theres no progress and use
nir_metadata_control_flow when there is progress. All suggested by
Alyssa.
v3: Fix a typo in the file header comment. Noticed by Ken. Don't
require nir_metadata_instr_index. Use nir_def_rewrite_uses_after instead
of open-coding something slightly more specific. Both suggested by Ken.
shader-db:
All Intel platforms had similar results. (Meteor Lake shown)
total instructions in shared programs: 19733645 -> 19733028 (<.01%)
instructions in affected programs: 193300 -> 192683 (-0.32%)
helped: 246
HURT: 1
helped stats (abs) min: 2 max: 48 x̄: 2.51 x̃: 2
helped stats (rel) min: 0.18% max: 0.39% x̄: 0.33% x̃: 0.34%
HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel) min: 0.31% max: 0.31% x̄: 0.31% x̃: 0.31%
95% mean confidence interval for instructions value: -2.87 -2.13
95% mean confidence interval for instructions %-change: -0.34% -0.32%
Instructions are helped.
total cycles in shared programs: 916180971 -> 916264656 (<.01%)
cycles in affected programs: 30197180 -> 30280865 (0.28%)
helped: 194
HURT: 142
helped stats (abs) min: 1 max: 21251 x̄: 872.75 x̃: 19
helped stats (rel) min: <.01% max: 23.17% x̄: 2.59% x̃: 0.23%
HURT stats (abs) min: 1 max: 28058 x̄: 1781.68 x̃: 399
HURT stats (rel) min: <.01% max: 37.21% x̄: 4.85% x̃: 1.63%
95% mean confidence interval for cycles value: -196.84 694.97
95% mean confidence interval for cycles %-change: -0.17% 1.27%
Inconclusive result (value mean confidence interval includes 0).
fossil-db:
Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown)
Totals:
Instrs: 151512021 -> 151511351 (-0.00%); split: -0.00%, +0.00%
Cycle count: 17209013596 -> 17209840995 (+0.00%); split: -0.02%, +0.02%
Max live registers: 32013312 -> 32013549 (+0.00%)
Max dispatch width: 5512304 -> 5512136 (-0.00%)
Totals from 774 (0.12% of 630172) affected shaders:
Instrs: 1559285 -> 1558615 (-0.04%); split: -0.05%, +0.01%
Cycle count: 1312656268 -> 1313483667 (+0.06%); split: -0.24%, +0.30%
Max live registers: 82195 -> 82432 (+0.29%)
Max dispatch width: 6664 -> 6496 (-2.52%)
Ice Lake
Totals:
Instrs: 151416791 -> 151416137 (-0.00%); split: -0.00%, +0.00%
Cycle count: 15162468885 -> 15163298824 (+0.01%); split: -0.00%, +0.01%
Max live registers: 32471367 -> 32471603 (+0.00%)
Max dispatch width: 5623752 -> 5623712 (-0.00%)
Totals from 733 (0.12% of 635598) affected shaders:
Instrs: 877965 -> 877311 (-0.07%); split: -0.09%, +0.01%
Cycle count: 190763628 -> 191593567 (+0.44%); split: -0.21%, +0.64%
Max live registers: 72067 -> 72303 (+0.33%)
Max dispatch width: 6216 -> 6176 (-0.64%)
Skylake
Totals:
Instrs: 140794845 -> 140794075 (-0.00%); split: -0.00%, +0.00%
Cycle count: 14665159301 -> 14665320514 (+0.00%); split: -0.00%, +0.01%
Max live registers: 31783341 -> 31783662 (+0.00%); split: -0.00%, +0.00%
Totals from 659 (0.11% of 625670) affected shaders:
Instrs: 829061 -> 828291 (-0.09%); split: -0.09%, +0.00%
Cycle count: 185478478 -> 185639691 (+0.09%); split: -0.33%, +0.41%
Max live registers: 67491 -> 67812 (+0.48%); split: -0.01%, +0.48%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29774>
There are two problems.
1. This is not NaN safe. 'add.le.sat dst F, Inf F, -Inf F' has a
different result than 'add dst F, Inf F, -Inf F; cmp.le null, dst F, 0F'.
2. Ignoring the first problem, this only produces the desired flags
for LE and G. All other cases can produce the wrong result.
shader-db:
All Intel platforms had similar results. (Broadwell shown)
total instructions in shared programs: 18282314 -> 18282316 (<.01%)
instructions in affected programs: 78 -> 80 (2.56%)
helped: 0
HURT: 2
total cycles in shared programs: 952924234 -> 952924252 (<.01%)
cycles in affected programs: 584 -> 602 (3.08%)
helped: 0
HURT: 2
Fixes: e6022281f2 ("intel/elk: Rename files to use elk prefix")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29774>
There are two problems.
1. This is not NaN safe. 'add.le.sat dst F, Inf F, -Inf F' has a
different result than 'add dst F, Inf F, -Inf F; cmp.le null, dst F, 0F'.
2. Ignoring the first problem, this only produces the desired flags
for LE and G. All other cases can produce the wrong result.
For example, batman_arkham_city_goty.foz 6a63c4caacaa0dae has the
following code:
mad.ge.f0.0(8) g51<1>F g50<8,8,1>F g46<8,8,1>F g11<1,1,1>F
mov.sat(8) g52<1>F g51<1,1,0>F
...
(+f0.0) sel(8) g54<1>UD g53<8,8,1>UD 0x3f000000UD
Without this commit, the saturate is incorrectly propagated to the MAD.
A similar case exists in witcher_3_dxvk_g2.foz 5b03243be667a275.
There are even worse cases like total_war_warhammer3.dx12vk-g6.foz
78328466761ef7ab and ee920491573860fc. The former has the following
code (and the latter has very similar code):
mad.l.f0.0(16) g95<1>F g93<8,8,1>F g62<8,8,1>F g68<1,1,1>F
...
mov.sat(16) g109<1>F -g95<1,1,0>F
...
(+f0.0) sel(16) g68<1>UD g111<1,1,0>UD g54<1,1,0>UD
(+f0.0) sel(16) g70<1>UD g113<1,1,0>UD g56<1,1,0>UD
(+f0.0) sel(16) g72<1>UD g115<1,1,0>UD g58<1,1,0>UD
Saturate propagation makes a hash of this code:
mad.sat.l.f0.0(16) g106<1>F -g93<8,8,1>F -g62<8,8,1>F g68<1,1,1>F
...
(+f0.0) sel(16) g70<1>UD g110<1,1,0>UD g56<1,1,0>UD
(+f0.0) sel(16) g72<1>UD g112<1,1,0>UD g58<1,1,0>UD
(+f0.0) sel(16) g68<1>UD g108<1,1,0>UD g54<1,1,0>UD
Not only is the saturate incorrectly applied to the MAD, but the MAD
result is negated without changing the conditional modifier to G!
NOTE: Backports of this commit to stable branches may need to be more
like the following commit to elk.
shader-db:
All Intel platforms had similar results. (Meteor Lake shown)
total instructions in shared programs: 19729375 -> 19729377 (<.01%)
instructions in affected programs: 112 -> 114 (1.79%)
helped: 0
HURT: 2
total cycles in shared programs: 916234266 -> 916234288 (<.01%)
cycles in affected programs: 636 -> 658 (3.46%)
helped: 0
HURT: 2
fossil-db:
All Intel platforms had similar results. (Meteor Lake shown)
Totals:
Instrs: 151531594 -> 151531601 (+0.00%)
Cycle count: 17209107419 -> 17209107474 (+0.00%); split: -0.00%, +0.00%
Totals from 6 (0.00% of 630198) affected shaders:
Instrs: 4550 -> 4557 (+0.15%)
Cycle count: 194629 -> 194684 (+0.03%); split: -0.00%, +0.03%
Fixes: 947c828d5c ("i965/fs: Add a saturation propagation optimization pass.")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29774>
This prevents a couple small regressions in the next commit.
The only changes in shader-db or fossil-db were on Skylake. This seems
to eliminate an unused flags write that doesn't exist on other platforms.
With that flag write eliminated, a later CMP can be scheduled better.
I did not investigate this further.
v2: Clean up some unnecessary bits and add some comments to
can_elminate_conditional_mod. Suggested by Ken and Matt.
Skylake
Totals:
Cycle count: 14665454524 -> 14665454444 (-0.00%)
Totals from 10 (0.00% of 625685) affected shaders:
Cycle count: 38630 -> 38550 (-0.21%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29774>
Gfx8 required us to emulate image load store with untyped messages,
whereas Gfx9 just has typed message support for everything. brw no
longer supports Gfx8, so all of this code is effectively dead.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30576>
Instead of emitting a single one at the top, and making reference to it,
emit the virtual instruction as needed and let CSE do its job.
Since load_subgroup_invocation now can appear not at the start of the
shader, use UNDEF in all cases to ensure that the liveness of the
destination doesn't extend to the first partial write done here (it was
being used only for SIMD > 8 before).
Note this option was considered in the past
6132992cdb but at the time dismissed. The
difference now is that the lowering of the virtual instruction happens
earlier than the scheduling.
The motivation for this change is to allow passes other than the NIR
conversion to use this value. The alternative of storing a `brw_reg` in
the shader (instead of NIR state) gets complicated by passes like
compact_vgrfs, that move VGRFs around (and update the instructions).
This and maybe other passes would have to care about the brw_reg.
Fossil-db numbers, TGL
```
*** Shaders only in 'after' results are ignored:
steam-native/shadow_of_the_tomb_raider/c683ea5067ee157d/fs.32/0, steam-native/shadow_of_the_tomb_raider/f4df450c3cef40b4/fs.32/0, steam-native/shadow_of_the_tomb_raider/94b708fb8e3d9597/fs.32/0, steam-native/shadow_of_the_tomb_raider/19d44c328edabd30/fs.32/0, steam-native/shadow_of_the_tomb_raider/8a7dcbd5a74a19bf/fs.32/0, and 366 more
from 4 apps: steam-dxvk/alan_wake, steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-native/shadow_of_the_tomb_raider
*** Shaders only in 'before' results are ignored:
steam-dxvk/octopath_traveler/aaa3d10acb726906/fs.32/0, steam-dxvk/batman_arkham_origins/e6872ae23569c35f/fs.32/0, steam-dxvk/octopath_traveler/fd33a99fa5c271a8/fs.32/0, steam-dxvk/octopath_traveler/9a077cdc16f24520/fs.32/0, steam-dxvk/batman_arkham_city_goty/fac7b438ad52f622/fs.32/0, and 12 more
from 4 apps: steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-dxvk/octopath_traveler, steam-native/shadow_of_the_tomb_raider
Totals:
Instrs: 149752381 -> 149751337 (-0.00%); split: -0.00%, +0.00%
Cycle count: 11553609349 -> 11549970294 (-0.03%); split: -0.06%, +0.03%
Spill count: 42763 -> 42764 (+0.00%); split: -0.01%, +0.01%
Fill count: 75650 -> 75651 (+0.00%); split: -0.00%, +0.01%
Max live registers: 31725096 -> 31671792 (-0.17%)
Max dispatch width: 5546008 -> 5551672 (+0.10%); split: +0.11%, -0.00%
Totals from 52574 (8.34% of 630441) affected shaders:
Instrs: 9535159 -> 9534115 (-0.01%); split: -0.03%, +0.02%
Cycle count: 1006627109 -> 1002988054 (-0.36%); split: -0.65%, +0.29%
Spill count: 11588 -> 11589 (+0.01%); split: -0.03%, +0.03%
Fill count: 21057 -> 21058 (+0.00%); split: -0.01%, +0.02%
Max live registers: 1992493 -> 1939189 (-2.68%)
Max dispatch width: 559696 -> 565360 (+1.01%); split: +1.06%, -0.05%
```
and DG2
```
*** Shaders only in 'after' results are ignored:
steam-native/shadow_of_the_tomb_raider/1f95a9d3db21df85/fs.32/0, steam-native/shadow_of_the_tomb_raider/56b87c4a46613a2a/fs.32/0, steam-native/shadow_of_the_tomb_raider/a74b4137f85dbbd3/fs.32/0, steam-native/shadow_of_the_tomb_raider/e07e38d3f48e8402/fs.32/0, steam-native/shadow_of_the_tomb_raider/206336789c48996c/fs.32/0, and 268 more
from 4 apps: steam-dxvk/alan_wake, steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-native/shadow_of_the_tomb_raider
*** Shaders only in 'before' results are ignored:
steam-native/shadow_of_the_tomb_raider/0420d7c3a2ea99ec/fs.32/0, steam-native/shadow_of_the_tomb_raider/2ff39f8bf7d24abb/fs.32/0, steam-native/shadow_of_the_tomb_raider/92d7be2824bd9659/fs.32/0, steam-native/shadow_of_the_tomb_raider/f09ca6d2ecf18015/fs.32/0, steam-native/shadow_of_the_tomb_raider/490f8ffd59e52949/fs.32/0, and 205 more
from 3 apps: steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-native/shadow_of_the_tomb_raider
Totals:
Instrs: 151597619 -> 151599914 (+0.00%); split: -0.00%, +0.00%
Subgroup size: 7699776 -> 7699784 (+0.00%)
Cycle count: 12738501989 -> 12739841170 (+0.01%); split: -0.01%, +0.02%
Spill count: 61283 -> 61274 (-0.01%)
Fill count: 119886 -> 119849 (-0.03%)
Max live registers: 31810432 -> 31758920 (-0.16%)
Max dispatch width: 5540128 -> 5541136 (+0.02%); split: +0.08%, -0.06%
Totals from 49286 (7.81% of 631231) affected shaders:
Instrs: 8607753 -> 8610048 (+0.03%); split: -0.01%, +0.04%
Subgroup size: 857752 -> 857760 (+0.00%)
Cycle count: 305939495 -> 307278676 (+0.44%); split: -0.28%, +0.72%
Spill count: 6339 -> 6330 (-0.14%)
Fill count: 12571 -> 12534 (-0.29%)
Max live registers: 1788346 -> 1736834 (-2.88%)
Max dispatch width: 510920 -> 511928 (+0.20%); split: +0.85%, -0.66%
```
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30489>
16-bit SIMD8 sampler writeback messages come with a bit of padding in
them, requiring us to emit a LOAD_PAYLOAD to reorganize the data into
the padding-free format expected by NIR. Additionally, we may reduce
the response length on the sampler messages based on which components
of the (always vec4) NIR destination are actually in use. When we do
that, dest_size > read_size, and the trailing components are all empty
BAD_FILE registers, indicating the contents are undefined.
Unfortunately, we can't ignore those trailing components entirely.
In the past, we left them default-initialized, giving us a BAD_FILE
register with UD type (which didn't matter, since all sampler returns
were 32-bit). But with 16-bit, this was confusing the LOAD_PAYLOAD.
For example, writing RGB and skipping A (without sparse) would produce
read_size = 3 and dest_size = 4 and nir_dest[5] containing:
nir_dest[] = <R:hf, G:hf, B:hf, blank-A:ud, blank-sparse:ud>
We'd then call LOAD_PAYLOAD on the first 4 sources, causing it to see
3 HF's and a UD, and try to copy the full 32-bit value at the end,
instead of 16-bits of pad like we intended. This meant it would
overflow the destination register's size, triggering validation errors.
Thanks to Ian Romanick for noticing this, writing a test, and also
coming up with a nearly identical fix.
Fixes: 0116430d39 ("intel/brw: Handle 16-bit sampler return payloads")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11617
References: https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/152
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30529>
on agx (and mali), we predicate atomics on "if (!helper)", so doing so again in
this pass is redundant. and would cause a problem since we'd then have to lower
the "is helper inv?" flag late. so just skip the extra lowering code.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Acked-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30488>
We can achieve most of what brw_fs_opt_predicated_break() does with
simple peepholes at NIR -> BRW conversion time.
For predicated break and continue, we can simply look at an IF ... ENDIF
sequence after emitting it. If there's a single instruction between the
two, and it's a BREAK or CONTINUE, then we can move the predicate from
the IF onto the jump, and delete the IF/ENDIF. Because we haven't built
the CFG at this stage, we only need to remove them from the linked list
of instructions, which is trivial to do.
For the predicated while optimization, we can rely on the fact that we
already did the predicated break optimization, and simply look for a
predicated BREAK just before the WHILE. If so, we move the predicate
onto the WHILE, invert it, and remove the BREAK.
There are a few cases where this approach does a worse job than the old
one: nir_convert_from_ssa may introduce load_reg and store_reg in blocks
containing break, and nir_trivialize_registers may decide it needs to
insert movs into those blocks. So, at NIR -> BRW time, we'll actually
emit some MOVs there, which might have been possible to copy propagate
out after later optimizations.
However, the fossil-db results show that it's still pretty competitive.
For instructions, 1017 shaders were helped (average -1.87 instructions),
while only 62 were hurt (average +2.19 instructions). In affected
shaders, it was -0.08% for instructions.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30498>
UBO loads with a non-indirect buffer index should be safe to perform
speculatively. With a direct offset, we may sometimes turn them into
push constants, at which point it's just reading a register with no
cost at all. Otherwise, we access them via messages that use surface
state, and automatically perform bounds checking. So we shouldn't have
any issues with reading out of bounds and page faulting, for example.
This allows nir_opt_peephole_sel() to operate on load_ubo intrinsics,
so we can turn simple if's with loads on both sides to bcsels. In some
cases this can collapse a surprising amount of control flow, allowing
other optimizations to work better.
The i965 OpenGL driver used load_uniform intrinsics, which are allowed
in NIR's peephole select pass. But iris uses the Gallium NIR pass that
translates uniforms to loads from UBO 0, so we haven't been able to take
advantage of NIR's peephole select pass there. The backend pass was
still able to handle this to some extent, however.
fossil-db results on Alchemist:
Totals:
Instrs: 150656329 -> 150645307 (-0.01%); split: -0.01%, +0.00%
Cycles: 12635230179 -> 12633696811 (-0.01%); split: -0.02%, +0.00%
Send messages: 7416330 -> 7416261 (-0.00%)
Spill count: 52471 -> 52473 (+0.00%)
Fill count: 100818 -> 100803 (-0.01%); split: -0.02%, +0.00%
Scratch Memory Size: 3197952 -> 3198976 (+0.03%)
Totals from 1848 (0.29% of 630003) affected shaders:
Instrs: 1412300 -> 1401278 (-0.78%); split: -0.80%, +0.02%
Cycles: 1809789567 -> 1808256199 (-0.08%); split: -0.11%, +0.03%
Send messages: 59829 -> 59760 (-0.12%)
Spill count: 3870 -> 3872 (+0.05%)
Fill count: 9693 -> 9678 (-0.15%); split: -0.18%, +0.02%
Scratch Memory Size: 174080 -> 175104 (+0.59%)
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30498>
If TraceRay() is called with the TerminateOnFirstHit flag, we need to
terminate the ray on the first confirmed intersection. This is handled
by the lowering of accept_ray_intersection and it's working fine for the
case of multiple instances of the intersection shader being called.
But if the shader calls reportIntersection() more than once, we were
handling them all and accepting the closest one regardless of the flag.
Check for the flag on every confirmed intersection and, if set, accept
it right there. The subsequent lowering will take care of terminating
handling the ray termination if necessary.
Fixes new test dEQP-VK.ray_tracing_pipeline.amber.flags-accept-first
Cc: mesa-stable
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30418>
Our validation code doesn't need to know which bytes are accessed. It
only needs to know which grfs were accessed by an element. This also
helps to easily handle the Xe2 register size change.
Backport-to: 24.2
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28479>
Previously this check would create a mask of the bytes used in the
grf, and then shift the mask. This worked well when there was 32 bytes
in the register because a 64-bit uint64_t could easily detect that
bytes were used in the next regiter. (The next register was the high
32-bits of the `access_mask` variable.)
With Xe2, the register size becomes 64 bytes, meaning this strategy
doesn't work. Instead of a mask, we can just check to see if more than
1 grfs are used during each loop iteration. (Suggested by Ken.) This
will make it easier to extend for Xe2 in a follow on commit.
Verified this with
dEQP-VK.subgroups.arithmetic.compute.subgroupexclusivemul_u64vec4_requiredsubgroupsize
on Xe2, which otherwise would cause the program to fail to validate
because it assumed a grf was 32 bytes.
Backport-to: 24.2
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28479>
We don't actually need to extend g0's live range to the EOT message
generally - most messages that end a shader are headerless. The main
implicit use of g0 is for constructing scratch headers. With the last
two patches, we now consider scratch access that may exist in the IR
and already extend the liveness appropriately.
There is one remaining problem: spilling. The register allocator will
create new scratch messages when spilling a register, which need to
create scratch headers, which need g0. So, every new spill or fill
might extend the live range of g0, which would create new interference,
altering the graph. This can be problematic.
However, when compiling SIMD16 or SIMD32 fragment shaders, we don't
allow spilling anyway. So, why not use allow g0? Also, when trying
various scheduling modes, we first try allocation without spilling.
If it works, great, if not, we try a (hopefully) less aggressive
schedule, and only allow spilling on the lowest-pressure schedule.
So, even for regular SIMD8 shaders, we can potentially gain the use
of g0 on the first few tries at scheduling+allocation.
Once we try to allocate with spilling, we go back to reserving g0
for the entire program, so that we can construct scratch headers at
any point. We could possibly do better here, but this is simple and
reliable with some benefit.
Thanks to Ian Romanick for suggesting I try this approach.
fossil-db on Alchemist shows some more spill/fill improvements:
Totals:
Instrs: 149062395 -> 149053010 (-0.01%); split: -0.01%, +0.00%
Cycles: 12609496913 -> 12611652181 (+0.02%); split: -0.45%, +0.47%
Spill count: 52891 -> 52471 (-0.79%)
Fill count: 101599 -> 100818 (-0.77%)
Scratch Memory Size: 3292160 -> 3197952 (-2.86%)
Totals from 416541 (66.59% of 625484) affected shaders:
Instrs: 124058587 -> 124049202 (-0.01%); split: -0.01%, +0.01%
Cycles: 3567164271 -> 3569319539 (+0.06%); split: -1.61%, +1.67%
Spill count: 420 -> 0 (-inf%)
Fill count: 781 -> 0 (-inf%)
Scratch Memory Size: 94208 -> 0 (-inf%)
Witcher 3 shows a 33% reduction in scratch memory size, for example.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30319>
brw_send_indirect_split_message() implicitly reads g0 to construct the
extended message descriptor for certain send messages when this is set.
Record that liveness explicitly.
Thanks to Francisco Jerez for reminding me about this use of g0.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30319>
The generator code for emitting legacy scratch headers was implicitly
using g0 as a source. But the IR wasn't indicating any usage of g0,
which means the liveness isn't properly tracked at the IR level.
It works because we reserve g0 as permanently live for the whole
program. In order to stop doing that, we need to record it properly.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30319>
API requires samplers to return 32-bit even though hardware can handle
16-bit floating point, so we detect that case and make more efficient
use of memory BW. This is helping improve performance of encode and
decode tokens during LLM by at least 5% across multiple platforms.
Thank you Kenneth Graunke for suggesting and guiding me throughout
this implementation.
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30447>
While extending our backend to handle 16-bit sampler return payloads, we
found that in piglit's arb_texture_view-rendering-formats, the SIMD8 FS
was missing the sampling operation altogether. This was because we were
first emitting the texturing instruction, and then calling
get_nir_def(), which adds an UNDEF instruction when the destination is
smaller than the 32-bit. So the texturing was dead code elimated. Fix
this by calling get_nir_def() earlier.
Thank you to Kenneth Graunke for suggesting and guiding me throughout
this implementation.
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30447>
This option is used for Gfx < 12, elk already set it to true,
so set it in brw and change the drivers to not set it anymore.
Because the dual-compiler support in Iris, the helper function
there had to change to consult the right compiler value instead.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30393>
v2: Add comment and assertion to explain why the shift is
safe. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, many instances of the following are
produced:
src/intel/compiler/brw_fs_nir.cpp:114:30: runtime error: left shift of negative value -1
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, 'ninja test' would fail in several
Intel assembly tests (mul.asm and and.asm) with:
src/intel/compiler/brw_reg.h:703:22: runtime error: left shift of 65532 by 16 places cannot be represented in type 'int'
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, many instances of the following are
produced:
src/intel/compiler/brw_eu_compact.c:2244:50: runtime error: left shift of negative value -306
v2: Add comment and assertion to explain why the shift is
safe. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, many instances of the following are
produced:
src/intel/compiler/brw_compiler.h:1661:44: runtime error: shift exponent 64 is too large for 64-bit type 'long long unsigned int'
I think this is an actual bug. It should check the sentinel value, but
the sentinel value is 64. The shift by 64 is treated as a shift by
0. The varying 0 is explicitly filtered by the rest of the
if-test. How does this work?
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
Due to recent regression, adding INTEL_DEBUG=optimizer is dumping
shader optimization pass details to console rather than to respective
files.
Thank you, Kenneth W Graunke for helping me figure this out.
Fixes: 17b7e49089 ("intel/brw: Move out of fs_visitor and rename print instructions")
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30389>