Logical sends and load_payload can have large VGRFs that cannot be
split. Once all of the lowering passes and optimization passes that
might eliminate any of those instructions have completed, try to split
larger VGRFs one last time.
Register allocation can only handle VGRFs up to a certain size, so this
is the last opportunity to prevent later failures due to VGRFs that are
too large.
Closes: #13239
shader-db:
Lunar Lake, Meteor Lake, DG2, and Tiger Lake had similar results. (Lunar Lake shown)
total instructions in shared programs: 17114494 -> 17114496 (<.01%)
instructions in affected programs: 2790 -> 2792 (0.07%)
helped: 2 / HURT: 4
total cycles in shared programs: 886617364 -> 886315282 (-0.03%)
cycles in affected programs: 4067540 -> 3765458 (-7.43%)
helped: 48 / HURT: 9
Ice Lake and Skylake had similar restuls. (Ice Lake shown)
total instructions in shared programs: 20799801 -> 20799691 (<.01%)
instructions in affected programs: 1210 -> 1100 (-9.09%)
helped: 1 / HURT: 0
total cycles in shared programs: 865495386 -> 865498990 (<.01%)
cycles in affected programs: 60132 -> 63736 (5.99%)
helped: 2 / HURT: 1
total spills in shared programs: 3987 -> 3981 (-0.15%)
spills in affected programs: 24 -> 18 (-25.00%)
helped: 1 / HURT: 0
total fills in shared programs: 3535 -> 3519 (-0.45%)
fills in affected programs: 36 -> 20 (-44.44%)
helped: 1 / HURT: 0
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 208647246 -> 208646499 (-0.00%); split: -0.00%, +0.00%
Cycle count: 31257819536 -> 31263957016 (+0.02%); split: -0.02%, +0.04%
Max live registers: 66160877 -> 66155728 (-0.01%)
Totals from 34703 (4.91% of 707053) affected shaders:
Instrs: 13766639 -> 13765892 (-0.01%); split: -0.02%, +0.01%
Cycle count: 3693572086 -> 3699709566 (+0.17%); split: -0.15%, +0.32%
Max live registers: 4843852 -> 4838703 (-0.11%)
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36202>
In debug builds, the assertion should be preferred as it will highlight
the actual problem. In non-debug builds, it is possible to fail register
allocation more gracefully. If the problem only occurs in, for example,
a SIMD32 version of a shader, the application may even continue to
function.
Closes: #13239
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36202>
Broadcast selects one lane from the source to write to all the lanes
of the destination. This makes it possible for the first half to
overwrite the source used by the second half.
No shader-db changes on any Intel platform.
fossil-db:
Lunar Lake
Totals:
Instrs: 208705405 -> 208705374 (-0.00%); split: -0.00%, +0.00%
Cycle count: 31274597098 -> 31273711544 (-0.00%); split: -0.00%, +0.00%
Totals from 77 (0.01% of 707133) affected shaders:
Instrs: 220177 -> 220146 (-0.01%); split: -0.02%, +0.00%
Cycle count: 461694212 -> 460808658 (-0.19%); split: -0.33%, +0.14%
No fossil-db changes on any other Intel platforms.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35903>
The unordered mode stored in dependencies might be a bitmask and not
only a single mode. In practice, only the "stronger" mode will stick.
Make sure that the code testing for the mode uses "&" instead of "==",
to avoid prevent some valid combinations to happen, e.g.
```
// ...
add(16) g104<1>F g94<1,1,0>F g34<1,1,0>F { align1 1H @7 $7.dst compacted };
```
which without the fix ends up as
```
// ...
sync nop(1) null<0,1,0>UB { align1 WE_all 1N F@7 };
add(16) g104<1>F g94<1,1,0>F g34<1,1,0>F { align1 1H $7.dst compacted };
```
Enables two tests for the scoreboard pass that illustrate this case.
For measuring the effect, re-enabled the sync.nop accounting on total of
instructions and got the following results.
```
Totals:
Instrs: 322041261 -> 321748285 (-0.09%)
Cycle count: 22864587567 -> 22863073741 (-0.01%)
Max dispatch width: 7989040 -> 7989024 (-0.00%); split: +0.00%, -0.00%
Totals from 88212 (9.78% of 902056) affected shaders:
Instrs: 102282050 -> 101989074 (-0.29%)
Cycle count: 12787629859 -> 12786116033 (-0.01%)
Max dispatch width: 525336 -> 525320 (-0.00%); split: +0.01%, -0.01%
```
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36096>
Add support for disabling the VRT (Variable Register Thread) feature.
The strategy here is to force the old BRW_MAX_GRF limit for the
register allocator (locks the upper limit) and make sure
ptl_register_blocks() always return that amount of blocks (locks
the lower limit).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35781>
This needs to be called before intel_nir_opt_peephole_ffma, so I
arbitrarilly decided to call it right before.
All Intel platforms had similar results. (Lunar Lake shown)
total instructions in shared programs: 17120227 -> 17118227 (-0.01%)
instructions in affected programs: 5854 -> 3854 (-34.16%)
helped: 51 / HURT: 0
total cycles in shared programs: 895497762 -> 894733940 (-0.09%)
cycles in affected programs: 4603518 -> 3839696 (-16.59%)
helped: 95 / HURT: 21
LOST: 1
GAINED: 0
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35925>
The conversion from bit value to register file type is already done
by the brw_eu_inst_3src_a1_dst_reg_file in the FFC macro now, so doing it
again produced incorrect results.
Fixes: e7179232 ("intel/brw: Move encoding of Gfx11 3-src inside the inst helpers")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13141
Signed-off-by: Sviatoslav Peleshko <sviatoslav.peleshko@globallogic.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35960>
Similar to nir_lower_alu_width(), the callback can return the
desired number of components for a phi, or 0 for no lowering.
The previous behavior of nir_lower_phis_to_scalar() with lower_all=true
can be elicited via nir_lower_all_phis_to_scalar() while the previous
behavior with lower_all=false now corresponds to nir_lower_phis_to_scalar()
with NULL callback.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35783>
The GLSL compiler always lowers inputs to temps for VS and GS, so exclude
them from driver support because the GLSL compiler will no longer do that
unconditionally. Thus, indirect VS and GS inputs are completely untested
and broken in a lot of drivers.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35945>
We simplify the implementation by assuming the worse case, copying
entire per-vertex regions if necessary.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35103>
The formula uses scalar indices (4bytes), not slots (16bytes).
We also incorrectly passed a scalar (vertex case) & slot (mesh case)
offset in the push constants. Use slots instead so that the value is
smaller and we can pack more stuff into fs_msaa_flags.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 18bbcf9a63 ("intel: introduce new VUE layout for separate compiled shader with mesh")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35103>
load intrinsics don't have range
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 18bbcf9a63 ("intel: introduce new VUE layout for separate compiled shader with mesh")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35103>
Consider an innocuous instruction like:
and(1) v250:UD, g0.3<0,1,0>:UD, 4294967264u NoMask group0
If register allocation decides to spill v250, it will see this
instruction and say, "Oh no! The other components of v250 aren't set, so
I'd better add a fill before that instruction!"
But it gets even worse than that... if register coalesce decided to
merge two of these, the live range gets massively extended because the
writes don't fully initialize the value. This causes the need to spill
these registers in the first place.
Changing that instruction to SIMD16 on Xe2 or SIMD8 on other platforms
alleviates these issues.
shader-db:
Lunar Lake
total instructions in shared programs: 17118324 -> 17113191 (-0.03%)
instructions in affected programs: 93701 -> 88568 (-5.48%)
helped: 42 / HURT: 6
total cycles in shared programs: 895422566 -> 895079488 (-0.04%)
cycles in affected programs: 30111338 -> 29768260 (-1.14%)
helped: 35 / HURT: 40
total spills in shared programs: 3588 -> 3304 (-7.92%)
spills in affected programs: 285 -> 1 (-99.65%)
helped: 10 / HURT: 0
total fills in shared programs: 2218 -> 1663 (-25.02%)
fills in affected programs: 556 -> 1 (-99.82%)
helped: 10 / HURT: 0
Meteor Lake, DG2, Tiger Lake, and Ice Lake had similar results. (Meteor Lake shown)
total instructions in shared programs: 20059218 -> 20053563 (-0.03%)
instructions in affected programs: 96938 -> 91283 (-5.83%)
helped: 43 / HURT: 6
total cycles in shared programs: 884174588 -> 883536475 (-0.07%)
cycles in affected programs: 22105268 -> 21467155 (-2.89%)
helped: 35 / HURT: 27
total spills in shared programs: 5032 -> 4679 (-7.02%)
spills in affected programs: 355 -> 2 (-99.44%)
helped: 12 / HURT: 0
total fills in shared programs: 4782 -> 4113 (-13.99%)
fills in affected programs: 671 -> 2 (-99.70%)
helped: 12 / HURT: 0
Skylake
total instructions in shared programs: 19097658 -> 19097665 (<.01%)
instructions in affected programs: 14202 -> 14209 (0.05%)
helped: 0 / HURT: 5
total cycles in shared programs: 862058109 -> 862058267 (<.01%)
cycles in affected programs: 3450244 -> 3450402 (<.01%)
helped: 7 / HURT: 11
fossil-db:
Lunar Lake
Totals:
Cycle count: 31439652246 -> 31439652272 (+0.00%)
Totals from 2 (0.00% of 707091) affected shaders:
Cycle count: 2602 -> 2628 (+1.00%)
No other Intel platforms had any fossil-db changes.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35721>
`getopt_long()` returns an `int`, not a `char`; putting the value in
a `char` before comparing it to `-1` was making the comparison always
fail, resulting in the invalid codepath taken that then fails with:
option `-' is invalid: ignored
cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34756>
`subprocess.Popen()` returns immediately, and the subprocess might not
have finished by the time `stdout` is read on the next line, spuriously
failing the tests.
`subprocess.check_output()` makes sure the output is available before
returning, solving this issue; it additionally raises an error if the
subprocess failed, giving a better error than a failed diff later in the
script.
cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34756>
e4m3fn: 8bit floating point format with 4bit exponent, 3bit mantissa
and no infinities (finite only)
e5m2: 8bit floating point format with 5bit exponent, 2bit mantissa
and with infinities.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35434>
This will later be encoded by the backend into the
LSC extended descriptor message.
Reworks:
* Sagar: Add nir_intrinsic_ssbo_atomic_swap
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35252>