Commit graph

3341 commits

Author SHA1 Message Date
Mark Janes
4acea392af intel/compiler: drop unused ray-tracing fields from cache hash
The compiler only references `intel_device_info->subslice_masks` for
ray tracing workloads.  Platforms which lack raytracing support can
share a cache even if they differ on this field.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28311>
2024-03-22 00:01:28 +00:00
Kenneth Graunke
9a72116367 intel/brw: Unify DF and Q/UQ lowering for MOV
Using the new unsupported_64bit_type helper.

Fixes: ea423aba1b ("intel/brw: Split out 64-bit lowering from algebraic optimizations")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28328>
2024-03-21 23:25:56 +00:00
Kenneth Graunke
97c7d5113d intel/brw: Use correct execution pipe for lowering SEL on DF
This is a float operation, let's keep it on the float pipe.

Fixes: ea423aba1b ("intel/brw: Split out 64-bit lowering from algebraic optimizations")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28328>
2024-03-21 23:25:56 +00:00
Kenneth Graunke
26d65e96dd intel/brw: Assert that min/max are not happening in 64-bit SEL lowering
These aren't handled, only pure selects.

Fixes: ea423aba1b ("intel/brw: Split out 64-bit lowering from algebraic optimizations")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28328>
2024-03-21 23:25:56 +00:00
Kenneth Graunke
a2c2a7bc00 intel/brw: Fix check for 64-bit SEL lowering types
The 64-bit type lowering for SEL in opt_algebraic had a pre-existing bug
where it only triggered when 64-bit float _and_ integer types were
unsupported.  Meteorlake supports 64-bit float but not integer, so we
need to lower Q/UQ in that case still.

When I moved this to a later pass, opt_peephole_sel started generating
Q/UQ SEL instructions which were failing to be lowered.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10867
Fixes: ea423aba1b ("intel/brw: Split out 64-bit lowering from algebraic optimizations")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28328>
2024-03-21 23:25:56 +00:00
Dylan Baker
75ede9d9bc intel/brw: track last successful pass and leave the loop early
This is similar to what RADV implements using the NIR_LOOP_PASS
helpers. I have not used those helpers for a couple of reasons:

 1. They use the pointer to the optimization function, which doesn't
    work if the same function is called multiple times in one invocation
    of the loop (fixable)
 2. After fixing them, due to Intel's use of sub-expressions, the amount
    of code added to wrap the shared macro becomes more than simply
    reimplementing them for the Intel compiler

On most workloads the results are a wash, but on compile heavy
workloads like Cyberpunk 2077 and Rise of the Tomb Raider, I saw
fossil-db runtimes fall by 1-2% on my ICL, with no changes to the
compiled shaders. Caio saw closer to 2.5% on TGL.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27510>
2024-03-21 23:02:32 +00:00
Caio Oliveira
b2ee98d2db intel/brw: Handle Xe2 in brw_fs_opt_zero_samples
The mlen tracking is in REG_SIZE units, but in Xe2 each GRF has
doubled the size.  The optimization can only elide full GRFs, so
round down the amount of trailing zeros to ensure the optimization
will remove only full GRFs.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28279>
2024-03-21 22:38:54 +00:00
Ian Romanick
cd70e49394 intel/brw: Allow SIMD16 F and HF type conversion moves
On DG2, the lowering generated for these MOV instructions is
**awful**. The original SIMD16 MOV

    { 18}   67: mov(16) vgrf54+0.0:HF, vgrf46+0.0:F NoMask group0

is lowered to SIMD8 MOVs:

    { 18}  118: mov(8) vgrf54+0.0:HF, vgrf46+0.0:F NoMask group0
    { 18}  119: mov(8) vgrf54+0.16:HF, vgrf46+1.0:F NoMask group8

These MOVs violate Gfx12.5 region restrictions, so these are further
lowered:

    { 17}  119: mov(8) vgrf83<2>:HF, vgrf46+0.0:F NoMask group0
    { 19}  120: mov(8) vgrf54+0.0:UW, vgrf83<2>:UW NoMask group0
    { 19}  122: mov(8) vgrf84<2>:HF, vgrf46+1.0:F NoMask group8
    { 19}  123: mov(8) vgrf54+0.16:UW, vgrf84<2>:UW NoMask group8

The shader-db and fossil-db results are nothing to get excited
about. However, the affect on vk_cooperative_matrix_perf is substantial. In one subtest

shader: shaders/shmemfp16.spv

cooperativeMatrixProps = 8x8x16   A = float16_t B = float16_t C = float16_t D = float16_t scope = subgroup
TILE_M=128 TILE_N=128, TILE_K=32 BLayout=0

performance on my DG2 improved by ~60% due to a MASSIVE reduction in spills and fills:

-Native code for unnamed compute shader (null) (src_hash 0x00000000) (sha1 c6a41b1c4e7aa2da327a39a70ed36c822a4b172f)
-SIMD32 shader: 32484 instructions. 1 loops. 1893868 cycles. 737:1820 spills:fills, 442 sends, scheduled with mode none. Promoted 1 constants. Compacted 519744 to 492224 bytes (5%)
-   START B0 (20782 cycles)
+Native code for unnamed compute shader (null) (src_hash 0x00000000) (sha1 621e960daad5b5579b176717f24a315e7ea560a1)
+SIMD32 shader: 23918 instructions. 1 loops. 1089894 cycles. 432:1166 spills:fills, 442 sends, scheduled with mode none. Promoted 1 constants. Compacted 382688 to 353232 bytes (8%)

shader-db:

All Gfx9 and later platforms had similar results. (Meteor Lake shown)
total instructions in shared programs: 19656270 -> 19653981 (-0.01%)
instructions in affected programs: 61810 -> 59521 (-3.70%)
helped: 116 / HURT: 0

total cycles in shared programs: 823368888 -> 823375854 (<.01%)
cycles in affected programs: 1165284 -> 1172250 (0.60%)
helped: 51 / HURT: 57

fossil-db:

DG2 and Meteor Lake had similar results. (Meteor Lake shown)
*** Shaders only in 'before' results are ignored:
fossil-db/steam-dxvk/total_war_warhammer3/2a3ed2ca632a7cb7/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/18b9d4a3b1961616/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/04ac9f3146a6db19/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/f37ebec6aa1b379a/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/255c987feb0d4310/fs.32, and 25
more
from 1 apps: fossil-db/steam-dxvk/total_war_warhammer3

Totals:
Instrs: 160946537 -> 160928389 (-0.01%); split: -0.01%, +0.00%
Cycles: 14125908620 -> 14125873958 (-0.00%); split: -0.00%, +0.00%

Totals from 1002 (0.15% of 652134) affected shaders:
Instrs: 411261 -> 393113 (-4.41%); split: -4.41%, +0.00%
Cycles: 16676735 -> 16642073 (-0.21%); split: -0.48%, +0.27%

Tiger Lake
Totals:
Instrs: 164511816 -> 164497202 (-0.01%); split: -0.01%, +0.00%
Cycles: 13801675722 -> 13801629397 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 7955168 -> 7955152 (-0.00%)
Send messages: 8544494 -> 8544486 (-0.00%)

Totals from 997 (0.15% of 651454) affected shaders:
Instrs: 460820 -> 446206 (-3.17%); split: -3.17%, +0.00%
Cycles: 16265514 -> 16219189 (-0.28%); split: -0.84%, +0.56%
Subgroup size: 17552 -> 17536 (-0.09%)
Send messages: 26045 -> 26037 (-0.03%)

Ice Lake
Totals:
Instrs: 165504747 -> 165489970 (-0.01%); split: -0.01%, +0.00%
Cycles: 15145244554 -> 15145149627 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 8107032 -> 8107016 (-0.00%)
Send messages: 8598680 -> 8598672 (-0.00%)
Spill count: 45427 -> 45423 (-0.01%)
Fill count: 74749 -> 74747 (-0.00%)

Totals from 1125 (0.17% of 656115) affected shaders:
Instrs: 521676 -> 506899 (-2.83%); split: -2.83%, +0.00%
Cycles: 19555434 -> 19460507 (-0.49%); split: -0.59%, +0.10%
Subgroup size: 21616 -> 21600 (-0.07%)
Send messages: 28623 -> 28615 (-0.03%)
Spill count: 603 -> 599 (-0.66%)
Fill count: 1362 -> 1360 (-0.15%)

Skylake
*** Shaders only in 'after' results are ignored:
fossil-db/steam-native/red_dead_redemption2/cef460b80bad8485/fs.16,
fossil-db/steam-native/red_dead_redemption2/cd5fe081e2e5529d/fs.16
from 1 apps: fossil-db/steam-native/red_dead_redemption2

Totals:
Instrs: 141607617 -> 141593776 (-0.01%); split: -0.01%, +0.00%
Cycles: 14257812441 -> 14257661671 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 7743752 -> 7743736 (-0.00%)
Send messages: 7552728 -> 7552720 (-0.00%)
Spill count: 43660 -> 43661 (+0.00%)
Fill count: 71301 -> 71303 (+0.00%)

Totals from 1017 (0.16% of 636964) affected shaders:
Instrs: 392454 -> 378613 (-3.53%); split: -3.53%, +0.00%
Cycles: 16622974 -> 16472204 (-0.91%); split: -1.04%, +0.13%
Subgroup size: 19840 -> 19824 (-0.08%)
Send messages: 23021 -> 23013 (-0.03%)
Spill count: 484 -> 485 (+0.21%)
Fill count: 1155 -> 1157 (+0.17%)

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28281>
2024-03-21 15:12:58 -07:00
Ian Romanick
66dc6e07f5 intel/brw: Fix handling of accumulator register numbers
Folks, there's more than one accumulator. In general, when the
register file is ARF, the upper 4 bits of the register number specify
which ARF, and the lower 4 bits specify which one of that ARF. This
can be further partitioned by the subregister number.

This is already mostly handled correctly for flags register, but lots
of places wanted to check the register number for equality with
BRW_ARF_ACCUMULATOR. If acc1 is ever specified, that won't work.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28281>
2024-03-21 15:12:54 -07:00
Dylan Baker
477943cc9d meson: Allow building intel-clc for the host if it can be run
In what is probably the most common case  cross of compilation, x86_64
-> x86, it should be possible to build intel-clc for the host machine
and run it. Doing so simplifies the build by not needing to be able to
cross compile half of mesa, and should ease developer and distro strain
for building Intel drivers for x86.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28222>
2024-03-21 16:31:35 +00:00
Ian Romanick
3556dbb97f intel/brw/xe2: Correctly disassemble RT write subtypes
The encoding changed when SIMD32 was added.

Part of Wa_14011334914.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
c4325f426c intel/brw/xe2+: Setup PS thread payload registers required for ALU-based pixel interpolation.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
6427f16074 intel/brw/gfx12: Setup PS thread payload registers required for ALU-based pixel interpolation.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Rohan Garg
2df6d208c8 intel/brw: Adjust src1 length bits for xe2+
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Rohan Garg
83f2bdc116 intel/brw: Set the right cache control bits for xe2
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Rohan Garg
adb853ed10 intel/brw: Update written size depending on the LSC message
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Rohan Garg
48376ac3b8 intel/brw: Cleanup send generation
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Rohan Garg
65f66974a5 intel/brw: Use the dimensions supplied in the instruction
Rework:
 * Francisco Jerez: Rebase on 07b9bfacc7 ("intel/compiler: Move
   logical-send lowering to a separate file")

Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
644a0ede1e intel/blorp/xe2+: Don't use replicated-data clears.
They've been removed from the hardware.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
af8b9af700 intel/brw/xe2+: Allow dual-source blending in SIMD16 mode.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
762ec3fd59 intel/brw/xe2+: Allow FS stencil output in SIMD16 dispatch mode.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
efc0601ddf intel/brw/xe2+: Double allowed SIMD width of FB write SEND messages.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
d96bfb160f intel/brw/xe2+: Update encoding of FB write extended descriptor.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
189422de1b intel/brw/xe2+: Update encoding of FB write descriptor message control.
Ref: bspec: 65209, 63908
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Francisco Jerez
7b0fbc22dd intel/brw/xe2: Render target reads have been removed from the hardware.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28306>
2024-03-20 15:46:44 -07:00
Kenneth Graunke
a075b44493 intel/brw: Eliminate top-level FIND_LIVE_CHANNEL & BROADCAST once
brw_fs_opt_eliminate_find_live_channel eliminates FIND_LIVE_CHANNEL
outside of control flow.  None of our optimization passes generate
additional cases of that instruction, so once it's gone, we shouldn't
ever have to run the pass again.  Moving it out of the loop should
save a bit of CPU time.

While we're at it, also clean adjacent BROADCAST instructions that
consume the result of our FIND_LIVE_CHANNEL.  Without this, we have
to perform copy propagation to get the MOV 0 immediate into the
BROADCAST, then algebraic to turn it into a MOV, which enables more
copy propagation...not to mention CSE gets involved.  Since this
FIND_LIVE_CHANNEL + BROADCAST pattern from emit_uniformize() is
really common, and it's trivial to clean up, we can do that.  This
lets the initial copy prop in the loop see MOV instead of BROADCAST.

Zero impact on fossil-db, but less work in the optimization loop.

Together with the previous patches, this cuts compile time in
Borderlands 3 on Alchemist by -1.38539% +/- 0.1632% (n = 24).

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28286>
2024-03-20 01:04:22 -07:00
Kenneth Graunke
5814534de5 intel/brw: Don't consider UNIFORM_PULL_CONSTANT_LOAD a send-from-GRF
It's a logical opcode which is lowered to a send-from-GRF later.  That
lowering code is responsible for ensuring the sources are set up in a
proper SEND payload.

This was preventing copy propagation of surface handles which started
out as scalars, were splatted out to full-SIMD values with NoMask, then
actually consumed as only component 0 (scalar again), because we thought
that scalar values were not allowed.

fossil-db on Alchemist shows improvements in q2rtx but no other titles:

   Totals:
   Instrs: 161310436 -> 161310152 (-0.00%)
   Cycles: 14370605159 -> 14370601066 (-0.00%)

   Totals from 17 (0.00% of 652298) affected shaders:
   Instrs: 16097 -> 15813 (-1.76%)
   Cycles: 185508 -> 181415 (-2.21%)

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28286>
2024-03-20 01:04:22 -07:00
Kenneth Graunke
ea423aba1b intel/brw: Split out 64-bit lowering from algebraic optimizations
We don't necessarily want to split up MOVs for 64-bit addresses into
2x 32-bit MOVs right away, as this makes things like copy propagating
the whole address around harder.  We should do this late, once, while
still doing other algebraic optimizations earlier.

fossil-db results for Alchemist show tiny improvements:

   Totals:
   Instrs: 161310502 -> 161310436 (-0.00%); split: -0.00%, +0.00%
   Cycles: 14370605606 -> 14370605159 (-0.00%); split: -0.00%, +0.00%

   Totals from 33 (0.01% of 652298) affected shaders:
   Instrs: 15053 -> 14987 (-0.44%); split: -0.64%, +0.20%
   Cycles: 196947 -> 196500 (-0.23%); split: -0.25%, +0.02%

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28286>
2024-03-20 01:04:17 -07:00
Kenneth Graunke
d473004576 intel/fs: Avoid generating useless UNDEFs for every SSA def
Emitting UNDEF is only necessary when the instructions we generate to
produce the NIR def are considered partial writes.  By adding a simple
check (adapted from fs_inst::is_partial_write()), we can avoid creating
loads of unnecessary UNDEFs that we have to clean up later.

Our first dead code elimination pass does get rid of them pretty
quickly, but this should save memory and time during our first
split_virtual_grfs and dead_code_elimination passes.

This generates roughly 30% fewer instructions at the beginning.

Improves compilation time of shaders:
- Rise of the Tomb Raider: -3.51563% +/- 0.103951% (n=7)
- Borderlands 3: -3.64422% +/- 0.300951% (n=7).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28169>
2024-03-19 19:32:18 +00:00
Caio Oliveira
b22879e753 intel/brw: Use predicates for quad_vote_any and quad_vote_all when available
Up until Xe2, we can use the predicates ANY4H and ALL4H to achieve the
same result with less instructions.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27279>
2024-03-19 18:41:15 +00:00
Caio Oliveira
857e62e6ac intel/brw: Implement quad_vote_any and quad_vote_all
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27279>
2024-03-19 18:41:15 +00:00
Ian Romanick
671745b616 intel/fs: Don't allow 0 stride on MOV destination
Outside SIMD1 instructions, a destination stride of zero doesn't make
any sense. When such strides exist, they would be fixed by the FS
generator. Currently the only place that intentionally generates such a
stride is setup_barrier_message_payload_gfx125, and this commit changes
that.

The existence of a zero stride that won't really be a zero stride causes
a variety of problems with other optimization passes. Those passes don't
know that 0 actually means 1, and they make incorrect assumptions about
sizes written, etc.

The assertion helped catch many bugs in some other work in progress that
tries to store convergent values in SIMD8 registers regardless of the
dispatch width. That code would accidentally generate destination
strides of zero.

v2: Check stride differently depending on register file. Suggested by
Caio.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28256>
2024-03-19 18:17:59 +00:00
Iván Briano
446f652cde intel/cmat: fix stride calculation in cmat load/store
The stride given in the shader is in number of elements of the of the
type pointed by the given pointer, which may not match the matrix own
element type.
Since we cast the pointer to match the element type, the stride needs to
be ajusted accordingly.

v2:
 - Fix mismatching bit-width in matrix element type and pointer type (Caio)
 - Do the stride calculation in one place

Fixes dEQP-VK.compute.pipeline.cooperative_matrix.khr_*.multicomponent.*

Fixes: 3a35f8b29b ("intel/cmat: Lower cmat_load and cmat_store")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10820

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27903>
2024-03-15 20:34:43 +00:00
Jordan Justen
6922f421f4 intel/compiler: nib_ctrl no longer exists on Xe2+
Ref: cfb34dc695 ("intel/eu/validate: Validate that the ExecSize is a factor of chosen ChanOff")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28191>
2024-03-15 03:01:53 +00:00
Jordan Justen
72d289b8d1 intel/compiler/fs: Restore SIMD32 restriction for ray_queries on Xe2
In 96e0d979a7, the restriction was dropped because we don't compile a
SIMD8 program on Xe2. This change moves it to run_fs() so the
restriction will be added when compiling SIMD16 on Xe2.

Fixes: 96e0d979a7 ("intel/fs: Check fs_visitor instance before using it")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28191>
2024-03-15 03:01:53 +00:00
Marcin Ślusarz
2ad4d5f8dd intel/compiler/xe2: fix decoding of sampler simd mode
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28191>
2024-03-15 03:01:53 +00:00
Lionel Landwerlin
4df58ef503 intel/fs: bump max simd size of some messages for xe2
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28191>
2024-03-15 03:01:53 +00:00
Lionel Landwerlin
b7719a9ed8 intel/fs: remove some unused send helpers
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28152>
2024-03-13 14:37:48 +00:00
Caio Oliveira
e324fbbe68 intel/brw: Fix validation of accumulator register
The `stride` and `offset` attributes are meaningful for the "virtual"
register files (VGRFs, UNIFORMs and ATTRs).  Accumulator is an ARF so
validation should check `hstride` (part of the <V,W,H> triple) and `subnr`
instead.

Fixes: 12d7aaf2b8 ("intel/compiler: add more validation for acc register usage")
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28059>
2024-03-13 03:23:30 +00:00
Caio Oliveira
db8022dc4d intel/brw: Use helper to create accumulator register
This ensure the region triple <V,W,H> is set correctly, in this case the
desired region is a sequential like <8,8,1>.  Without the helper the
sequence we get is <0,1,0> -- which the generator currently partially
adjusts when emitting code, but is not sufficient when doing validation
earlier.

The code generated code is slightly modified.  From crucible test
func.shader.subtractSaturate.uint in the fragment shader for SIMD8, the
diff looks like

```
 mov(8)          acc0<1>UD       g21<8,8,1>UD                    { align1 1Q $0.dst };
-add.sat(8)      g22<1>UD        -acc0<0,1,0>UD  g16<8,8,1>UD    { align1 1Q @1 $0.dst };
+add.sat(8)      g22<1>UD        -acc0<8,8,1>UD  g16<8,8,1>UD    { align1 1Q @1 $0.dst };
```

Note that without the patch generator adjusted the hstride for acc0 used
as destination (see brw_set_dest), but kept the src region as is.  For
the source, it is not clear to me why the <0,1,0> would work correctly
here since it is a scalar, but using <8,8,1> it is correct.

Fixes: 58907568ec ("intel/fs: Add SHADER_OPCODE_[IU]SUB_SAT pseudo-ops")
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28059>
2024-03-13 03:23:30 +00:00
Jordan Justen
f0769f5d8a intel/compiler: Adjust fs_visitor::emit_cs_terminate() for Xe2
Fixes: 97bf3d3b2d ("intel/brw: Replace CS_OPCODE_CS_TERMINATE with SHADER_OPCODE_SEND")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28110>
2024-03-13 00:25:55 +00:00
Kenneth Graunke
97aec40111 intel/brw: Emit better code for read_invocation(x, constant)
For something as basic as read_invocation(x, 0), we were emitting:

   mov(8) vgrf67:D, 0d
   find_live_channel(8) vgrf236:UD, NoMask
   broadcast(8) vgrf237:D, vgrf67:D, vgrf236+0.0<0>:UD NoMask
   broadcast(8) vgrf235+0.0:W, vgrf197+0.0:W, vgrf237+0.0<0>:D NoMask
   mov(8) vgrf234+0.0:W, vgrf235+0.0<0>:W

This is way overcomplicated - if the invocation is a constant, we can
simply emit a single MOV which reads the desired channel index.  Not
only that, but it's difficult to clean up:

1. If this expression appears multiple times, CSE will find all the
   redundant emit_uniformize(invocation) and get rid of the duplicate
   (find_live_channel+broadcast) on future instructions.
2. Copy propagation will put the 0d directly in the first broadcast.
3. Dead code elimination will get rid of the vgrf67 temp holding 0.
4. Algebraic will replace the first broadcast(x, 0) with a MOV.
5. Copy propagation will put the 0d directly in the second broadcast.
6. Dead code elimination will get rid of the vgrf237 temp.
7. Algebraic will replace the second broadcast(x, 0) with a MOV.
8. Copy propagation will finally combine the two MOVs

That's at least 7-8 optimization passes and several loops through the
same passes just to clean up something we can do trivially.

Cuts 25% of the of the optimizer steps in pipeline 22200210259a2c9c
of fossil-db/google-meet-clvk/BgBlur.1f58fdf742c27594.1 (31 to 23).

Shortens compilation time of the google-meet-clvk/Relight pipeline by
-2.87717% +/- 0.509162% (n=150).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28097>
2024-03-12 21:58:27 +00:00
Ian Romanick
e87881f616 intel/brw: Avoid a silly add with zero in assign_curb_setup
No shader-db changes.

fossil-db:

DG2
Totals:
Instrs: 161008251 -> 161004452 (-0.00%)
Cycles: 13894249509 -> 13893050101 (-0.01%); split: -0.01%, +0.00%

Totals from 3804 (0.58% of 652145) affected shaders:
Instrs: 2232984 -> 2229185 (-0.17%)
Cycles: 7124966553 -> 7123767145 (-0.02%); split: -0.02%, +0.00%

No fossil-db changes on any other platform.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Ian Romanick
d9674cbe7d intel/brw: Combine constants for src0 of POW instructions too
I tried this when I was working on MR !7698, and it didn't have much
affect back then. Maybe I've added more stuff to my fossil-db?

Gfx12 platforms (Tiger Lake and DG2) are unaffected because the POW
instruction was removed.

shader-db:

Ice Lake and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 20301933 -> 20301900 (<.01%)
instructions in affected programs: 9077 -> 9044 (-0.36%)
helped: 33 / HURT: 0

total cycles in shared programs: 842797624 -> 842799471 (<.01%)
cycles in affected programs: 1361911 -> 1363758 (0.14%)
helped: 35 / HURT: 111

LOST:   0
GAINED: 9

fossil-db:

Ice Lake and Skylake had similar results. (Ice Lake shown)
Totals:
Instrs: 165510222 -> 165510163 (-0.00%)
Cycles: 15125195835 -> 15125194484 (-0.00%); split: -0.00%, +0.00%
Spill count: 45204 -> 45196 (-0.02%)
Fill count: 74157 -> 74149 (-0.01%)

Totals from 65 (0.01% of 656118) affected shaders:
Instrs: 57426 -> 57367 (-0.10%)
Cycles: 1667918 -> 1666567 (-0.08%); split: -0.11%, +0.03%
Spill count: 137 -> 129 (-5.84%)
Fill count: 515 -> 507 (-1.55%)

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Ian Romanick
e7480f94c1 intel/brw: Combine constants for src0 of integer multiply too
The majority of cases that would have been affected by this actually
had both sources as integer constants. The earlier commit "intel/rt:
Don't directly generate umul_32x16" allowed those to be constant
folded.

v2: Move the a*-1 block to be near the existing a*-1 block.

No shader-db changes on any Intel platform.

fossil-db results:

All Intel platforms had similar results. (Ice Lake shown)
Totals:
Instrs: 165510246 -> 165510222 (-0.00%)
Cycles: 15125198238 -> 15125195835 (-0.00%); split: -0.00%, +0.00%

Totals from 46 (0.01% of 656118) affected shaders:
Instrs: 36010 -> 35986 (-0.07%)
Cycles: 2613658 -> 2611255 (-0.09%); split: -0.17%, +0.07%

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Ian Romanick
dd3bed1d92 intel/brw: Integer multiply w/ DW and W sources is not commutative
The DW source must be first on all platforms since Gfx7. On previous
platforms it's the other way around.

Unsurprisingly, no shader-db or fossil-db changes. This change is
necessary for the next commit.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Ian Romanick
93478c095e intel/compiler: Enforce 64-bit RepCtrl restriction in eu_validate
For some reason, this wasn't always caught in fs_visitor::validate.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Ian Romanick
31f640bc5f intel/brw: Correctly dump subnr for FIXED_GRF in INTEL_DEBUG=optimizer
v2: Also update printing FIXED_GRF as destionation. Suggested by Lionel.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Ian Romanick
f89d9cc53d intel/brw: Silence "statement may fall through" warning
src/intel/compiler/brw_lower_logical_sends.cpp: In member function ‘bool fs_visitor::lower_logical_sends()’:
src/intel/compiler/brw_lower_logical_sends.cpp:3170:10: warning: this statement may fall through [-Wimplicit-fallthrough=]
 3170 |          if (devinfo->has_lsc) {
      |          ^~
src/intel/compiler/brw_lower_logical_sends.cpp:3174:7: note: here
 3174 |       case SHADER_OPCODE_DWORD_SCATTERED_READ_LOGICAL:
      |       ^~~~

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27552>
2024-03-12 21:31:30 +00:00
Alyssa Rosenzweig
a6123a80da nir/opt_shrink_vectors: shrink some intrinsics from start
If the backend supports it, intrinsics with a component() are straightforward to
shrink from the start. Notably helps vectorized I/O.

v2: add an option for this and enable only on grown up backends, because some
backends ignore the component() parameter.

RADV GFX11:
Totals from 921 (1.16% of 79439) affected shaders:
Instrs: 616558 -> 615529 (-0.17%); split: -0.30%, +0.14%
CodeSize: 3099864 -> 3095632 (-0.14%); split: -0.25%, +0.11%
Latency: 2177075 -> 2160966 (-0.74%); split: -0.79%, +0.05%
InvThroughput: 299997 -> 298664 (-0.44%); split: -0.47%, +0.02%
VClause: 16343 -> 16395 (+0.32%); split: -0.01%, +0.32%
SClause: 10715 -> 10714 (-0.01%)
Copies: 24736 -> 24701 (-0.14%); split: -0.37%, +0.23%
PreVGPRs: 30179 -> 30173 (-0.02%)
VALU: 353472 -> 353439 (-0.01%); split: -0.03%, +0.02%
SALU: 40323 -> 40322 (-0.00%)
VMEM: 25353 -> 25352 (-0.00%)

AGX:

total instructions in shared programs: 2038217 -> 2038049 (<.01%)
instructions in affected programs: 10249 -> 10081 (-1.64%)

total alu in shared programs: 1593094 -> 1592939 (<.01%)
alu in affected programs: 7145 -> 6990 (-2.17%)

total fscib in shared programs: 1589254 -> 1589102 (<.01%)
fscib in affected programs: 7217 -> 7065 (-2.11%)

total bytes in shared programs: 13975666 -> 13974722 (<.01%)
bytes in affected programs: 65942 -> 64998 (-1.43%)

total regs in shared programs: 592758 -> 591187 (-0.27%)
regs in affected programs: 6936 -> 5365 (-22.65%)

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev> (v1)
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28004>
2024-03-12 18:17:17 +00:00