Commit graph

34 commits

Author SHA1 Message Date
Caio Oliveira
25384dccc0 intel/brw: Remove 'fs' prefix from passes filenames
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32813>
2025-01-02 18:11:05 +00:00
Ian Romanick
007c92b2ac brw/lower: Adjust source stride on DF is_scalar sources to MAD on Gfx9
This commit used to be "brw/emit: Allow scalar sources to 64-bit
3-source instructions". These instructions were fixed up in
brw_eu_emit. There seems to be some conflict with the <0,1,0> stride an
post-RA scheduling. The only difference between the passing code
generated by this commit and the failing code generated by the older
commit is some post-RA scheduling.

v2: Change the stride of a MAD even if the instruction isn't
lowered. MAD instructions that are already SIMD8 have to follow the same
rules. 🤦

v3: Pull the lowering out to its own pass. Update the comment in
brw_fs_validate. Suggested by Ken.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
2024-12-24 18:09:58 -08:00
Kenneth Graunke
6341b3cd87 brw: Combine convergent texture buffer fetches into fewer loads
Borderlands 3 (both DX11 and DX12 renderers) have a common pattern
across many shaders:

  con 32x4 %510 = (uint32)txf %2 (handle), %1191 (0x10) (coord), %1 (0x0) (lod), 0 (texture)
  con 32x4 %512 = (uint32)txf %2 (handle), %1511 (0x11) (coord), %1 (0x0) (lod), 0 (texture)
  ...
  con 32x4 %550 = (uint32)txf %2 (handle), %1549 (0x25) (coord), %1 (0x0) (lod), 0 (texture)
  con 32x4 %552 = (uint32)txf %2 (handle), %1551 (0x26) (coord), %1 (0x0) (lod), 0 (texture)

A single basic block contains piles of texelFetches from a 1D buffer
texture, with constant coordinates.  In most cases, only the .x channel
of the result is read.  So we have something on the order of 28 sampler
messages, each asking for...a single uint32_t scalar value.  Because our
sampler doesn't have any support for convergent block loads (like the
untyped LSC transpose messages for SSBOs)...this means we were emitting
SIMD8/16 (or SIMD16/32 on Xe2) sampler messages for every single scalar,
replicating what's effectively a SIMD1 value to the entire register.
This is hugely wasteful, both in terms of register pressure, and also in
back-and-forth sending and receiving memory messages.

The good news is we can take advantage of our explicit SIMD model to
handle this more efficiently.  This patch adds a new optimization pass
that detects a series of SHADER_OPCODE_TXF_LOGICAL, in the same basic
block, with constant offsets, from the same texture.  It constructs a
new divergent coordinate where each channel is one of the constants
(i.e <10, 11, 12, ..., 26> in the above example).  It issues a new
NoMask divergent texel fetch which loads N useful channels in one go,
and replaces the rest with expansion MOVs that splat the SIMD1 result
back to the full SIMD width.  (These get copy propagated away.)

We can pick the SIMD size of the load independently of the native shader
width as well.  On Xe2, those 28 convergent loads become a single SIMD32
ld message.  On earlier hardware, we use 2 SIMD16 messages.  Or we can
use a smaller size when there aren't many to combine.

In fossil-db, this cuts 27% of send messages in affected shaders, 3-6%
of cycles, 2-3% of instructions, and 8-12% of live registers.  On A770,
this improves performance of Borderlands 3 by roughly 2.5-3.5%.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32573>
2024-12-12 00:05:42 +00:00
Ian Romanick
662339a2ff brw/build: Use SIMD8 temporaries in emit_uniformize
The fossil-db results are very different from v1. This is now mostly
helpful on older platforms.

v2: When optimizing BROADCAST or FIND_LIVE_CHANNEL to a simple MOV,
adjust the exec_size to match the size allocated for the destination
register. Fixes EU validation failures in some piglit OpenCL tests
(e.g., atomic_add-global-return.cl).

v3: Use component_size() in emit_uniformize and BROADCAST to properly
account for UQ vs UD destination. This doesn't matter for
emit_uniformize because the type is always UD, but it is technically
more correct.

v4: Update trace checksums. Now amly expects the same checksum as
several other platforms.

v5: Use xbld.dispatch_width() in the builder for when scalar_group()
eventually becomes SIMD1. Suggested by Lionel.

shader-db:

Lunar Lake, Meteor Lake, DG2, and Tiger Lake had similar results. (Lunar Lake shown)
total instructions in shared programs: 18091701 -> 18091586 (<.01%)
instructions in affected programs: 29616 -> 29501 (-0.39%)
helped: 28 / HURT: 18

total cycles in shared programs: 919250494 -> 919123828 (-0.01%)
cycles in affected programs: 12201102 -> 12074436 (-1.04%)
helped: 124 / HURT: 108

LOST:   0
GAINED: 1

Ice Lake and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 20480808 -> 20480624 (<.01%)
instructions in affected programs: 58465 -> 58281 (-0.31%)
helped: 61 / HURT: 20

total cycles in shared programs: 874860168 -> 874960312 (0.01%)
cycles in affected programs: 18240986 -> 18341130 (0.55%)
helped: 113 / HURT: 158

total spills in shared programs: 4557 -> 4555 (-0.04%)
spills in affected programs: 93 -> 91 (-2.15%)
helped: 1 / HURT: 0

total fills in shared programs: 5247 -> 5243 (-0.08%)
fills in affected programs: 224 -> 220 (-1.79%)
helped: 1 / HURT: 0

fossil-db:

Lunar Lake
Totals:
Instrs: 220486064 -> 220486959 (+0.00%); split: -0.00%, +0.00%
Subgroup size: 14102592 -> 14102624 (+0.00%)
Cycle count: 31602733838 -> 31604733270 (+0.01%); split: -0.01%, +0.02%
Max live registers: 65371025 -> 65355084 (-0.02%)

Totals from 12130 (1.73% of 702392) affected shaders:
Instrs: 5162700 -> 5163595 (+0.02%); split: -0.06%, +0.08%
Subgroup size: 388128 -> 388160 (+0.01%)
Cycle count: 751721956 -> 753721388 (+0.27%); split: -0.54%, +0.81%
Max live registers: 1538550 -> 1522609 (-1.04%)

Meteor Lake and DG2 had similar results. (Meteor Lake shown)
Totals:
Instrs: 241601142 -> 241599114 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 9631168 -> 9631216 (+0.00%)
Cycle count: 25101781573 -> 25097909570 (-0.02%); split: -0.03%, +0.01%
Max live registers: 41540611 -> 41514296 (-0.06%)
Max dispatch width: 6993456 -> 7000928 (+0.11%); split: +0.15%, -0.05%

Totals from 16852 (2.11% of 796880) affected shaders:
Instrs: 6303937 -> 6301909 (-0.03%); split: -0.11%, +0.07%
Subgroup size: 323592 -> 323640 (+0.01%)
Cycle count: 625455880 -> 621583877 (-0.62%); split: -1.20%, +0.58%
Max live registers: 1072491 -> 1046176 (-2.45%)
Max dispatch width: 76672 -> 84144 (+9.75%); split: +14.04%, -4.30%

Tiger Lake
Totals:
Instrs: 235190395 -> 235193286 (+0.00%); split: -0.00%, +0.00%
Cycle count: 23130855720 -> 23128936334 (-0.01%); split: -0.02%, +0.01%
Max live registers: 41644106 -> 41620052 (-0.06%)
Max dispatch width: 6959160 -> 6981512 (+0.32%); split: +0.34%, -0.02%

Totals from 15102 (1.90% of 793371) affected shaders:
Instrs: 5771042 -> 5773933 (+0.05%); split: -0.06%, +0.11%
Cycle count: 371062226 -> 369142840 (-0.52%); split: -1.04%, +0.52%
Max live registers: 989858 -> 965804 (-2.43%)
Max dispatch width: 61344 -> 83696 (+36.44%); split: +38.42%, -1.98%

Ice Lake and Skylake had similar results. (Ice Lake shown)
Totals:
Instrs: 236063150 -> 236063242 (+0.00%); split: -0.00%, +0.00%
Cycle count: 24516187174 -> 24516027518 (-0.00%); split: -0.00%, +0.00%
Spill count: 567071 -> 567049 (-0.00%)
Fill count: 701323 -> 701273 (-0.01%)
Max live registers: 41914047 -> 41913281 (-0.00%)
Max dispatch width: 7042608 -> 7042736 (+0.00%); split: +0.00%, -0.00%

Totals from 3904 (0.49% of 798473) affected shaders:
Instrs: 2809690 -> 2809782 (+0.00%); split: -0.02%, +0.03%
Cycle count: 182114259 -> 181954603 (-0.09%); split: -0.34%, +0.25%
Spill count: 1696 -> 1674 (-1.30%)
Fill count: 2523 -> 2473 (-1.98%)
Max live registers: 341695 -> 340929 (-0.22%)
Max dispatch width: 32752 -> 32880 (+0.39%); split: +0.44%, -0.05%

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32097>
2024-12-05 00:15:27 +00:00
Ian Romanick
d2b266187d brw: Use resize_sources several more places
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32097>
2024-12-05 00:15:27 +00:00
Ian Romanick
5dfea87623 brw/opt: Always do both kinds of copy propagation before lower_load_payload
shader-db:

All Intel platforms except Skylake had similar results. (Lunar Lake shown)
total instructions in shared programs: 18092932 -> 18092713 (<.01%)
instructions in affected programs: 139290 -> 139071 (-0.16%)
helped: 103
HURT: 18
helped stats (abs) min: 1 max: 8 x̄: 2.43 x̃: 2
helped stats (rel) min: 0.02% max: 9.09% x̄: 0.73% x̃: 0.29%
HURT stats (abs)   min: 1 max: 5 x̄: 1.72 x̃: 1
HURT stats (rel)   min: 0.02% max: 0.55% x̄: 0.10% x̃: 0.08%
95% mean confidence interval for instructions value: -2.17 -1.45
95% mean confidence interval for instructions %-change: -0.83% -0.38%
Instructions are helped.

total cycles in shared programs: 922792268 -> 921495900 (-0.14%)
cycles in affected programs: 400296984 -> 399000616 (-0.32%)
helped: 765
HURT: 635
helped stats (abs) min: 2 max: 77018 x̄: 6739.33 x̃: 60
helped stats (rel) min: <.01% max: 35.59% x̄: 1.98% x̃: 0.32%
HURT stats (abs)   min: 2 max: 88658 x̄: 6077.51 x̃: 152
HURT stats (rel)   min: <.01% max: 51.33% x̄: 2.75% x̃: 0.63%
95% mean confidence interval for cycles value: -1620.41 -231.54
95% mean confidence interval for cycles %-change: -0.10% 0.44%
Inconclusive result (%-change mean confidence interval includes 0).

LOST:   4
GAINED: 3

Skylake
total instructions in shared programs: 18658324 -> 18579715 (-0.42%)
instructions in affected programs: 2089957 -> 2011348 (-3.76%)
helped: 9842
HURT: 23
helped stats (abs) min: 1 max: 24 x̄: 7.99 x̃: 8
helped stats (rel) min: 0.05% max: 40.00% x̄: 5.37% x̃: 4.52%
HURT stats (abs)   min: 1 max: 5 x̄: 1.57 x̃: 1
HURT stats (rel)   min: 0.02% max: 1.28% x̄: 0.36% x̃: 0.24%
95% mean confidence interval for instructions value: -7.98 -7.95
95% mean confidence interval for instructions %-change: -5.43% -5.29%
Instructions are helped.

total cycles in shared programs: 860031654 -> 860237548 (0.02%)
cycles in affected programs: 449175235 -> 449381129 (0.05%)
helped: 7895
HURT: 4416
helped stats (abs) min: 1 max: 14129 x̄: 113.70 x̃: 22
helped stats (rel) min: <.01% max: 40.95% x̄: 1.31% x̃: 0.56%
HURT stats (abs)   min: 1 max: 33397 x̄: 249.89 x̃: 34
HURT stats (rel)   min: <.01% max: 67.47% x̄: 2.65% x̃: 0.65%
95% mean confidence interval for cycles value: 1.46 31.98
95% mean confidence interval for cycles %-change: 0.02% 0.19%
Cycles are HURT.

LOST:   557
GAINED: 900

fossil-db:

Lunar Lake
Totals:
Instrs: 141933621 -> 141884681 (-0.03%); split: -0.03%, +0.00%
Cycle count: 21990657282 -> 21990200212 (-0.00%); split: -0.14%, +0.14%
Spill count: 69754 -> 69732 (-0.03%); split: -0.05%, +0.02%
Fill count: 128559 -> 128521 (-0.03%); split: -0.05%, +0.02%
Scratch Memory Size: 5934080 -> 5925888 (-0.14%)
Max live registers: 48021653 -> 48051253 (+0.06%); split: -0.00%, +0.06%

Totals from 13510 (2.45% of 551410) affected shaders:
Instrs: 19497180 -> 19448240 (-0.25%); split: -0.25%, +0.00%
Cycle count: 2455370202 -> 2454913132 (-0.02%); split: -1.25%, +1.23%
Spill count: 10975 -> 10953 (-0.20%); split: -0.32%, +0.12%
Fill count: 21709 -> 21671 (-0.18%); split: -0.28%, +0.10%
Scratch Memory Size: 674816 -> 666624 (-1.21%)
Max live registers: 2502653 -> 2532253 (+1.18%); split: -0.01%, +1.19%

Meteor Lake and DG2 had similar results. (Meteor Lake shown)
Totals:
Instrs: 152763523 -> 152772716 (+0.01%); split: -0.00%, +0.01%
Cycle count: 17188701887 -> 17187510768 (-0.01%); split: -0.10%, +0.09%
Spill count: 79280 -> 79279 (-0.00%); split: -0.00%, +0.00%
Fill count: 148809 -> 148803 (-0.00%)
Max live registers: 31879240 -> 31879093 (-0.00%); split: -0.00%, +0.00%
Max dispatch width: 5559984 -> 5559712 (-0.00%); split: +0.00%, -0.01%

Totals from 20524 (3.24% of 633183) affected shaders:
Instrs: 20366964 -> 20376157 (+0.05%); split: -0.01%, +0.05%
Cycle count: 2406162382 -> 2404971263 (-0.05%); split: -0.68%, +0.63%
Spill count: 19935 -> 19934 (-0.01%); split: -0.02%, +0.01%
Fill count: 34487 -> 34481 (-0.02%)
Max live registers: 1745598 -> 1745451 (-0.01%); split: -0.01%, +0.01%
Max dispatch width: 117992 -> 117720 (-0.23%); split: +0.03%, -0.26%

Tiger Lake and Ice Lake had similar results. (Tiger Lake shown)
Totals:
Instrs: 150694108 -> 150683859 (-0.01%); split: -0.01%, +0.00%
Cycle count: 15526754059 -> 15529031079 (+0.01%); split: -0.10%, +0.12%
Max live registers: 31791599 -> 31791441 (-0.00%); split: -0.00%, +0.00%
Max dispatch width: 5569488 -> 5569296 (-0.00%); split: +0.00%, -0.01%

Totals from 15000 (2.37% of 632406) affected shaders:
Instrs: 10965577 -> 10955328 (-0.09%); split: -0.11%, +0.02%
Cycle count: 2025347115 -> 2027624135 (+0.11%); split: -0.80%, +0.91%
Max live registers: 983373 -> 983215 (-0.02%); split: -0.02%, +0.00%
Max dispatch width: 83064 -> 82872 (-0.23%); split: +0.12%, -0.35%

Skylake
Totals:
Instrs: 140588784 -> 140413758 (-0.12%); split: -0.13%, +0.00%
Cycle count: 14724286265 -> 14723402393 (-0.01%); split: -0.04%, +0.04%
Fill count: 100130 -> 100129 (-0.00%)
Max live registers: 31418029 -> 31417146 (-0.00%); split: -0.00%, +0.00%
Max dispatch width: 5513400 -> 5535192 (+0.40%); split: +0.89%, -0.49%

Totals from 39733 (6.35% of 625986) affected shaders:
Instrs: 17240737 -> 17065711 (-1.02%); split: -1.02%, +0.01%
Cycle count: 1994668203 -> 1993784331 (-0.04%); split: -0.31%, +0.27%
Fill count: 44481 -> 44480 (-0.00%)
Max live registers: 2766781 -> 2765898 (-0.03%); split: -0.03%, +0.00%
Max dispatch width: 210600 -> 232392 (+10.35%); split: +23.23%, -12.89%

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32041>
2024-11-08 17:46:45 +00:00
Ian Romanick
be26012f1d brw/opt: Always do copy prop, DCE, and register coalesce after lower_regioning
shader-db:

Lunar Lake
total instructions in shared programs: 18100289 -> 18083853 (-0.09%)
instructions in affected programs: 790048 -> 773612 (-2.08%)
helped: 3058 / HURT: 1

total cycles in shared programs: 921691992 -> 921293816 (-0.04%)
cycles in affected programs: 37210762 -> 36812586 (-1.07%)
helped: 2329 / HURT: 624

LOST:   27
GAINED: 26

Meteor Lake, DG2, Tiger Lake, and Ice Lake had similar results. (Meteor Lake shown)
total instructions in shared programs: 19825635 -> 19821391 (-0.02%)
instructions in affected programs: 138675 -> 134431 (-3.06%)
helped: 877 / HURT: 0

total cycles in shared programs: 907900598 -> 907885713 (<.01%)
cycles in affected programs: 7127161 -> 7112276 (-0.21%)
helped: 318 / HURT: 242

total spills in shared programs: 5790 -> 5758 (-0.55%)
spills in affected programs: 660 -> 628 (-4.85%)
helped: 8 / HURT: 0

total fills in shared programs: 6744 -> 6712 (-0.47%)
fills in affected programs: 708 -> 676 (-4.52%)
helped: 8 / HURT: 0

LOST:   10
GAINED: 0

Skylake
total instructions in shared programs: 18722197 -> 18637637 (-0.45%)
instructions in affected programs: 2757553 -> 2672993 (-3.07%)
helped: 12290 / HURT: 1

total cycles in shared programs: 859716039 -> 859432560 (-0.03%)
cycles in affected programs: 113731837 -> 113448358 (-0.25%)
helped: 9555 / HURT: 2422

LOST:   265
GAINED: 714

fossil-db:

Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Instrs: 142000618 -> 141928331 (-0.05%); split: -0.05%, +0.00%
Subgroup size: 10995136 -> 10995072 (-0.00%)
Cycle count: 21994723230 -> 21990481140 (-0.02%); split: -0.08%, +0.06%
Spill count: 69911 -> 69754 (-0.22%); split: -0.23%, +0.00%
Fill count: 128723 -> 128559 (-0.13%); split: -0.15%, +0.02%
Scratch Memory Size: 5936128 -> 5934080 (-0.03%)
Max live registers: 48006880 -> 48020936 (+0.03%); split: -0.01%, +0.04%

Totals from 17450 (3.16% of 551410) affected shaders:
Instrs: 14984149 -> 14911862 (-0.48%); split: -0.48%, +0.00%
Subgroup size: 365744 -> 365680 (-0.02%)
Cycle count: 2585095128 -> 2580853038 (-0.16%); split: -0.71%, +0.54%
Spill count: 20893 -> 20736 (-0.75%); split: -0.76%, +0.00%
Fill count: 44181 -> 44017 (-0.37%); split: -0.44%, +0.07%
Scratch Memory Size: 995328 -> 993280 (-0.21%)
Max live registers: 2378069 -> 2392125 (+0.59%); split: -0.20%, +0.79%

Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Totals:
Instrs: 150719758 -> 150676269 (-0.03%); split: -0.04%, +0.01%
Subgroup size: 7764560 -> 7764632 (+0.00%)
Cycle count: 15526689814 -> 15525687740 (-0.01%); split: -0.03%, +0.02%
Spill count: 60120 -> 59472 (-1.08%); split: -1.17%, +0.10%
Fill count: 105973 -> 104675 (-1.22%); split: -1.40%, +0.17%
Scratch Memory Size: 2396160 -> 2381824 (-0.60%); split: -0.73%, +0.13%
Max live registers: 31782879 -> 31788857 (+0.02%); split: -0.01%, +0.03%
Max dispatch width: 5569200 -> 5569344 (+0.00%); split: +0.00%, -0.00%

Totals from 10089 (1.60% of 632405) affected shaders:
Instrs: 6389866 -> 6346377 (-0.68%); split: -0.87%, +0.19%
Subgroup size: 102912 -> 102984 (+0.07%)
Cycle count: 681310278 -> 680308204 (-0.15%); split: -0.65%, +0.51%
Spill count: 19571 -> 18923 (-3.31%); split: -3.61%, +0.30%
Fill count: 38229 -> 36931 (-3.40%); split: -3.88%, +0.48%
Scratch Memory Size: 808960 -> 794624 (-1.77%); split: -2.15%, +0.38%
Max live registers: 677473 -> 683451 (+0.88%); split: -0.45%, +1.33%
Max dispatch width: 88672 -> 88816 (+0.16%); split: +0.27%, -0.11%

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32041>
2024-11-08 17:46:45 +00:00
Ian Romanick
04e1783278 brw: Call brw_fs_opt_algebraic less often
No shader-db or fossil-db changes on any Intel platform.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31729>
2024-10-25 23:39:36 +00:00
Caio Oliveira
9537b62759 intel/brw: Add SHADER_OPCODE_REDUCE
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30496>
2024-10-11 06:40:29 +00:00
Caio Oliveira
affa7567c2 intel/brw: Add phases to backend
The general idea is to be able to validate that certain instructions
were lowered and certain restrictions were already handled.  Passes can
now assert their expectations, i.e. if a pass is mean to run after
certain lowerings or not.

The actual phases are a initial stab and as we re-organized the passes,
we may remove/add phases.

This commit just add some phase steps, later commits will make use of
them.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30496>
2024-10-11 06:40:29 +00:00
Caio Oliveira
2811cb2923 intel: Add statistic for Non SSA registers after NIR to BRW
This is going to be useful while we convert the NIR to BRW to produce
SSA definitions.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30496>
2024-10-11 06:40:29 +00:00
Caio Oliveira
31dfb04fd3 intel/brw: Remove long register file names
The long names were originally meant to map to the HW encoding but
nowadays the actual encoding values depend on gfx version, whether
instruction is 3src, etc.

Suggested by Ken.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30704>
2024-08-25 22:08:14 +00:00
Kenneth Graunke
c19e5a0a75 intel/brw: Replace predicated break optimization with a simple peephole
We can achieve most of what brw_fs_opt_predicated_break() does with
simple peepholes at NIR -> BRW conversion time.

For predicated break and continue, we can simply look at an IF ... ENDIF
sequence after emitting it.  If there's a single instruction between the
two, and it's a BREAK or CONTINUE, then we can move the predicate from
the IF onto the jump, and delete the IF/ENDIF.  Because we haven't built
the CFG at this stage, we only need to remove them from the linked list
of instructions, which is trivial to do.

For the predicated while optimization, we can rely on the fact that we
already did the predicated break optimization, and simply look for a
predicated BREAK just before the WHILE.  If so, we move the predicate
onto the WHILE, invert it, and remove the BREAK.

There are a few cases where this approach does a worse job than the old
one: nir_convert_from_ssa may introduce load_reg and store_reg in blocks
containing break, and nir_trivialize_registers may decide it needs to
insert movs into those blocks.  So, at NIR -> BRW time, we'll actually
emit some MOVs there, which might have been possible to copy propagate
out after later optimizations.

However, the fossil-db results show that it's still pretty competitive.
For instructions, 1017 shaders were helped (average -1.87 instructions),
while only 62 were hurt (average +2.19 instructions).  In affected
shaders, it was -0.08% for instructions.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30498>
2024-08-05 19:17:55 -07:00
Kenneth Graunke
fad63d6483 intel/brw: Delete the brw_fs_opt_dead_control_flow_eliminate() pass
With the select peephole gone, this no longer does much of anything.

No instruction changes in fossil-db on Alchemist.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30498>
2024-08-05 19:17:55 -07:00
Kenneth Graunke
06e8335e11 intel/brw: Delete the brw_fs_opt_peephole_select() pass
Now that we can handle load_ubo in NIR's peephole select pass, the
backend pass isn't really useful anymore.

fossil-db results on Alchemist show almost no impact:

   Totals:
   Instrs: 150646561 -> 150647106 (+0.00%); split: -0.00%, +0.00%
   Cycles: 12633748945 -> 12633760459 (+0.00%)

   Totals from 261 (0.04% of 630008) affected shaders:
   Instrs: 404946 -> 405491 (+0.13%); split: -0.00%, +0.14%
   Cycles: 23947172 -> 23958686 (+0.05%)

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30498>
2024-08-05 19:17:55 -07:00
Caio Oliveira
d00329e821 intel/brw: Replace some fs_reg constructors with functions
Create three helper functions for ATTR, UNIFORM and VGRF creation.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29791>
2024-07-03 02:53:18 +00:00
Sagar Ghuge
99ce8b5a07 intel/compiler: Add indirect mov lowering pass
Indirect addressing(vx1 and vxh) not supported with UB/B datatype for
src0, so we need to change the data type for both dest and src0.

This fixes following tests cases on Xe2+
 - dEQP-VK.spirv_assembly.instruction.compute.8bit_storage.push_constant_8_to_16*
 - dEQP-VK.spirv_assembly.instruction.compute.8bit_storage.push_constant_8_to_32*

Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29316>
2024-07-01 19:06:31 +00:00
Kenneth Graunke
1e69ec3b8d intel/brw: Add a lower_csel pass and allow building it for all types
We can do CSEL on F, HF, *W, and *D on Gfx11+.  Gfx9 can only do F.

We can lower unsupported types to CMP+CSEL, allowing us to use CSEL
in the IR and not worry about the limitations.

Rework: (Sagar)
- Update validation pass for CSEL

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29316>
2024-07-01 19:06:31 +00:00
Dylan Baker
35298e84f1 intel/compiler: move predicated_break out of backend loop
This has no impact on the generated shaders, but does have a small
(positive) impact on the amount of time spent in shader compilation.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29126>
2024-06-27 15:20:19 -07:00
Kenneth Graunke
2af84c2d49 intel/brw: Use the defs-based copy propagation along with the old one
The new def-based pass works better in many cases, and should be less
resource intensive.  However, the limited visibility of the defs-based
pass due to many values not being SSA yet makes it unable to fully
replace the old pass.  Try the new one, and if it can't make progress,
then try the old one.  That way, things will mostly be handled by the
new pass, but everything that was being cleaned up still will be.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
8f09c58ddc intel/brw: Switch to the new defs-based global CSE pass
While the limited visibility due to partial SSA is a downside to the new
pass, it has a huge number of advantages that make it worth switching
over even now.  It's much more efficient, can eliminate redundant memory
loads across blocks, and doesn't generate loads of unnecessary copies
that other passes have to clean up.  This means we also eliminate the
infighting between the old CSE, coalescing, and copy propagation passes.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
84219892ad intel/brw: Make gl_SubgroupInvocation lane index loading SSA
Our code to initialize gl_SubgroupInvocation uses multiple instructions
some of which are partial writes.  This makes it difficult to analyze
expressions involving gl_SubgroupInvocation, which appear very
frequently in compute shaders.

To make this easier, we add a new virtual opcode which initializes
a full VGRF to the value of gl_SubgroupInvocation.  (We also expand
it to UD for SIMD8 so there are not partial write issues.)  We then
lower it to the original code later on in compilation, after we've
done the bulk of our optimizations.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
545bb8fb6f intel/brw: Replace type_sz and brw_reg_type_to_size with brw_type_size_*
Both of these helpers do the same thing.  We now have brw_type_size_bits
and brw_type_size_bytes and can use whichever makes sense in that place.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28847>
2024-04-25 11:41:48 +00:00
Caio Oliveira
13093ceb3c intel/brw: Move validate out of fs_visitor
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28534>
2024-04-22 13:38:41 -07:00
Caio Oliveira
671d216f39 intel/brw: Remove two duplicated validate calls in optimizer
The OPT macro will call validate() after each pass, so both cases
removed by this patch are just redundant calls.  Will only affect
Debug builds since in Release builds validation is a no-op.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28534>
2024-04-22 13:38:41 -07:00
Kenneth Graunke
ba11127944 intel/brw: Fix opt_split_sends() to allow for FIXED_GRF send sources
opt_copy_propagation() can sometimes propagate FIXED_GRF sources into
SHADER_OPCODE_SENDs as the message payload.  For example, GS input
reads, which simply take a URB handle and have the offset in the
descriptor.  For non-VGRFs, there isn't a payload to split, so just
skip past such send messages.

Fixes: 589b03d02f ("intel/fs: Opportunistically split SEND message payloads")
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28067>
2024-03-27 04:52:17 +00:00
Caio Oliveira
b2ee98d2db intel/brw: Handle Xe2 in brw_fs_opt_zero_samples
The mlen tracking is in REG_SIZE units, but in Xe2 each GRF has
doubled the size.  The optimization can only elide full GRFs, so
round down the amount of trailing zeros to ensure the optimization
will remove only full GRFs.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28279>
2024-03-21 22:38:54 +00:00
Kenneth Graunke
a075b44493 intel/brw: Eliminate top-level FIND_LIVE_CHANNEL & BROADCAST once
brw_fs_opt_eliminate_find_live_channel eliminates FIND_LIVE_CHANNEL
outside of control flow.  None of our optimization passes generate
additional cases of that instruction, so once it's gone, we shouldn't
ever have to run the pass again.  Moving it out of the loop should
save a bit of CPU time.

While we're at it, also clean adjacent BROADCAST instructions that
consume the result of our FIND_LIVE_CHANNEL.  Without this, we have
to perform copy propagation to get the MOV 0 immediate into the
BROADCAST, then algebraic to turn it into a MOV, which enables more
copy propagation...not to mention CSE gets involved.  Since this
FIND_LIVE_CHANNEL + BROADCAST pattern from emit_uniformize() is
really common, and it's trivial to clean up, we can do that.  This
lets the initial copy prop in the loop see MOV instead of BROADCAST.

Zero impact on fossil-db, but less work in the optimization loop.

Together with the previous patches, this cuts compile time in
Borderlands 3 on Alchemist by -1.38539% +/- 0.1632% (n = 24).

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28286>
2024-03-20 01:04:22 -07:00
Kenneth Graunke
ea423aba1b intel/brw: Split out 64-bit lowering from algebraic optimizations
We don't necessarily want to split up MOVs for 64-bit addresses into
2x 32-bit MOVs right away, as this makes things like copy propagating
the whole address around harder.  We should do this late, once, while
still doing other algebraic optimizations earlier.

fossil-db results for Alchemist show tiny improvements:

   Totals:
   Instrs: 161310502 -> 161310436 (-0.00%); split: -0.00%, +0.00%
   Cycles: 14370605606 -> 14370605159 (-0.00%); split: -0.00%, +0.00%

   Totals from 33 (0.01% of 652298) affected shaders:
   Instrs: 15053 -> 14987 (-0.44%); split: -0.64%, +0.20%
   Cycles: 196947 -> 196500 (-0.23%); split: -0.25%, +0.02%

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28286>
2024-03-20 01:04:17 -07:00
Kenneth Graunke
bb191e3af5 intel/brw: Call constant combining after copy propagation/algebraic
This copy propagation can create MADs with immediates in src1, which
need to be cleaned up by constant combining (which puts them back in
VGRFs).

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27876>
2024-03-05 11:39:26 +00:00
Caio Oliveira
d9552fccf2 intel/brw: Remove extra stage_prog_data field in fs_visitor
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27861>
2024-02-29 19:28:06 +00:00
Caio Oliveira
559d94cd0d intel/brw: Use fs_visitor instead of backend_shader in various passes
And since we are touching them, rename a couple of passes
to follow same name convention as existing ones.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27861>
2024-02-29 19:28:05 +00:00
Caio Oliveira
7ac5696157 intel/brw: Remove Gfx8- code from backend passes
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27691>
2024-02-28 05:45:38 +00:00
Caio Oliveira
f3b7f4726a intel/brw: Move optimize and small optimizations to brw_fs_opt.cpp
Remaining optimizations in brw_fs.cpp will get their own files.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26887>
2024-02-26 20:54:25 +00:00