A negative hole size means the loads overlap. This will be used by drivers
to handle overlapping loads in the callback easily.
Reviewed-by: Mel Henning <drawoc@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32699>
The lowering code does not generate efficient code. It is better to
just not emit the bad thing in the first place. The shaders that I
examined had blocks of NIR like:
con 32 %527 = extract_u8 %456.o, %5 (0x0)
con 32 %528 = extract_u8 %456.o, %35 (0x1)
con 32 %529 = extract_u8 %456.o, %14 (0x2)
con 32 %530 = extract_u8 %456.o, %11 (0x3)
con 32 %531 = u2f32 %527
con 32 %532 = u2f32 %528
con 32 %533 = u2f32 %529
con 32 %534 = u2f32 %530
In some cases the u2f results are multiplied with 1/255. There may be
a slightly more efficient way to do this by doing something like
mov(8) g40<1>UW g12.1<32,8,4>UB
mov(8) g41<1>UW g12.2<32,8,4>UB
mov(8) g42<1>UW g12.3<32,8,4>UB
mov(8) g60<1>F g12<32,8,4>UB
mov(8) g61<1>F g40<1,1,0>UW
mov(8) g62<1>F g41<1,1,0>UW
mov(8) g63<1>F g42<1,1,0>UW
In SIMD16 and SIMD32 that would save temporary register space. It could
save a register in SIMD8 by using g40.8 instead of g42. Making that
happen might be tricky. Maybe we should just add a special NIR opcode
that converts a packed uint32 to a vec4?
v2: Add a bunch of documentation explaining what's going on. Suggested
by Ken.
shader-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
total instructions in shared programs: 18228689 -> 18228720 (<.01%)
instructions in affected programs: 43091 -> 43122 (0.07%)
helped: 0 / HURT: 30
total cycles in shared programs: 932542994 -> 932544290 (<.01%)
cycles in affected programs: 8150758 -> 8152054 (0.02%)
helped: 15 / HURT: 17
fossil-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Instrs: 142890605 -> 142890392 (-0.00%); split: -0.00%, +0.00%
Cycle count: 21655049536 -> 21654693720 (-0.00%); split: -0.00%, +0.00%
Totals from 181 (0.03% of 553251) affected shaders:
Instrs: 188022 -> 187809 (-0.11%); split: -0.12%, +0.01%
Cycle count: 85291658 -> 84935842 (-0.42%); split: -0.47%, +0.05%
Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Totals:
Instrs: 154438050 -> 154436980 (-0.00%)
Cycle count: 15334650326 -> 15334644375 (-0.00%); split: -0.00%, +0.00%
Spill count: 56754 -> 56706 (-0.08%)
Fill count: 95919 -> 95808 (-0.12%)
Scratch Memory Size: 2306048 -> 2304000 (-0.09%)
Max live registers: 32469924 -> 32469899 (-0.00%)
Totals from 112 (0.02% of 642922) affected shaders:
Instrs: 156186 -> 155116 (-0.69%)
Cycle count: 11111478 -> 11105527 (-0.05%); split: -0.62%, +0.56%
Spill count: 1766 -> 1718 (-2.72%)
Fill count: 2815 -> 2704 (-3.94%)
Scratch Memory Size: 78848 -> 76800 (-2.60%)
Max live registers: 11526 -> 11501 (-0.22%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
Between 5 and 10 shaders (depending on the platform) from Blender are
massively helped for spills and fills (e.g., from 45 spills to 0, and
180 fills to 0).
Previously this commit cause a lot of spill and fill damage to
Wolfenstein Youngblood and Red Dead Redemption 2. I believe due to
!32041 and !32097, this is no longer the case. RDR2 is helped, and
Wolfenstein Youngblood has no changes.
However, q2rtx/q2rtx-rt-pipeline is hurt:
Spill count: 126 -> 175 (+38.89%); split: -0.79%, +39.68%
Fill count: 156 -> 235 (+50.64%); split: -1.92%, +52.56%
By the end of this series this damage is fixed, and q2rtx is helped
overall by -0.79% spills and -1.92% fills.
v2: Fix for Xe2.
v3: Just keep using bld for the group(1, 0) call. Suggested by Ken.
v4: Major re-write. Pass bld and xbld to fs_emit_memory_access. The big
fix is changing the way srcs[MEMORY_LOGICAL_ADDRESS] is calculated
(around line 7180). In previous versions of the commit, the address
would be calculated using bld (which is now xbld) even if the address
source was not is_scalar. This could cause the emit_uniformize (later in
the function) to fetch garbage. This also drops the special case
handling of constant offset. Constant propagation and algebraic will
handle this.
v5: Fix a subtle bug that was ultimately caused by the removal of
offset_to_component. The MEMORY_LOGICAL_ADDRESS for
load_shared_uniform_block_intel was being calculated as SIMD16 on LNL,
but the later emit_uniformize would treat it as SIMD32. This caused GPU
hangs in Assassin's Creed Valhalla.
v6: Fix a bug in D16 to D16U32 expansion. Noticed by Ken. Add a comment
explaining bld vs xbld vs ubld in fs_nir_emit_memory_access. Suggested
by Ken.
v7: Revert some of the v6 changes related to D16 to D16U32
expansion. This code was mostly correct. xbld is correct because DATA0
needs to be generated in size of the eventual SEND instruction. Using
offset(nir_src, xbld, c) will cause offset() to correctly added
component(..., 0) if nir_src.is_scalar but xbld is not scalar_group().
v8: nir_intrinsic_load_shared_uniform_block_intel was removed. This
caused reproducible hangs in Assassin's Creed: Valhalla. There are some
other compiler issues related to this game, and we're not yet sure
exactly what the cause of any of it is.
shader-db:
Lunar Lake
total instructions in shared programs: 18058270 -> 18068886 (0.06%)
instructions in affected programs: 5196846 -> 5207462 (0.20%)
helped: 4442 / HURT: 11416
total cycles in shared programs: 921324492 -> 919819398 (-0.16%)
cycles in affected programs: 733274162 -> 731769068 (-0.21%)
helped: 11312 / HURT: 31788
total spills in shared programs: 3633 -> 3585 (-1.32%)
spills in affected programs: 48 -> 0
helped: 5 / HURT: 0
total fills in shared programs: 2277 -> 2198 (-3.47%)
fills in affected programs: 79 -> 0
helped: 5 / HURT: 0
LOST: 123
GAINED: 377
Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown)
total instructions in shared programs: 19703458 -> 19699173 (-0.02%)
instructions in affected programs: 5885251 -> 5880966 (-0.07%)
helped: 4545 / HURT: 14971
total cycles in shared programs: 903497253 -> 902054570 (-0.16%)
cycles in affected programs: 691762248 -> 690319565 (-0.21%)
helped: 16412 / HURT: 28080
total spills in shared programs: 4894 -> 4646 (-5.07%)
spills in affected programs: 248 -> 0
helped: 7 / HURT: 0
total fills in shared programs: 6638 -> 5581 (-15.92%)
fills in affected programs: 1057 -> 0
helped: 7 / HURT: 0
LOST: 427
GAINED: 978
Ice Lake and Skylake had similar results. (Ice Lake shonw)
total instructions in shared programs: 20384200 -> 20384889 (<.01%)
instructions in affected programs: 5295084 -> 5295773 (0.01%)
helped: 5309 / HURT: 12564
total cycles in shared programs: 873002832 -> 872515246 (-0.06%)
cycles in affected programs: 463413458 -> 462925872 (-0.11%)
helped: 16079 / HURT: 13339
total spills in shared programs: 4552 -> 4373 (-3.93%)
spills in affected programs: 546 -> 367 (-32.78%)
helped: 11 / HURT: 0
total fills in shared programs: 5298 -> 4657 (-12.10%)
fills in affected programs: 1798 -> 1157 (-35.65%)
helped: 10 / HURT: 0
LOST: 380
GAINED: 925
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 141528822 -> 141728392 (+0.14%); split: -0.21%, +0.35%
Subgroup size: 10968048 -> 10968144 (+0.00%)
Send messages: 6567930 -> 6567909 (-0.00%)
Cycle count: 22165780202 -> 21624534624 (-2.44%); split: -3.09%, +0.65%
Spill count: 69890 -> 66665 (-4.61%); split: -5.06%, +0.44%
Fill count: 128331 -> 120189 (-6.34%); split: -7.44%, +1.09%
Scratch Memory Size: 5829632 -> 5664768 (-2.83%); split: -2.86%, +0.04%
Max live registers: 47928290 -> 47611371 (-0.66%); split: -0.71%, +0.05%
Totals from 364369 (66.18% of 550563) affected shaders:
Instrs: 113448842 -> 113648412 (+0.18%); split: -0.26%, +0.44%
Subgroup size: 7694080 -> 7694176 (+0.00%)
Send messages: 5308287 -> 5308266 (-0.00%)
Cycle count: 21885237842 -> 21343992264 (-2.47%); split: -3.13%, +0.65%
Spill count: 65152 -> 61927 (-4.95%); split: -5.42%, +0.47%
Fill count: 122811 -> 114669 (-6.63%); split: -7.77%, +1.14%
Scratch Memory Size: 5438464 -> 5273600 (-3.03%); split: -3.07%, +0.04%
Max live registers: 34355310 -> 34038391 (-0.92%); split: -1.00%, +0.07%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
No shader-db changes on any Intel platform.
fossil-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Instrs: 141808595 -> 141808815 (+0.00%); split: -0.00%, +0.00%
Cycle count: 22181300418 -> 22185066952 (+0.02%); split: -0.01%, +0.03%
Max live registers: 48052077 -> 48052083 (+0.00%)
Totals from 720 (0.13% of 551446) affected shaders:
Instrs: 116778 -> 116998 (+0.19%); split: -0.01%, +0.20%
Cycle count: 1197931082 -> 1201697616 (+0.31%); split: -0.21%, +0.53%
Max live registers: 56552 -> 56558 (+0.01%)
No fossil-db changes on any other Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
No shader-db changes on any Intel platform.
v2: Fix for Xe2.
v3: Rework the way that we determine that an intrinsic can actually be
convergent. This will now depend on whether or not the important
sources have previously be determined to be convergent. Fixes
intermitent failures in some test cases (including
dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.push_constant_float_16_to_32.scalar_frag).
v4: s/the it/it/ in a comment. Noticed by Ken.
fossil-db:
No fossil-db changes on Lunar Lake.
Meteor Lake and DG2 had similar results. (Meteor Lake shown)
Totals:
Instrs: 152743449 -> 152743161 (-0.00%)
Cycle count: 17399179660 -> 17399193488 (+0.00%)
Totals from 144 (0.02% of 633314) affected shaders:
Instrs: 5936 -> 5648 (-4.85%)
Cycle count: 51616 -> 65444 (+26.79%)
Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Totals:
Instrs: 150646195 -> 150645907 (-0.00%)
Cycle count: 15618427818 -> 15618428942 (+0.00%)
Totals from 144 (0.02% of 632567) affected shaders:
Instrs: 6218 -> 5930 (-4.63%)
Cycle count: 39968 -> 41092 (+2.81%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
This commit used to be "brw/emit: Allow scalar sources to 64-bit
3-source instructions". These instructions were fixed up in
brw_eu_emit. There seems to be some conflict with the <0,1,0> stride an
post-RA scheduling. The only difference between the passing code
generated by this commit and the failing code generated by the older
commit is some post-RA scheduling.
v2: Change the stride of a MAD even if the instruction isn't
lowered. MAD instructions that are already SIMD8 have to follow the same
rules. 🤦
v3: Pull the lowering out to its own pass. Update the comment in
brw_fs_validate. Suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
Sources that are scalars (almost all source) and convergent generally
want <0,1,0> source stride. Sources that are vectors (e.g., texture
coordinates, SSBO write data, etc.) and convergent want no extra strides
applied. In nearly all cases LOAD_PAYLOAD lowering will do the right
thing.
v2: Use VEC in emit_pixel_interpolater_send. Suggested by Ken.
v3: With the elimination of offset_to_component(), offset() may not
convert an is_scalar source to have a zero stride. Explicitly do this
in get_nir_src and prepare_alu_destination_and_sources.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
In SIMD16 and SIMD32, storing convergent values in full 16- or
32-channel registers is wasteful. It wastes register space, and in most
cases on SIMD32, it wastes instructions. Our register allocator is not
clever enough to handle scalar allocations. It's fundamental unit of
allocation is SIMD8. Start treating convergent values as SIMD8.
Add a tracking bit in brw_reg to specify that a register represents a
convergent, scalar value. This has two implications:
1. All channels of the SIMD8 register must contain the same value. In
general, this means that writes to the register must be
force_writemask_all and exec_size = 8;
2. Reads of this register can (and should) use <0,1,0> stride. SIMD8
instructions that have restrictions on source stride can us <8,8,1>.
Values that are vectors (e.g., results of load_uniform or texture
operations) will be stored as multiple SIMD8 hardware registers.
v2: brw_fs_opt_copy_propagation_defs fix from Ken. Fix for Xe2.
v3: Eliminte offset_to_scalar(). Remove mention of vec4 backend in
brw_reg.h. Both suggested by Caio. The offset_to_scalar() change
necessitates some trickery in the fs_builder offset() function, but I
think this is an improvement overall. There is also some rework in
find_value_for_offset to account for the possibility that is_scalar
sources in LOAD_PAYLOAD might be <8;8,1> or <0;1,0>.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
This isn't used now, but future commits will add uses. Doing this as a
separate commit removes a lot of "just typing" churn from commits that
have real changes to review.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
As Asahi, Intel and soon Panfrost requires an offline compiler for their
respective internal shaders, this commit adds generic new options to
workaround meson current limitations around cross-compillation.
Signed-off-by: Mary Guillemard <mary.guillemard@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32719>
Disable mesh autostrip for platforms that need WA 16020916187.
Additionally, zero out the layer and viewport slots when a shading rate
is found through the brw_nir_initialize_mue pass.
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32751>
The SIMD splitting pass does not handle wide force_writemask_all
instructions correctly at the moment. For example, a SIMD32 TXF
on pre-Xe2 would get split to a pair of SIMD16. But it will set
the groups to operate on channels 15:0 and 31:16. That's not what
we want for a NoMask instruction - both should be 15:0, i.e.
bld.group(inst->exec_size, 0).
We could (and perhaps should) fix the SIMD splitting pass to handle
this, but the pass already has subtle complexity in which builders
are used. Or we could alter fs_builder::group(), but that has broader
implications. As a stop-gap, just make opt_combine_covergent_txfs stop
relying on SIMD splitting. It's trivial to do and fixes the issue
without risking other breakage.
Fixes: 6341b3cd87 ("brw: Combine convergent texture buffer fetches into fewer loads")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32714>
According to HSD 14016252163 if compute shader uses the sample
operation, morton walk order and set the thread group batch size to 4 is
expected to increase sampler cache hit rates by increasing sample
address locality within a subslice.
Rework:
* Caio: "||" => "&&" for type checking in instr_uses_sampler()
* Jordan: Use nir's foreach macros rather than
nir_shader_lower_instructions()
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32430>
And the SEND gather variant that uses a scalar register as its only
source.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32236>