We return an immediate for 32-bit constant values, but fall back to
calling get_nir_src() for other values, as 64-bit, and even 8-bit
immediates have odd restrictions. We could probably support 16-bit
here without too many issues, but we leave it be for now.
This makes it usable for case where we'd like to get constants for
32-bit values but where it may be a different bit-size too.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
We were already skipping unread trailing components, but now we skip
them on both ends.
About -3.5% spills on Shadow of the Tomb Raider on Alchemist (mostly a
wash elsewhere, but it will help additional shaders with later patches).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
HDC doesn't support block loads/stores with sub-DWord (<4B) aligned
offsets, and shared local memory has to use the Aligned OWord Block
messages which require OWord (16B) alignment.
Make the validator detect this case and say no. Also make the lowering
code assert that the alignment is valid as a second line of defense.
LSC has no such restrictions.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
Just assert that the array will fit whatever the MAX is for a given
Gfx version.
Fixes: 172c1ab984 ("intel/elk: Add ELK_MAX_MRF_ALL for static allocating arrays")
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32978>
Was causing trouble in some build configurations, we don't really need
them. Unless there's a good reason, defaults to use ralloc for
consistency with the larger codebase.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Antonio Ospite <None>
Reviewed-by: Kenneth Graunke <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32916>
Was causing trouble in some build configurations, we don't really need
them. Use ralloc for consistency.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Antonio Ospite <None>
Reviewed-by: Kenneth Graunke <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32916>
Fix defect reported by Coverity Scan.
Side effect in assertion (ASSERT_SIDE_EFFECT)
assert_side_effect: Argument ++eot_count of assert() has a side effect.
The containing function might work differently in a non-debug build.
Fixes: ebd6738260 ("intel/elk/chv: Implement WaClearArfDependenciesBeforeEot")
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32884>
Marek recently changed hole_size to be signed, rather than unsigned.
A negative hole_size means that the two loads overlap - and thus are
prime candidates to be combined.
My original hole_size handling was:
if hole_size > 4 * (8 - low->num_components) then don't vectorize
For non-overlapping loads, this worked: NIR's largest vector is vec16,
and if low was already a vec16, combining it with anything would exceed
that, so it'd never be considered. That meant low would always be a
vec8 or less, so (8 - low->num_components) was a positive number.
Now that we see overlapping loads, we can see a vec16 low, vec4 high,
and also a negative hole size, giving us fun comparisons like:
-16 > 4 * (8 - 16) => -16 > -32 => true, don't vectorize
Which is absolutely the wrong thing to do, because the high load's data
is entirely included within the former load's data.
The idea here was to make sure the second load would be able to pack at
least one component into the first's V8 result. But even this isn't the
best, because...even if it's simply adjacent, doing one V16 load is more
efficient than requesting two back to back V8 loads.
So, we just simplify down to a static check: if there's an entire V8 of
hole, don't vectorize. This already won't happen because the core pass
has max_hole set to 28 bytes (7 32-bit components), but that could
change based on the needs of other drivers, so let's be defensive.
fossil-db results on Alchemist:
Instrs: 161533978 -> 161295137 (-0.15%); split: -0.20%, +0.05%
Subgroup size: 8092544 -> 8092568 (+0.00%)
Send messages: 7915233 -> 7844503 (-0.89%); split: -0.94%, +0.05%
Cycle count: 16577700697 -> 16702609256 (+0.75%); split: -0.59%, +1.35%
Spill count: 72338 -> 67226 (-7.07%); split: -7.36%, +0.29%
Fill count: 134058 -> 125980 (-6.03%); split: -6.83%, +0.80%
Scratch Memory Size: 4092928 -> 3786752 (-7.48%); split: -7.53%, +0.05%
Max live registers: 33031460 -> 32945994 (-0.26%); split: -0.27%, +0.01%
Max dispatch width: 5778384 -> 5778536 (+0.00%); split: +0.26%, -0.26%
Non SSA regs after NIR: 179809505 -> 152735471 (-15.06%); split: -15.08%, +0.03%
Fixes: c21bc65ba7 ("nir/opt_load_store_vectorize: make hole_size signed to indicate overlapping loads")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32932>
When a SEND instruction is a EOT, the scoreboard lowering will not
allocate a new SBID for it, since nothing needs to wait for it. In
Gfx12 this allowed the SEND to get out-of-order $.dst or $.src
dependencies.
Starting on Xe2+ this is not supported anymore, in favor of supporting
more combined modes.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32712>
The push_constant_loc[] array is always an identity mapping these days,
so it's kind of pointless. Just use the original uniform number and
skip the unnecessary "remap" step. With that gone, and shrinking UBO
ranges gone, assign_constant_locations() is now empty and can be removed
as well.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32841>
Now that we never shrink ranges in the backend, we never lower push
constants to pull constants late in the backend either. get_pull_loc
will never return true, and so all of brw_lower_constant_loads becomes
a noop.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32841>
Back in the bad old days (vec4?) we had a bunch of smarts in the backend
to dead code eliminate unused vector components and re-pack regular
uniforms, so we really couldn't decide how much data we were pushing
until very late in the backend. Nowadays we have none of that - we do
all of our elimination and packing in NIR. anv shrinks ranges to deal
with Vulkan API push constants, and iris treats everything as a UBO and
as of the previous commit will also shrink appropriately.
So we don't need to do this anymore...which will let us simplify quite
a bit of code.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32841>
anv already does this limiting, since it needs to handle non-UBO push
constants as well. iris treats everything as a UBO, but doesn't have
a limiter and was relying on the backend to handle it.
Do this in the NIR pass so that we can eliminate the backend code.
It's not necessary for anv, but handling it here is simple and less
error prone for iris, which calls this in a number of places. We know
we need to limit things to this much; anv can limit more if needed.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32841>
A negative hole size means the loads overlap. This will be used by drivers
to handle overlapping loads in the callback easily.
Reviewed-by: Mel Henning <drawoc@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32699>
The lowering code does not generate efficient code. It is better to
just not emit the bad thing in the first place. The shaders that I
examined had blocks of NIR like:
con 32 %527 = extract_u8 %456.o, %5 (0x0)
con 32 %528 = extract_u8 %456.o, %35 (0x1)
con 32 %529 = extract_u8 %456.o, %14 (0x2)
con 32 %530 = extract_u8 %456.o, %11 (0x3)
con 32 %531 = u2f32 %527
con 32 %532 = u2f32 %528
con 32 %533 = u2f32 %529
con 32 %534 = u2f32 %530
In some cases the u2f results are multiplied with 1/255. There may be
a slightly more efficient way to do this by doing something like
mov(8) g40<1>UW g12.1<32,8,4>UB
mov(8) g41<1>UW g12.2<32,8,4>UB
mov(8) g42<1>UW g12.3<32,8,4>UB
mov(8) g60<1>F g12<32,8,4>UB
mov(8) g61<1>F g40<1,1,0>UW
mov(8) g62<1>F g41<1,1,0>UW
mov(8) g63<1>F g42<1,1,0>UW
In SIMD16 and SIMD32 that would save temporary register space. It could
save a register in SIMD8 by using g40.8 instead of g42. Making that
happen might be tricky. Maybe we should just add a special NIR opcode
that converts a packed uint32 to a vec4?
v2: Add a bunch of documentation explaining what's going on. Suggested
by Ken.
shader-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
total instructions in shared programs: 18228689 -> 18228720 (<.01%)
instructions in affected programs: 43091 -> 43122 (0.07%)
helped: 0 / HURT: 30
total cycles in shared programs: 932542994 -> 932544290 (<.01%)
cycles in affected programs: 8150758 -> 8152054 (0.02%)
helped: 15 / HURT: 17
fossil-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Instrs: 142890605 -> 142890392 (-0.00%); split: -0.00%, +0.00%
Cycle count: 21655049536 -> 21654693720 (-0.00%); split: -0.00%, +0.00%
Totals from 181 (0.03% of 553251) affected shaders:
Instrs: 188022 -> 187809 (-0.11%); split: -0.12%, +0.01%
Cycle count: 85291658 -> 84935842 (-0.42%); split: -0.47%, +0.05%
Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Totals:
Instrs: 154438050 -> 154436980 (-0.00%)
Cycle count: 15334650326 -> 15334644375 (-0.00%); split: -0.00%, +0.00%
Spill count: 56754 -> 56706 (-0.08%)
Fill count: 95919 -> 95808 (-0.12%)
Scratch Memory Size: 2306048 -> 2304000 (-0.09%)
Max live registers: 32469924 -> 32469899 (-0.00%)
Totals from 112 (0.02% of 642922) affected shaders:
Instrs: 156186 -> 155116 (-0.69%)
Cycle count: 11111478 -> 11105527 (-0.05%); split: -0.62%, +0.56%
Spill count: 1766 -> 1718 (-2.72%)
Fill count: 2815 -> 2704 (-3.94%)
Scratch Memory Size: 78848 -> 76800 (-2.60%)
Max live registers: 11526 -> 11501 (-0.22%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
Between 5 and 10 shaders (depending on the platform) from Blender are
massively helped for spills and fills (e.g., from 45 spills to 0, and
180 fills to 0).
Previously this commit cause a lot of spill and fill damage to
Wolfenstein Youngblood and Red Dead Redemption 2. I believe due to
!32041 and !32097, this is no longer the case. RDR2 is helped, and
Wolfenstein Youngblood has no changes.
However, q2rtx/q2rtx-rt-pipeline is hurt:
Spill count: 126 -> 175 (+38.89%); split: -0.79%, +39.68%
Fill count: 156 -> 235 (+50.64%); split: -1.92%, +52.56%
By the end of this series this damage is fixed, and q2rtx is helped
overall by -0.79% spills and -1.92% fills.
v2: Fix for Xe2.
v3: Just keep using bld for the group(1, 0) call. Suggested by Ken.
v4: Major re-write. Pass bld and xbld to fs_emit_memory_access. The big
fix is changing the way srcs[MEMORY_LOGICAL_ADDRESS] is calculated
(around line 7180). In previous versions of the commit, the address
would be calculated using bld (which is now xbld) even if the address
source was not is_scalar. This could cause the emit_uniformize (later in
the function) to fetch garbage. This also drops the special case
handling of constant offset. Constant propagation and algebraic will
handle this.
v5: Fix a subtle bug that was ultimately caused by the removal of
offset_to_component. The MEMORY_LOGICAL_ADDRESS for
load_shared_uniform_block_intel was being calculated as SIMD16 on LNL,
but the later emit_uniformize would treat it as SIMD32. This caused GPU
hangs in Assassin's Creed Valhalla.
v6: Fix a bug in D16 to D16U32 expansion. Noticed by Ken. Add a comment
explaining bld vs xbld vs ubld in fs_nir_emit_memory_access. Suggested
by Ken.
v7: Revert some of the v6 changes related to D16 to D16U32
expansion. This code was mostly correct. xbld is correct because DATA0
needs to be generated in size of the eventual SEND instruction. Using
offset(nir_src, xbld, c) will cause offset() to correctly added
component(..., 0) if nir_src.is_scalar but xbld is not scalar_group().
v8: nir_intrinsic_load_shared_uniform_block_intel was removed. This
caused reproducible hangs in Assassin's Creed: Valhalla. There are some
other compiler issues related to this game, and we're not yet sure
exactly what the cause of any of it is.
shader-db:
Lunar Lake
total instructions in shared programs: 18058270 -> 18068886 (0.06%)
instructions in affected programs: 5196846 -> 5207462 (0.20%)
helped: 4442 / HURT: 11416
total cycles in shared programs: 921324492 -> 919819398 (-0.16%)
cycles in affected programs: 733274162 -> 731769068 (-0.21%)
helped: 11312 / HURT: 31788
total spills in shared programs: 3633 -> 3585 (-1.32%)
spills in affected programs: 48 -> 0
helped: 5 / HURT: 0
total fills in shared programs: 2277 -> 2198 (-3.47%)
fills in affected programs: 79 -> 0
helped: 5 / HURT: 0
LOST: 123
GAINED: 377
Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown)
total instructions in shared programs: 19703458 -> 19699173 (-0.02%)
instructions in affected programs: 5885251 -> 5880966 (-0.07%)
helped: 4545 / HURT: 14971
total cycles in shared programs: 903497253 -> 902054570 (-0.16%)
cycles in affected programs: 691762248 -> 690319565 (-0.21%)
helped: 16412 / HURT: 28080
total spills in shared programs: 4894 -> 4646 (-5.07%)
spills in affected programs: 248 -> 0
helped: 7 / HURT: 0
total fills in shared programs: 6638 -> 5581 (-15.92%)
fills in affected programs: 1057 -> 0
helped: 7 / HURT: 0
LOST: 427
GAINED: 978
Ice Lake and Skylake had similar results. (Ice Lake shonw)
total instructions in shared programs: 20384200 -> 20384889 (<.01%)
instructions in affected programs: 5295084 -> 5295773 (0.01%)
helped: 5309 / HURT: 12564
total cycles in shared programs: 873002832 -> 872515246 (-0.06%)
cycles in affected programs: 463413458 -> 462925872 (-0.11%)
helped: 16079 / HURT: 13339
total spills in shared programs: 4552 -> 4373 (-3.93%)
spills in affected programs: 546 -> 367 (-32.78%)
helped: 11 / HURT: 0
total fills in shared programs: 5298 -> 4657 (-12.10%)
fills in affected programs: 1798 -> 1157 (-35.65%)
helped: 10 / HURT: 0
LOST: 380
GAINED: 925
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 141528822 -> 141728392 (+0.14%); split: -0.21%, +0.35%
Subgroup size: 10968048 -> 10968144 (+0.00%)
Send messages: 6567930 -> 6567909 (-0.00%)
Cycle count: 22165780202 -> 21624534624 (-2.44%); split: -3.09%, +0.65%
Spill count: 69890 -> 66665 (-4.61%); split: -5.06%, +0.44%
Fill count: 128331 -> 120189 (-6.34%); split: -7.44%, +1.09%
Scratch Memory Size: 5829632 -> 5664768 (-2.83%); split: -2.86%, +0.04%
Max live registers: 47928290 -> 47611371 (-0.66%); split: -0.71%, +0.05%
Totals from 364369 (66.18% of 550563) affected shaders:
Instrs: 113448842 -> 113648412 (+0.18%); split: -0.26%, +0.44%
Subgroup size: 7694080 -> 7694176 (+0.00%)
Send messages: 5308287 -> 5308266 (-0.00%)
Cycle count: 21885237842 -> 21343992264 (-2.47%); split: -3.13%, +0.65%
Spill count: 65152 -> 61927 (-4.95%); split: -5.42%, +0.47%
Fill count: 122811 -> 114669 (-6.63%); split: -7.77%, +1.14%
Scratch Memory Size: 5438464 -> 5273600 (-3.03%); split: -3.07%, +0.04%
Max live registers: 34355310 -> 34038391 (-0.92%); split: -1.00%, +0.07%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
No shader-db changes on any Intel platform.
fossil-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Instrs: 141808595 -> 141808815 (+0.00%); split: -0.00%, +0.00%
Cycle count: 22181300418 -> 22185066952 (+0.02%); split: -0.01%, +0.03%
Max live registers: 48052077 -> 48052083 (+0.00%)
Totals from 720 (0.13% of 551446) affected shaders:
Instrs: 116778 -> 116998 (+0.19%); split: -0.01%, +0.20%
Cycle count: 1197931082 -> 1201697616 (+0.31%); split: -0.21%, +0.53%
Max live registers: 56552 -> 56558 (+0.01%)
No fossil-db changes on any other Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
No shader-db changes on any Intel platform.
v2: Fix for Xe2.
v3: Rework the way that we determine that an intrinsic can actually be
convergent. This will now depend on whether or not the important
sources have previously be determined to be convergent. Fixes
intermitent failures in some test cases (including
dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.push_constant_float_16_to_32.scalar_frag).
v4: s/the it/it/ in a comment. Noticed by Ken.
fossil-db:
No fossil-db changes on Lunar Lake.
Meteor Lake and DG2 had similar results. (Meteor Lake shown)
Totals:
Instrs: 152743449 -> 152743161 (-0.00%)
Cycle count: 17399179660 -> 17399193488 (+0.00%)
Totals from 144 (0.02% of 633314) affected shaders:
Instrs: 5936 -> 5648 (-4.85%)
Cycle count: 51616 -> 65444 (+26.79%)
Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Totals:
Instrs: 150646195 -> 150645907 (-0.00%)
Cycle count: 15618427818 -> 15618428942 (+0.00%)
Totals from 144 (0.02% of 632567) affected shaders:
Instrs: 6218 -> 5930 (-4.63%)
Cycle count: 39968 -> 41092 (+2.81%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
This commit used to be "brw/emit: Allow scalar sources to 64-bit
3-source instructions". These instructions were fixed up in
brw_eu_emit. There seems to be some conflict with the <0,1,0> stride an
post-RA scheduling. The only difference between the passing code
generated by this commit and the failing code generated by the older
commit is some post-RA scheduling.
v2: Change the stride of a MAD even if the instruction isn't
lowered. MAD instructions that are already SIMD8 have to follow the same
rules. 🤦
v3: Pull the lowering out to its own pass. Update the comment in
brw_fs_validate. Suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>