Our validation code doesn't need to know which bytes are accessed. It
only needs to know which grfs were accessed by an element. This also
helps to easily handle the Xe2 register size change.
Backport-to: 24.2
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28479>
Previously this check would create a mask of the bytes used in the
grf, and then shift the mask. This worked well when there was 32 bytes
in the register because a 64-bit uint64_t could easily detect that
bytes were used in the next regiter. (The next register was the high
32-bits of the `access_mask` variable.)
With Xe2, the register size becomes 64 bytes, meaning this strategy
doesn't work. Instead of a mask, we can just check to see if more than
1 grfs are used during each loop iteration. (Suggested by Ken.) This
will make it easier to extend for Xe2 in a follow on commit.
Verified this with
dEQP-VK.subgroups.arithmetic.compute.subgroupexclusivemul_u64vec4_requiredsubgroupsize
on Xe2, which otherwise would cause the program to fail to validate
because it assumed a grf was 32 bytes.
Backport-to: 24.2
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28479>
We don't actually need to extend g0's live range to the EOT message
generally - most messages that end a shader are headerless. The main
implicit use of g0 is for constructing scratch headers. With the last
two patches, we now consider scratch access that may exist in the IR
and already extend the liveness appropriately.
There is one remaining problem: spilling. The register allocator will
create new scratch messages when spilling a register, which need to
create scratch headers, which need g0. So, every new spill or fill
might extend the live range of g0, which would create new interference,
altering the graph. This can be problematic.
However, when compiling SIMD16 or SIMD32 fragment shaders, we don't
allow spilling anyway. So, why not use allow g0? Also, when trying
various scheduling modes, we first try allocation without spilling.
If it works, great, if not, we try a (hopefully) less aggressive
schedule, and only allow spilling on the lowest-pressure schedule.
So, even for regular SIMD8 shaders, we can potentially gain the use
of g0 on the first few tries at scheduling+allocation.
Once we try to allocate with spilling, we go back to reserving g0
for the entire program, so that we can construct scratch headers at
any point. We could possibly do better here, but this is simple and
reliable with some benefit.
Thanks to Ian Romanick for suggesting I try this approach.
fossil-db on Alchemist shows some more spill/fill improvements:
Totals:
Instrs: 149062395 -> 149053010 (-0.01%); split: -0.01%, +0.00%
Cycles: 12609496913 -> 12611652181 (+0.02%); split: -0.45%, +0.47%
Spill count: 52891 -> 52471 (-0.79%)
Fill count: 101599 -> 100818 (-0.77%)
Scratch Memory Size: 3292160 -> 3197952 (-2.86%)
Totals from 416541 (66.59% of 625484) affected shaders:
Instrs: 124058587 -> 124049202 (-0.01%); split: -0.01%, +0.01%
Cycles: 3567164271 -> 3569319539 (+0.06%); split: -1.61%, +1.67%
Spill count: 420 -> 0 (-inf%)
Fill count: 781 -> 0 (-inf%)
Scratch Memory Size: 94208 -> 0 (-inf%)
Witcher 3 shows a 33% reduction in scratch memory size, for example.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30319>
brw_send_indirect_split_message() implicitly reads g0 to construct the
extended message descriptor for certain send messages when this is set.
Record that liveness explicitly.
Thanks to Francisco Jerez for reminding me about this use of g0.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30319>
The generator code for emitting legacy scratch headers was implicitly
using g0 as a source. But the IR wasn't indicating any usage of g0,
which means the liveness isn't properly tracked at the IR level.
It works because we reserve g0 as permanently live for the whole
program. In order to stop doing that, we need to record it properly.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30319>
API requires samplers to return 32-bit even though hardware can handle
16-bit floating point, so we detect that case and make more efficient
use of memory BW. This is helping improve performance of encode and
decode tokens during LLM by at least 5% across multiple platforms.
Thank you Kenneth Graunke for suggesting and guiding me throughout
this implementation.
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30447>
While extending our backend to handle 16-bit sampler return payloads, we
found that in piglit's arb_texture_view-rendering-formats, the SIMD8 FS
was missing the sampling operation altogether. This was because we were
first emitting the texturing instruction, and then calling
get_nir_def(), which adds an UNDEF instruction when the destination is
smaller than the 32-bit. So the texturing was dead code elimated. Fix
this by calling get_nir_def() earlier.
Thank you to Kenneth Graunke for suggesting and guiding me throughout
this implementation.
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30447>
This option is used for Gfx < 12, elk already set it to true,
so set it in brw and change the drivers to not set it anymore.
Because the dual-compiler support in Iris, the helper function
there had to change to consult the right compiler value instead.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30393>
v2: Add comment and assertion to explain why the shift is
safe. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, many instances of the following are
produced:
src/intel/compiler/brw_fs_nir.cpp:114:30: runtime error: left shift of negative value -1
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, 'ninja test' would fail in several
Intel assembly tests (mul.asm and and.asm) with:
src/intel/compiler/brw_reg.h:703:22: runtime error: left shift of 65532 by 16 places cannot be represented in type 'int'
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, many instances of the following are
produced:
src/intel/compiler/brw_eu_compact.c:2244:50: runtime error: left shift of negative value -306
v2: Add comment and assertion to explain why the shift is
safe. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
When -fsanitize=shift is used, many instances of the following are
produced:
src/intel/compiler/brw_compiler.h:1661:44: runtime error: shift exponent 64 is too large for 64-bit type 'long long unsigned int'
I think this is an actual bug. It should check the sentinel value, but
the sentinel value is 64. The shift by 64 is treated as a shift by
0. The varying 0 is explicitly filtered by the rest of the
if-test. How does this work?
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30333>
Due to recent regression, adding INTEL_DEBUG=optimizer is dumping
shader optimization pass details to console rather than to respective
files.
Thank you, Kenneth W Graunke for helping me figure this out.
Fixes: 17b7e49089 ("intel/brw: Move out of fs_visitor and rename print instructions")
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30389>
The idea here was that pixel shader framebuffer writes used the g0 and
g1 thread payload register values to construct the message header.
However, most messages are headerless and don't use either. There's a
2012-era comment that the simulator at one point had a bug where certain
headerless messages would incorrectly take the values from the g0/g1
register contents rather than using sideband. But, that was likely
fixed eons ago. So we really don't need to do this.
Furthermore, there are many more shader stages these days:
- VS: r1 contains output URB handles
- TCS: r1 contains ICP handles
- TES: r1 contains gl_TessCoord.x (r4 contains output URB handles)
- GS: r1 contains output URB handles
- CS: r1 contains LocalID.X on DG2+ but nothing on older hardware
- Task/Mesh: r1 contains LocalID.X
- BS: r1 contains bindless stack handles
Vertex and geometry aren't likely to benefit here because r1 is needed
for their output messages, which are also what terminate the shader.
TES will definitely benefit because we were making a value pointlessly
live for the whole program. Same for TCS, to a lesser extent. Compute
prior to DG2 was the worst, as g1 literally has no meaningful content,
so there is no point to keeping it live.
fossil-db on Alchemist shows substantial spill/fill improvements:
Totals:
Instrs: 148782351 -> 148741996 (-0.03%); split: -0.03%, +0.01%
Cycles: 12602907531 -> 12605795191 (+0.02%); split: -0.70%, +0.72%
Subgroup size: 7518608 -> 7518632 (+0.00%)
Send messages: 7341727 -> 7341762 (+0.00%)
Spill count: 54633 -> 52575 (-3.77%)
Fill count: 104694 -> 100680 (-3.83%)
Scratch Memory Size: 3375104 -> 3287040 (-2.61%)
Totals from 301172 (48.21% of 624670) affected shaders:
Instrs: 95531927 -> 95491572 (-0.04%); split: -0.05%, +0.01%
Cycles: 9643531593 -> 9646419253 (+0.03%); split: -0.91%, +0.94%
Subgroup size: 4492512 -> 4492536 (+0.00%)
Send messages: 4399737 -> 4399772 (+0.00%)
Spill count: 20034 -> 17976 (-10.27%)
Fill count: 41530 -> 37516 (-9.67%)
Scratch Memory Size: 1522688 -> 1434624 (-5.78%)
Assassin's Creed Odyssey in particular has 20% fewer fills.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30146>
Doxygen documentation says
> If the file name is omitted (i.e. the line after \file is left
> blank) then the documentation block that contains the \file command will
> belong to the file it is located in.
so we can omit the filename itself when using the annotation.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30168>
Gfx 12.5 moved scratch to a surface and SURFTYPE_SCRATCH has this pitch
restriction:
RENDER_SURFACE_STATE::Surface Pitch
For surfaces of type SURFTYPE_SCRATCH, valid range of pitch is:
[63,262143] -> [64B, 256KB]
The pitch of the surface is the scratch size per thread and the surface
should be large enough to accommodate every physical thread.
So here adding a new field to intel_device_info, setting it in
intel_device_info_init_common() so even offline tools can have it set.
And finally adding a check to fail shader compilation if needed
scratch is larger than supported.
This issue can be reproduced in debug builds when running
dEQP-VK.protected_memory.stack.stacksize_1024 on Gfx 12.5 or newer
platforms.
Ref: BSpec 43862 (r52666)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30271>
This implements an undocumented workaround for a hardware bug that
affects draw calls with a pixel shader that has 0 push constant cycles
when TBIMR is enabled, which has been seen to lead to a hang with
Fallout 3 and Metal Gear Rising Revengeance. This hardware bug has
been reported as HSDES#22020184996 which is still pending a resolution
by the hardware team. However since this workaround found empirically
has been confirmed to fix the issue reliably and it's relatively
harmless it seems worth checking in already even though no final W/A
number is available nor has the W/A json file been updated.
To avoid the issue we simply pad the push constant payload to be at
least 1 register. This is enabled via a brw_wm_prog_key since the
driver needs to be in agreement with the compiler on whether the dummy
push constant cycle is present, and it can be avoided in cases where
the driver knows that TBIMR will be disabled (e.g. for BLORP).
Related: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10728
Related: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11399
Fixes: 57decad976 ("intel/xehp: Enable TBIMR by default.")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30031>
Instead of having a hardcoded list of endian-independent format aliases
in the header, generate them from the format definitions.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29649>
If you have something like this :
con 32 %66 = @load_reg (%62) (base=0, legacy_fabs=0, legacy_fneg=0)
con 32 %27 = @resource_intel (%22 (0xdeaddead), %66, %67, %17 (0x0)) (desc_set=2, binding=96, resource_intel=0, resource_block_intel=-1)
Just copying the brw_reg in ssa_values[] is not enough for the
load_reg intrinsic. We need to call get_nir_src() to force some logic
to create the register correct.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: b8209d69ff ("intel/fs: Add support for new-style registers")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30050>
I haven't been able to find this restriction mentioned anywhere in the
hardware documentation, but the simulator has code to reject this case
as invalid, and it doesn't appear to work on hardware anymore.
Having lower_regioning() handle this takes care of the issue so we
don't have to worry about generating it in random places.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11489
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30140>
We were generating odd instructions like:
math inv(8) g93<1>HF g85<8,8,1>HF null<8,8,1>F { align1 1Q @7 $4 };
It's unclear whether the type of the null operand matters, but sometimes
these things don't get ignored properly. Out of caution, retype the
null source to match the actual operand's type. It'll at least look
less surprising in assembly dumps.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30193>