Sources that are scalars (almost all source) and convergent generally
want <0,1,0> source stride. Sources that are vectors (e.g., texture
coordinates, SSBO write data, etc.) and convergent want no extra strides
applied. In nearly all cases LOAD_PAYLOAD lowering will do the right
thing.
v2: Use VEC in emit_pixel_interpolater_send. Suggested by Ken.
v3: With the elimination of offset_to_component(), offset() may not
convert an is_scalar source to have a zero stride. Explicitly do this
in get_nir_src and prepare_alu_destination_and_sources.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29884>
The last argument seems to be used as brw_shader_reloc::delta (from
brw_add_reloc), and we're unconditionally setting it to 0 here, while
the other place where we handle nir_intrinsic_load_reloc_const_intel
seems to be setting the base appropriately.
I found this by inspection while debugging a bug related to this code,
so I'm not aware of any workloads that get improved by this patch.
Related patches:
- ecbec25e84 ("intel/nir: add reloc delta to load_reloc_const_intel intrinsic")
- 99047451c9 ("intel/fs: add plumbing for embedded samplers")
Fixes: ecbec25e84 ("intel/nir: add reloc delta to load_reloc_const_intel intrinsic")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32531>
Just calculate the block size using util_logbase2() - it's simpler.
Also drop the name "oword" as this refers to legacy HDC messages,
rather than the newer LSC "vector size" field.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
This will matter more with overfetching, where we may suggest loading
additional data that we don't actually need for vectorization purposes.
We want to make sure that push ranges have the data we actually need;
any extra padding is irrelevant.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
Moving it to intel_shader_enums.h
The plan is to make it visible to OpenCL shaders.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32329>
We were accidentally doing a signed integer comparison here for ult32,
or a sign-extending shift for ushr.
One notable bit of fallout was that load_global_uniform_block_intel
address calculations broke on platforms that don't have native 64-bit
integer support, as the iadd64 lowering for "do I need to carry?" was
using ult32...and performing the wrong comparison. We spotted this in
Borderlands 3 on Alchemist once we turned on other optimizations.
Thanks to Lionel Landwerlin for helping spot the problem!
Fixes: c7b312ad45 ("brw: factor out source extraction for rematerialization")
Fixes: 339630ab05 ("brw: enable A64 loads source rematerialization")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31995>
Some uses of the old pattern still exist. The use in brw_fs_nir.cpp is
deleted by commits !29884. The use in brw_lower_logical_sends.cpp seems
different, so I decided to keep it.
The next commit wants to use this.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32041>
Add opcodes for VOTE_ALL, VOTE_ANY and VOTE_EQUAL. The first two
are also used for the quad variants. Move their lowering from
NIR conversion to brw_lower_subgroup_ops.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31029>
Include in the helper which already take care of using exec_all() and
taking the first component of the result. Both are expected by
SHADER_OPCODE_BROADCAST.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31029>
This intrinsic was initially dedicated to mesh/task shaders, but the
mechanism it exposes also exists in the compute shaders on Gfx12.5+.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31508>
We were manually allocating 1 REG_SIZE for the barrier payload, which is
only half a register on Xe2. This should eventually get allocated to a
whole register anyway, but it's awkward in the meantime. Also, we were
zero-initializing the header using group(8, 0) which only initialized
half the register. The rest of the fields are Reserved MBZ, so they're
likely unused and unread anyway - but it's better to zero-initialize
them so we don't get random undefined, miserable-to-debug behavior.
Backport-to: 24.2
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31499>
Honestly, I don't know what I was thinking - we are emitting a single
spill/fill message here, but were counting it as 2 spill/fills in SIMD32
shaders. So our eventual shader stat reporting would subtract the
number of spills and fills from send_count, and get a negative number,
wrapping around to just shy of UINT32_MAX. That's way too many sends.
This is especially noticable on Xe2 which often uses SIMD32 shaders.
Backport-to: 24.2
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31499>
The general idea is to be able to validate that certain instructions
were lowered and certain restrictions were already handled. Passes can
now assert their expectations, i.e. if a pass is mean to run after
certain lowerings or not.
The actual phases are a initial stab and as we re-organized the passes,
we may remove/add phases.
This commit just add some phase steps, later commits will make use of
them.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30496>
We can't just always negate the alu instruction's cmod, because negating
it can produce different results when the argument is NaN float. We can
still do that if the condition is == or !=.
Fixes: 0ba9497e ("intel/fs: Improve discard_if code generation")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11800
Signed-off-by: Sviatoslav Peleshko <sviatoslav.peleshko@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31042>
We can generalize the simd8-16bits case by just rounding to a physical
register.
We also take the opportunity to limit the register allocation to a
single physical GRF for the residency data.
Signed-off-by: Lionel Landwerlin <llandwerlin@gmail.com>
Fixes: 0116430d39 ("intel/brw: Handle 16-bit sampler return payloads")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Acked-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31307>
While there are many cases that turn into the *_PULL_CONSTANT_LOAD ops
or push constants, this one piece was emitting surface block loads.
Switch it over to use the new intrinsics to delete a bunch of code.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Acked-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30828>
We introduce a new fs_nir_emit_memory_access() helper that can handle
image, bindless image, SSBO, shared, global, and scratch memory, and
handles loads, stores, atomics, and block loads. It translates each
of these NIR intrinsics into the new MEMORY_*_LOGICAL intrinsics.
As a result, we delete a lot of similar surface access emitter code.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Acked-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30828>
This is going to handle more than atomics shortly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30828>
Literally inside an if-statement (about 26 lines before this hunk)
that checks for !nir_src_is_const(instr->src[1]).
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30251>
When INTEL_DEBUG=ann is also set, the disassembler would annotate the
output with either a string or the string verison of a NIR instruction.
This was done by keeping two pointers (but only using one at a time).
Change the code to print the instruction into a string instead of
keeping it pointer around (peg the string to the shader). That way,
only one pointer is needed for annotations. Because that serialization
is not free, only do that when the environment variable is set.
Since we are here, move the annotation string field to the end, moving
it to the least commonly used cacheline. Further packing might allow
the entire fs_inst to fit in two cachelines.
For release builds, don't even add the debug annotation to the struct.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30822>
The long names were originally meant to map to the HW encoding but
nowadays the actual encoding values depend on gfx version, whether
instruction is 3src, etc.
Suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30704>
In these cases there's a clear bound we can use. In C++ this is a
compiler extension and not compatible with zero initializing a
regular struct -- which will happen in a later change.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30704>
We were currently emitting logical atomic instructions with a packed
destination region for sub-dword LSC atomics, along the lines of:
> untyped_atomic_logical(32) dst<1>:HF, ...
However, these instructions use an LSC data size D16U32, which means
that the 16b data on the return payload is expanded to 32b by the LSC
shared function, so we were lying to the compiler about the location
of the individual channels on the return payload, its execution
masking, etc. This is why the hacks that manually set the
'inst->size_written' of the instruction were required.
In some cases this worked, but any non-trivial manipulation of the
instruction destination by lowering or optimization passes could have
led to corruption, as has been reproduced in deqp-vk during
lower_simd_width() for shaders that use 16-bit atomics in SIMD32
dispatch mode.
Note that LSC sub-dword reads aren't affected by this because they use
raw UD destinations and specify the actual bit size of the operation
datatype as the immediate SURFACE_LOGICAL_SRC_IMM_ARG, which doesn't
work for atomic operations since that immediate specifies the atomic
opcode.
Instead, have the logical operation implement the behavior of 16-bit
destinations correctly instead of silently replacing the 16-bit region
with an inconsistent 32-bit region -- This is done by emitting the MOV
instructions used to pack the data from the UD temporary into the
packed destination from the lower_logical_sends() pass instead of from
the NIR translation pass.
Fixes: 43169dbbe5 ("intel/compiler: Support 16 bit float ops")
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30683>
Make sure to take new GRF size into consideration and adjust the
indirect offset according to new size so that when we do the indirect
load with address register, we load right values.
This helps pass the following tests:
- dEQP-VK.binding_model.descriptor_buffer.mutable_descriptor.*geom*
- dEQP-VK.ray_query.*geometry_shader.*
Backport-to: 24.2
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30679>
load_global_constant_uniform_block_intel is equivalent in terms of
loading, then for the predicate we just do a bcsel afterward in places
where that is required.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30659>
Bspec 57508: Structure_SIMD16TraceRayMessage:: RayQuery Enable
"When this bit is set in the header, Trace Ray Message behaves like a
Ray Query. This message requires a write-back message indicating
RayQuery for all valid Rays (SIMD lanes) have completed."
If we don't pass the write-back register, somehow it was stepping on
over R0 register and can mess up the scratch space accesses which could
potentially lead to GPU hang. It can be noticed while running it under
simulator trace.
send.rta (16|M0) null r124 r126:1 0x0 0x02000100 {$15} // wr:1+1, rd:0; simd16 trace ray
R0 = 00000001 00000000 00000000 00000001 00000000 00000000 00000001 00000000 00000000 00000001 00000000 00000000 00000001 00000000 00000000 00000001
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30600>
Gfx8 required us to emulate image load store with untyped messages,
whereas Gfx9 just has typed message support for everything. brw no
longer supports Gfx8, so all of this code is effectively dead.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30576>
Instead of emitting a single one at the top, and making reference to it,
emit the virtual instruction as needed and let CSE do its job.
Since load_subgroup_invocation now can appear not at the start of the
shader, use UNDEF in all cases to ensure that the liveness of the
destination doesn't extend to the first partial write done here (it was
being used only for SIMD > 8 before).
Note this option was considered in the past
6132992cdb but at the time dismissed. The
difference now is that the lowering of the virtual instruction happens
earlier than the scheduling.
The motivation for this change is to allow passes other than the NIR
conversion to use this value. The alternative of storing a `brw_reg` in
the shader (instead of NIR state) gets complicated by passes like
compact_vgrfs, that move VGRFs around (and update the instructions).
This and maybe other passes would have to care about the brw_reg.
Fossil-db numbers, TGL
```
*** Shaders only in 'after' results are ignored:
steam-native/shadow_of_the_tomb_raider/c683ea5067ee157d/fs.32/0, steam-native/shadow_of_the_tomb_raider/f4df450c3cef40b4/fs.32/0, steam-native/shadow_of_the_tomb_raider/94b708fb8e3d9597/fs.32/0, steam-native/shadow_of_the_tomb_raider/19d44c328edabd30/fs.32/0, steam-native/shadow_of_the_tomb_raider/8a7dcbd5a74a19bf/fs.32/0, and 366 more
from 4 apps: steam-dxvk/alan_wake, steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-native/shadow_of_the_tomb_raider
*** Shaders only in 'before' results are ignored:
steam-dxvk/octopath_traveler/aaa3d10acb726906/fs.32/0, steam-dxvk/batman_arkham_origins/e6872ae23569c35f/fs.32/0, steam-dxvk/octopath_traveler/fd33a99fa5c271a8/fs.32/0, steam-dxvk/octopath_traveler/9a077cdc16f24520/fs.32/0, steam-dxvk/batman_arkham_city_goty/fac7b438ad52f622/fs.32/0, and 12 more
from 4 apps: steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-dxvk/octopath_traveler, steam-native/shadow_of_the_tomb_raider
Totals:
Instrs: 149752381 -> 149751337 (-0.00%); split: -0.00%, +0.00%
Cycle count: 11553609349 -> 11549970294 (-0.03%); split: -0.06%, +0.03%
Spill count: 42763 -> 42764 (+0.00%); split: -0.01%, +0.01%
Fill count: 75650 -> 75651 (+0.00%); split: -0.00%, +0.01%
Max live registers: 31725096 -> 31671792 (-0.17%)
Max dispatch width: 5546008 -> 5551672 (+0.10%); split: +0.11%, -0.00%
Totals from 52574 (8.34% of 630441) affected shaders:
Instrs: 9535159 -> 9534115 (-0.01%); split: -0.03%, +0.02%
Cycle count: 1006627109 -> 1002988054 (-0.36%); split: -0.65%, +0.29%
Spill count: 11588 -> 11589 (+0.01%); split: -0.03%, +0.03%
Fill count: 21057 -> 21058 (+0.00%); split: -0.01%, +0.02%
Max live registers: 1992493 -> 1939189 (-2.68%)
Max dispatch width: 559696 -> 565360 (+1.01%); split: +1.06%, -0.05%
```
and DG2
```
*** Shaders only in 'after' results are ignored:
steam-native/shadow_of_the_tomb_raider/1f95a9d3db21df85/fs.32/0, steam-native/shadow_of_the_tomb_raider/56b87c4a46613a2a/fs.32/0, steam-native/shadow_of_the_tomb_raider/a74b4137f85dbbd3/fs.32/0, steam-native/shadow_of_the_tomb_raider/e07e38d3f48e8402/fs.32/0, steam-native/shadow_of_the_tomb_raider/206336789c48996c/fs.32/0, and 268 more
from 4 apps: steam-dxvk/alan_wake, steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-native/shadow_of_the_tomb_raider
*** Shaders only in 'before' results are ignored:
steam-native/shadow_of_the_tomb_raider/0420d7c3a2ea99ec/fs.32/0, steam-native/shadow_of_the_tomb_raider/2ff39f8bf7d24abb/fs.32/0, steam-native/shadow_of_the_tomb_raider/92d7be2824bd9659/fs.32/0, steam-native/shadow_of_the_tomb_raider/f09ca6d2ecf18015/fs.32/0, steam-native/shadow_of_the_tomb_raider/490f8ffd59e52949/fs.32/0, and 205 more
from 3 apps: steam-dxvk/batman_arkham_city_goty, steam-dxvk/batman_arkham_origins, steam-native/shadow_of_the_tomb_raider
Totals:
Instrs: 151597619 -> 151599914 (+0.00%); split: -0.00%, +0.00%
Subgroup size: 7699776 -> 7699784 (+0.00%)
Cycle count: 12738501989 -> 12739841170 (+0.01%); split: -0.01%, +0.02%
Spill count: 61283 -> 61274 (-0.01%)
Fill count: 119886 -> 119849 (-0.03%)
Max live registers: 31810432 -> 31758920 (-0.16%)
Max dispatch width: 5540128 -> 5541136 (+0.02%); split: +0.08%, -0.06%
Totals from 49286 (7.81% of 631231) affected shaders:
Instrs: 8607753 -> 8610048 (+0.03%); split: -0.01%, +0.04%
Subgroup size: 857752 -> 857760 (+0.00%)
Cycle count: 305939495 -> 307278676 (+0.44%); split: -0.28%, +0.72%
Spill count: 6339 -> 6330 (-0.14%)
Fill count: 12571 -> 12534 (-0.29%)
Max live registers: 1788346 -> 1736834 (-2.88%)
Max dispatch width: 510920 -> 511928 (+0.20%); split: +0.85%, -0.66%
```
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30489>