This enabled SIMD32 in blorp shaders and seems to be give a small FPS
bump when using a DG2 GPU as secondary (requires copies to linear
buffers to exchange with main GPU).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19341>
The first instruction of any kernel should have non-zero emask. This
restriction needs to be obeyed to avoid GPU hangs.
Patch adds a function to insert dummy mov as first instruction
to make sure this requirement is fulfilled.
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20194>
The parameter `nr` is currenlty an `int` but it only gets assigned to an
`unsigned int`. Make it clear in the function signature what's actually
required.
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19423>
GCC 12.2.0 warns:
../src/intel/compiler/brw_fs.cpp: In member function ‘bool fs_visitor::
split_virtual_grfs()’:
../src/intel/compiler/brw_fs.cpp:2199:10: warning: ‘void* memset(void*, int,
size_t)’ specified size between 18446744071562067968 and 18446744073709551615
exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=]
2199 | memset(vgrf_has_split, 0, num_vars * sizeof(*vgrf_has_split));
`num_vars` is an `int` but gets assigned the value of `this->alloc.count`,
which is an `unsigned int`. Thus, `num_vars` will be negative if
`this->alloc.count` is larger than int max value. Converting that negative
`int` to a `size_t`, which `memset` expects, then blows it up to a huge
positive value.
Simply turning `num_vars` into an `unsigned int` would be enough to fix this
specific problem, but there are many other instances where an `unsigned int`
gets assigned to an `int` for no good reason in this function. Some of which
the compiler warns about now, some of which it doesn't warn about.
This turns all variables in `fs_visitor::split_virtual_grfs`, which should
reasonably be unsigned, into `unsigned int`s. While at it, a few now pointless
casts are removed.
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19423>
The initial implementation is a pretty big hammer. Implement the HW
recommendation to minimize cases in which we need a fence.
This improves by 10FPS on some of the Sascha Willems RT demos.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 6031ad4bf6 ("intel/fs: Add Wa_22013689345")
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19322>
We still update the cs_prog_data, but don't rely on it for this state anymore.
This will allow use the SIMD selector with shaders that don't use cs_prog_data.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19601>
This is a preparation to decouple the storage of what SIMDs
compiled/spilled from the cs_prog_data. This will allow reuse
of SIMD selection code by Bindless Shaders.
And since we have a struct now, move the error array there so
reduce the boilerplate of the users.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19601>
Many Intel platforms can only perform 32x16 bit multiplication. The
straightforward way to implement 32x32 bit multiplications is by
splitting one of the operands into high and low parts called H and L,
repsectively. The full multiplication can be implemented as:
((A * H) << 16) + (A * L)
On Intel platforms, special register accesses can be used to eliminate
the shift operation. This results in three instructions and a temporary
register for most values.
If H or L is 1, then one (or both) of the multiplications will later be
eliminated. On some platforms it may be possible to eliminate the
multiplication when H is 256.
If L is zero (note that H cannot be zero), one of the multiplications
will also be eliminated.
Instead of splitting the operand into high and low parts, it may
possible to factor the operand into two 16-bit factors X and Y. The
original multiplication can be replaced with (A * (X * Y)) = ((A * X) *
Y). This requires two instructions without a temporary register.
I may have gone a bit overboard with optimizing the factorization
routine. It was a fun brainteaser, and I couldn't put it down. :) On my
1.3GHz Ice Lake, a standalone test could chug through 1,000,000 randomly
selected values in about 5.7 seconds. This is about 9x the performance
of the obvious, straightforward implementation that I started with.
v2: Drop an unnecessary return. Rearrange logic slightly and rename
variables in factor_uint32 to better match the names used in the large
comment. Both suggested by Caio. Rearrange logic to avoid possibly
using `a` uninitialized. Noticed by Marcin.
v3: Use DIV_ROUND_UP instead of open coding it. Noticed by Caio.
Tiger Lake, Ice Lake, Haswell, and Ivy Bridge had similar results. (Ice Lake shown)
total instructions in shared programs: 19912558 -> 19912526 (<.01%)
instructions in affected programs: 3432 -> 3400 (-0.93%)
helped: 10 / HURT: 0
total cycles in shared programs: 856413218 -> 856412810 (<.01%)
cycles in affected programs: 122032 -> 121624 (-0.33%)
helped: 9 / HURT: 0
No shader-db changes on any other Intel platforms.
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
Instructions in all programs: 141997227 -> 141996923 (-0.0%)
Instructions helped: 71
Cycles in all programs: 9162524757 -> 9162523886 (-0.0%)
Cycles helped: 63
Cycles hurt: 5
No fossil-db changes on any other Intel platforms.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
The previous bounds checking would cause
mul(8) g121<1>D g120<8,8,1>D 0xec4dD
to be lowered to
mul(8) g121<1>D g120<8,8,1>D 0xec4dUW
mul(8) g41<1>D g120<8,8,1>D 0x0000UW
add(8) g121.1<2>UW g121.1<16,8,2>UW g41<16,8,2>UW
Instead of picking the bounds (and the new type) based on the old type,
pick the new type based on the value only.
This helps a few fossil-db shaders in Witcher 3 and Geekbench5. No
changes on any other Intel platforms.
Tiger Lake
Instructions in all programs: 157581069 -> 157580768 (-0.0%)
Instructions helped: 24
Cycles in all programs: 7566979620 -> 7566977172 (-0.0%)
Cycles helped: 22
Cycles hurt: 4
Ice Lake
Instructions in all programs: 141998965 -> 141998667 (-0.0%)
Instructions helped: 26
Cycles in all programs: 9162568666 -> 9162565297 (-0.0%)
Cycles helped: 24
Cycles hurt: 2
Skylake
No changes.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
This specialized version prints out the liveness count as well as the
maximum liveness count. It was eye opening when seeing the max
liveness jump after lowering of packing instructions which should not
have changed the count.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18657>
When lowering a single instruction with a destination VGRF to 2 or
more, the VGRF is now considered partially written by each generated
instruction and that increases its liveness especially in loops. Thus
potentially increasing the number of spills/fills due to register
allocation.
Putting an UNDEF instruction in front of the lowered instructions
allows the IR to limit the liveness of the VGRF, reducing register
pressure.
This has a pretty dramatic effect on spills/fills for RT shaders. Here
the stats on Q2RTX shaders on DG2 (wipping out any spills/fills due to
register allocation) :
Instructions in all programs: 26150 -> 24955 (-4.6%)
SENDs in all programs: 1148 -> 1148 (+0.0%)
Loops in all programs: 4 -> 4 (+0.0%)
Cycles in all programs: 392179 -> 332787 (-15.1%)
Spills in all programs: 132 -> 116 (-12.1%)
Fills in all programs: 262 -> 154 (-41.2%)
Shader-db results on TGL :
total instructions in shared programs: 21158140 -> 21158377 (<.01%)
instructions in affected programs: 76629 -> 76866 (0.31%)
helped: 18
HURT: 20
helped stats (abs) min: 1 max: 60 x̄: 18.89 x̃: 12
helped stats (rel) min: 0.21% max: 3.61% x̄: 1.02% x̃: 0.77%
HURT stats (abs) min: 1 max: 79 x̄: 28.85 x̃: 18
HURT stats (rel) min: 0.04% max: 2.81% x̄: 1.13% x̃: 0.79%
95% mean confidence interval for instructions value: -4.82 17.30
95% mean confidence interval for instructions %-change: -0.34% 0.57%
Inconclusive result (value mean confidence interval includes 0).
total loops in shared programs: 5753 -> 5753 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 798856834 -> 798870688 (<.01%)
cycles in affected programs: 6208395 -> 6222249 (0.22%)
helped: 22
HURT: 17
helped stats (abs) min: 2 max: 8794 x̄: 1438.18 x̃: 782
helped stats (rel) min: 0.05% max: 2.28% x̄: 0.63% x̃: 0.44%
HURT stats (abs) min: 2 max: 19178 x̄: 2676.12 x̃: 1358
HURT stats (rel) min: 0.04% max: 23.49% x̄: 2.25% x̃: 0.71%
95% mean confidence interval for cycles value: -952.19 1662.65
95% mean confidence interval for cycles %-change: -0.64% 1.90%
Inconclusive result (value mean confidence interval includes 0).
total spills in shared programs: 4078 -> 4066 (-0.29%)
spills in affected programs: 40 -> 28 (-30.00%)
helped: 2
HURT: 0
total fills in shared programs: 2856 -> 2832 (-0.84%)
fills in affected programs: 127 -> 103 (-18.90%)
helped: 2
HURT: 0
total sends in shared programs: 998554 -> 998554 (0.00%)
sends in affected programs: 0 -> 0
helped: 0
HURT: 0
LOST: 0
GAINED: 0
Total CPU time (seconds): 2346.06 -> 2304.80 (-1.76%)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18657>
To clean up compilation warnings about unused variables
when asserts are disabled.
v2: UNUSED -> ASSERTED (Eric Engestrom)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19016>
Since divergence is a lot more likely in RT than compute, it makes
sense to limit ourselves to SIMD8.
The trampoline shader defaults to SIMD16 since this one is uniform.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16970>
There will be situations where we will want to use a local builder
rather than the one associated with NIR->backend translation.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16970>
VS, TCS, TES, and GS threads must end with a URB write message with the
EOT (end of thread) bit set. For VS and TES, we shadow output variables
with temporaries and perform all stores at the end of the shader, giving
us an existing message to do the EOT.
In tessellation control shaders, we don't defer output stores until the
end of the thread like we do for vertex or evaluation shaders. We just
process store_output and store_per_vertex_output intrinsics where they
occur, which may be in control flow. So we can't guarantee that there's
a URB write being at the end of the shader.
Traditionally, we've just emitted a separate URB write to finish TCS
threads, doing a writemasked write to an single patch header DWord.
On Broadwell, we need to set a "TR DS Cache Disable" bit, so this is
a convenient spot to do so. But on other platforms, there's no such
field, and this write is purely wasteful.
Insetad of emitting a separate write, we can just look for an existing
URB write at the end of the program and tag that with EOT, if possible.
We already had code to do this for geometry shaders, so just lift it
into a helper function and reuse it.
No changes in shader-db.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17944>
And run algebraic when either int64 for float64 are not supported so
those don't end up in the generated code.
Cc: mesa-stable
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17396>
Move subgroup_id, that's only used by CS for verx10 < 125, as part of
the payload too -- even though is not, strictly speaking.
Note the thread execution of Task/Mesh is similar enough, so we make
their common struct inherit from cs_thread_payload.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18176>
Move the setup into the FS thread payload constructor. Consolidate
payload setup for that in brw_fs_thread_payload.cpp file.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18176>
We're now able to load up to 8 GRFs in one send.
v2: Switch to use transpose + vector of up to 64 (Thanks Curro!)
v3: Increase parallelism by not reusing the same register for push
constant offset (Curro)
v4: Drop dead ADD() instruction (Curro)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17555>
Make it clearer we are dealing with multiple patches,
works better in constrast with SINGLE_PATCH.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18151>
This backend lowering code has been dead since the removal of i965 -
nothing in the current source tree ever sets the flag.
This is handled by iris_setup_uniforms() and crocus_setup_uniforms().
Variable group size does not appear to be a feature in anv.
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18055>
That way we can look at the SBT entries for debug purposes.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17908>
Bspec 53421 says:
"A URB fence memory is typically performed prior the thread
exit message, so that the next thread dispatch that reads
that URB memory will see it."
Cc: 22.1 <mesa-stable>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16665>
This allows the software scoreboarding pass, scheduler, and so on
to handle the individual instructions and handle them, rather than
trusting in the generator to do scoreboarding correctly when expanding
the virtual instruction to multiple actual instructions.
By using SHADER_OPCODE_READ_SR_REG, we also correctly handle the
software scoreboarding workaround when reading DMask/VMask.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17530>
No shader-db changes on any Intel platform
Fossil-db results:
Tiger Lake
Instructions in all programs: 156926440 -> 156926470 (+0.0%)
Instructions hurt: 15
Cycles in all programs: 7513099349 -> 7513099402 (+0.0%)
Cycles hurt: 15
Ice Lake and Skylake had similar results. (Ice Lake shown)
Cycles in all programs: 9099036492 -> 9099036489 (-0.0%)
Cycles helped: 1
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17605>
The changes to fs_visitor::validate() helped track down a place where I
initially forgot to convert a message to the new sources layout. This
had caused a different validation failure in
dEQP-GLES31.functional.tessellation.tesscoord.triangles_equal_spacing,
but this were not detected until after SENDs were lowered.
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 19951145 -> 19951133 (<.01%)
instructions in affected programs: 2429 -> 2417 (-0.49%)
helped: 8 / HURT: 0
total cycles in shared programs: 858904152 -> 858862331 (<.01%)
cycles in affected programs: 5702652 -> 5660831 (-0.73%)
helped: 2138 / HURT: 1255
Broadwell
total cycles in shared programs: 904869459 -> 904835501 (<.01%)
cycles in affected programs: 7686744 -> 7652786 (-0.44%)
helped: 2861 / HURT: 2050
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
Instructions in all programs: 141442369 -> 141442032 (-0.0%)
Instructions helped: 337
Cycles in all programs: 9099270231 -> 9099036492 (-0.0%)
Cycles helped: 40661
Cycles hurt: 28606
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17605>
[anholt: changed to make all drivers do the right thing by moving the
payload barycentric check into the compiler]
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17381>