This is a preparation to decouple the storage of what SIMDs
compiled/spilled from the cs_prog_data. This will allow reuse
of SIMD selection code by Bindless Shaders.
And since we have a struct now, move the error array there so
reduce the boilerplate of the users.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19601>
Many Intel platforms can only perform 32x16 bit multiplication. The
straightforward way to implement 32x32 bit multiplications is by
splitting one of the operands into high and low parts called H and L,
repsectively. The full multiplication can be implemented as:
((A * H) << 16) + (A * L)
On Intel platforms, special register accesses can be used to eliminate
the shift operation. This results in three instructions and a temporary
register for most values.
If H or L is 1, then one (or both) of the multiplications will later be
eliminated. On some platforms it may be possible to eliminate the
multiplication when H is 256.
If L is zero (note that H cannot be zero), one of the multiplications
will also be eliminated.
Instead of splitting the operand into high and low parts, it may
possible to factor the operand into two 16-bit factors X and Y. The
original multiplication can be replaced with (A * (X * Y)) = ((A * X) *
Y). This requires two instructions without a temporary register.
I may have gone a bit overboard with optimizing the factorization
routine. It was a fun brainteaser, and I couldn't put it down. :) On my
1.3GHz Ice Lake, a standalone test could chug through 1,000,000 randomly
selected values in about 5.7 seconds. This is about 9x the performance
of the obvious, straightforward implementation that I started with.
v2: Drop an unnecessary return. Rearrange logic slightly and rename
variables in factor_uint32 to better match the names used in the large
comment. Both suggested by Caio. Rearrange logic to avoid possibly
using `a` uninitialized. Noticed by Marcin.
v3: Use DIV_ROUND_UP instead of open coding it. Noticed by Caio.
Tiger Lake, Ice Lake, Haswell, and Ivy Bridge had similar results. (Ice Lake shown)
total instructions in shared programs: 19912558 -> 19912526 (<.01%)
instructions in affected programs: 3432 -> 3400 (-0.93%)
helped: 10 / HURT: 0
total cycles in shared programs: 856413218 -> 856412810 (<.01%)
cycles in affected programs: 122032 -> 121624 (-0.33%)
helped: 9 / HURT: 0
No shader-db changes on any other Intel platforms.
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
Instructions in all programs: 141997227 -> 141996923 (-0.0%)
Instructions helped: 71
Cycles in all programs: 9162524757 -> 9162523886 (-0.0%)
Cycles helped: 63
Cycles hurt: 5
No fossil-db changes on any other Intel platforms.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
The previous bounds checking would cause
mul(8) g121<1>D g120<8,8,1>D 0xec4dD
to be lowered to
mul(8) g121<1>D g120<8,8,1>D 0xec4dUW
mul(8) g41<1>D g120<8,8,1>D 0x0000UW
add(8) g121.1<2>UW g121.1<16,8,2>UW g41<16,8,2>UW
Instead of picking the bounds (and the new type) based on the old type,
pick the new type based on the value only.
This helps a few fossil-db shaders in Witcher 3 and Geekbench5. No
changes on any other Intel platforms.
Tiger Lake
Instructions in all programs: 157581069 -> 157580768 (-0.0%)
Instructions helped: 24
Cycles in all programs: 7566979620 -> 7566977172 (-0.0%)
Cycles helped: 22
Cycles hurt: 4
Ice Lake
Instructions in all programs: 141998965 -> 141998667 (-0.0%)
Instructions helped: 26
Cycles in all programs: 9162568666 -> 9162565297 (-0.0%)
Cycles helped: 24
Cycles hurt: 2
Skylake
No changes.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
This specialized version prints out the liveness count as well as the
maximum liveness count. It was eye opening when seeing the max
liveness jump after lowering of packing instructions which should not
have changed the count.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18657>
When lowering a single instruction with a destination VGRF to 2 or
more, the VGRF is now considered partially written by each generated
instruction and that increases its liveness especially in loops. Thus
potentially increasing the number of spills/fills due to register
allocation.
Putting an UNDEF instruction in front of the lowered instructions
allows the IR to limit the liveness of the VGRF, reducing register
pressure.
This has a pretty dramatic effect on spills/fills for RT shaders. Here
the stats on Q2RTX shaders on DG2 (wipping out any spills/fills due to
register allocation) :
Instructions in all programs: 26150 -> 24955 (-4.6%)
SENDs in all programs: 1148 -> 1148 (+0.0%)
Loops in all programs: 4 -> 4 (+0.0%)
Cycles in all programs: 392179 -> 332787 (-15.1%)
Spills in all programs: 132 -> 116 (-12.1%)
Fills in all programs: 262 -> 154 (-41.2%)
Shader-db results on TGL :
total instructions in shared programs: 21158140 -> 21158377 (<.01%)
instructions in affected programs: 76629 -> 76866 (0.31%)
helped: 18
HURT: 20
helped stats (abs) min: 1 max: 60 x̄: 18.89 x̃: 12
helped stats (rel) min: 0.21% max: 3.61% x̄: 1.02% x̃: 0.77%
HURT stats (abs) min: 1 max: 79 x̄: 28.85 x̃: 18
HURT stats (rel) min: 0.04% max: 2.81% x̄: 1.13% x̃: 0.79%
95% mean confidence interval for instructions value: -4.82 17.30
95% mean confidence interval for instructions %-change: -0.34% 0.57%
Inconclusive result (value mean confidence interval includes 0).
total loops in shared programs: 5753 -> 5753 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 798856834 -> 798870688 (<.01%)
cycles in affected programs: 6208395 -> 6222249 (0.22%)
helped: 22
HURT: 17
helped stats (abs) min: 2 max: 8794 x̄: 1438.18 x̃: 782
helped stats (rel) min: 0.05% max: 2.28% x̄: 0.63% x̃: 0.44%
HURT stats (abs) min: 2 max: 19178 x̄: 2676.12 x̃: 1358
HURT stats (rel) min: 0.04% max: 23.49% x̄: 2.25% x̃: 0.71%
95% mean confidence interval for cycles value: -952.19 1662.65
95% mean confidence interval for cycles %-change: -0.64% 1.90%
Inconclusive result (value mean confidence interval includes 0).
total spills in shared programs: 4078 -> 4066 (-0.29%)
spills in affected programs: 40 -> 28 (-30.00%)
helped: 2
HURT: 0
total fills in shared programs: 2856 -> 2832 (-0.84%)
fills in affected programs: 127 -> 103 (-18.90%)
helped: 2
HURT: 0
total sends in shared programs: 998554 -> 998554 (0.00%)
sends in affected programs: 0 -> 0
helped: 0
HURT: 0
LOST: 0
GAINED: 0
Total CPU time (seconds): 2346.06 -> 2304.80 (-1.76%)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18657>
To clean up compilation warnings about unused variables
when asserts are disabled.
v2: UNUSED -> ASSERTED (Eric Engestrom)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19016>
Since divergence is a lot more likely in RT than compute, it makes
sense to limit ourselves to SIMD8.
The trampoline shader defaults to SIMD16 since this one is uniform.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16970>
There will be situations where we will want to use a local builder
rather than the one associated with NIR->backend translation.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16970>
VS, TCS, TES, and GS threads must end with a URB write message with the
EOT (end of thread) bit set. For VS and TES, we shadow output variables
with temporaries and perform all stores at the end of the shader, giving
us an existing message to do the EOT.
In tessellation control shaders, we don't defer output stores until the
end of the thread like we do for vertex or evaluation shaders. We just
process store_output and store_per_vertex_output intrinsics where they
occur, which may be in control flow. So we can't guarantee that there's
a URB write being at the end of the shader.
Traditionally, we've just emitted a separate URB write to finish TCS
threads, doing a writemasked write to an single patch header DWord.
On Broadwell, we need to set a "TR DS Cache Disable" bit, so this is
a convenient spot to do so. But on other platforms, there's no such
field, and this write is purely wasteful.
Insetad of emitting a separate write, we can just look for an existing
URB write at the end of the program and tag that with EOT, if possible.
We already had code to do this for geometry shaders, so just lift it
into a helper function and reuse it.
No changes in shader-db.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17944>
And run algebraic when either int64 for float64 are not supported so
those don't end up in the generated code.
Cc: mesa-stable
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17396>
Move subgroup_id, that's only used by CS for verx10 < 125, as part of
the payload too -- even though is not, strictly speaking.
Note the thread execution of Task/Mesh is similar enough, so we make
their common struct inherit from cs_thread_payload.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18176>
Move the setup into the FS thread payload constructor. Consolidate
payload setup for that in brw_fs_thread_payload.cpp file.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18176>
We're now able to load up to 8 GRFs in one send.
v2: Switch to use transpose + vector of up to 64 (Thanks Curro!)
v3: Increase parallelism by not reusing the same register for push
constant offset (Curro)
v4: Drop dead ADD() instruction (Curro)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17555>
Make it clearer we are dealing with multiple patches,
works better in constrast with SINGLE_PATCH.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18151>
This backend lowering code has been dead since the removal of i965 -
nothing in the current source tree ever sets the flag.
This is handled by iris_setup_uniforms() and crocus_setup_uniforms().
Variable group size does not appear to be a feature in anv.
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18055>
That way we can look at the SBT entries for debug purposes.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17908>
Bspec 53421 says:
"A URB fence memory is typically performed prior the thread
exit message, so that the next thread dispatch that reads
that URB memory will see it."
Cc: 22.1 <mesa-stable>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16665>
This allows the software scoreboarding pass, scheduler, and so on
to handle the individual instructions and handle them, rather than
trusting in the generator to do scoreboarding correctly when expanding
the virtual instruction to multiple actual instructions.
By using SHADER_OPCODE_READ_SR_REG, we also correctly handle the
software scoreboarding workaround when reading DMask/VMask.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17530>
No shader-db changes on any Intel platform
Fossil-db results:
Tiger Lake
Instructions in all programs: 156926440 -> 156926470 (+0.0%)
Instructions hurt: 15
Cycles in all programs: 7513099349 -> 7513099402 (+0.0%)
Cycles hurt: 15
Ice Lake and Skylake had similar results. (Ice Lake shown)
Cycles in all programs: 9099036492 -> 9099036489 (-0.0%)
Cycles helped: 1
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17605>
The changes to fs_visitor::validate() helped track down a place where I
initially forgot to convert a message to the new sources layout. This
had caused a different validation failure in
dEQP-GLES31.functional.tessellation.tesscoord.triangles_equal_spacing,
but this were not detected until after SENDs were lowered.
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 19951145 -> 19951133 (<.01%)
instructions in affected programs: 2429 -> 2417 (-0.49%)
helped: 8 / HURT: 0
total cycles in shared programs: 858904152 -> 858862331 (<.01%)
cycles in affected programs: 5702652 -> 5660831 (-0.73%)
helped: 2138 / HURT: 1255
Broadwell
total cycles in shared programs: 904869459 -> 904835501 (<.01%)
cycles in affected programs: 7686744 -> 7652786 (-0.44%)
helped: 2861 / HURT: 2050
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
Instructions in all programs: 141442369 -> 141442032 (-0.0%)
Instructions helped: 337
Cycles in all programs: 9099270231 -> 9099036492 (-0.0%)
Cycles helped: 40661
Cycles hurt: 28606
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17605>
[anholt: changed to make all drivers do the right thing by moving the
payload barycentric check into the compiler]
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17381>
Now that this information is accurately gathered by spirv_to_nir, we no
longer need the hack. We just need to fix up the way we handle some of
the key bits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14020>
Thanks to the previous commit, we no longer need this check.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14020>
This lets us drop demote_sample_qualifiers as well as a back-end check
for key->multisample_fbo.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14020>
NIR constructs this information for us as part of nir_gather_info these
days so we can simplify our logic a bit. This will also let us be more
correct once we move uses_sample_shading scraping earlier.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14020>
The lowering is currently fake. It just changes the opcode from the
_LOGICAL version to the non-_LOGICAL version.
v2: Remove some rebase cruft. 's/gfx8_//;s/simd8_/' in
brw_instruction_name. Both suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17379>
With stages dispatching with a mask, we can run into situations where
we don't have the global address in all lanes. The existing code
always assumed we had the addres in at least lane0.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: bb40e999d1 ("intel/nir: use a single intel intrinsic to deal with ray traversal")
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17330>
While we've taken advantage of split-sends in select situations, there
are many other cases (such as sampler messages, framebuffer writes, and
URB writes) that have never received that treatment, and continued to
use monolithic send payloads.
This commit introduces a new optimization pass which detects SEND
messages with a single payload, finds an adjacent LOAD_PAYLOAD that
produces that payload, splits it two, and updates the SEND to use both
of the new smaller payloads.
In places where we manually used split SENDS, we rely on underlying
knowledge of the message to determine a natural split point. For
example, header and data, or address and value.
In this pass, we instead infer a natural split point by looking at the
source registers. Often times, consecutive LOAD_PAYLOAD sources may
already be grouped together in a contiguous block, such as a texture
coordinate. Then, there is another bit of data, such as a LOD, that
may come from elsewhere. We look for the point where the source list
switches VGRFs, and split it there. (If there is a message header, we
choose to split there, as it will naturally come from elsewhere.)
This not only reduces the payload sizes, alleviating register pressure,
but it means that we may be able to eliminate some payload construction
altogether, if we have a contiguous block already and some extra data
being tacked on to one side or the other.
shader-db results for Icelake are:
total instructions in shared programs: 19602513 -> 19369255 (-1.19%)
instructions in affected programs: 6085404 -> 5852146 (-3.83%)
helped: 23650 / HURT: 15
helped stats (abs) min: 1 max: 1344 x̄: 9.87 x̃: 3
helped stats (rel) min: 0.03% max: 35.71% x̄: 3.78% x̃: 2.15%
HURT stats (abs) min: 1 max: 44 x̄: 7.20 x̃: 2
HURT stats (rel) min: 1.04% max: 20.00% x̄: 4.13% x̃: 2.00%
95% mean confidence interval for instructions value: -10.16 -9.55
95% mean confidence interval for instructions %-change: -3.84% -3.72%
Instructions are helped.
total cycles in shared programs: 848180368 -> 842208063 (-0.70%)
cycles in affected programs: 599931746 -> 593959441 (-1.00%)
helped: 22114 / HURT: 13053
helped stats (abs) min: 1 max: 482486 x̄: 580.94 x̃: 22
helped stats (rel) min: <.01% max: 78.92% x̄: 4.76% x̃: 0.75%
HURT stats (abs) min: 1 max: 94022 x̄: 526.67 x̃: 22
HURT stats (rel) min: <.01% max: 188.99% x̄: 4.52% x̃: 0.61%
95% mean confidence interval for cycles value: -222.87 -116.79
95% mean confidence interval for cycles %-change: -1.44% -1.20%
Cycles are helped.
total spills in shared programs: 8387 -> 6569 (-21.68%)
spills in affected programs: 5110 -> 3292 (-35.58%)
helped: 359 / HURT: 3
total fills in shared programs: 11833 -> 8218 (-30.55%)
fills in affected programs: 8635 -> 5020 (-41.86%)
helped: 358 / HURT: 3
LOST: 1 SIMD16 shader, 659 SIMD32 shaders
GAINED: 65 SIMD16 shaders, 959 SIMD32 shaders
Total CPU time (seconds): 1505.48 -> 1474.08 (-2.09%)
Examining these results: the few shaders where spills/fills increased
were already spilling significantly, and were only slightly hurt. The
applications affected were also helped in countless other shaders, and
other shaders stopped spilling altogether or had 50% reductions. Many
SIMD16 shaders were gained, and overall we gain more SIMD32, though many
close to the register pressure line go back and forth.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17018>