Since inline parameter is the last field of the thread payload, the
backend can always assume they may exist. They won't affect the
position of other payload fields and the register allocator will
reuse any unused space.
In Anv, also update EmitInlineParameter for Task/Mesh/CS to reflect
previous changes in inline parameter setup. Remove/Update some stale
comments since we are here.
Finally, remove the prog_key/prog_data bits that tracked whether inline
data or a push address was needed.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41230>
This contains new tests for DGC+multiview which are valid in DX12
but invalid in Vulkan, unless RADV allows support for it. Important
to have coverage for us because it's used for Crimson Desert.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41193>
The intention of the original commit was to make all the shaders report
the same max_dispatch_width. When CS has multiple variants, this was
not happening as expected.
Fixes: 2acc2f18ea ("intel/compiler: report max dispatch width statistic")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41209>
Now instructions still read/write UFLAG, which preserves the information about
lane 0 we need for proper predication etc.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
Merge the empty else optimization, the then-block predication, and the
break-while fusion into a unified "try to predicate each side of an if, peephole
optimizing control flow" optimization. This is simpler and more general.
Totals:
Instrs: 4783809 -> 4775647 (-0.17%)
CodeSize: 70766656 -> 70674064 (-0.13%); split: -0.13%, +0.00%
Totals from 1109 (41.90% of 2647) affected shaders:
Instrs: 4130644 -> 4122482 (-0.20%)
CodeSize: 61180848 -> 61088256 (-0.15%); split: -0.15%, +0.00%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
This is totally broken now that we have a physical CFG for UGPRs. And of course,
UGPRs generally were totally broken without the physical CFG. So I conclude
this code basically never worked. Which is good because it was also basically
always dead too. Just delete it and replace with a clear error message, instead
of pretending it works and either randomly splatting validation or just straight
up miscompiling silently or whatever.
We might need an alternative UGPR->GPR spill path some day but that day is not
today.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
Consider:
u0 = foo()
if (divergent) {
u0 = bar()
r0 = baz(u0)
} else {
r0 = quux(u0)
}
Logically, this is fine, there is no interference between bar() and u0. But
physically, both sides of the if execute so the bar() write to u0 overwrites the
variable the else reads. So this is a miscompile.
The solution is to model the extra edges in the physical control flow graph,
which lives next to the existing logical control flow graph. Liveness for UGPRs
now follows the physical CFG, while liveness for GPRs continues to follow the
logical CFG. That models the interference properly, while still allowing phis to
work as before (since phis writing UGPRs follow uniform bits of control flow
that are necessarily critical edge free for the same reason the logical CFG is).
Because our RA copies shuffled registers back at block ends (following
Colombet), there's no issue with live range splits here (unlike aco which
inserts phis for this case and then needs to worry about critical edges around
those phis).
There might still be an extremely-challenging-to-hit bug here with UGPR spilling
which I need to think more about. It might be fine as-is? Not convinced though.
But this is big enough and strictly less broken than what we have right now and
the full solution will build on this, so here we are.
Fixes artefating in SuperTuxKart and Celestia knows what else.
Totals:
Instrs: 2770938 -> 2771269 (+0.01%); split: -0.00%, +0.02%
CodeSize: 40133712 -> 40138480 (+0.01%); split: -0.01%, +0.02%
Totals from 158 (5.97% of 2647) affected shaders:
Instrs: 514523 -> 514854 (+0.06%); split: -0.02%, +0.09%
CodeSize: 7603040 -> 7607808 (+0.06%); split: -0.03%, +0.09%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
On my Arc A380 (DG2), this more than doubles the performance of Jeff
Bolz's cooperative matrix benchmark. With llama.cpp modified to use
cooperative matrix on DG2, performance is improved by 37%.
Closes: #15311
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Tested-by: Matt Corallo <git@bluematt.me>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41172>
The next commit will cause some very specific phis to not be lowered to
scalar, and that's the reason the callback is used instead of
nir_lower_all_phis_to_scalar.
It's worth noting that the comment in nir_lower_phis_to_scalar.c
specifically calls out Deus Ex as the reason some phis should not be
lowered. At least on current BRW, zero shaders from Deus Ex trace were
affected for spills or fills on any Intel platform.
shader-db:
All Intel platforms had similar results. (Lunar Lake shown)
total instructions in shared programs: 17050005 -> 17051449 (<.01%)
instructions in affected programs: 41032 -> 42476 (3.52%)
helped: 29 / HURT: 159
total cycles in shared programs: 876411976 -> 876433702 (<.01%)
cycles in affected programs: 1455550 -> 1477276 (1.49%)
helped: 40 / HURT: 150
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 916599633 -> 916694854 (+0.01%); split: -0.00%, +0.01%
CodeSize: 14705971792 -> 14708302384 (+0.02%); split: -0.00%, +0.02%
Send messages: 40870114 -> 40870113 (-0.00%)
Cycle count: 102360965889 -> 102364169753 (+0.00%); split: -0.00%, +0.01%
Spill count: 3460669 -> 3460240 (-0.01%)
Fill count: 4988325 -> 4987891 (-0.01%)
Max live registers: 192914542 -> 192918153 (+0.00%); split: -0.00%, +0.00%
Max dispatch width: 48848112 -> 48848128 (+0.00%)
Non SSA regs after NIR: 141633613 -> 141671589 (+0.03%); split: -0.00%, +0.03%
Totals from 5713 (0.28% of 2010434) affected shaders:
Instrs: 5215921 -> 5311142 (+1.83%); split: -0.09%, +1.91%
CodeSize: 88940784 -> 91271376 (+2.62%); split: -0.20%, +2.82%
Send messages: 284751 -> 284750 (-0.00%)
Cycle count: 275671864 -> 278875728 (+1.16%); split: -0.74%, +1.90%
Spill count: 857 -> 428 (-50.06%)
Fill count: 845 -> 411 (-51.36%)
Max live registers: 667776 -> 671387 (+0.54%); split: -0.86%, +1.40%
Max dispatch width: 160416 -> 160432 (+0.01%)
Non SSA regs after NIR: 1127904 -> 1165880 (+3.37%); split: -0.10%, +3.47%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Tested-by: Matt Corallo <git@bluematt.me>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41172>
There are less than 2^16 lanes within a threadgroup, so it is safe to do
all math at 16-bit. This allows us to use 16-bit integer division which is
much faster than 32-bit integer division (in terms of the lowerings).
In a "hello world" kernel with variable wg size, simd32 goes 72 inst -> 57
inst on jay and 82 -> 67 inst on brw.
OTOH it's a loss for non-variable wg size, so do it only there to avoid
unwelcome stats regresions on Vulkan.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41084>
These are a mix of fields whose last used was removed or fields that were
never used, possibly because they remained in a patch while the rest of the
code changed before landing.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41139>
BTD unit will keep accumulating the threads and then eventually dispatch
those active threads once it reaches the counter.
I guess dispatching too fast will not have full occupancy at the BTD
unit, instead we just pick the half of max value for counter.
This patch also add drirc option to dispatch_timeout_counter and tweak
values internally with respect to HW limits. Default value we have right
now is 512 clocks, we can for sure tune it per app.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40733>
Since field is split in between multiple fields, we have to manually
write the values and refer to Bspec 43851 for exact values.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40733>
The new tool has much better image diffing presentation (thanks to
Danilo's work on turnip's private trace CI), better performance, flake
checking within a single run, parallelized downloads along with replays,
system monitoring for replay debug (OOMs especially), and DXVK support
(I've added a few traces, but not most of the collection because I didn't
want to block on stabilizing this job with everything).
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41115>
On Xe, we have this bit reversed. It's called Thread preemption Disable.
On Xe2+ (Bspec 56590), it's called Thread preemption with option
enabled/disabled.
AFAIK, we don't support mid-thread preemption. This patch set values
properly according to bspec.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41120>
This can be used to workaround problem cases with application
controlled subgroup size.
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40813>
../src/intel/vulkan/genX_init_state.c: In function ‘gfx9_CreateSampler’:
../src/intel/vulkan/genX_init_state.c:1507:40: warning: ‘border_color_offset’ may be used uninitialized [-Wmaybe-uninitialized]
1507 | sampler_state.BorderColorPointer = border_color_offset;
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41116>
Format is set to ISL_FORMAT_UNSUPPORTED at anv_get_format_plane at src/intel/vulkan_hasvk/anv_formats.c,
because Ivy Bridge does not support enough 24 and 48-bits formats.
Problem solved by checking format after the call.
Signed-off-by: GKraats <vd.kraats@hccnet.nl>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40237>