The intention of the original commit was to make all the shaders report
the same max_dispatch_width. When CS has multiple variants, this was
not happening as expected.
Fixes: 2acc2f18ea ("intel/compiler: report max dispatch width statistic")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41209>
Now instructions still read/write UFLAG, which preserves the information about
lane 0 we need for proper predication etc.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
Merge the empty else optimization, the then-block predication, and the
break-while fusion into a unified "try to predicate each side of an if, peephole
optimizing control flow" optimization. This is simpler and more general.
Totals:
Instrs: 4783809 -> 4775647 (-0.17%)
CodeSize: 70766656 -> 70674064 (-0.13%); split: -0.13%, +0.00%
Totals from 1109 (41.90% of 2647) affected shaders:
Instrs: 4130644 -> 4122482 (-0.20%)
CodeSize: 61180848 -> 61088256 (-0.15%); split: -0.15%, +0.00%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
This is totally broken now that we have a physical CFG for UGPRs. And of course,
UGPRs generally were totally broken without the physical CFG. So I conclude
this code basically never worked. Which is good because it was also basically
always dead too. Just delete it and replace with a clear error message, instead
of pretending it works and either randomly splatting validation or just straight
up miscompiling silently or whatever.
We might need an alternative UGPR->GPR spill path some day but that day is not
today.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
Consider:
u0 = foo()
if (divergent) {
u0 = bar()
r0 = baz(u0)
} else {
r0 = quux(u0)
}
Logically, this is fine, there is no interference between bar() and u0. But
physically, both sides of the if execute so the bar() write to u0 overwrites the
variable the else reads. So this is a miscompile.
The solution is to model the extra edges in the physical control flow graph,
which lives next to the existing logical control flow graph. Liveness for UGPRs
now follows the physical CFG, while liveness for GPRs continues to follow the
logical CFG. That models the interference properly, while still allowing phis to
work as before (since phis writing UGPRs follow uniform bits of control flow
that are necessarily critical edge free for the same reason the logical CFG is).
Because our RA copies shuffled registers back at block ends (following
Colombet), there's no issue with live range splits here (unlike aco which
inserts phis for this case and then needs to worry about critical edges around
those phis).
There might still be an extremely-challenging-to-hit bug here with UGPR spilling
which I need to think more about. It might be fine as-is? Not convinced though.
But this is big enough and strictly less broken than what we have right now and
the full solution will build on this, so here we are.
Fixes artefating in SuperTuxKart and Celestia knows what else.
Totals:
Instrs: 2770938 -> 2771269 (+0.01%); split: -0.00%, +0.02%
CodeSize: 40133712 -> 40138480 (+0.01%); split: -0.01%, +0.02%
Totals from 158 (5.97% of 2647) affected shaders:
Instrs: 514523 -> 514854 (+0.06%); split: -0.02%, +0.09%
CodeSize: 7603040 -> 7607808 (+0.06%); split: -0.03%, +0.09%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41215>
On my Arc A380 (DG2), this more than doubles the performance of Jeff
Bolz's cooperative matrix benchmark. With llama.cpp modified to use
cooperative matrix on DG2, performance is improved by 37%.
Closes: #15311
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Tested-by: Matt Corallo <git@bluematt.me>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41172>
The next commit will cause some very specific phis to not be lowered to
scalar, and that's the reason the callback is used instead of
nir_lower_all_phis_to_scalar.
It's worth noting that the comment in nir_lower_phis_to_scalar.c
specifically calls out Deus Ex as the reason some phis should not be
lowered. At least on current BRW, zero shaders from Deus Ex trace were
affected for spills or fills on any Intel platform.
shader-db:
All Intel platforms had similar results. (Lunar Lake shown)
total instructions in shared programs: 17050005 -> 17051449 (<.01%)
instructions in affected programs: 41032 -> 42476 (3.52%)
helped: 29 / HURT: 159
total cycles in shared programs: 876411976 -> 876433702 (<.01%)
cycles in affected programs: 1455550 -> 1477276 (1.49%)
helped: 40 / HURT: 150
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 916599633 -> 916694854 (+0.01%); split: -0.00%, +0.01%
CodeSize: 14705971792 -> 14708302384 (+0.02%); split: -0.00%, +0.02%
Send messages: 40870114 -> 40870113 (-0.00%)
Cycle count: 102360965889 -> 102364169753 (+0.00%); split: -0.00%, +0.01%
Spill count: 3460669 -> 3460240 (-0.01%)
Fill count: 4988325 -> 4987891 (-0.01%)
Max live registers: 192914542 -> 192918153 (+0.00%); split: -0.00%, +0.00%
Max dispatch width: 48848112 -> 48848128 (+0.00%)
Non SSA regs after NIR: 141633613 -> 141671589 (+0.03%); split: -0.00%, +0.03%
Totals from 5713 (0.28% of 2010434) affected shaders:
Instrs: 5215921 -> 5311142 (+1.83%); split: -0.09%, +1.91%
CodeSize: 88940784 -> 91271376 (+2.62%); split: -0.20%, +2.82%
Send messages: 284751 -> 284750 (-0.00%)
Cycle count: 275671864 -> 278875728 (+1.16%); split: -0.74%, +1.90%
Spill count: 857 -> 428 (-50.06%)
Fill count: 845 -> 411 (-51.36%)
Max live registers: 667776 -> 671387 (+0.54%); split: -0.86%, +1.40%
Max dispatch width: 160416 -> 160432 (+0.01%)
Non SSA regs after NIR: 1127904 -> 1165880 (+3.37%); split: -0.10%, +3.47%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Tested-by: Matt Corallo <git@bluematt.me>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41172>
There are less than 2^16 lanes within a threadgroup, so it is safe to do
all math at 16-bit. This allows us to use 16-bit integer division which is
much faster than 32-bit integer division (in terms of the lowerings).
In a "hello world" kernel with variable wg size, simd32 goes 72 inst -> 57
inst on jay and 82 -> 67 inst on brw.
OTOH it's a loss for non-variable wg size, so do it only there to avoid
unwelcome stats regresions on Vulkan.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41084>
These are a mix of fields whose last used was removed or fields that were
never used, possibly because they remained in a patch while the rest of the
code changed before landing.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41139>
This helper might be used as by another instruction emission,
which itself might have set the saturate bit in the default
state. This might result in the SYNC being created already
with saturate bit set.
Since SYNC doesn't have saturate, clear that field
instead of sometimes having it set.
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41005>
For DG2 (Bspec 47937) has the same programming note as of Xe2+,
"When this bit is set in the header, Trace Ray Message behaves like a
Ray Query. This message requires a write-back message indicating
RayQuery for all valid Rays (SIMD lanes) have completed."
So this patch is just passing a write back destination register when we
have ray query message.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41039>
StackSizePerRay is the RTDispatchGlobals::AsyncStackSize and
DisableRTGlobalsKnownValues is to interpret how many Max BVH levels we
need to use. It's not relevant to Vulkan, since we have just 2 fixed BVH
levels.
Fixes: cb423ee6 ("anv: Fix Wa_14021821874, Wa_14018813551, Wa_14026600921")
Fixes: c1a44e8d ("anv: force StackIDControl value for Wa_14021821874")
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41012>
this lets us lower away 8<-->2 copies/swaps in a faster, more straightforward
way by (ab)using accumulators. I think as an edge case this plays nicely enough
with my plans to profit from accs for normal fma-heavy code.
SIMD16:
Totals:
Instrs: 2761525 -> 2758108 (-0.12%)
CodeSize: 44222384 -> 44167168 (-0.12%)
Totals from 33 (1.25% of 2647) affected shaders:
Instrs: 422130 -> 418713 (-0.81%)
CodeSize: 6713680 -> 6658464 (-0.82%)
SIMD32:
Totals:
Instrs: 4911601 -> 4691895 (-4.47%)
CodeSize: 79553984 -> 76010880 (-4.45%)
Totals from 947 (35.78% of 2647) affected shaders:
Instrs: 4143501 -> 3923795 (-5.30%)
CodeSize: 67174592 -> 63631488 (-5.27%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41064>
SEL.f32 flushes denorms but SEL.u32 does not. That means changing the type of
the SEL is only justified if we know we're used as a float. This fixes
miscompilation in cases like:
ieq(1, bcsel(a, fneg(b), c))
Previously we'd be too greedy and form
(a) SEL.f32 t, -b, c
cmp.u32 t, 1
But that would inadvertently flush c which is an integer here. So just set the
type based on what we're used as. Some regressions due to is_only_used_as_float
not seeing through phis (..could probably be fixed?).
Totals:
Instrs: 2760796 -> 2761525 (+0.03%); split: -0.06%, +0.08%
CodeSize: 44244128 -> 44222384 (-0.05%); split: -0.13%, +0.08%
Totals from 945 (35.70% of 2647) affected shaders:
Instrs: 1968645 -> 1969374 (+0.04%); split: -0.08%, +0.11%
CodeSize: 31721968 -> 31700224 (-0.07%); split: -0.17%, +0.11%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41064>