jay/lower_scoreboard: use .src annotations

This is less heavy handed, avoiding unnecessary stalls after SENDs in a
bunch of common cases. The stats (SIMD32) are:

Totals:
CodeSize: 70345392 -> 71674272 (+1.89%)

Totals from 1774 (67.02% of 2647) affected shaders:
CodeSize: 67359248 -> 68688128 (+1.97%)

What's happening here is we are inserting extra SYNC.nop instructions in a
bunch of cases for the .src preceding the eventual .dst. However, putting aside
the i-cache impact for a moment, this is showing the optimization doing what it
should (deferring dst syncs and inserting cheaper src syncs first). So this
should be positive in reality despite the negative stat impact.

The most hurt shaders are pooling up SYNC.nop's at the end of blocks due to
local-only SWSB and lack of SYNC.allwr optimization. The latter is added later
in this MR. The former is planned.

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41398>
This commit is contained in:
Alyssa Rosenzweig 2026-05-06 09:47:36 -04:00 committed by Marge Bot
parent 130e724d5e
commit 0885ed10f5

View file

@ -69,10 +69,13 @@ lower_send_local(jay_function *func, jay_block *block)
struct gpr_range dst = def_to_gpr(func, I, I->dst);
u_foreach_bit(sbid, busy) {
if (BITSET_TEST_COUNT(tokens[sbid].reading, dst.base, dst.width) ||
BITSET_TEST_COUNT(tokens[sbid].writing, dst.base, dst.width)) {
if (BITSET_TEST_COUNT(tokens[sbid].writing, dst.base, dst.width)) {
jay_SYNC_nop(&b, tgl_swsb_sbid(TGL_SBID_DST, sbid));
busy &= ~BITFIELD_BIT(sbid);
} else if (BITSET_TEST_COUNT(tokens[sbid].reading, dst.base,
dst.width)) {
jay_SYNC_nop(&b, tgl_swsb_sbid(TGL_SBID_SRC, sbid));
BITSET_ZERO(tokens[sbid].reading);
}
}
}