intel/brw/xehp+: Adjust performance model weights of LSC atomic ops.

The LSC implements several optimizations for atomic operations on a
memory addresses that are uniform across all lanes, in which case its
cost is approximately O(1) instead of O(exec_size).  Even cases where
memory offsets are non-uniform but packed in a cacheline appear to
have a cost that is non-linear with the number of lanes.

In order to approximate this behavior more closely approximate its
back-end cost as roughly 1300 cycles instead of the previous 400 *
exec_size/8.  This fixes some cases where we were incorrectly
predicting the SIMD32 shader would be bound by the throughput of LSC
atomic operations, even though the observed cost per lane of the LSC
operations was significantly lower in SIMD32 mode so it would have the
best performance.

Clearly this is still a rough approximation and it might be possible
to obtain a more accurate result by plumbing divergence analysis data
all the way down to codegen, however the goal of the performance
analysis pass isn't to provide an exact prediction of the performance
of a shader (that's not really possible in general via static analysis
without solving the halting problem), but to provide a good enough
approximation at a low cost -- And the constant approximation seems to
be strictly better in practice than the approximation we were using
before, there appear to be no regressions from this change, and
ShadowTombRaider-trace-dx11-2160p-ultra shows 5.7% better performance
on PTL with a subsequent commit that re-enables the use of the static
analysis-based SIMD32 heuristic on xe3+.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
This commit is contained in:
Francisco Jerez 2025-07-16 14:58:14 -07:00 committed by Marge Bot
parent 6eea9659db
commit 1272ff5ed1

View file

@ -683,7 +683,7 @@ namespace {
case LSC_OP_ATOMIC_OR:
case LSC_OP_ATOMIC_XOR:
return calculate_desc(info, EU_UNIT_DP_DC, 2, 0, 0,
30 /* XXX */, 400 /* XXX */,
1300 /* XXX */, 0 /* XXX */,
10 /* XXX */, 100 /* XXX */, 0, 0,
0, 400 /* XXX */);
default: