intel/brw/xe3+: Re-enable static analysis-based SIMD32 FS heuristic for the moment.

This disables for now the "optimistic" SIMD heuristic that was
implemented for xe3+ and makes it dependent on a debugging option,
instead use the static analysis-based codepath that was used in
previous generations and was extended by previous commits in this MR
to model the xe3 trade-off between register use and thread
parallelism.

The reason is that the main assumption of the optimistic SIMD
heuristic didn't hold up with reality: Real-world testing on PTL shows
that there are many cases where SIMD32 shows performance degradation
relative to SIMD16 despite the ability of xe3 hardware to scale the
GRF file of a thread on demand, unfortunately that scenario seems to
be more pervasive than hoped when the optimistic SIMD heuristic was
implemented pre-silicon.

In many cases what seems to be going on is that even when the register
file is able to scale with the increased register use of SIMD32, the
thread parallelism of the EU is scaled down by a similar factor, so at
the bottom line SIMD32 (depending on the actual ratio of register use
between both variants) may not buy us anything, and it frequently
encounters constraints (like SIMD lowering and less effective
scheduling) that lead to worse codegen than SIMD16, easily tipping the
balance in favor of SIMD16.  The extension of the performance analysis
pass that was done in a previous commit allows the original SIMD32
heuristic to take into account quantitatively this effect, and that
seems pretty effective at disabling SIMD32 shaders that underperform
judging from the statistically significant improvement of most Traci
test-cases that run on my PTL system (4 iterations, 5% significance),
no statistically significant regressions were observed:

Nba2K23-trace-dx11-2160p-ultra:                    10.16% ±0.34%
Superposition-trace-dx11-2160p-extreme:             4.06% ±0.50%
TotalWarWarhammer3-trace-dx11-1080p-high:           3.52% ±0.76%
Payday3-trace-dx11-1440p-ultra:                     2.41% ±0.81%
MetroExodus-trace-dx11-2160p-ultra:                 2.28% ±0.78%
Borderlands3-trace-dx11-2160p-ultra:                1.89% ±0.65%
MountAndBlade2-trace-dx11-1440p-veryhigh:           1.81% ±0.40%
Blackops3-trace-dx11-1080p-high:                    1.66% ±0.29%
HogwartsLegacy-trace-dx12-1080p-ultra:              1.53% ±0.22%
TotalWarPharaoh-trace-dx11-1440p-ultra:             1.44% ±0.31%
Fortnite-trace-dx11-2160p-epix:                     1.44% ±0.27%
Naraka-trace-dx11-1440p-highest:                    1.39% ±0.27%
PubG-trace-dx11-1440p-ultra:                        1.30% ±0.49%
Destiny2-trace-dx11-1440p-highest:                  1.10% ±0.23%
Factorio-trace-1080p-high:                          1.10% ±1.77%
TerminatorResistance-trace-dx11-2160p-ultra:        1.08% ±0.31%
Ghostrunner2-trace-dx11-1440p-ultra:                1.05% ±0.15%
ShadowTombRaider-trace-dx11-2160p-ultra:            0.98% ±0.19%
CitiesSkylines2-trace-dx11-1440p-high:              0.67% ±0.19%
Palworld-trace-dx11-1080p-med:                      0.44% ±0.22%

The downside is that this will reverse the large reduction in
compile-time we gained from the optimistic SIMD heuristic -- The
run-time of both shader-db and fossil-db jump back up by nearly 20%
with this change.  I'm working on a better compromise based on
run-time feedback that will hopefully allow us to preserve the
compile-time benefit of the optimistic heuristic without the reduction
in run-time performance, but in the meantime it seems like the
run-time performance gap from SIMD32 is the more urgent issue to
address since it has an impact on titles across the board.  Despite
the reversal of that compile-time improvement xe3 still achieves
slightly lower compile time on the average than previous generations
as a result of VRT, so this doesn't seem terribly tragic.

v2: Add bit to brw_get_compiler_config_value() (Lionel).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
This commit is contained in:
Francisco Jerez 2025-07-16 15:28:06 -07:00 committed by Marge Bot
parent a7969b5d42
commit 5bf7bb5cf9
3 changed files with 26 additions and 1 deletions

View file

@ -1641,7 +1641,7 @@ brw_compile_fs(const struct brw_compiler *compiler,
}
}
if (devinfo->ver >= 30) {
if (compiler->optimistic_simd_heuristic) {
unsigned max_dispatch_width = reqd_dispatch_width ? reqd_dispatch_width : 32;
if (max_polygons >= 2 && !key->coarse_pixel) {

View file

@ -112,6 +112,9 @@ brw_compiler_create(void *mem_ctx, const struct intel_device_info *devinfo)
compiler->lower_dpas = !devinfo->has_systolic ||
debug_get_bool_option("INTEL_LOWER_DPAS", false);
compiler->optimistic_simd_heuristic =
debug_get_bool_option("INTEL_SIMD_OPTIMISTIC", false);
nir_lower_int64_options int64_options =
nir_lower_imul64 |
nir_lower_isign64 |
@ -244,6 +247,8 @@ brw_get_compiler_config_value(const struct brw_compiler *compiler)
bits++;
insert_u64_bit(&config, compiler->lower_dpas);
bits++;
insert_u64_bit(&config, compiler->optimistic_simd_heuristic);
bits++;
enum intel_debug_flag debug_bits[] = {
DEBUG_NO_DUAL_OBJECT_GS,

View file

@ -121,6 +121,26 @@ struct brw_compiler {
*/
bool lower_dpas;
/**
* This can be set to use an "optimistic" SIMD heuristic that
* assumes that the highest SIMD width and polygon count achievable
* without spills will give the highest performance, so the
* compiler doesn't need to try more than that.
*
* As of xe3 most programs compile without spills at 32-wide
* dispatch so with this option enabled typically only a single
* back-end compilation will be done instead of the default
* behavior of one compilation per supported dispatch mode. This
* can speed up the back-end compilation of fragment shaders by a
* 2+ factor, but could also increase compile-time especially on
* pre-xe3 platforms in cases with high register pressure.
*
* Run-time performance of the shaders will be reduced since this
* removes the ability to use a static analysis to estimate the
* relative performance of the dispatch modes supported.
*/
bool optimistic_simd_heuristic;
/**
* Calling the ra_allocate function after each register spill can take
* several minutes. This option speeds up shader compilation by spilling