intel/brw/xe3+: Re-enable static analysis-based SIMD32 FS heuristic for the moment.

This disables for now the "optimistic" SIMD heuristic that was implemented for xe3+ and makes it dependent on a debugging option, instead use the static analysis-based codepath that was used in previous generations and was extended by previous commits in this MR to model the xe3 trade-off between register use and thread parallelism. The reason is that the main assumption of the optimistic SIMD heuristic didn't hold up with reality: Real-world testing on PTL shows that there are many cases where SIMD32 shows performance degradation relative to SIMD16 despite the ability of xe3 hardware to scale the GRF file of a thread on demand, unfortunately that scenario seems to be more pervasive than hoped when the optimistic SIMD heuristic was implemented pre-silicon. In many cases what seems to be going on is that even when the register file is able to scale with the increased register use of SIMD32, the thread parallelism of the EU is scaled down by a similar factor, so at the bottom line SIMD32 (depending on the actual ratio of register use between both variants) may not buy us anything, and it frequently encounters constraints (like SIMD lowering and less effective scheduling) that lead to worse codegen than SIMD16, easily tipping the balance in favor of SIMD16. The extension of the performance analysis pass that was done in a previous commit allows the original SIMD32 heuristic to take into account quantitatively this effect, and that seems pretty effective at disabling SIMD32 shaders that underperform judging from the statistically significant improvement of most Traci test-cases that run on my PTL system (4 iterations, 5% significance), no statistically significant regressions were observed: Nba2K23-trace-dx11-2160p-ultra: 10.16% ±0.34% Superposition-trace-dx11-2160p-extreme: 4.06% ±0.50% TotalWarWarhammer3-trace-dx11-1080p-high: 3.52% ±0.76% Payday3-trace-dx11-1440p-ultra: 2.41% ±0.81% MetroExodus-trace-dx11-2160p-ultra: 2.28% ±0.78% Borderlands3-trace-dx11-2160p-ultra: 1.89% ±0.65% MountAndBlade2-trace-dx11-1440p-veryhigh: 1.81% ±0.40% Blackops3-trace-dx11-1080p-high: 1.66% ±0.29% HogwartsLegacy-trace-dx12-1080p-ultra: 1.53% ±0.22% TotalWarPharaoh-trace-dx11-1440p-ultra: 1.44% ±0.31% Fortnite-trace-dx11-2160p-epix: 1.44% ±0.27% Naraka-trace-dx11-1440p-highest: 1.39% ±0.27% PubG-trace-dx11-1440p-ultra: 1.30% ±0.49% Destiny2-trace-dx11-1440p-highest: 1.10% ±0.23% Factorio-trace-1080p-high: 1.10% ±1.77% TerminatorResistance-trace-dx11-2160p-ultra: 1.08% ±0.31% Ghostrunner2-trace-dx11-1440p-ultra: 1.05% ±0.15% ShadowTombRaider-trace-dx11-2160p-ultra: 0.98% ±0.19% CitiesSkylines2-trace-dx11-1440p-high: 0.67% ±0.19% Palworld-trace-dx11-1080p-med: 0.44% ±0.22% The downside is that this will reverse the large reduction in compile-time we gained from the optimistic SIMD heuristic -- The run-time of both shader-db and fossil-db jump back up by nearly 20% with this change. I'm working on a better compromise based on run-time feedback that will hopefully allow us to preserve the compile-time benefit of the optimistic heuristic without the reduction in run-time performance, but in the meantime it seems like the run-time performance gap from SIMD32 is the more urgent issue to address since it has an impact on titles across the board. Despite the reversal of that compile-time improvement xe3 still achieves slightly lower compile time on the average than previous generations as a result of VRT, so this doesn't seem terribly tragic. v2: Add bit to brw_get_compiler_config_value() (Lionel). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2026-03-19 12:10:36 +01:00 · 2025-07-16 15:28:06 -07:00 · 2025-07-16 15:28:06 -07:00 · 5bf7bb5cf9
commit 5bf7bb5cf9
parent a7969b5d42
3 changed files with 26 additions and 1 deletions
--- a/src/intel/compiler/brw_compile_fs.cpp
+++ b/src/intel/compiler/brw_compile_fs.cpp
@ -1641,7 +1641,7 @@ brw_compile_fs(const struct brw_compiler *compiler,
      }
   }

-   if (devinfo->ver >= 30) {
+   if (compiler->optimistic_simd_heuristic) {
      unsigned max_dispatch_width = reqd_dispatch_width ? reqd_dispatch_width : 32;

      if (max_polygons >= 2 && !key->coarse_pixel) {
--- a/src/intel/compiler/brw_compiler.c
+++ b/src/intel/compiler/brw_compiler.c
@ -112,6 +112,9 @@ brw_compiler_create(void *mem_ctx, const struct intel_device_info *devinfo)
   compiler->lower_dpas = !devinfo->has_systolic ||
                          debug_get_bool_option("INTEL_LOWER_DPAS", false);

+   compiler->optimistic_simd_heuristic =
+      debug_get_bool_option("INTEL_SIMD_OPTIMISTIC", false);
+
   nir_lower_int64_options int64_options =
      nir_lower_imul64 |
      nir_lower_isign64 |
@ -244,6 +247,8 @@ brw_get_compiler_config_value(const struct brw_compiler *compiler)
   bits++;
   insert_u64_bit(&config, compiler->lower_dpas);
   bits++;
+   insert_u64_bit(&config, compiler->optimistic_simd_heuristic);
+   bits++;

   enum intel_debug_flag debug_bits[] = {
      DEBUG_NO_DUAL_OBJECT_GS,
--- a/src/intel/compiler/brw_compiler.h
+++ b/src/intel/compiler/brw_compiler.h
@ -121,6 +121,26 @@ struct brw_compiler {
    */
   bool lower_dpas;

+   /**
+    * This can be set to use an "optimistic" SIMD heuristic that
+    * assumes that the highest SIMD width and polygon count achievable
+    * without spills will give the highest performance, so the
+    * compiler doesn't need to try more than that.
+    *
+    * As of xe3 most programs compile without spills at 32-wide
+    * dispatch so with this option enabled typically only a single
+    * back-end compilation will be done instead of the default
+    * behavior of one compilation per supported dispatch mode.  This
+    * can speed up the back-end compilation of fragment shaders by a
+    * 2+ factor, but could also increase compile-time especially on
+    * pre-xe3 platforms in cases with high register pressure.
+    *
+    * Run-time performance of the shaders will be reduced since this
+    * removes the ability to use a static analysis to estimate the
+    * relative performance of the dispatch modes supported.
+    */
+   bool optimistic_simd_heuristic;
+
   /**
    * Calling the ra_allocate function after each register spill can take
    * several minutes. This option speeds up shader compilation by spilling