intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis.

This extends the performance analysis pass used in previous generations to make it more useful to deal with the performance trade-off encountered on xe3 hardware as a result of VRT. VRT allows the driver to request a per-thread GRF allocation different from the 128 GRFs that were typical in previous platforms, but this comes at either a thread parallelism cost or benefit depending on the number of GRF register blocks requested. This makes a number of decisions more difficult for the compiler since certain optimizations potentially trade off run-time in a thread against the total number of threads that can run in parallel (e.g. consider scheduling and how reordering an instruction to avoid a stall can increase GRF use and therefore reduce thread-level parallelism when trying to improve instruction-level parallelism). This patch provides a simple heuristic tool to account for the combined interaction of register pressure and other single-threaded factors that affect performance. This is expressed with the redefinition of the pre-existing brw_performance::throughput estimate as the number of invocations per cycle per EU that would be achieved if there were enough threads to reach full load (in this sense this is to be considered a heuristic since the penalty from VRT may be lower than expected from this model at low EU load). This will be used e.g. in order to decide whether to use a more aggressive latency-minimizing mode during scheduling or a mode more effective at minimizing register pressure (it makes sense to take the path that will lead to the most invocations being serviced per cycle while under load). This also allows us to re-enable the old PS SIMD32 heuristic on xe3+, and due to this change it is able to identify cases where the combined effect of poorer scheduling and higher GRF use of the SIMD32 variant makes it more favorable to use SIMD16 only (see last patch of the MR for details and numbers). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-12-24 11:00:11 +01:00 · 2025-08-05 00:31:08 -07:00 · 2025-08-05 00:31:08 -07:00 · 6eea9659db
commit 6eea9659db
parent 760437c4c4
2 changed files with 43 additions and 7 deletions
--- a/src/intel/compiler/brw_analysis.h
+++ b/src/intel/compiler/brw_analysis.h
@ -550,13 +550,20 @@ struct brw_performance {

   /**
    * Estimate of the throughput of the whole program in
-    * invocations-per-cycle units.
+    * invocations-per-cycle-per-EU units.
    *
-    * Note that this might be lower than the ratio between the dispatch
-    * width of the program and its latency estimate in cases where
-    * performance doesn't scale without limits as a function of its thread
-    * parallelism, e.g. due to the existence of a bottleneck in a shared
-    * function.
+    * This gives the expected throughput of a whole EU under the
+    * heuristic assumption that it is fully loaded instead of the
+    * throughput of a single thread, this is in order to be able to
+    * account for the reduction in parallelism that xe3+ EUs
+    * experience with increasing register use.  Earlier platforms use
+    * a fixed factor as EU thread count instead.
+    *
+    * Note that this number might be lower than expected from the
+    * reciprocal of the latency estimate in cases where performance
+    * doesn't scale without limits as a function of its thread
+    * parallelism, e.g. due to the existence of a bottleneck in a
+    * shared function.
    */
   float throughput;

--- a/src/intel/compiler/brw_analysis_performance.cpp
+++ b/src/intel/compiler/brw_analysis_performance.cpp
@ -24,6 +24,7 @@
 #include "brw_eu.h"
 #include "brw_shader.h"
 #include "brw_cfg.h"
+#include <algorithm>

 namespace {
   /**
@ -1000,6 +1001,33 @@ namespace {
      return 1.0 / busy;
   }

+   /**
+    * Calculate the number of threads of this program that can run
+    * concurrently in an EU based on the estimate of register pressure
+    * derived from liveness information (pre-RA) or on the actual
+    * number of GRFs used if available (post-RA).  Platforms prior to
+    * xe3 don't support VRT so we can just return the constant value
+    * from device info.
+    */
+   unsigned
+   calculate_threads_per_eu(const brw_shader *s)
+   {
+      if (s->devinfo->ver >= 30) {
+         unsigned grf_used = s->grf_used;
+
+         if (!grf_used) {
+            const brw_register_pressure &rp = s->regpressure_analysis.require();
+            const unsigned max_regs_live = *std::max_element(rp.regs_live_at_ip,
+               rp.regs_live_at_ip + s->cfg->total_instructions);
+            grf_used = DIV_ROUND_UP(max_regs_live, reg_unit(s->devinfo));
+         }
+
+         return 32 / MAX2(3, ptl_register_blocks(grf_used) + 1);
+      } else {
+         return s->devinfo->num_thread_per_eu;
+      }
+   }
+
   /**
    * Estimate the performance of the specified shader.
    */
@ -1066,7 +1094,8 @@ namespace {
      }

      p.latency = elapsed;
-      p.throughput = dispatch_width * calculate_thread_throughput(st, elapsed);
+      p.throughput = dispatch_width * calculate_threads_per_eu(s) *
+                     calculate_thread_throughput(st, elapsed);
   }
 }