intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency.

The current register allocation loop attempts to use a sequence of pre-RA scheduling heuristics until register allocation is successful. The sequence of scheduling heuristics is expected to be increasingly aggressive at reducing the register pressure of the program (at a performance cost), so that the instruction ordering chosen gives the lowest latency achievable with the register space available. Unfortunately that approach doesn't consistently give the best performance on xe3+, since on recent platforms a schedule with higher latency may actually give better performance if its lower register pressure allows the use of a lower number of VRT register blocks which allows the EU to run more threads in parallel. This means that on xe3+ the scheduling mode with highest performance is fundamentally dependent on the specific scenario (in particular where in the thread count-register use curve the program is at, and how effective the scheduler heuristics are at reducing latency for each additional block of GRFs used), so it isn't possible to construct a fixed sequence of the existing heuristics guaranteed to be ordered by decreasing performance. In order to find the scheduling heuristic with better performance we have to run multiple of them prior to register allocation and do some arithmetic to account for the effect on parallelism of the register pressure estimated in each case, in order to decide which heuristic will give the best performance. This sounds costly but it is similar to the approach taken by brw_allocate_registers() when unable to allocate without spills in order to decide which scheduling heuristic to use in order to minimize the number of spills. In cases where that happens on xe3+ the scheduling runs introduced here don't add to the scheduling runs done to find the heuristic with minimum register pressure, we attempt to determine the heuristic with lowest pressure and best performance in the same loop, and then use one or the other depending on whether register allocation succeeds without spills. Significantly improves performance on PTL of the following Traci test cases (4 iterations, 5% significance): Nba2K23-trace-dx11-2160p-ultra: 4.48% ±0.38% Fortnite-trace-dx11-2160p-epix: 1.61% ±0.28% Superposition-trace-dx11-2160p-extreme: 1.37% ±0.26% PubG-trace-dx11-1440p-ultra: 1.15% ±0.29% GtaV-trace-dx11-2160p-ultra: 0.80% ±0.24% CitiesSkylines2-trace-dx11-1440p-high: 0.68% ±0.19% SpaceEngineers-trace-dx11-2160p-high: 0.65% ±0.34% The compile-time cost of shader-db increases significantly by 3.7% after this commit (15 iterations, 5% significance), the compile-time of fossil-db doesn't change significantly in my setup. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-12-22 11:20:11 +01:00 · 2025-08-04 18:09:13 -07:00 · 2025-08-04 18:09:13 -07:00 · 531a34c7dd
commit 531a34c7dd
parent 0e802cecba
1 changed files with 87 additions and 35 deletions
--- a/src/intel/compiler/brw_shader.cpp
+++ b/src/intel/compiler/brw_shader.cpp
@ -1107,7 +1107,9 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
   };

   uint32_t best_register_pressure = UINT32_MAX;
-   enum brw_instruction_scheduler_mode best_sched = BRW_SCHEDULE_NONE;
+   float best_perf = 0;
+   unsigned best_press_idx = 0;
+   unsigned best_perf_idx = 0;

   brw_opt_compact_virtual_grfs(s);

@ -1123,11 +1125,62 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
    * prevent dependencies between the different scheduling modes.
    */
   brw_inst **orig_order = save_instruction_order(s.cfg);
-   brw_inst **best_pressure_order = NULL;
+   brw_inst **orders[ARRAY_SIZE(pre_modes)] = {};

   void *scheduler_ctx = ralloc_context(NULL);
   brw_instruction_scheduler *sched = brw_prepare_scheduler(s, scheduler_ctx);

+   /* Try each scheduling heuristic to choose the one one with the
+    * best trade-off between latency and register pressure, which on
+    * xe3+ is dependent on the thread parallelism that can be achieved
+    * at the GRF register requirement of each ordering of the program
+    * (note that the register requirement of the program can only be
+    * estimated at this point prior to register allocation).
+    */
+   for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
+      enum brw_instruction_scheduler_mode sched_mode = pre_modes[i];
+
+      /* Only use the PRE heuristic on pre-xe3 platforms during the
+       * first pass, since the trade-off between EU thread count and
+       * GRF use isn't a concern on platforms that don't support VRT.
+       */
+      if (devinfo->ver < 30 && sched_mode != BRW_SCHEDULE_PRE)
+         continue;
+
+      /* These don't appear to provide much benefit on xe3+.
+       */
+      if (devinfo->ver >= 30 && (sched_mode == BRW_SCHEDULE_PRE_LIFO ||
+                                 sched_mode == BRW_SCHEDULE_NONE))
+         continue;
+
+      brw_schedule_instructions_pre_ra(s, sched, sched_mode);
+      s.shader_stats.scheduler_mode = scheduler_mode_name[sched_mode];
+      s.debug_optimizer(nir, s.shader_stats.scheduler_mode, 95, i);
+      orders[i] = save_instruction_order(s.cfg);
+
+      const unsigned press = brw_compute_max_register_pressure(s);
+      if (press < best_register_pressure) {
+         best_register_pressure = press;
+         best_press_idx = i;
+      }
+
+      const brw_performance &perf = s.performance_analysis.require();
+      if (perf.throughput > best_perf) {
+         best_perf = perf.throughput;
+         best_perf_idx = i;
+      }
+
+      if (i + 1 < ARRAY_SIZE(pre_modes)) {
+         /* Reset back to the original order before trying the next mode */
+         restore_instruction_order(s, orig_order);
+      }
+   }
+
+   restore_instruction_order(s, orders[best_perf_idx]);
+   s.shader_stats.scheduler_mode = scheduler_mode_name[pre_modes[best_perf_idx]];
+   allocated = brw_assign_regs(s, false, spill_all);
+
+   if (!allocated) {
      /* Try each scheduling heuristic to see if it can successfully register
       * allocate without spilling.  They should be ordered by decreasing
       * performance but increasing likelihood of allocating.
@ -1135,19 +1188,34 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
      for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
         enum brw_instruction_scheduler_mode sched_mode = pre_modes[i];

-      if (devinfo->ver < 30 && sched_mode == BRW_SCHEDULE_PRE_LATENCY)
+         /* The latency-sensitive heuristic is unlikely to be helpful
+          * if we failed to register-allocate.
+          */
+         if (sched_mode == BRW_SCHEDULE_PRE_LATENCY)
            continue;

+         /* Already tried to register-allocate this. */
+         if (i == best_perf_idx)
+            continue;
+
+         if (orders[i]) {
+            /* We already scheduled the program with this mode. */
+            restore_instruction_order(s, orders[i]);
+         } else {
+            restore_instruction_order(s, orig_order);
            brw_schedule_instructions_pre_ra(s, sched, sched_mode);
            s.shader_stats.scheduler_mode = scheduler_mode_name[sched_mode];
-
            s.debug_optimizer(nir, s.shader_stats.scheduler_mode, 95, i);
+            orders[i] = save_instruction_order(s.cfg);

-      if (0) {
-         brw_assign_regs_trivial(s);
-         allocated = true;
-         break;
+            const unsigned press = brw_compute_max_register_pressure(s);
+            if (press < best_register_pressure) {
+               best_register_pressure = press;
+               best_press_idx = i;
            }
+         }
+
+         s.shader_stats.scheduler_mode = scheduler_mode_name[sched_mode];

         /* We should only spill registers on the last scheduling. */
         assert(!s.spilled_any_registers);
@ -1155,24 +1223,7 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
         allocated = brw_assign_regs(s, false, spill_all);
         if (allocated)
            break;
-
-      /* Save the maximum register pressure */
-      uint32_t this_pressure = brw_compute_max_register_pressure(s);
-
-      if (0) {
-         fprintf(stderr, "Scheduler mode \"%s\" spilled, max pressure = %u\n",
-                 scheduler_mode_name[sched_mode], this_pressure);
      }
-
-      if (this_pressure < best_register_pressure) {
-         best_register_pressure = this_pressure;
-         best_sched = sched_mode;
-         delete[] best_pressure_order;
-         best_pressure_order = save_instruction_order(s.cfg);
-      }
-
-      /* Reset back to the original order before trying the next mode */
-      restore_instruction_order(s, orig_order);
   }

   ralloc_free(scheduler_ctx);
@ -1180,16 +1231,17 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
   if (!allocated) {
      if (0) {
         fprintf(stderr, "Spilling - using lowest-pressure mode \"%s\"\n",
-                 scheduler_mode_name[best_sched]);
+                 scheduler_mode_name[pre_modes[best_press_idx]]);
      }
-      restore_instruction_order(s, best_pressure_order);
-      s.shader_stats.scheduler_mode = scheduler_mode_name[best_sched];
+      restore_instruction_order(s, orders[best_press_idx]);
+      s.shader_stats.scheduler_mode = scheduler_mode_name[pre_modes[best_press_idx]];

      allocated = brw_assign_regs(s, allow_spilling, spill_all);
   }

   delete[] orig_order;
-   delete[] best_pressure_order;
+   for (unsigned i = 0; i < ARRAY_SIZE(orders); i++)
+      delete[] orders[i];

   if (!allocated) {
      s.fail("Failure to register allocate.  Reduce number of "