intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency.

The current register allocation loop attempts to use a sequence of
pre-RA scheduling heuristics until register allocation is successful.
The sequence of scheduling heuristics is expected to be increasingly
aggressive at reducing the register pressure of the program (at a
performance cost), so that the instruction ordering chosen gives the
lowest latency achievable with the register space available.

Unfortunately that approach doesn't consistently give the best
performance on xe3+, since on recent platforms a schedule with higher
latency may actually give better performance if its lower register
pressure allows the use of a lower number of VRT register blocks which
allows the EU to run more threads in parallel.

This means that on xe3+ the scheduling mode with highest performance
is fundamentally dependent on the specific scenario (in particular
where in the thread count-register use curve the program is at, and
how effective the scheduler heuristics are at reducing latency for
each additional block of GRFs used), so it isn't possible to construct
a fixed sequence of the existing heuristics guaranteed to be ordered
by decreasing performance.  In order to find the scheduling heuristic
with better performance we have to run multiple of them prior to
register allocation and do some arithmetic to account for the effect
on parallelism of the register pressure estimated in each case, in
order to decide which heuristic will give the best performance.

This sounds costly but it is similar to the approach taken by
brw_allocate_registers() when unable to allocate without spills in
order to decide which scheduling heuristic to use in order to minimize
the number of spills.  In cases where that happens on xe3+ the
scheduling runs introduced here don't add to the scheduling runs done
to find the heuristic with minimum register pressure, we attempt to
determine the heuristic with lowest pressure and best performance in
the same loop, and then use one or the other depending on whether
register allocation succeeds without spills.

Significantly improves performance on PTL of the following Traci test
cases (4 iterations, 5% significance):

Nba2K23-trace-dx11-2160p-ultra:                     4.48% ±0.38%
Fortnite-trace-dx11-2160p-epix:                     1.61% ±0.28%
Superposition-trace-dx11-2160p-extreme:             1.37% ±0.26%
PubG-trace-dx11-1440p-ultra:                        1.15% ±0.29%
GtaV-trace-dx11-2160p-ultra:                        0.80% ±0.24%
CitiesSkylines2-trace-dx11-1440p-high:              0.68% ±0.19%
SpaceEngineers-trace-dx11-2160p-high:               0.65% ±0.34%

The compile-time cost of shader-db increases significantly by 3.7%
after this commit (15 iterations, 5% significance), the compile-time
of fossil-db doesn't change significantly in my setup.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
This commit is contained in:
Francisco Jerez 2025-08-04 18:09:13 -07:00 committed by Marge Bot
parent 0e802cecba
commit 531a34c7dd

View file

@ -1107,7 +1107,9 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
};
uint32_t best_register_pressure = UINT32_MAX;
enum brw_instruction_scheduler_mode best_sched = BRW_SCHEDULE_NONE;
float best_perf = 0;
unsigned best_press_idx = 0;
unsigned best_perf_idx = 0;
brw_opt_compact_virtual_grfs(s);
@ -1123,11 +1125,62 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
* prevent dependencies between the different scheduling modes.
*/
brw_inst **orig_order = save_instruction_order(s.cfg);
brw_inst **best_pressure_order = NULL;
brw_inst **orders[ARRAY_SIZE(pre_modes)] = {};
void *scheduler_ctx = ralloc_context(NULL);
brw_instruction_scheduler *sched = brw_prepare_scheduler(s, scheduler_ctx);
/* Try each scheduling heuristic to choose the one one with the
* best trade-off between latency and register pressure, which on
* xe3+ is dependent on the thread parallelism that can be achieved
* at the GRF register requirement of each ordering of the program
* (note that the register requirement of the program can only be
* estimated at this point prior to register allocation).
*/
for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
enum brw_instruction_scheduler_mode sched_mode = pre_modes[i];
/* Only use the PRE heuristic on pre-xe3 platforms during the
* first pass, since the trade-off between EU thread count and
* GRF use isn't a concern on platforms that don't support VRT.
*/
if (devinfo->ver < 30 && sched_mode != BRW_SCHEDULE_PRE)
continue;
/* These don't appear to provide much benefit on xe3+.
*/
if (devinfo->ver >= 30 && (sched_mode == BRW_SCHEDULE_PRE_LIFO ||
sched_mode == BRW_SCHEDULE_NONE))
continue;
brw_schedule_instructions_pre_ra(s, sched, sched_mode);
s.shader_stats.scheduler_mode = scheduler_mode_name[sched_mode];
s.debug_optimizer(nir, s.shader_stats.scheduler_mode, 95, i);
orders[i] = save_instruction_order(s.cfg);
const unsigned press = brw_compute_max_register_pressure(s);
if (press < best_register_pressure) {
best_register_pressure = press;
best_press_idx = i;
}
const brw_performance &perf = s.performance_analysis.require();
if (perf.throughput > best_perf) {
best_perf = perf.throughput;
best_perf_idx = i;
}
if (i + 1 < ARRAY_SIZE(pre_modes)) {
/* Reset back to the original order before trying the next mode */
restore_instruction_order(s, orig_order);
}
}
restore_instruction_order(s, orders[best_perf_idx]);
s.shader_stats.scheduler_mode = scheduler_mode_name[pre_modes[best_perf_idx]];
allocated = brw_assign_regs(s, false, spill_all);
if (!allocated) {
/* Try each scheduling heuristic to see if it can successfully register
* allocate without spilling. They should be ordered by decreasing
* performance but increasing likelihood of allocating.
@ -1135,19 +1188,34 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
for (unsigned i = 0; i < ARRAY_SIZE(pre_modes); i++) {
enum brw_instruction_scheduler_mode sched_mode = pre_modes[i];
if (devinfo->ver < 30 && sched_mode == BRW_SCHEDULE_PRE_LATENCY)
/* The latency-sensitive heuristic is unlikely to be helpful
* if we failed to register-allocate.
*/
if (sched_mode == BRW_SCHEDULE_PRE_LATENCY)
continue;
/* Already tried to register-allocate this. */
if (i == best_perf_idx)
continue;
if (orders[i]) {
/* We already scheduled the program with this mode. */
restore_instruction_order(s, orders[i]);
} else {
restore_instruction_order(s, orig_order);
brw_schedule_instructions_pre_ra(s, sched, sched_mode);
s.shader_stats.scheduler_mode = scheduler_mode_name[sched_mode];
s.debug_optimizer(nir, s.shader_stats.scheduler_mode, 95, i);
orders[i] = save_instruction_order(s.cfg);
if (0) {
brw_assign_regs_trivial(s);
allocated = true;
break;
const unsigned press = brw_compute_max_register_pressure(s);
if (press < best_register_pressure) {
best_register_pressure = press;
best_press_idx = i;
}
}
s.shader_stats.scheduler_mode = scheduler_mode_name[sched_mode];
/* We should only spill registers on the last scheduling. */
assert(!s.spilled_any_registers);
@ -1155,24 +1223,7 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
allocated = brw_assign_regs(s, false, spill_all);
if (allocated)
break;
/* Save the maximum register pressure */
uint32_t this_pressure = brw_compute_max_register_pressure(s);
if (0) {
fprintf(stderr, "Scheduler mode \"%s\" spilled, max pressure = %u\n",
scheduler_mode_name[sched_mode], this_pressure);
}
if (this_pressure < best_register_pressure) {
best_register_pressure = this_pressure;
best_sched = sched_mode;
delete[] best_pressure_order;
best_pressure_order = save_instruction_order(s.cfg);
}
/* Reset back to the original order before trying the next mode */
restore_instruction_order(s, orig_order);
}
ralloc_free(scheduler_ctx);
@ -1180,16 +1231,17 @@ brw_allocate_registers(brw_shader &s, bool allow_spilling)
if (!allocated) {
if (0) {
fprintf(stderr, "Spilling - using lowest-pressure mode \"%s\"\n",
scheduler_mode_name[best_sched]);
scheduler_mode_name[pre_modes[best_press_idx]]);
}
restore_instruction_order(s, best_pressure_order);
s.shader_stats.scheduler_mode = scheduler_mode_name[best_sched];
restore_instruction_order(s, orders[best_press_idx]);
s.shader_stats.scheduler_mode = scheduler_mode_name[pre_modes[best_press_idx]];
allocated = brw_assign_regs(s, allow_spilling, spill_all);
}
delete[] orig_order;
delete[] best_pressure_order;
for (unsigned i = 0; i < ARRAY_SIZE(orders); i++)
delete[] orders[i];
if (!allocated) {
s.fail("Failure to register allocate. Reduce number of "