intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis.

This extends the performance analysis pass used in previous
generations to make it more useful to deal with the performance
trade-off encountered on xe3 hardware as a result of VRT.  VRT allows
the driver to request a per-thread GRF allocation different from the
128 GRFs that were typical in previous platforms, but this comes at
either a thread parallelism cost or benefit depending on the number of
GRF register blocks requested.

This makes a number of decisions more difficult for the compiler since
certain optimizations potentially trade off run-time in a thread
against the total number of threads that can run in parallel
(e.g. consider scheduling and how reordering an instruction to avoid a
stall can increase GRF use and therefore reduce thread-level
parallelism when trying to improve instruction-level parallelism).

This patch provides a simple heuristic tool to account for the
combined interaction of register pressure and other single-threaded
factors that affect performance.  This is expressed with the
redefinition of the pre-existing brw_performance::throughput estimate
as the number of invocations per cycle per EU that would be achieved
if there were enough threads to reach full load (in this sense this is
to be considered a heuristic since the penalty from VRT may be lower
than expected from this model at low EU load).

This will be used e.g. in order to decide whether to use a more
aggressive latency-minimizing mode during scheduling or a mode more
effective at minimizing register pressure (it makes sense to take the
path that will lead to the most invocations being serviced per cycle
while under load).  This also allows us to re-enable the old PS SIMD32
heuristic on xe3+, and due to this change it is able to identify cases
where the combined effect of poorer scheduling and higher GRF use of
the SIMD32 variant makes it more favorable to use SIMD16 only (see
last patch of the MR for details and numbers).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
This commit is contained in:
Francisco Jerez 2025-08-05 00:31:08 -07:00 committed by Marge Bot
parent 760437c4c4
commit 6eea9659db
2 changed files with 43 additions and 7 deletions

View file

@ -550,13 +550,20 @@ struct brw_performance {
/**
* Estimate of the throughput of the whole program in
* invocations-per-cycle units.
* invocations-per-cycle-per-EU units.
*
* Note that this might be lower than the ratio between the dispatch
* width of the program and its latency estimate in cases where
* performance doesn't scale without limits as a function of its thread
* parallelism, e.g. due to the existence of a bottleneck in a shared
* function.
* This gives the expected throughput of a whole EU under the
* heuristic assumption that it is fully loaded instead of the
* throughput of a single thread, this is in order to be able to
* account for the reduction in parallelism that xe3+ EUs
* experience with increasing register use. Earlier platforms use
* a fixed factor as EU thread count instead.
*
* Note that this number might be lower than expected from the
* reciprocal of the latency estimate in cases where performance
* doesn't scale without limits as a function of its thread
* parallelism, e.g. due to the existence of a bottleneck in a
* shared function.
*/
float throughput;

View file

@ -24,6 +24,7 @@
#include "brw_eu.h"
#include "brw_shader.h"
#include "brw_cfg.h"
#include <algorithm>
namespace {
/**
@ -1000,6 +1001,33 @@ namespace {
return 1.0 / busy;
}
/**
* Calculate the number of threads of this program that can run
* concurrently in an EU based on the estimate of register pressure
* derived from liveness information (pre-RA) or on the actual
* number of GRFs used if available (post-RA). Platforms prior to
* xe3 don't support VRT so we can just return the constant value
* from device info.
*/
unsigned
calculate_threads_per_eu(const brw_shader *s)
{
if (s->devinfo->ver >= 30) {
unsigned grf_used = s->grf_used;
if (!grf_used) {
const brw_register_pressure &rp = s->regpressure_analysis.require();
const unsigned max_regs_live = *std::max_element(rp.regs_live_at_ip,
rp.regs_live_at_ip + s->cfg->total_instructions);
grf_used = DIV_ROUND_UP(max_regs_live, reg_unit(s->devinfo));
}
return 32 / MAX2(3, ptl_register_blocks(grf_used) + 1);
} else {
return s->devinfo->num_thread_per_eu;
}
}
/**
* Estimate the performance of the specified shader.
*/
@ -1066,7 +1094,8 @@ namespace {
}
p.latency = elapsed;
p.throughput = dispatch_width * calculate_thread_throughput(st, elapsed);
p.throughput = dispatch_width * calculate_threads_per_eu(s) *
calculate_thread_throughput(st, elapsed);
}
}