The current implementation does nothing, since it has no side effects,
only a return value. By passing `x` as a reference we can mutate the
value before returning.
Fixes: df37c7ca74 ("brw: fix analysis dirtying with pulled constants")
CID: 1665293
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37263>
This extends the performance analysis pass used in previous
generations to make it more useful to deal with the performance
trade-off encountered on xe3 hardware as a result of VRT. VRT allows
the driver to request a per-thread GRF allocation different from the
128 GRFs that were typical in previous platforms, but this comes at
either a thread parallelism cost or benefit depending on the number of
GRF register blocks requested.
This makes a number of decisions more difficult for the compiler since
certain optimizations potentially trade off run-time in a thread
against the total number of threads that can run in parallel
(e.g. consider scheduling and how reordering an instruction to avoid a
stall can increase GRF use and therefore reduce thread-level
parallelism when trying to improve instruction-level parallelism).
This patch provides a simple heuristic tool to account for the
combined interaction of register pressure and other single-threaded
factors that affect performance. This is expressed with the
redefinition of the pre-existing brw_performance::throughput estimate
as the number of invocations per cycle per EU that would be achieved
if there were enough threads to reach full load (in this sense this is
to be considered a heuristic since the penalty from VRT may be lower
than expected from this model at low EU load).
This will be used e.g. in order to decide whether to use a more
aggressive latency-minimizing mode during scheduling or a mode more
effective at minimizing register pressure (it makes sense to take the
path that will lead to the most invocations being serviced per cycle
while under load). This also allows us to re-enable the old PS SIMD32
heuristic on xe3+, and due to this change it is able to identify cases
where the combined effect of poorer scheduling and higher GRF use of
the SIMD32 variant makes it more favorable to use SIMD16 only (see
last patch of the MR for details and numbers).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
We're already looking at this data to calculate the per-component
vars_from_vgrf[] and vgrf_from_vars[] mappings, so just record the
largest VGRF size while we're here. This will allow passes to size
arrays based on the actual size needed, rather than hardcoding some
fixed size. In many cases, MAX_VGRF_SIZE(devinfo) is larger than
necessary, because e.g. vec5 sparse sampling results aren't used.
Not hardcoding this means we can also temporarily handle very large
VGRFs which we know will be split eventually, without having to
increase the maximum which is ultimately used for RA classes.
Cc: mesa-stable
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34461>
Makes the intention of some comparisons clearer by using the named
helper functions. Add commentary when the straightforward range is not
the one used, e.g. VGRF interference.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34253>
Calculate the IP ranges of the shader as an analysis pass. This will
later replace the existing tracking of start_ip/end_ip as the blocks are
changed (and the defer/adjust scheme to avoid too much churn when that
happen).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34012>