On my desktop system, this gets compilation of (a subset of) 32bits Mesa
from 2.5 minutes down to 15 seconds - and most of what left is linking.
Also show that absolute paths are not required.
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31945>
These have been tested only with x86_64->i686 cross-compilation but I
don't see why they couldn't work with other architectures if/when Linux
distributions support them. For at least x86_64->i686 cross-compilation,
these commands save an enormous amount of time.
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31945>
After optimization happen, if the sources are still in one or two
contigous spans for some reason (e.g. some data read from memory
now being written), it is beneficial to just use regular SEND
and avoid having to set the ARF scalar instruction.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32410>
Add an optimization pass to turn regular SENDs into SEND_GATHERs.
This allows the payload to be "broken" into smaller pieces that
can be further optimized, which _may_ result in
- less register pressure (no need to contiguous space), and
- less instructions (no need to MOV to such space).
For debugging, the INTEL_DEBUG=no-send-gather option skips this
optimization, and reporting how many opportunities were missed.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32410>
Starting in Xe3, there's a variant of SEND that take the
register numbers from the ARF scalar register, and don't
require them to be contiguous. The new opcode added here
represents that kind of SEND.
To make the original sources still reachable, we keep them
around during the IR, just ignoring them at generator time.
This allow software scoreboard to properly reason the
dependencies without trying to decode the contents of ARF
scalar register being used.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32410>
Xe3 adds a new pipe that handles *only* MOVs from immediate into the
scalar register.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Lionel Landwerlin <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32410>
Defaults to true. When set to false Iris and various tools can be
built without ELK support. In both cases this means supporting
only Gfx9+. This option must be true to build Crocus or Hasvk.
This allows skipping re-building ELK when developing for newer platforms
with tools/tests enabled.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11575
Reviewed-by: Daniel Stone <daniels@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33054>
Isolate the BRW/ELK differences in a single place. The way is done now,
we are not reusing the isa_info between calls. For the tools here this
is probably fine, if its someday this gets in the way, we can add an
opaque pointer to store the right data.
This intentionally is not used in Iris, since there the driver need more
detailed view into BRW/ELK and we don't want to create an all
encompassing abstraction for that.
Reviewed-by: Daniel Stone <daniels@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33054>
Since the internal dependency object exists and is already
used in some cases, let's be consistent.
Reviewed-by: Daniel Stone <daniels@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33054>
v2: Use MOV and wrap in conditional during BTD spawn header setup
(Lionel). Remove references to SIMD8 (Tapani).
v3: Update brw_bsr() to specify number of registers per thread, don't
initialize Registers Per Thread on BTD spawn header (Lionel).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This is similar to what we used to do on pre-SNB platforms, the number
of GRF registers used by the shader will be used on Xe3+ to adjust the
trade-off between thread-level parallelism and size of the GRF file.
Plumb the value through prog_data so the driver can set up the
hardware state accordingly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This is similar in principle to the previous commit "intel/brw/xe3+:
brw_compile_fs() implementation for Xe3+." but applied to compute-like
shader stages. It changes the implementation of brw_compile_cs/task/mesh()
to reduce compile time and take advantage of wider dispatch modes more
aggressively than the original logic, since as of Xe3 SIMD32 builds
succeed without spills in most cases thanks to VRT.
The new "optimistic" SIMD selection logic starts with the SIMD width
that is potentially highest performance and only compiles additional
narrower variants if that fails (typically due to spilling), while the
old "pessimistic" logic did the opposite: It started with the
narrowest SIMD width and compiled additional variants with increasing
register pressure until one of them failed to compile.
In typical non-spilling cases where we formerly compiled SIMD16 and
SIMD32 variants of the same compute shader, this change will halve the
number of backend compilations required to build it.
XXX - Possibly don't do this in cases with variable workgroup size
until effect on runtime performance can be measured directly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: Don't do this for now in cases with variable workgroup size, still
compile every possible variant in such cases.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This reworks the implementation of brw_compile_fs() to reduce compile
time and take advantage of wider dispatch modes more aggressively than
the original logic.
The new "optimistic" PS compilation logic starts with the SIMD width
that is potentially highest performance and only compiles additional
narrower variants if that fails (typically due to spilling or hardware
restrictions), while the old "pessimistic" logic did the opposite: It
started with the narrowest SIMD width and compiled additional variants
with increasing register pressure until one of them failed to compile.
The main disadvantage of this is that selectively throwing away some
of the compiled variants based on the static analysis of their
performance behavior will no longer be possible, however this is
expected to be less useful on Xe3+ since the GRF space allocated to a
thread can be scaled up or down, which leads to less dramatic
differences in scheduling between SIMD variants.
In typical non-spilling cases where we formerly compiled SIMD16 and
SIMD32 variants of the same fragment shader, this change will halve
the number of backend compilations required to build a shader. With
multi-polygon PS dispatch enabled (which is disabled by default right
now) this has an even more dramatic effect since the number of
compiler iterations can be reduced down to a fifth in the best case
scenario.
Even though in most cases we will only attempt to return a single
binary from the pixel shader compilation, the hardware allows a pair
of PS kernels to be specified, and we'll still take advantage of this
when the multi-polygon PS kernel has the potential to have worse
performance than the single-polygon shader because only the latter
register-allocates successfully at SIMD32 -- Only in such case
(SIMD2x8 multi-polygon, SIMD32 single-polygon) we'll continue
programming both so the hardware will chose one or the other at
runtime depending on the SIMD fullness and number of polygons it can
buffer at runtime.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This avoids running the optimizer uselessly if compilation of the
current kernel failed due to some hardware (e.g. SIMD-width)
restriction. This isn't only inefficient but it can break assumptions
throughout the compiler which would lead to crashes on Xe3 when this
arises during translation from NIR.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
Xe3+ benefits from packing register allocations tightly in order to
make optimal use of the GRF space. The round-robin heuristic
previously in use often causes the whole GRF space to be used even if
register pressure is substantially lower, which would severely
decrease thread-level parallelism on Xe3+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
Xe3 supports 32 SBID tokens per thread regardless of the number of
register blocks allocated per thread. Take advantage of the increased
number of SBIDs in the scoreboard pass to reduce the frequency of
false dependencies on Xe3+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
These don't seem to be disallowed by recent hardware anymore. Stop
disabling SIMD32 due to hardware restrictions of multisample
rasterization, since it should have better performance, and on Xe3+
there may be no shader variant available other than SIMD32.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>