v2: Use MOV and wrap in conditional during BTD spawn header setup
(Lionel). Remove references to SIMD8 (Tapani).
v3: Update brw_bsr() to specify number of registers per thread, don't
initialize Registers Per Thread on BTD spawn header (Lionel).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This is similar to what we used to do on pre-SNB platforms, the number
of GRF registers used by the shader will be used on Xe3+ to adjust the
trade-off between thread-level parallelism and size of the GRF file.
Plumb the value through prog_data so the driver can set up the
hardware state accordingly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This is similar in principle to the previous commit "intel/brw/xe3+:
brw_compile_fs() implementation for Xe3+." but applied to compute-like
shader stages. It changes the implementation of brw_compile_cs/task/mesh()
to reduce compile time and take advantage of wider dispatch modes more
aggressively than the original logic, since as of Xe3 SIMD32 builds
succeed without spills in most cases thanks to VRT.
The new "optimistic" SIMD selection logic starts with the SIMD width
that is potentially highest performance and only compiles additional
narrower variants if that fails (typically due to spilling), while the
old "pessimistic" logic did the opposite: It started with the
narrowest SIMD width and compiled additional variants with increasing
register pressure until one of them failed to compile.
In typical non-spilling cases where we formerly compiled SIMD16 and
SIMD32 variants of the same compute shader, this change will halve the
number of backend compilations required to build it.
XXX - Possibly don't do this in cases with variable workgroup size
until effect on runtime performance can be measured directly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: Don't do this for now in cases with variable workgroup size, still
compile every possible variant in such cases.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This reworks the implementation of brw_compile_fs() to reduce compile
time and take advantage of wider dispatch modes more aggressively than
the original logic.
The new "optimistic" PS compilation logic starts with the SIMD width
that is potentially highest performance and only compiles additional
narrower variants if that fails (typically due to spilling or hardware
restrictions), while the old "pessimistic" logic did the opposite: It
started with the narrowest SIMD width and compiled additional variants
with increasing register pressure until one of them failed to compile.
The main disadvantage of this is that selectively throwing away some
of the compiled variants based on the static analysis of their
performance behavior will no longer be possible, however this is
expected to be less useful on Xe3+ since the GRF space allocated to a
thread can be scaled up or down, which leads to less dramatic
differences in scheduling between SIMD variants.
In typical non-spilling cases where we formerly compiled SIMD16 and
SIMD32 variants of the same fragment shader, this change will halve
the number of backend compilations required to build a shader. With
multi-polygon PS dispatch enabled (which is disabled by default right
now) this has an even more dramatic effect since the number of
compiler iterations can be reduced down to a fifth in the best case
scenario.
Even though in most cases we will only attempt to return a single
binary from the pixel shader compilation, the hardware allows a pair
of PS kernels to be specified, and we'll still take advantage of this
when the multi-polygon PS kernel has the potential to have worse
performance than the single-polygon shader because only the latter
register-allocates successfully at SIMD32 -- Only in such case
(SIMD2x8 multi-polygon, SIMD32 single-polygon) we'll continue
programming both so the hardware will chose one or the other at
runtime depending on the SIMD fullness and number of polygons it can
buffer at runtime.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This avoids running the optimizer uselessly if compilation of the
current kernel failed due to some hardware (e.g. SIMD-width)
restriction. This isn't only inefficient but it can break assumptions
throughout the compiler which would lead to crashes on Xe3 when this
arises during translation from NIR.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
Xe3+ benefits from packing register allocations tightly in order to
make optimal use of the GRF space. The round-robin heuristic
previously in use often causes the whole GRF space to be used even if
register pressure is substantially lower, which would severely
decrease thread-level parallelism on Xe3+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
Xe3 supports 32 SBID tokens per thread regardless of the number of
register blocks allocated per thread. Take advantage of the increased
number of SBIDs in the scoreboard pass to reduce the frequency of
false dependencies on Xe3+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
These don't seem to be disallowed by recent hardware anymore. Stop
disabling SIMD32 due to hardware restrictions of multisample
rasterization, since it should have better performance, and on Xe3+
there may be no shader variant available other than SIMD32.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
Request a fixed subgroup size for pixel shaders that require it due to
the hardware restrictions of fast clears and repeated data clears.
This requires plumbing the "is_fast_clear" boolean across several
callers since blorp_compile_fs_brw() currently has no information
regarding whether the kernel is intended for a fast clear operation.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
On older hardware the "use_rep_send" compile parameter was being
implicitly used to request the compilation of the SIMD16 variant of
clear pixel shaders that require it due to hardware restrictions.
However starting on Gfx12+ this flag is never set since replicated
data clears are no longer supported, but BLORP still implicitly relies
on the SIMD16 variant being generated even though there's no way for
BLORP to explicitly request it. This doesn't cause much of a problem
right now since brw_compile_fs() typically generates a SIMD16 kernel
unless the SIMD8 kernel spills or SIMD debugging flags are enabled,
but it won't work reliably on Xe3+ since we'll start using SIMD32 more
aggressively.
In order to avoid these issues use the standard required subgroup_size
parameter from shader_info to signal that the SIMD16 variant of the
shader is needed by the caller.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
The 16-wide variant of the trampoline shader doesn't appear to work
and would be inadvertently enabled by this series on Gfx12.5. Set the
required subgroup size to avoid changing current behavior.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
Makes sure the number of registers reserved for the payload matches
the size of the URB read, which prevents the VS shared function from
writing past the end of the register file on Xe3 with VRT enabled.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
The lavacli version 1.5.2 has been released in December 2022.
Use the most recent version 2.2.0, released in October 2023, instead.
Notable changes since 1.5.2:
- Authentication tokens are now stripped from exceptions when HTTP
requests fail. (1.6)
Signed-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33266>
One of the DUTs had to be retired in LAVA a while ago, and the pending
times were high on the dashboard. Decrease the parallelism and increase
the fractions to address this.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
We have more hp-x360-14a-cb0001xx-zork DUTs available in LAVA,
use it to offload the overloaded lenovo-TPad-C13-Yoga-zork.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
This job was taking too long. However, with the other jobs
de-duplicated on a618 and a630, we can increase parallelism,
which also allows us to reduce the fraction.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
a630-gl takes just over 7 minutes on 4 DUTs, so we can safely reduce
the parallelism to 3 and stay within the time limit.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
The a618-gl and a618-egl jobs are covered by a630-gl, which also
does egl testing, while a630-piglit is a more comprehensive
equivalent of a618-piglit, so we can de-duplicate these jobs.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
Currently there are only 8 sm8350-hdk DUTs in LAVA, but there were
10 pre-merge jobs scheduled for them.
Add a full nightly job to cover the gaps.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
The panfrost-g52-piglit-gles2:arm64 job was taking 19 minutes on
average, and the pending durations of the meson-g12b-a311d-khadas-vim3
DUTs in LAVA were reaching 5-6 minutes, so we have to make this job
manual instead of pre-merge.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
There are only 4 DUTs available in LAVA, and their pending durations
were reaching 3–4 minutes. To address this, reduce the parallelism
for iris-glk-deqp and adjust the fractions to maintain the 10-minute
time limit.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33082>
This fixes depth-only rendering with mesh shaders,
as well as array derefs in unlinked shaders in general.
Lowering array derefs of vectors is necessary for correctness.
Without this, nir_lower_io will incorrectly add the array index
to the IO intrinsic base instead of to the component offset.
This was previously only done during shader linking, which leaves
some problems with unlinked shaders and depth-only rendering.
Whether these calls can be safely removed from shader linking
will be investigated in a future commit.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12516
Cc: mesa-stable
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33264>
This is a preliminary implementation of detiling for
NV12_16L32 tiled format external images. When we
encounter such an image, decode it into a secondary
buffer which will then be used to actually texture from.
In some cases applications may wish to represent the individual
planes of an NV12 image separately, we support that by allowing
detiling of just an R8 (luma) or R8G8 (chroma) plane.
Acked-by: Boris Brezillon <boris.brezillon@collabora.com>
Acked-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31899>
Taken from commit 3ab334814 of the drm-misc-next kernel tree
Signed-off-by: Eric R. Smith <eric.smith@collabora.com>
Acked-by: Boris Brezillon <boris.brezillon@collabora.com>
Acked-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31899>
PanVK on V10 GPUs has reached production quality, so let's enable
building it by default now.
Acked-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Daniel Stone <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33262>
After commit 44bda7c258 (dri: put shared-glapi into libgallium.*.so,
2024-12-26) the mesa Android build does not have a separate libglapi.so
object anymore in the install dir, so stop pushing it to the Android
device in cuttlefish-runner.sh
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33261>
Add a trailing dot to the remote directoyy in the `adb pull` command to
make sure to recursively pull only the **content** of the directory and
not the directory itself.
This prevents having `results/results/` in the artifacts.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33261>