Commit graph

211810 commits

Author SHA1 Message Date
Rob Clark
250dba1dce freedreno/a6xx: Fallback to original blit in the snorm_copy path
Unlike z/s blits, where we want the fallback to use the re-written blit,
we don't want this in the handle_snorm_copy_blit() path.

Signed-off-by: Rob Clark <rob.clark@oss.qualcomm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37279>
2025-09-11 03:08:54 +00:00
Caio Oliveira
03e9c01f0c brw: Add and use more brw_validate.cpp macros
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Add and use more comparison variants (which provide more detailed print
out of the values), remove old references to "fsv" and "scalar", use
assertion names more similar to GoogleTest that we already use
elsewhere.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37267>
2025-09-10 17:44:38 -07:00
Mel Henning
a9ea4630d4 nak: Make BindlessSSA store [SSAValue; 2]
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
This reduces the size of ir::Src from 40 bytes down to 32 bytes. This
makes the size of ir::Op fall from 272 bytes down to 232 bytes, meaning
we save 40 bytes per instruction.

Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130>
2025-09-10 22:25:13 +00:00
Mel Henning
8ac9b077b1 nak/assign_regs: Make src_ssa_ref return a slice
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130>
2025-09-10 22:25:13 +00:00
Mel Henning
d21a4d9e50 nak: impl HasRegFile for SSARef and &[SSAValue]
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130>
2025-09-10 22:25:13 +00:00
Mel Henning
603d7f9413 nak: Remove Option<> from SSARef::file() return
Nothing actually wants to mix register files in a SSARef so in practice
no callers really handled the None return case. Panic on that case
instead.

Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130>
2025-09-10 22:25:12 +00:00
Dylan Baker
08a3497223 anv: add assertion that tes and tcs data is non-null
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
It doesn't make any sense ot have TCS but not TES (or vice versa), but
coverity doesn't realize that. Add an assertion that they are both
non-null before we start reading them.

Fixes: 50fd669294 ("anv: prep work for separate tessellation shaders")
CID: 1665360
CID: 1665327
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37266>
2025-09-10 18:18:42 +00:00
Dylan Baker
ecfce9f9ad blorp: Fix potential read of uninitaized elk fields in debug paths
The intel_vue_map is only partially initialized before being used. All
used fields are initialized, but in debug paths the unitialzed fields
will also be read. To fix this initialize the struct to 0. In the brw
path this struct is part of the prog_data, and is rzalloc'd.

CID: 1665308
Reviewed-by: Iván Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37261>
2025-09-10 17:51:34 +00:00
Dylan Baker
6fe4b7344d isl: prevent potential overflow before widen
Fixes: 73608eb8b7 ("isl: Add support for creating layered surfaces for video encode/decode")
CID: 1665354
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37260>
2025-09-10 17:01:40 +00:00
Dylan Baker
f18aca8689 intel/brw: Fix implementaiton of |= operator for enum
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
The current implementation does nothing, since it has no side effects,
only a return value. By passing `x` as a reference we can mutate the
value before returning.

Fixes: df37c7ca74 ("brw: fix analysis dirtying with pulled constants")
CID: 1665293
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37263>
2025-09-10 16:30:19 +00:00
Dylan Baker
70ebc14de9 anv: avoid potential integer overflow in video address calculation
Coverity caught one instance of this, by visual inspection I found
another case.

Fixes: 3fb25cc78a ("anv: Add support for creating layered surfaces for video encode/decode")
CID: 1665326
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37262>
2025-09-10 16:06:37 +00:00
Anna Maniscalco
011ba1842e freedreno/registers: add CP_ALWAYS_ON_CONTEXT
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37237>
2025-09-10 15:10:14 +00:00
Samuel Pitoiset
1da270fb35 radv/amdgpu: add more helpers for managing virtual BOs
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
All these new helpers will make the SMEM PRT workaround better
organized.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37193>
2025-09-10 14:50:25 +00:00
Samuel Pitoiset
3c4168a3cc radv/amdgpu: return OOM device when BO mapping fails
It's more appropriate than VK_ERROR_UNKNOWN.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37193>
2025-09-10 14:50:24 +00:00
Valentine Burley
fd6d285417 zink/ci: Add a new Minecraft restricted trace
From @zmike, it exposes a very niche corner case bug in zink.

Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37270>
2025-09-10 14:35:23 +00:00
David Rosca
9d9fc1fe72 radeonsi/vcn: Get rid of PIPE_ALIGN_IN_BLOCK_SIZE
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37143>
2025-09-10 13:27:54 +00:00
David Rosca
8eb84f8854 radeonsi/vcn: Fix calculating QP map region dimensions
It needs to be aligned to block size otherwise it would skip last
row/column on resolutions like 1080p.

Cc: mesa-stable
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37143>
2025-09-10 13:27:54 +00:00
Mike Blumenkrantz
5fefb9e795 zink: flag vertex element state for rebind after vstate draws
vstate draws bind their own elements unrelated to the bound
gallium elements, so any draw occurring after a vstate draw must
rebind to ensure the correct ones are bound

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13570

cc: mesa-stable

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37274>
2025-09-10 13:06:07 +00:00
David Rosca
a03a79aa9d pipe: Remove PIPE_VIDEO_CAP_PREFERS/SUPPORTS_INTERLACED
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36632>
2025-09-10 12:33:57 +00:00
David Rosca
6954460899 radeonsi/video: Remove support for interlaced buffers
This is not used anymore with VDPAU removed.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36632>
2025-09-10 12:33:57 +00:00
David Rosca
223d3ec433 gallium/vl: Remove now unused filters
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36632>
2025-09-10 12:33:57 +00:00
David Rosca
4b54277d2e Remove VDPAU
VDPAU only supports X11 and GL interop. There is no Wayland or Vulkan
interop support. The API has limitations that makes it impossible to
correctly decode certain streams.
Application support is also very limited, and VAAPI is always a better
choice over VDPAU.

Acked-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Acked-by: Adam Jackson <ajax@redhat.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36632>
2025-09-10 12:33:57 +00:00
David Rosca
e7ea1233b1 mesa: Remove NV_vdpau_interop
Acked-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Acked-by: Adam Jackson <ajax@redhat.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36632>
2025-09-10 12:33:57 +00:00
David Rosca
272bde24a3 ci: Stop building VDPAU driver
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36632>
2025-09-10 12:33:57 +00:00
Mary Guillemard
497005dc18 panvk: Enable SNORM rendering
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Blending should work properly those days.

Signed-off-by: Mary Guillemard <mary.guillemard@collabora.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37271>
2025-09-10 12:15:06 +00:00
Mary Guillemard
f707f093ec panvk: Do not clamp blend constants in command buffer
This is wrong for SNORM and this is handled by nir_lower_blend.

Signed-off-by: Mary Guillemard <mary.guillemard@collabora.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37271>
2025-09-10 12:15:06 +00:00
Lionel Landwerlin
1646e7d311 anv: run nir_opt_acquire_release_barriers
In the middle of writing all this new shader object compile code, this
pass got added and I missed adding it to the shader object path.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: d39e443ef8 ("anv: add infrastructure for common vk_pipeline")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37269>
2025-09-10 11:47:05 +00:00
Konstantin Seurer
7c9e945460 radv,vulkan: Avoid a useless barrier in radv_update_bind_pipeline
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36982>
2025-09-10 08:35:50 +00:00
Konstantin Seurer
a35dfab281 radv: Use vk_barrier_compute_w_to_compute_r more
vk_barrier_compute_w_to_compute_r shows up in rgp captures and is less
code.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36982>
2025-09-10 08:35:50 +00:00
Konstantin Seurer
850f339b89 vulkan: Add more detail to encode debug markers
Useful for radv because radv has quite a few different configurations.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36982>
2025-09-10 08:35:50 +00:00
Konstantin Seurer
5c94e20abe vulkan: Use a struct for debug markers
Improves u_trace integation with anv.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36982>
2025-09-10 08:35:50 +00:00
Ella Stanforth
01c7c97ef7 util/tests: Add list iterator tests
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Reviewed-by: Rob Clark <rob.clark@oss.qualcomm.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37061>
2025-09-10 07:38:25 +00:00
Ella Stanforth
d943a91b71 util/list: Add iterator debug to more routines.
Reviewed-by: Rob Clark <rob.clark@oss.qualcomm.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37061>
2025-09-10 07:38:24 +00:00
Ella Stanforth
6863223033 util/list: Fix next instruction removal usecase for non safe iterators
Introducing this iterator debug information breaks the usecase of removing
elements in the list other than the current element.

Fixes: 372e83b95f

Reviewed-by: Rob Clark <rob.clark@oss.qualcomm.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37061>
2025-09-10 07:38:24 +00:00
Samuel Pitoiset
c739d836f7 radv: exclude dynamic vertex input stride for the late scissor workaround
RADV_DYNAMIC_VERTEX_INPUT_BINDING_STRIDE doesn't emit any context
registers, so it can be excluded for the late scissor workaround to
avoid re-emitting scissors all the time it's dirty.

This fixes a performance regression noticed with Cyberpunk on Vega10,
but other games are likely affected too. The late scissor workaround is
only applied on Raven/Vega10.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13828
Fixes: d7f401c2bb ("radv: bind the vertex binding strides like a normal dynamic state")
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37252>
2025-09-10 07:09:48 +00:00
abdelhadi
3a41644165 aco, radv: remove line duplicate
Signed-off-by: abdelhadi <abdelhadims@icloud.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37243>
2025-09-10 06:34:43 +00:00
Lionel Landwerlin
33d2c31d7a brw: don't use brw_null_reg() for unused SEND sources
Just avoiding the validation assert.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 47fe9d28e7 ("brw: Enumerate SHADER_OPCODE_SEND sources and standardize how many")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13777
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37112>
2025-09-10 09:08:27 +03:00
Timothy Arceri
11a434f3df glsl: remove now unused NumUniformRemapTable
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Acked-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36997>
2025-09-10 05:11:47 +00:00
Timothy Arceri
e052254066 glsl: make use of u_range_remap for uniform remapping
This will allow ubo buffers to have arrays containing millions of
elements without excessive memory use on a remap table. Before this
change using the max sized array on radeonsi would result in 1.3GB
of memory being used for a remap table in a single shader.

There is also a small functional change here, previously if the
shader used more than GL_MAX_UNIFORM_BLOCK_SIZE mesa would ignore
and allow this as the original ARB_uniform_buffer_object spec
stated:

   "If the amount of storage required for a uniform block exceeds
   this limit, a program may fail to link."

However in OpenGL 4.3 the text was clarified and the "may" was
removed so with this change we enforce the max limit.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9953
Acked-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36997>
2025-09-10 05:11:47 +00:00
Timothy Arceri
bf946bccf2 util: add range remap util
This util allows a range of values to be remapped to a single
pointer.

Acked-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36997>
2025-09-10 05:11:47 +00:00
Francisco Jerez
5bf7bb5cf9 intel/brw/xe3+: Re-enable static analysis-based SIMD32 FS heuristic for the moment.
This disables for now the "optimistic" SIMD heuristic that was
implemented for xe3+ and makes it dependent on a debugging option,
instead use the static analysis-based codepath that was used in
previous generations and was extended by previous commits in this MR
to model the xe3 trade-off between register use and thread
parallelism.

The reason is that the main assumption of the optimistic SIMD
heuristic didn't hold up with reality: Real-world testing on PTL shows
that there are many cases where SIMD32 shows performance degradation
relative to SIMD16 despite the ability of xe3 hardware to scale the
GRF file of a thread on demand, unfortunately that scenario seems to
be more pervasive than hoped when the optimistic SIMD heuristic was
implemented pre-silicon.

In many cases what seems to be going on is that even when the register
file is able to scale with the increased register use of SIMD32, the
thread parallelism of the EU is scaled down by a similar factor, so at
the bottom line SIMD32 (depending on the actual ratio of register use
between both variants) may not buy us anything, and it frequently
encounters constraints (like SIMD lowering and less effective
scheduling) that lead to worse codegen than SIMD16, easily tipping the
balance in favor of SIMD16.  The extension of the performance analysis
pass that was done in a previous commit allows the original SIMD32
heuristic to take into account quantitatively this effect, and that
seems pretty effective at disabling SIMD32 shaders that underperform
judging from the statistically significant improvement of most Traci
test-cases that run on my PTL system (4 iterations, 5% significance),
no statistically significant regressions were observed:

Nba2K23-trace-dx11-2160p-ultra:                    10.16% ±0.34%
Superposition-trace-dx11-2160p-extreme:             4.06% ±0.50%
TotalWarWarhammer3-trace-dx11-1080p-high:           3.52% ±0.76%
Payday3-trace-dx11-1440p-ultra:                     2.41% ±0.81%
MetroExodus-trace-dx11-2160p-ultra:                 2.28% ±0.78%
Borderlands3-trace-dx11-2160p-ultra:                1.89% ±0.65%
MountAndBlade2-trace-dx11-1440p-veryhigh:           1.81% ±0.40%
Blackops3-trace-dx11-1080p-high:                    1.66% ±0.29%
HogwartsLegacy-trace-dx12-1080p-ultra:              1.53% ±0.22%
TotalWarPharaoh-trace-dx11-1440p-ultra:             1.44% ±0.31%
Fortnite-trace-dx11-2160p-epix:                     1.44% ±0.27%
Naraka-trace-dx11-1440p-highest:                    1.39% ±0.27%
PubG-trace-dx11-1440p-ultra:                        1.30% ±0.49%
Destiny2-trace-dx11-1440p-highest:                  1.10% ±0.23%
Factorio-trace-1080p-high:                          1.10% ±1.77%
TerminatorResistance-trace-dx11-2160p-ultra:        1.08% ±0.31%
Ghostrunner2-trace-dx11-1440p-ultra:                1.05% ±0.15%
ShadowTombRaider-trace-dx11-2160p-ultra:            0.98% ±0.19%
CitiesSkylines2-trace-dx11-1440p-high:              0.67% ±0.19%
Palworld-trace-dx11-1080p-med:                      0.44% ±0.22%

The downside is that this will reverse the large reduction in
compile-time we gained from the optimistic SIMD heuristic -- The
run-time of both shader-db and fossil-db jump back up by nearly 20%
with this change.  I'm working on a better compromise based on
run-time feedback that will hopefully allow us to preserve the
compile-time benefit of the optimistic heuristic without the reduction
in run-time performance, but in the meantime it seems like the
run-time performance gap from SIMD32 is the more urgent issue to
address since it has an impact on titles across the board.  Despite
the reversal of that compile-time improvement xe3 still achieves
slightly lower compile time on the average than previous generations
as a result of VRT, so this doesn't seem terribly tragic.

v2: Add bit to brw_get_compiler_config_value() (Lionel).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:58 +00:00
Francisco Jerez
a7969b5d42 intel/brw: Apply 7e1362e9c0 to pre-xe3 codepath of brw_compile_fs().
This applies the same workaround as 7e1362e9c0 to the pre-xe3
codepath of brw_compile_fs(), since ray queries appear to be
unsupported from SIMD32 fragment shaders.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:58 +00:00
Francisco Jerez
531a34c7dd intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency.
The current register allocation loop attempts to use a sequence of
pre-RA scheduling heuristics until register allocation is successful.
The sequence of scheduling heuristics is expected to be increasingly
aggressive at reducing the register pressure of the program (at a
performance cost), so that the instruction ordering chosen gives the
lowest latency achievable with the register space available.

Unfortunately that approach doesn't consistently give the best
performance on xe3+, since on recent platforms a schedule with higher
latency may actually give better performance if its lower register
pressure allows the use of a lower number of VRT register blocks which
allows the EU to run more threads in parallel.

This means that on xe3+ the scheduling mode with highest performance
is fundamentally dependent on the specific scenario (in particular
where in the thread count-register use curve the program is at, and
how effective the scheduler heuristics are at reducing latency for
each additional block of GRFs used), so it isn't possible to construct
a fixed sequence of the existing heuristics guaranteed to be ordered
by decreasing performance.  In order to find the scheduling heuristic
with better performance we have to run multiple of them prior to
register allocation and do some arithmetic to account for the effect
on parallelism of the register pressure estimated in each case, in
order to decide which heuristic will give the best performance.

This sounds costly but it is similar to the approach taken by
brw_allocate_registers() when unable to allocate without spills in
order to decide which scheduling heuristic to use in order to minimize
the number of spills.  In cases where that happens on xe3+ the
scheduling runs introduced here don't add to the scheduling runs done
to find the heuristic with minimum register pressure, we attempt to
determine the heuristic with lowest pressure and best performance in
the same loop, and then use one or the other depending on whether
register allocation succeeds without spills.

Significantly improves performance on PTL of the following Traci test
cases (4 iterations, 5% significance):

Nba2K23-trace-dx11-2160p-ultra:                     4.48% ±0.38%
Fortnite-trace-dx11-2160p-epix:                     1.61% ±0.28%
Superposition-trace-dx11-2160p-extreme:             1.37% ±0.26%
PubG-trace-dx11-1440p-ultra:                        1.15% ±0.29%
GtaV-trace-dx11-2160p-ultra:                        0.80% ±0.24%
CitiesSkylines2-trace-dx11-1440p-high:              0.68% ±0.19%
SpaceEngineers-trace-dx11-2160p-high:               0.65% ±0.34%

The compile-time cost of shader-db increases significantly by 3.7%
after this commit (15 iterations, 5% significance), the compile-time
of fossil-db doesn't change significantly in my setup.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
0e802cecba intel/brw: Make sure we don't use stale analysis after inst. order restore in brw_allocate_registers().
Do invalidate_analysis() from restore_instruction_order() to make sure
we don't re-use stale analysis pass results if the user forgets to
call invalidate_analysis() explicitly.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
dfc2a89d96 intel/brw: Allow using performance analysis pass pre-register allocation.
Mainly this involves changing 'struct state' so that the dep_ready
array is allocated with a dynamic size based on the number of VGRFs of
the program instead of assuming a fixed XE3_MAX_GRF count of GRF
dependencies.  VGRF register dependencies are then handled by using
one dep_ready entry per VGRF allocation instead of one per hardware
register.

The ability to use the performance analysis pass pre-regalloc will
mostly be useful on xe3+, but this also has the side effect of saving
some memory on xe2 and earlier platforms since we no longer need to
allocate XE3_MAX_GRF dep_ready entries for them.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
3936a43496 intel/brw/xe3+: Tweak render target write timings in performance modeling pass.
Reduce the cycle-count cost estimate used by the performance model for
render target writes on xe3+ in order to match the real-world
observation of shaders with latency lower than the previously
estimated cost of its render target write.

In a shader used by Factorio this would have led us to incorrectly
model the shader as fillrate-bound, even though in reality the shader
is EU-bound and benefits from the higher parallelism of SIMD32, so the
subsequent commit that re-enables the static analysis-based SIMD32
heuristic on PTL would lead to a ~2% regression without this tweak.

There appear to be no other regressions nor other changes from this in
combination with the subsequent commit that enables it to have an
effect, but it is possible that the real cycle count cost of a render
target write still lies below the estimated value, ~400 is just the
upper bound that can be inferred from the behavior of this test case.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
6ccf2a375a intel/brw/xe3+: Adjust weights of discard control flow for non-EU-fused platforms.
Currently on platforms without EU fusion (all platforms other than
gfx12.x) we were using a constant discard_weight = 1.0 regardless of
SIMD width.  This was far from ideal, in particular since it made the
performance analysis pass fully insensitive to the presence of discard
jumps, even though the scheduler is able to move code past a discard
statement so the range of the program under discard control flow can
vary and have a material effect on the relative performance of SIMD16
vs. SIMD32, since the scheduler is typically more constrained in
SIMD32 dispatch mode.

In order to fix this use a discard_weight lower than 1.0 for all
dispatch modes, so that the performance analysis pass accounts for the
presence and range of discard control flow.  In addition use a lower
discard_weight for SIMD16 dispatch like we do on Gfx12.x in order to
account for the higher likelihood of divergent discard in SIMD32 mode.

The specific weights were determined iteratively on PTL based on the
final FPS result of several traces that are sensitive to the dispatch
width of one or more fragment shaders that use discard, in order to
ensure that in none of those cases we end up using the
lower-performing dispatch width variant.  This avoids regressions
between 3.7% and 0.8% in Superposition-trace-dx11-2160p-extreme,
BaldursGate3-trace-dx11-1440p-ultra and
MetroExodus-trace-dx11-2160p-ultra after enabling the static
analysis-based SIMD32 heuristic in PTL.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1)

v2: Limit to xe3+ for now since performance effect seems to be a wash
    on xe2.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
1272ff5ed1 intel/brw/xehp+: Adjust performance model weights of LSC atomic ops.
The LSC implements several optimizations for atomic operations on a
memory addresses that are uniform across all lanes, in which case its
cost is approximately O(1) instead of O(exec_size).  Even cases where
memory offsets are non-uniform but packed in a cacheline appear to
have a cost that is non-linear with the number of lanes.

In order to approximate this behavior more closely approximate its
back-end cost as roughly 1300 cycles instead of the previous 400 *
exec_size/8.  This fixes some cases where we were incorrectly
predicting the SIMD32 shader would be bound by the throughput of LSC
atomic operations, even though the observed cost per lane of the LSC
operations was significantly lower in SIMD32 mode so it would have the
best performance.

Clearly this is still a rough approximation and it might be possible
to obtain a more accurate result by plumbing divergence analysis data
all the way down to codegen, however the goal of the performance
analysis pass isn't to provide an exact prediction of the performance
of a shader (that's not really possible in general via static analysis
without solving the halting problem), but to provide a good enough
approximation at a low cost -- And the constant approximation seems to
be strictly better in practice than the approximation we were using
before, there appear to be no regressions from this change, and
ShadowTombRaider-trace-dx11-2160p-ultra shows 5.7% better performance
on PTL with a subsequent commit that re-enables the use of the static
analysis-based SIMD32 heuristic on xe3+.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:56 +00:00
Francisco Jerez
6eea9659db intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis.
This extends the performance analysis pass used in previous
generations to make it more useful to deal with the performance
trade-off encountered on xe3 hardware as a result of VRT.  VRT allows
the driver to request a per-thread GRF allocation different from the
128 GRFs that were typical in previous platforms, but this comes at
either a thread parallelism cost or benefit depending on the number of
GRF register blocks requested.

This makes a number of decisions more difficult for the compiler since
certain optimizations potentially trade off run-time in a thread
against the total number of threads that can run in parallel
(e.g. consider scheduling and how reordering an instruction to avoid a
stall can increase GRF use and therefore reduce thread-level
parallelism when trying to improve instruction-level parallelism).

This patch provides a simple heuristic tool to account for the
combined interaction of register pressure and other single-threaded
factors that affect performance.  This is expressed with the
redefinition of the pre-existing brw_performance::throughput estimate
as the number of invocations per cycle per EU that would be achieved
if there were enough threads to reach full load (in this sense this is
to be considered a heuristic since the penalty from VRT may be lower
than expected from this model at low EU load).

This will be used e.g. in order to decide whether to use a more
aggressive latency-minimizing mode during scheduling or a mode more
effective at minimizing register pressure (it makes sense to take the
path that will lead to the most invocations being serviced per cycle
while under load).  This also allows us to re-enable the old PS SIMD32
heuristic on xe3+, and due to this change it is able to identify cases
where the combined effect of poorer scheduling and higher GRF use of
the SIMD32 variant makes it more favorable to use SIMD16 only (see
last patch of the MR for details and numbers).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:56 +00:00
Francisco Jerez
760437c4c4 intel/brw/xe3+: Override P value of GRF register classes to increase thread parallelism.
This causes the graph coloring allocator to use the optimistic
coloring codepath for all nodes whose total Q value exceeds the
threshold of 96 GRFs, in order to do a better job at minimizing the
register requirement of programs even when they are trivially
colorable.  At the threshold of 96 GRFs the number of threads
available per EU starts decreasing as the number of register blocks
requested by the program increases, so decreasing the number of
registers can increase performance.

That showed up in some test cases as a performance inversion from the
enabling of VRT, since the extension of the register set to 256 GRFs
has the side effect of making some non-trivially colorable programs
trivially colorable, which would cause the register allocator to do a
worse job at ordering the (trivial) allocations due to the optimistic
coloring path being skipped, leading to increased register use and
reduced performance.

The following Traci test cases improve significantly as a result of
this change (4 iterations, 5% significance):

MetroExodus-trace-dx11-2160p-ultra:                 1.90% ±0.85%
BaldursGate3-trace-dx11-1440p-ultra:                1.47% ±0.38%
Palworld-trace-dx11-1080p-med:                      1.01% ±0.09%
TerminatorResistance-trace-dx11-2160p-ultra:        0.95% ±0.29%
Control-trace-dx11-1440p-high:                      0.87% ±0.50%

Even though lowering the P value threshold is expected to have a cost
in compile time theoretically due to the increased use of the slower
optimistic path of the graph coloring allocator, this doesn't actually
show up in my numbers, my shader-db and fossil-db compile-time numbers
don't show any statistically significant change (13 iterations, 5%
significance).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:55 +00:00