Commit graph

4664 commits

Author SHA1 Message Date
Caio Oliveira
df2b5fb03f brw: Add brw_fb_write_inst
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:04 +00:00
Caio Oliveira
d06c0a370e brw: Add brw_urb_inst
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:04 +00:00
Caio Oliveira
90967e7b16 brw: Add brw_load_payload_inst
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:03 +00:00
Caio Oliveira
388bac06ce brw: Add brw_dpas_inst
Fixed the types in brw_inst::bits so the struct is packed correctly.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:03 +00:00
Caio Oliveira
09a26526cc brw: Add brw_mem_inst
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:02 +00:00
Caio Oliveira
f0f1e63f99 brw: Add brw_tex_inst
Incorporate some "control sources" directly into the instruction.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:02 +00:00
Caio Oliveira
0fcce2722f brw: Add brw_send_inst
Move all the SEND specific fields from brw_inst into brw_send_inst.
This new instruction kind will contain all variants of SENDs plus the
virtual opcodes that were already relying on those SEND fields.

Use the `as_send()` helper to go from a brw_inst into the brw_send_inst
when applicable.  Some of the code was changed to use the brw_send_inst
type directly.

Until other kinds are added, all the instructions are allocated the same
amount of space as brw_send_inst.  This ensures that all
brw_transform_inst() calls are still valid.  This will change after
a few patches so that BASE instructions can use less memory.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:01 +00:00
Caio Oliveira
b27f6621ae brw: Add initial support for different instruction kinds
Prepare code for supporting subclasses of brw_inst for certain
specialized kinds of instructions.  This will allow

- Move certain fields from brw_inst to the specialized one,
  reducing its size and making it easy to understand what applies
  to which instruction;
- Move certain control sources into the specialized inst type, which
  currently take a full brw_reg to encode small integers.  Reducing
  the overall sources we walk and care also might help the code
  in general.

Next commits will add the new instruction kinds.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:01 +00:00
Caio Oliveira
339a4e8680 brw: Remove the extra function call when lowering samplers
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:00 +00:00
Caio Oliveira
71c23c6722 brw: Add brw_builder::URB_READ and URB_WRITE helpers
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:00 +00:00
Caio Oliveira
f92116832f brw: Add brw_builder::SEND() helper
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:59 +00:00
Caio Oliveira
e194909b3f brw: Add and use brw_transform_inst()
The new function takes care of changing an instruction opcode and sources,
which will allow later patches to tweak how allocations are done in
those cases.  Like the instruction allocation, this also takes a shader
(or a builder, for it to get a shader).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:59 +00:00
Caio Oliveira
5d0160a87f brw: Pass brw_shader in fold_instruction
Will be used later for the general instruction transforming
function.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:58 +00:00
Caio Oliveira
8f16cac492 brw: Allow emit instruction with only number of sources
The emit will allocate the necessary number of sources but will
let the caller fill them in.

Change a couple of places to take advantage of that.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:58 +00:00
Caio Oliveira
3ef86a8d00 brw: Let the builder fill the sources of brw_inst
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:58 +00:00
Caio Oliveira
506fce20f0 brw: Bundle the allocation of brw_inst and its sources
Flatten all the work being done into brw_new_inst() and
brw_clone_inst() and allocate both the instruction and
the sources in one swoop.

For now we still keep a pointer to the array instead of
declaring an array as last element to still allow growing
the array -- which is used by the compiler in a few places.

This commit removes the constructors for brw_inst, the idea
is that the instructions are managed by the brw_shader, so
we always go through it for new ones.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:57 +00:00
Caio Oliveira
c81c8c917f brw: Remove builtin sources from brw_inst
A later patch will add a different mechanism to achieve the same
goal.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:57 +00:00
Caio Oliveira
858162a2fc brw: Allocate brw_inst::src with ralloc
In the few cases we have to _increase_ the number of sources, the new
code will not attempt to recollect the memory, i.e. it delays freeing
the old smaller one source array.  For the instructions that may need
this (when making a SEND into a SEND_GATHER), this is not expected to
happen more than once.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:56 +00:00
Caio Oliveira
29c12bbebf brw: Centralize brw_inst allocation
Add and use brw_new_inst() and brw_clone_inst() and do not use stack
allocated brw_insts.  The builder was changed to not use the temporary
ones either.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:56 +00:00
Caio Oliveira
c90ec6d7e7 brw: Use uint16_t for size_written
UINT16_MAX is larger than the maximum number of bytes in the
general register file: 256 GRFs * 16 slots * 4 bytes = 16384.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:55 +00:00
Kenneth Graunke
6281a12822 brw: Remove brw_inst::no_dd_check/no_dd_clear
These dependency hints were primarily useful for the vec4 backend, where
it was common to write subsets of a vec4's components across multiple
instructions.  In the scalar backend, we rarely used them.  They also no
longer exist on Tigerlake and later in favor of software scoreboarding.

Dropping this allows us to clean up the IR a bit.

We still use the hardware hints in the generator in a couple places:

   - Gfx9-12.0 scratch headers
   - Quad swizzles
   - Indirect MOV lowering

In theory we might want them back if we moved that lowering to the IR.
For scratch at least, I suspect it won't have a huge impact, as we're
already incurring the cost of spills/fills.  The others are fairly rare
as well, so it may not be worth keeping.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:55 +00:00
Caio Oliveira
03e9c01f0c brw: Add and use more brw_validate.cpp macros
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Add and use more comparison variants (which provide more detailed print
out of the values), remove old references to "fsv" and "scalar", use
assertion names more similar to GoogleTest that we already use
elsewhere.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37267>
2025-09-10 17:44:38 -07:00
Dylan Baker
f18aca8689 intel/brw: Fix implementaiton of |= operator for enum
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
The current implementation does nothing, since it has no side effects,
only a return value. By passing `x` as a reference we can mutate the
value before returning.

Fixes: df37c7ca74 ("brw: fix analysis dirtying with pulled constants")
CID: 1665293
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37263>
2025-09-10 16:30:19 +00:00
Lionel Landwerlin
33d2c31d7a brw: don't use brw_null_reg() for unused SEND sources
Just avoiding the validation assert.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 47fe9d28e7 ("brw: Enumerate SHADER_OPCODE_SEND sources and standardize how many")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13777
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37112>
2025-09-10 09:08:27 +03:00
Francisco Jerez
5bf7bb5cf9 intel/brw/xe3+: Re-enable static analysis-based SIMD32 FS heuristic for the moment.
This disables for now the "optimistic" SIMD heuristic that was
implemented for xe3+ and makes it dependent on a debugging option,
instead use the static analysis-based codepath that was used in
previous generations and was extended by previous commits in this MR
to model the xe3 trade-off between register use and thread
parallelism.

The reason is that the main assumption of the optimistic SIMD
heuristic didn't hold up with reality: Real-world testing on PTL shows
that there are many cases where SIMD32 shows performance degradation
relative to SIMD16 despite the ability of xe3 hardware to scale the
GRF file of a thread on demand, unfortunately that scenario seems to
be more pervasive than hoped when the optimistic SIMD heuristic was
implemented pre-silicon.

In many cases what seems to be going on is that even when the register
file is able to scale with the increased register use of SIMD32, the
thread parallelism of the EU is scaled down by a similar factor, so at
the bottom line SIMD32 (depending on the actual ratio of register use
between both variants) may not buy us anything, and it frequently
encounters constraints (like SIMD lowering and less effective
scheduling) that lead to worse codegen than SIMD16, easily tipping the
balance in favor of SIMD16.  The extension of the performance analysis
pass that was done in a previous commit allows the original SIMD32
heuristic to take into account quantitatively this effect, and that
seems pretty effective at disabling SIMD32 shaders that underperform
judging from the statistically significant improvement of most Traci
test-cases that run on my PTL system (4 iterations, 5% significance),
no statistically significant regressions were observed:

Nba2K23-trace-dx11-2160p-ultra:                    10.16% ±0.34%
Superposition-trace-dx11-2160p-extreme:             4.06% ±0.50%
TotalWarWarhammer3-trace-dx11-1080p-high:           3.52% ±0.76%
Payday3-trace-dx11-1440p-ultra:                     2.41% ±0.81%
MetroExodus-trace-dx11-2160p-ultra:                 2.28% ±0.78%
Borderlands3-trace-dx11-2160p-ultra:                1.89% ±0.65%
MountAndBlade2-trace-dx11-1440p-veryhigh:           1.81% ±0.40%
Blackops3-trace-dx11-1080p-high:                    1.66% ±0.29%
HogwartsLegacy-trace-dx12-1080p-ultra:              1.53% ±0.22%
TotalWarPharaoh-trace-dx11-1440p-ultra:             1.44% ±0.31%
Fortnite-trace-dx11-2160p-epix:                     1.44% ±0.27%
Naraka-trace-dx11-1440p-highest:                    1.39% ±0.27%
PubG-trace-dx11-1440p-ultra:                        1.30% ±0.49%
Destiny2-trace-dx11-1440p-highest:                  1.10% ±0.23%
Factorio-trace-1080p-high:                          1.10% ±1.77%
TerminatorResistance-trace-dx11-2160p-ultra:        1.08% ±0.31%
Ghostrunner2-trace-dx11-1440p-ultra:                1.05% ±0.15%
ShadowTombRaider-trace-dx11-2160p-ultra:            0.98% ±0.19%
CitiesSkylines2-trace-dx11-1440p-high:              0.67% ±0.19%
Palworld-trace-dx11-1080p-med:                      0.44% ±0.22%

The downside is that this will reverse the large reduction in
compile-time we gained from the optimistic SIMD heuristic -- The
run-time of both shader-db and fossil-db jump back up by nearly 20%
with this change.  I'm working on a better compromise based on
run-time feedback that will hopefully allow us to preserve the
compile-time benefit of the optimistic heuristic without the reduction
in run-time performance, but in the meantime it seems like the
run-time performance gap from SIMD32 is the more urgent issue to
address since it has an impact on titles across the board.  Despite
the reversal of that compile-time improvement xe3 still achieves
slightly lower compile time on the average than previous generations
as a result of VRT, so this doesn't seem terribly tragic.

v2: Add bit to brw_get_compiler_config_value() (Lionel).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:58 +00:00
Francisco Jerez
a7969b5d42 intel/brw: Apply 7e1362e9c0 to pre-xe3 codepath of brw_compile_fs().
This applies the same workaround as 7e1362e9c0 to the pre-xe3
codepath of brw_compile_fs(), since ray queries appear to be
unsupported from SIMD32 fragment shaders.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:58 +00:00
Francisco Jerez
531a34c7dd intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency.
The current register allocation loop attempts to use a sequence of
pre-RA scheduling heuristics until register allocation is successful.
The sequence of scheduling heuristics is expected to be increasingly
aggressive at reducing the register pressure of the program (at a
performance cost), so that the instruction ordering chosen gives the
lowest latency achievable with the register space available.

Unfortunately that approach doesn't consistently give the best
performance on xe3+, since on recent platforms a schedule with higher
latency may actually give better performance if its lower register
pressure allows the use of a lower number of VRT register blocks which
allows the EU to run more threads in parallel.

This means that on xe3+ the scheduling mode with highest performance
is fundamentally dependent on the specific scenario (in particular
where in the thread count-register use curve the program is at, and
how effective the scheduler heuristics are at reducing latency for
each additional block of GRFs used), so it isn't possible to construct
a fixed sequence of the existing heuristics guaranteed to be ordered
by decreasing performance.  In order to find the scheduling heuristic
with better performance we have to run multiple of them prior to
register allocation and do some arithmetic to account for the effect
on parallelism of the register pressure estimated in each case, in
order to decide which heuristic will give the best performance.

This sounds costly but it is similar to the approach taken by
brw_allocate_registers() when unable to allocate without spills in
order to decide which scheduling heuristic to use in order to minimize
the number of spills.  In cases where that happens on xe3+ the
scheduling runs introduced here don't add to the scheduling runs done
to find the heuristic with minimum register pressure, we attempt to
determine the heuristic with lowest pressure and best performance in
the same loop, and then use one or the other depending on whether
register allocation succeeds without spills.

Significantly improves performance on PTL of the following Traci test
cases (4 iterations, 5% significance):

Nba2K23-trace-dx11-2160p-ultra:                     4.48% ±0.38%
Fortnite-trace-dx11-2160p-epix:                     1.61% ±0.28%
Superposition-trace-dx11-2160p-extreme:             1.37% ±0.26%
PubG-trace-dx11-1440p-ultra:                        1.15% ±0.29%
GtaV-trace-dx11-2160p-ultra:                        0.80% ±0.24%
CitiesSkylines2-trace-dx11-1440p-high:              0.68% ±0.19%
SpaceEngineers-trace-dx11-2160p-high:               0.65% ±0.34%

The compile-time cost of shader-db increases significantly by 3.7%
after this commit (15 iterations, 5% significance), the compile-time
of fossil-db doesn't change significantly in my setup.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
0e802cecba intel/brw: Make sure we don't use stale analysis after inst. order restore in brw_allocate_registers().
Do invalidate_analysis() from restore_instruction_order() to make sure
we don't re-use stale analysis pass results if the user forgets to
call invalidate_analysis() explicitly.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
dfc2a89d96 intel/brw: Allow using performance analysis pass pre-register allocation.
Mainly this involves changing 'struct state' so that the dep_ready
array is allocated with a dynamic size based on the number of VGRFs of
the program instead of assuming a fixed XE3_MAX_GRF count of GRF
dependencies.  VGRF register dependencies are then handled by using
one dep_ready entry per VGRF allocation instead of one per hardware
register.

The ability to use the performance analysis pass pre-regalloc will
mostly be useful on xe3+, but this also has the side effect of saving
some memory on xe2 and earlier platforms since we no longer need to
allocate XE3_MAX_GRF dep_ready entries for them.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
3936a43496 intel/brw/xe3+: Tweak render target write timings in performance modeling pass.
Reduce the cycle-count cost estimate used by the performance model for
render target writes on xe3+ in order to match the real-world
observation of shaders with latency lower than the previously
estimated cost of its render target write.

In a shader used by Factorio this would have led us to incorrectly
model the shader as fillrate-bound, even though in reality the shader
is EU-bound and benefits from the higher parallelism of SIMD32, so the
subsequent commit that re-enables the static analysis-based SIMD32
heuristic on PTL would lead to a ~2% regression without this tweak.

There appear to be no other regressions nor other changes from this in
combination with the subsequent commit that enables it to have an
effect, but it is possible that the real cycle count cost of a render
target write still lies below the estimated value, ~400 is just the
upper bound that can be inferred from the behavior of this test case.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
6ccf2a375a intel/brw/xe3+: Adjust weights of discard control flow for non-EU-fused platforms.
Currently on platforms without EU fusion (all platforms other than
gfx12.x) we were using a constant discard_weight = 1.0 regardless of
SIMD width.  This was far from ideal, in particular since it made the
performance analysis pass fully insensitive to the presence of discard
jumps, even though the scheduler is able to move code past a discard
statement so the range of the program under discard control flow can
vary and have a material effect on the relative performance of SIMD16
vs. SIMD32, since the scheduler is typically more constrained in
SIMD32 dispatch mode.

In order to fix this use a discard_weight lower than 1.0 for all
dispatch modes, so that the performance analysis pass accounts for the
presence and range of discard control flow.  In addition use a lower
discard_weight for SIMD16 dispatch like we do on Gfx12.x in order to
account for the higher likelihood of divergent discard in SIMD32 mode.

The specific weights were determined iteratively on PTL based on the
final FPS result of several traces that are sensitive to the dispatch
width of one or more fragment shaders that use discard, in order to
ensure that in none of those cases we end up using the
lower-performing dispatch width variant.  This avoids regressions
between 3.7% and 0.8% in Superposition-trace-dx11-2160p-extreme,
BaldursGate3-trace-dx11-1440p-ultra and
MetroExodus-trace-dx11-2160p-ultra after enabling the static
analysis-based SIMD32 heuristic in PTL.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1)

v2: Limit to xe3+ for now since performance effect seems to be a wash
    on xe2.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez
1272ff5ed1 intel/brw/xehp+: Adjust performance model weights of LSC atomic ops.
The LSC implements several optimizations for atomic operations on a
memory addresses that are uniform across all lanes, in which case its
cost is approximately O(1) instead of O(exec_size).  Even cases where
memory offsets are non-uniform but packed in a cacheline appear to
have a cost that is non-linear with the number of lanes.

In order to approximate this behavior more closely approximate its
back-end cost as roughly 1300 cycles instead of the previous 400 *
exec_size/8.  This fixes some cases where we were incorrectly
predicting the SIMD32 shader would be bound by the throughput of LSC
atomic operations, even though the observed cost per lane of the LSC
operations was significantly lower in SIMD32 mode so it would have the
best performance.

Clearly this is still a rough approximation and it might be possible
to obtain a more accurate result by plumbing divergence analysis data
all the way down to codegen, however the goal of the performance
analysis pass isn't to provide an exact prediction of the performance
of a shader (that's not really possible in general via static analysis
without solving the halting problem), but to provide a good enough
approximation at a low cost -- And the constant approximation seems to
be strictly better in practice than the approximation we were using
before, there appear to be no regressions from this change, and
ShadowTombRaider-trace-dx11-2160p-ultra shows 5.7% better performance
on PTL with a subsequent commit that re-enables the use of the static
analysis-based SIMD32 heuristic on xe3+.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:56 +00:00
Francisco Jerez
6eea9659db intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis.
This extends the performance analysis pass used in previous
generations to make it more useful to deal with the performance
trade-off encountered on xe3 hardware as a result of VRT.  VRT allows
the driver to request a per-thread GRF allocation different from the
128 GRFs that were typical in previous platforms, but this comes at
either a thread parallelism cost or benefit depending on the number of
GRF register blocks requested.

This makes a number of decisions more difficult for the compiler since
certain optimizations potentially trade off run-time in a thread
against the total number of threads that can run in parallel
(e.g. consider scheduling and how reordering an instruction to avoid a
stall can increase GRF use and therefore reduce thread-level
parallelism when trying to improve instruction-level parallelism).

This patch provides a simple heuristic tool to account for the
combined interaction of register pressure and other single-threaded
factors that affect performance.  This is expressed with the
redefinition of the pre-existing brw_performance::throughput estimate
as the number of invocations per cycle per EU that would be achieved
if there were enough threads to reach full load (in this sense this is
to be considered a heuristic since the penalty from VRT may be lower
than expected from this model at low EU load).

This will be used e.g. in order to decide whether to use a more
aggressive latency-minimizing mode during scheduling or a mode more
effective at minimizing register pressure (it makes sense to take the
path that will lead to the most invocations being serviced per cycle
while under load).  This also allows us to re-enable the old PS SIMD32
heuristic on xe3+, and due to this change it is able to identify cases
where the combined effect of poorer scheduling and higher GRF use of
the SIMD32 variant makes it more favorable to use SIMD16 only (see
last patch of the MR for details and numbers).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:56 +00:00
Francisco Jerez
760437c4c4 intel/brw/xe3+: Override P value of GRF register classes to increase thread parallelism.
This causes the graph coloring allocator to use the optimistic
coloring codepath for all nodes whose total Q value exceeds the
threshold of 96 GRFs, in order to do a better job at minimizing the
register requirement of programs even when they are trivially
colorable.  At the threshold of 96 GRFs the number of threads
available per EU starts decreasing as the number of register blocks
requested by the program increases, so decreasing the number of
registers can increase performance.

That showed up in some test cases as a performance inversion from the
enabling of VRT, since the extension of the register set to 256 GRFs
has the side effect of making some non-trivially colorable programs
trivially colorable, which would cause the register allocator to do a
worse job at ordering the (trivial) allocations due to the optimistic
coloring path being skipped, leading to increased register use and
reduced performance.

The following Traci test cases improve significantly as a result of
this change (4 iterations, 5% significance):

MetroExodus-trace-dx11-2160p-ultra:                 1.90% ±0.85%
BaldursGate3-trace-dx11-1440p-ultra:                1.47% ±0.38%
Palworld-trace-dx11-1080p-med:                      1.01% ±0.09%
TerminatorResistance-trace-dx11-2160p-ultra:        0.95% ±0.29%
Control-trace-dx11-1440p-high:                      0.87% ±0.50%

Even though lowering the P value threshold is expected to have a cost
in compile time theoretically due to the increased use of the slower
optimistic path of the graph coloring allocator, this doesn't actually
show up in my numbers, my shader-db and fossil-db compile-time numbers
don't show any statistically significant change (13 iterations, 5%
significance).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:55 +00:00
Francisco Jerez
35ac517780 intel/brw/xe3+: Define BRW_SCHEDULE_PRE_LATENCY scheduling mode.
This defines a new pre-RA scheduling mode similar to BRW_SCHEDULE_PRE
but more aggressive at optimizing for minimum latency rather than
minimum register usage.  The main motivation is that on recent xe3
platforms we use a register allocation heuristic that packs variables
more tightly at the bottom of the register file instead of the
round-robin heuristic we used on previous platforms, since as a result
of VRT there is a parallelism penalty when a program uses more GRF
registers than necessary.  Unfortunately the xe3 tight-packing
heuristic severely constrains the work of the post-RA scheduler due to
the false dependencies introduced during register allocation, so we
can do a better job by making the scheduler aware of instruction
latencies before the register allocator introduces any false
dependencies.

This can lead to higher register pressure, but only when the scheduler
decides it could save cycles by extending a live range.  It makes
sense to preserve the preexisting BRW_SCHEDULE_PRE as a separate mode
since some workloads can still benefit from neglecting latencies
pre-RA due to the trade-off mentioned between parallelism and GRF use,
a future commit will introduce a more accurate estimate of the
expected relative performance of BRW_SCHEDULE_PRE
vs. BRW_SCHEDULE_PRE_LATENCY taking into account this trade-off.

In theory this could also be helpful on earlier pre-xe3 platforms, but
the benefit should be significantly smaller due to the different RA
heuristic so it hasn't been tested extensively pre-xe3.

The following Traci tests are improved significantly by this change on
PTL (nearly all tests that run on my system are affected positively):

Ghostrunner2-trace-dx11-1440p-ultra:                7.12% ±0.36%
SpaceEngineers-trace-dx11-2160p-high:               5.77% ±0.43%
HogwartsLegacy-trace-dx12-1080p-ultra:              4.40% ±0.03%
Naraka-trace-dx11-1440p-highest:                    3.06% ±0.43%
MetroExodus-trace-dx11-2160p-ultra:                 2.26% ±0.60%
Fortnite-trace-dx11-2160p-epix:                     2.12% ±0.53%
Nba2K23-trace-dx11-2160p-ultra:                     1.98% ±0.30%
Control-trace-dx11-1440p-high:                      1.93% ±0.36%
GodOfWar-trace-dx11-2160p-ultra:                    1.62% ±0.47%
TotalWarPharaoh-trace-dx11-1440p-ultra:             1.55% ±0.18%
MountAndBlade2-trace-dx11-1440p-veryhigh:           1.51% ±0.37%
Destiny2-trace-dx11-1440p-highest:                  1.44% ±0.34%
GtaV-trace-dx11-2160p-ultra:                        1.26% ±0.27%
ShadowTombRaider-trace-dx11-2160p-ultra:            1.10% ±0.58%
Borderlands3-trace-dx11-2160p-ultra:                0.95% ±0.43%
TerminatorResistance-trace-dx11-2160p-ultra:        0.87% ±0.22%
BaldursGate3-trace-dx11-1440p-ultra:                0.84% ±0.28%
CitiesSkylines2-trace-dx11-1440p-high:              0.82% ±0.22%
PubG-trace-dx11-1440p-ultra:                        0.72% ±0.37%
Palworld-trace-dx11-1080p-med:                      0.71% ±0.26%
Superposition-trace-dx11-2160p-extreme:             0.69% ±0.19%

The compile-time cost of shader-db increases significantly by 1.85%
after this commit (14 iterations, 5% significance), the compile-time
of fossil-db doesn't change significantly in my setup.

v2: Addressed interaction with 81594d0db1,
    since the code that calculates deps, delays and exits is no longer
    mode-independent after this change.  Instead of reverting that
    commit (which is non-trivial and would have a greater compile-time
    hit) simply reconstruct the scheduler object during the transition
    between BRW_SCHEDULE_PRE_LATENCY and any other PRE mode that
    doesn't require instruction latencies.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:55 +00:00
Francisco Jerez
501b1cbc2c intel/brw: Fix behavior of scheduler around flag register writes.
We were currently treating explicit flag writes and reads as a full
scheduler barrier, which is unnecessary since the tracking we already
do handles explicit flag access correctly so there is no reason for
taking a possibly large performance hit from add_barrier_deps().

Found by inspection while trying to understand the poor scheduling of
some fragment shaders.  Improves performance by a small but
statistically significant amount (4 iterations, 5% significance) for
the following Traci tests in combination with a subsequent commit that
makes the pre-RA scheduler sensitive to instruction latencies:

SpaceEngineers-trace-dx11-2160p-high:               0.66% ±0.30%
MountAndBlade2-trace-dx11-1440p-veryhigh:           0.62% ±0.23%

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:55 +00:00
Francisco Jerez
17b068ed1c intel/brw/xe3+: Handle SENDG in instruction scheduler.
We weren't handling the SHADER_OPCODE_SEND_GATHER instruction in the
instruction scheduler and this was leading to reduced performance in
many programs since SEND instructions have the longest latency and
tend to be among the most critical to schedule efficiently.  Handle
SENDG similarly to SEND since the timings of both instructions are
mostly bound by the shared function which doesn't care if the message
was sent by SEND or SENDG.

Improves performance significantly in the following Traci traces (4
iterations, 5% significance), most of them regressions from SENDG
being enabled:

MetroExodus-trace-dx11-2160p-ultra:                 1.99% ±0.88%
HogwartsLegacy-trace-dx12-1080p-ultra:              1.33% ±0.20%
GtaV-trace-dx11-2160p-ultra:                        1.12% ±0.19%
Borderlands3-trace-dx11-2160p-ultra:                1.00% ±0.58%
TerminatorResistance-trace-dx11-2160p-ultra:        0.98% ±0.27%
Control-trace-dx11-1440p-high:                      0.91% ±0.36%
Naraka-trace-dx11-1440p-highest:                    0.90% ±0.30%
Ghostrunner2-trace-dx11-1440p-ultra:                0.87% ±0.38%
Palworld-trace-dx11-1080p-med:                      0.71% ±0.17%

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:54 +00:00
Caio Oliveira
67fcfed67b brw: Add FILE * parameter to dump_assembly
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37259>
2025-09-09 10:40:42 -07:00
Mel Henning
17876a00af nir: Add a faster lowest common ancestor algorithm
On a fossil from the blender 4.5.0 vulkan backend, this improves compile
times in nak by about 17%. Compile time of other shaders improves by a
more modest 1.2%.

No stat changes on shader-db.

Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36184>
2025-09-08 23:03:13 +00:00
Caio Oliveira
f37c9c873c brw: Fix printing of blocks in disassembly when BRW is available
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
When disassembling and BRW IR is available (which happens in the
generator), there will be pointers to the BRW's basic block structures
that are used to print the block numbers and predecessor/successors
in the output.

There are two challenges:

- Because DO and FLOW instructions are not real instructions, they are
  not emitted in the output but would still cause the output to contain
  empty blocks.  Previous code accounted for DO but still had problems.

- DO blocks have special physical links that don't make sense when the
  DO is not emitted at the end, but they would be shown even if that
  block was omitted.

These issues can be seen here (edited to remove non-essential bits)

```
   START B0 (2 cycles)
mov(8)          g126<1>UD       0x3f800000UD
   END B0 ->B1
   START B2 <-B1 <-B4 (0 cycles)
   END B2 ->B3
   START B3 <-B2 (260 cycles)

LABEL1:
mov(8)          g1<1>D          0D
cmp.ge.f0.0(8)  null<1>D        g2<0,1,0>D      10D
sync nop(1)                     null<0,1,0>UB
send(1)         g0UD            g1UD            nullUD
(+f0.0) break(8) JIP:  LABEL0         UIP:  LABEL0
   END B3 ->B1 ->B5 ->B4
   START B4 <-B3 (1000 cycles)
sync nop(1)                     null<0,1,0>UB
mov(8)          g126<1>UD       g0<0,1,0>UD

LABEL0:
while(8)        JIP:  LABEL1
   END B4 ->B2
   START B5 <-B1 <-B3 (20 cycles)
```

For example:
- Block 1 is missing (a skipped DO block)
- Block 2 is empty (it was a FLOW block)
- Block 3 ends with a link to Block 1 (the special links involving DO
  blocks).

Two key changes were made to fix this.  First, skip the DO and FLOW
blocks completely.  The use_tail ensures that the instruction group is
reused to avoid empty blocks.  Second, when printing, the successors and
predecessors, walk through the skipped blocks.  And finally, don't print
the special blocks.

With the fix, here's the output.  Note the blocks retain their original
BRW IR number.

```
   START B0 (2 cycles)
mov(8)          g127<1>UD       0x3f800000UD
   END B0 ->B3
   START B3 <-B0 <-B4 (260 cycles)

LABEL1:
mov(8)          g1<1>D          0D
cmp.ge.f0.0(8)  null<1>D        g2<0,1,0>D      10D
sync nop(1)                     null<0,1,0>UB
send(1)         g0UD            g1UD            nullUD
(+f0.0) break(8) JIP:  LABEL0         UIP:  LABEL0
   END B3 ->B5 ->B4
   START B4 <-B3 (1000 cycles)
sync nop(1)                     null<0,1,0>UB
mov(8)          g127<1>UD       g0<0,1,0>UD

LABEL0:
while(8)        JIP:  LABEL1
   END B4 ->B3
   START B5 <-B3 (20 cycles)
```

Issue was spotted by Ken.

Fixes: d2c39b1779 ("intel/brw: Always have a (non-DO) block after a DO in the CFG")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36226>
2025-09-06 16:42:05 +00:00
Lionel Landwerlin
a91e0e0d61 brw: add support for separate tessellation shader compilation
Tessellation factors have to be written dynamically (based on the next
shader primitive topology) and the builtins read using a dynamic
offset (based on the preceeding shader's VUE).

Anv is updated to use this new infrastructure for dynamic
patch_control_points.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34872>
2025-09-05 07:46:17 +00:00
Lionel Landwerlin
a18835a9ca anv/brw/iris: move VS VUE computation to backend
Drivers can provide the inputs required for the backend to call the
compute function.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34872>
2025-09-05 07:46:16 +00:00
Lionel Landwerlin
8dee4813b0 brw: add ability to compute VUE map for separate tcs/tes
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34872>
2025-09-05 07:46:16 +00:00
Ian Romanick
1ce90ad5e1 elk: Use nir_opt_sink and more nir_opt_move
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
I spent a bunch of time playing around with the various enable bits, and
this was the best I could come up with. Enabling any of
nir_move_comparisons or nir_move_load_ubo in nir_opt_sink helped
instructions quite a bit, but it also caused a large pile of added
spills and fills.

shader-db:

Broadwell
total instructions in shared programs: 18428980 -> 18427957 (<.01%)
instructions in affected programs: 425245 -> 424222 (-0.24%)
helped: 1522 / HURT: 405

total cycles in shared programs: 954756705 -> 953755695 (-0.10%)
cycles in affected programs: 623470486 -> 622469476 (-0.16%)
helped: 17989 / HURT: 21175

total spills in shared programs: 8349 -> 8356 (0.08%)
spills in affected programs: 285 -> 292 (2.46%)
helped: 7 / HURT: 13

total fills in shared programs: 10426 -> 10192 (-2.24%)
fills in affected programs: 675 -> 441 (-34.67%)
helped: 25 / HURT: 1

LOST:   346
GAINED: 554

Haswell
total instructions in shared programs: 16809730 -> 16801634 (-0.05%)
instructions in affected programs: 772251 -> 764155 (-1.05%)
helped: 3055 / HURT: 840

total cycles in shared programs: 945179935 -> 944315696 (-0.09%)
cycles in affected programs: 549177588 -> 548313349 (-0.16%)
helped: 34143 / HURT: 23605

total spills in shared programs: 7699 -> 7666 (-0.43%)
spills in affected programs: 353 -> 320 (-9.35%)
helped: 10 / HURT: 16

total fills in shared programs: 8184 -> 7671 (-6.27%)
fills in affected programs: 1006 -> 493 (-50.99%)
helped: 30 / HURT: 2

total sends in shared programs: 1016676 -> 1016682 (<.01%)
sends in affected programs: 49 -> 55 (12.24%)
helped: 0 / HURT: 6

LOST:   415
GAINED: 441

Ivy Bridge
total instructions in shared programs: 15764955 -> 15757178 (-0.05%)
instructions in affected programs: 707453 -> 699676 (-1.10%)
helped: 2893 / HURT: 547

total cycles in shared programs: 430017934 -> 429720104 (-0.07%)
cycles in affected programs: 251816726 -> 251518896 (-0.12%)
helped: 33110 / HURT: 22056

total spills in shared programs: 1537 -> 1525 (-0.78%)
spills in affected programs: 18 -> 6 (-66.67%)
helped: 6 / HURT: 0

total fills in shared programs: 926 -> 905 (-2.27%)
fills in affected programs: 24 -> 3 (-87.50%)
helped: 6 / HURT: 0

total sends in shared programs: 816646 -> 816652 (<.01%)
sends in affected programs: 49 -> 55 (12.24%)
helped: 0 / HURT: 6

LOST:   332
GAINED: 417

Sandy Bridge
total instructions in shared programs: 14055229 -> 14045281 (-0.07%)
instructions in affected programs: 1436142 -> 1426194 (-0.69%)
helped: 5858 / HURT: 757

total cycles in shared programs: 772123170 -> 813543451 (5.36%)
cycles in affected programs: 521342483 -> 562762764 (7.94%)
helped: 27928 / HURT: 35923

total spills in shared programs: 1742 -> 1741 (-0.06%)
spills in affected programs: 66 -> 65 (-1.52%)
helped: 1 / HURT: 0

total fills in shared programs: 970 -> 967 (-0.31%)
fills in affected programs: 93 -> 90 (-3.23%)
helped: 1 / HURT: 0

total sends in shared programs: 1239222 -> 1238992 (-0.02%)
sends in affected programs: 6137 -> 5907 (-3.75%)
helped: 342 / HURT: 112

LOST:   244
GAINED: 434

Iron Lake and GM45 had similar results. (Iron Lake shown)
total instructions in shared programs: 8366385 -> 8363954 (-0.03%)
instructions in affected programs: 162761 -> 160330 (-1.49%)
helped: 600 / HURT: 195

total cycles in shared programs: 248992618 -> 252119334 (1.26%)
cycles in affected programs: 50774708 -> 53901424 (6.16%)
helped: 3435 / HURT: 5131

total sends in shared programs: 623693 -> 623681 (<.01%)
sends in affected programs: 351 -> 339 (-3.42%)
helped: 12 / HURT: 0

LOST: 0
GAINED: 6

Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25463>
2025-09-04 15:01:18 -07:00
Ian Romanick
6f30cf71fe brw: Use nir_opt_sink and more nir_opt_move
The shader-db results on most platforms are pretty mixed. However, this
seems to be a decent improvement in fossil-db.

shader-db::

Lunar Lake
total instructions in shared programs: 17019147 -> 17023017 (0.02%)
instructions in affected programs: 1200847 -> 1204717 (0.32%)
helped: 814 / HURT: 2458

total cycles in shared programs: 880532116 -> 880406462 (-0.01%)
cycles in affected programs: 798253846 -> 798128192 (-0.02%)
helped: 30064 / HURT: 33008

total spills in shared programs: 3262 -> 3260 (-0.06%)
spills in affected programs: 66 -> 64 (-3.03%)
helped: 1 / HURT: 2

total fills in shared programs: 1616 -> 1637 (1.30%)
fills in affected programs: 89 -> 110 (23.60%)
helped: 1 / HURT: 2

LOST:   241
GAINED: 356

Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown)
total instructions in shared programs: 19859724 -> 19865383 (0.03%)
instructions in affected programs: 2166810 -> 2172469 (0.26%)
helped: 942 / HURT: 3563

total cycles in shared programs: 879095859 -> 878616086 (-0.05%)
cycles in affected programs: 753840990 -> 753361217 (-0.06%)
helped: 33442 / HURT: 35053

total spills in shared programs: 4679 -> 4677 (-0.04%)
spills in affected programs: 80 -> 78 (-2.50%)
helped: 1 / HURT: 2

total fills in shared programs: 4113 -> 4175 (1.51%)
fills in affected programs: 87 -> 149 (71.26%)
helped: 1 / HURT: 2

LOST:   706
GAINED: 563

Ice Lake and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 20610947 -> 20615741 (0.02%)
instructions in affected programs: 2138334 -> 2143128 (0.22%)
helped: 979 / HURT: 3635

total cycles in shared programs: 863103771 -> 862153697 (-0.11%)
cycles in affected programs: 731626072 -> 730675998 (-0.13%)
helped: 34060 / HURT: 34256

total spills in shared programs: 3992 -> 3949 (-1.08%)
spills in affected programs: 504 -> 461 (-8.53%)
helped: 8 / HURT: 6

total fills in shared programs: 3640 -> 3573 (-1.84%)
fills in affected programs: 1505 -> 1438 (-4.45%)
helped: 8 / HURT: 5

LOST:   622
GAINED: 1018

fossil-db:

All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 232649299 -> 232485503 (-0.07%); split: -0.16%, +0.09%
Subgroup size: 15932144 -> 15933056 (+0.01%); split: +0.01%, -0.00%
Loop count: 137431 -> 137430 (-0.00%)
Cycle count: 32619860020 -> 32714539770 (+0.29%); split: -0.80%, +1.09%
Spill count: 540835 -> 519861 (-3.88%); split: -4.79%, +0.91%
Fill count: 700278 -> 663650 (-5.23%); split: -6.46%, +1.23%
Scratch Memory Size: 37258240 -> 35654656 (-4.30%); split: -5.24%, +0.94%
Max live registers: 72561256 -> 71501759 (-1.46%); split: -1.62%, +0.16%
Non SSA regs after NIR: 67682385 -> 67692495 (+0.01%); split: -0.00%, +0.02%

Totals from 617432 (78.20% of 789594) affected shaders:
Instrs: 217754449 -> 217590653 (-0.08%); split: -0.17%, +0.10%
Subgroup size: 12656912 -> 12657824 (+0.01%); split: +0.01%, -0.00%
Loop count: 133283 -> 133282 (-0.00%)
Cycle count: 32367979192 -> 32462658942 (+0.29%); split: -0.81%, +1.10%
Spill count: 540770 -> 519796 (-3.88%); split: -4.79%, +0.91%
Fill count: 700277 -> 663649 (-5.23%); split: -6.46%, +1.23%
Scratch Memory Size: 37182464 -> 35578880 (-4.31%); split: -5.25%, +0.94%
Max live registers: 64912683 -> 63853186 (-1.63%); split: -1.81%, +0.18%
Non SSA regs after NIR: 60158776 -> 60168886 (+0.02%); split: -0.00%, +0.02%

Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25463>
2025-09-04 15:01:18 -07:00
Caio Oliveira
4e253184de brw: Run validation as soon as we have the CFG around
Fixes: affa7567c2 ("intel/brw: Add phases to backend")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37148>
2025-09-03 20:42:05 +00:00
Lionel Landwerlin
23a4aef14a Revert "brw: move texture offset packing to NIR"
Some checks are pending
macOS-CI / macOS-CI (dri) (push) Waiting to run
macOS-CI / macOS-CI (xlib) (push) Waiting to run
This reverts commit 4346210ae6.

Fixes: 4346210ae6 ("brw: move texture offset packing to NIR")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37050>
2025-08-29 06:29:14 +00:00
Ian Romanick
49141ad5f2 brw: Strategically place flags initialization to help cmod prop
v2: Rebase on ac2b072312 ("brw: Add more specific brw_builder
helpers"), and fix a bug that caused the new instruction to possibly be
put in the wrong place.

No shader-db changes on any Intel platform.

fossil-db:

All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 233675305 -> 233641585 (-0.01%)
Cycle count: 32593658094 -> 32591467794 (-0.01%); split: -0.01%, +0.00%

Totals from 33513 (4.25% of 789264) affected shaders:
Instrs: 5200332 -> 5166612 (-0.65%)
Cycle count: 1499831128 -> 1497640828 (-0.15%); split: -0.15%, +0.00%

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35444>
2025-08-28 22:08:20 +00:00
Ian Romanick
3018849535 brw: Don't emit redundant flags initialization for subgroup op lowering
No shader-db changes on any Intel platform.

fossil-db:

All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 233676039 -> 233675305 (-0.00%)
Cycle count: 32594097814 -> 32593658094 (-0.00%); split: -0.00%, +0.00%

Totals from 325 (0.04% of 789264) affected shaders:
Instrs: 104491 -> 103757 (-0.70%)
Cycle count: 1183870034 -> 1183430314 (-0.04%); split: -0.04%, +0.00%

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35444>
2025-08-28 22:08:20 +00:00
Ian Romanick
4a238f461d brw: Do cmod prop again after brw_lower_subgroup_ops
shader-db:

All Intel platforms had similar results. (Lunar Lake shown)
total instructions in shared programs: 17114300 -> 17114294 (<.01%)
instructions in affected programs: 3617 -> 3611 (-0.17%)
helped: 6 / HURT: 0

total cycles in shared programs: 886397556 -> 886397454 (<.01%)
cycles in affected programs: 511400 -> 511298 (-0.02%)
helped: 6 / HURT: 0

fossil-db:

Lunar Lake
Totals:
Instrs: 233683694 -> 233676039 (-0.00%); split: -0.00%, +0.00%
Cycle count: 32602038466 -> 32594097814 (-0.02%); split: -0.03%, +0.01%
Spill count: 540908 -> 540704 (-0.04%)
Fill count: 700935 -> 700258 (-0.10%)

Totals from 2200 (0.28% of 789264) affected shaders:
Instrs: 2062360 -> 2054705 (-0.37%); split: -0.37%, +0.00%
Cycle count: 2506073282 -> 2498132630 (-0.32%); split: -0.41%, +0.09%
Spill count: 14423 -> 14219 (-1.41%)
Fill count: 34219 -> 33542 (-1.98%)

Meteor Lake and DG2 had similar results. (Meteor Lake shown)
Totals:
Instrs: 263545171 -> 263543341 (-0.00%); split: -0.00%, +0.00%
Cycle count: 26480835985 -> 26484748317 (+0.01%); split: -0.01%, +0.03%
Spill count: 554335 -> 554338 (+0.00%)
Fill count: 645486 -> 645498 (+0.00%)

Totals from 610 (0.07% of 903944) affected shaders:
Instrs: 1139871 -> 1138041 (-0.16%); split: -0.17%, +0.01%
Cycle count: 2274612327 -> 2278524659 (+0.17%); split: -0.15%, +0.33%
Spill count: 15153 -> 15156 (+0.02%)
Fill count: 36831 -> 36843 (+0.03%)

Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Totals:
Instrs: 268713723 -> 268712817 (-0.00%); split: -0.00%, +0.00%
Cycle count: 24653238085 -> 24652269669 (-0.00%); split: -0.00%, +0.00%
Fill count: 671369 -> 671361 (-0.00%)

Totals from 666 (0.07% of 899711) affected shaders:
Instrs: 924423 -> 923517 (-0.10%); split: -0.11%, +0.01%
Cycle count: 840380565 -> 839412149 (-0.12%); split: -0.13%, +0.02%
Fill count: 13006 -> 12998 (-0.06%)

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35444>
2025-08-28 22:08:20 +00:00