Commit graph

4222 commits

Author SHA1 Message Date
Sagar Ghuge
edcad250ed intel/compiler: Don't use half float param for sample_b
Looks like some of the tests uses the bias which does not fit into half
float parameter, so it's better to use float param for sample_b.

If we have cube arrays, we anyway combine BIAS and array index properly
so we don't have to worry about the first parameter.

This fixes: GTF-GL46.gtf21.GL3Tests.texture_lod_bias.texture_lod_bias_clamp_m_g_M

Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29533>
2024-06-28 03:33:18 +00:00
Dylan Baker
35298e84f1 intel/compiler: move predicated_break out of backend loop
This has no impact on the generated shaders, but does have a small
(positive) impact on the amount of time spent in shader compilation.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29126>
2024-06-27 15:20:19 -07:00
Jordan Justen
7b3149c99b intel/brw: Retype some regs to BRW_TYPE_UD for Xe2 indirect accesses
Following https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28957,
some Xe2 code paths started triggering asserts.

In the cases fixed by this patch, it was because of the assert added
to brw_type_larger_of() in cf8ed9925f ("intel/brw: Make a helper for
finding the largest of two types"), and then brw_type_larger_of() is
used in 674e89953f. (For example, the assert was triggering when the
SHL types differed between D and UD.)

Fixes: 674e89953f ("intel/brw: Use new builder helpers that allocate a VGRF destination")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29925>
2024-06-27 21:51:07 +00:00
Ian Romanick
531461d576 intel/brw: Test corner case CSE of ADD3 instructions
When the destination of both instructions is NULL and the conditional
modifier matches, operands_match (by way of instructions_match) will
only test the first two operands. This can result in bad CSE
happening.

This is a very, very narrow edge case.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29848>
2024-06-27 18:34:53 +00:00
Kenneth Graunke
7adccbd48d intel/brw: Support CSE of ADD3
This one is a bit more complex in that we need to handle 3-source
commutative opcodes.  But it's also quite useful:

fossil-db results on Alchemist (A770):

    Instrs: 151659750 -> 150164959 (-0.99%); split: -0.99%, +0.01%
    Cycles: 12822686329 -> 12574996669 (-1.93%); split: -2.05%, +0.12%
    Subgroup size: 7589608 -> 7589592 (-0.00%)
    Send messages: 7375047 -> 7375053 (+0.00%); split: -0.00%, +0.00%
    Loop count: 46313 -> 46315 (+0.00%); split: -0.01%, +0.01%
    Spill count: 110184 -> 54670 (-50.38%); split: -50.79%, +0.41%
    Fill count: 213724 -> 104802 (-50.96%); split: -51.43%, +0.47%
    Scratch Memory Size: 9406464 -> 3375104 (-64.12%); split: -64.35%, +0.23%

Our older Shadow of the Tomb Raider fossil is particularly helped with
over a 90% reduction in scratch access (spills, fills, and scratch
size).  However, benchmarking in the actual game shows no change in
performance.  We're thinking the game's shaders have been updated since
our capture.

Ian noted that there was a bug here where we'd accidentally CSE two ADD3
instructions with null destinations and different src[2] that couldn't
be dead code eliminated due to conditional mods.  However, this is only
a bug in the new cse_defs pass so we don't need to nominate this for
stable branches.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29848>
2024-06-27 18:34:53 +00:00
Francisco Jerez
79fa3eba11 intel/fs/xe2+: Add ALU-based implementation of barycentric interpolation at a per-channel sample.
This implements a replacement for the previous implementation of
nir_intrinsic_load_barycentric_at_sample that relied on the Pixel
Interpolator shared function, since it's going to be removed from the
hardware from Xe2 onwards.

This implementation simply looks up the X/Y offsets of each sample
index on the table provided in the PS thread payload by using indirect
addressing, then does the actual interpolation by recursing into
emit_pixel_interpolater_alu_at_offset() introduced in the previous
commit.

Note that even though this is only immediately useful on Xe2+
platforms there's no reason why it shouldn't work on earlier
platforms, as long as we have the sample X/Y offsets available in the
thread payload.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
2024-06-27 00:18:00 +00:00
Francisco Jerez
95eec5a0dd intel/fs/xe2+: Add ALU-based implementation of barycentric interpolation at a per-channel offset.
This implements a replacement for the previous implementation of
nir_intrinsic_load_barycentric_at_offset that relied on the Pixel
Interpolator shared function, since it's going to be removed from the
hardware from Xe2 onwards.

That's okay since we can get all the primitive setup information
needed for interpolation at an arbitrary coordinate: We use the X/Y
offset relative to the "X/Y Start" coordinates from the thread payload
order to evaluate the plane equations also provided in the thread
payload for each barycentric coordinate of each polygon.  The
evaluation of the barycentric plane equations (and the RHW plane
equation for perspective-correct interpolation) uses the accumulator
and MAD/MAC for ALU efficiency, but that means we need to manually
split instructions to fit the width of the accumulator.  The division
and scaling for perspective-correct interpolation is also now done in
the shader if necessary.

Note that even though this is only immediately useful on Xe2+, the
thread payload numbers are filled out for older platforms, and the EU
restrictions of previous Xe platforms are taken into account, mostly
for the purposes of testing and performance evaluation.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
2024-06-27 00:18:00 +00:00
Francisco Jerez
e8007c9325 intel/fs/xe2+: Don't lower barycentric load offsets to fixed-point format on Xe2+.
Floating-point offsets work fine in combination with the
floating-point arithmetic we're about to lower these intrinsics into,
and they require less instructions than converting to fixed-point and
then back.  No reason to take the precision/range hit nor the extra
instructions.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
2024-06-27 00:18:00 +00:00
Francisco Jerez
3d30cc82f9 intel/fs/xe2+: Ask driver for PS payload registers based on barycentric load intrinsics in use.
The ALU-based implementation of the barycentric interpolation
intrinsics introduced by a subsequent commit will require some
primitive setup information not delivered in the PS thread payload
unless explicitly requested:

 - "Source Depth and/or W Attribute Vertex Deltas" if a
   perspective-correct interpolation mode is used -- Note that this is
   already requested for CPS interpolation, we just need to enable it
   in more cases.

 - "Perspective Bary Planes" if a perspective-correct interpolation
   mode is used.

 - "Non-Perspective Bary Planes" if a non-perspective-corrected
   interpolation mode is used.

 - "Sample offsets" if any at_sample interpolation is used so the
   coordinate offsets of the sample can be calculated.

This ALU implementation of barycentric interpolation will only be
needed for *_at_offset and *_at_sample interpolation, since the fixed
function hardware still computes barycentrics for us at the current
sample coordinates, only the cases that previously relied on the Pixel
Interpolator shared function need to be re-implemented with ALU
instructions, since that shared function will no longer exist on Xe2
hardware.

Thanks to Rohan for a bugfix of the uses_sample_offsets calculation,
this patch includes his fix squashed in.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
2024-06-27 00:18:00 +00:00
Ian Romanick
556e78f737 intel/brw/xe2+: Allow vec16 for cooperative matrix
Xe2 will allow a B matrix large enough that it will be stored in a
vec16.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 14:17:47 -07:00
Ian Romanick
b6236dd8f3 intel/brw/xe2+: Adjust DPAS lowering to DP4A to accommodate larger GRF and SIMD16
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 14:17:47 -07:00
Ian Romanick
77ef241577 intel/brw/xe2+: Scale size_written by reg_unit for DPAS
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 14:17:47 -07:00
Ian Romanick
e368b8e01b intel/brw/xe2+: Adjust size_read() for DPAS
v2: Remov "DG2" from a comment because it applies to DG2 and
Xe2. Suggested by Caio.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 14:17:47 -07:00
Ian Romanick
b051602754 intel/brw/xe2+: Catch invalid uses of writes_accumulator earlier
It turns out the problem I was trying to catch in be4fa59a72
("intel/brw: Clear write_accumulator flag when changing the
destination") also came from the DPAS lowering pass itself. Checking for
invalid uses of the feature in fs_validate helped detect the problem.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 14:17:47 -07:00
Ian Romanick
7a773ac53e intel/brw: Major rework of lower_cmat_load_store
The original goal was to get rid of a bunch of the magic constants
sprinkled through the function. Once I did that, I realized that there
was a lot my symmertry between the row-major and column-major paths
possible.

It's +6 lines of code, but about 15 of those lines are comments
explaining things that were not obvious in the original code.

v2: Save duplicated condition in a variable with a meaningful
name. Suggested by Caio.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 14:16:48 -07:00
Ian Romanick
ea6e10c0b2 intel/brw: Temporarily disable result=float16 matrix configs
Even though the hardware does not naively support these configurations,
there are many potential benefits to advertising them. These
configurations can theoretically use half the memory bandwidth for loads
and stores. For large matrices, that can be the limiting in performance.

The current implementation, however, has a number of significant
problems.

The conversion from float16 to float32 is performed in the driver during
conversion from NIR. As a result, many common usage patterns end up
doing back-to-back conversions to and from float16 between matrix
multiplications (when the result of one multiplication is used as the
accumulator for the next).

The float16 version of the matrix waste half the possible register
space. Each float16 value sits alone in a dword. This is done so that
the per-invocation slice of an 8x8 float16 result matrix and an 8x8
float32 result matrix will have the same number of elements. This makes
it possible to do straightforward implementations of all the unary_op
type conversions in NIR.

It would be possible to perform N:M element type conversions in the
backend using specialized NIR intrinsics. However, per #10961, this
would be very, very painful. My hope is that, once a suitable resolution
for that issue can be found, support for these configs can be restored.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-06-25 13:52:12 -07:00
Kenneth Graunke
5cb15a6c67 intel/brw: Make bld.ADD(x, 0) emit no instructions and return x directly
There are a lot of places where we add 0 to an offset.  Avoiding
generating this can save us algebraic + copy_propagation later.

Cuts compile time in Borderlands 3 by -0.590631% +/- 0.170108% (n=25).

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29849>
2024-06-24 19:12:21 -07:00
Kenneth Graunke
068865ce81 intel/brw: Make an alu2 builder helper
Instead of replicating the whole thing in macros, just make an alu2()
function and use that in the wrappers.  It ought to get inlined anyway.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29849>
2024-06-24 19:12:19 -07:00
Kenneth Graunke
c18de3f048 intel/brw: Delay liveness calculations in saturate propagation
Wait and see if we actually have a candidate for saturate propagation
before requesting liveness info.  Saves the calculation in the case
where we have nothing to do.

Cuts compile time in Borderlands 3 by -0.304754% +/- 0.194162% (n=25).

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29849>
2024-06-24 19:12:00 -07:00
Caio Oliveira
b59ea3d63f intel/brw: Print SWSB information when dumping instructions
These were only being shown before as part of disassemble.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29738>
2024-06-23 08:09:56 -07:00
Alyssa Rosenzweig
da752ed7c1 treewide: use nir_def_replace sometimes
Two Coccinelle patches here. Didn't catch nearly as much as I would've liked but
it's a start.

Coccinelle patch:

    @@
    expression intr, repl;
    @@

    -nir_def_rewrite_uses(&intr->def, repl);
    -nir_instr_remove(&intr->instr);
    +nir_def_replace(&intr->def, repl);

Coccinelle patch:

    @@
    identifier intr;
    expression instr, repl;
    @@

    nir_intrinsic_instr *intr = nir_instr_as_intrinsic(instr);
    ...
    -nir_def_rewrite_uses(&intr->def, repl);
    -nir_instr_remove(instr);
    +nir_def_replace(&intr->def, repl);

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Juan A. Suarez Romero <jasuarez@igalia.com> [broadcom]
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com> [lima]
Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com> [etna]
Reviewed-by: Pavel Ondračka <pavel.ondracka@gmail.com> [r300]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29817>
2024-06-21 15:36:56 +00:00
Lionel Landwerlin
339630ab05 brw: enable A64 loads source rematerialization
Allows to avoid Wa_1407528679 on A64 loads

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
f482fc33cf brw: blockify load_global_const_block_intel
This intrinsic is pretty much equivalent to
load_global_constant_uniform_block_intel, it just has a predicate. If
the predicate is always true we can turn into into the other.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
6fe6b9c8fa brw: avoid Wa_1407528679 in uniform cases
When the surface handles are generated with exec_all, we can avoid
emitting the workaround.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
5227b2db73 brw: annotation send instructions with surface handles generated with exec_all
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
b79e85a93f brw: always use new registers for load address increments
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
7f1ca16e3b brw: enable rematerialization of non 32bit uniforms
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
0531f568ac brw: remove some brackets
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
11a634151b brw: remove rematerialization assert
The default case should lead us to the next rematerialization block so
this is useless.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
d42bc0d3fc brw: bound the amount of rematerialized NIR instructions
Some of the instructions we don't need to rematerialize because we
already know they are executed with NoMask so we can use their
destination without reemitting them again.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
4bfb4f35a8 brw: improve rematalization of surface/sampler handles
This change handles patterns like this

con v0 = load_ubo ...
con v1 = add v0, 0x30
con v2 = load_ubo v1, 0x0

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
c7b312ad45 brw: factor out source extraction for rematerialization
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Lionel Landwerlin
8fbbc9c301 brw: add missing break
Not fixing anything because of the default case below.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29663>
2024-06-21 08:29:44 +00:00
Francisco Jerez
c1feccdd90 intel/fs/gfx20+: Fix surface state address on extended descriptors for NIR scratch intrinsics.
The r0.5 thread payload register contains Surface State Offset bits
[27:6] as bits [31:10], so we need to shift the register right by 4 in
order to get the surface state offset expected in ExBSO mode, which is
the only extended descriptor encoding supported by the UGM shared
function for SS addressing on Xe2+.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29543>
2024-06-21 01:49:43 +00:00
Kenneth Graunke
50519598ff intel/brw: Skip discarding the interference graph
We no longer need to reserve registers for constructing spill/fill
messages.  We have split sends and construct message headers in new
temporary registers with a very short lifespan which are simply added
to the existing interference graph as new nodes and allocated via the
normal mechanism.

This means that when we need to spill for the first time, we can avoid
discarding and recomputing the entire interference graph.  We also avoid
needing to recreate all spill candidate information once ra_allocate()
fails, because the graph remains valid, and none of the existing nodes
had any changes to their interference.  The existing spill candidates
remain valid.

This will slightly help improve compile time when needing to spill.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25811>
2024-06-20 09:47:18 +00:00
Kenneth Graunke
29d6264627 intel/brw: Build the scratch header on the fly for pre-LSC systems
Instead of reserving a register to contain the spill header, which
gets marked live for the entire program, we can just emit the ALU
instructions to build it on the fly.  (This is similar to the way
we handle scratch on Alchemist with the newer LSC data port.)

There are a couple of downsides that make this not obviously a win.
First, in order to construct the scratch header on Gfx9-12, we have
to use fields from g0, which will have to remain live anywhere that
scratch access is required.  This could negate the register pressure
benefits of creating the header on the fly.  However, g0 is oft used
in other places anyway, so it may already be there.  Another is that
it's a non-trivial number of ALU instructions to construct the value.
Still, trading lower pressure (so fewer spills, less memory access
and stalls) for more cheap ALU seems like it ought to be a win.

There is another valuable benefit: by not reserving a register, we
eliminate the need to reconstruct the interference graph.  (The next
patch will actually do so.)

shader-db on Icelake shows spills/fills at 54/53 helped, 4/10 hurt,
and an 8% increase in ALU on affected shaders.  Synmark's OglCSDof
(a benchmark that spills) performance remains the same on Alderlake.

fossil-db on Icelake shows a 5.6%/5.1% reduction in spills/fills and a
4% reduction in scratch memory size on affected shaders.  Instruction
counts go up by 11.07%, but cycle estimates only increase by 0.57%.
Assassin's Creed Odyssey and Wolfenstein Youngblood both see 20-30%
reductions in spills/fills, a significant improvement.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25811>
2024-06-20 09:47:18 +00:00
Caio Oliveira
2a9f4618c5 intel/brw: Make component_size() consistent between VGRF and FIXED_GRF
Change so the size rounds up to the next multiple of the horizontal stride like
is done for VGRF.  This was causing an inconsistency in regs_read() -- The original
component_size() calculation for FIXED_GRF excluded any padding at the end but it was
still being discounted by regs_read().

Suggested by Curro.

Fixes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11069
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29736>
2024-06-19 01:33:58 +00:00
Caio Oliveira
8fb70f0746 intel/brw: Add unit tests for scoreboard handling FIXED_GRF with stride
Based on shaders reported in
https://gitlab.freedesktop.org/mesa/mesa/-/issues/11069 and
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29723.  These
currently fail, later patch will enable them.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29736>
2024-06-19 01:33:58 +00:00
Caio Oliveira
f982d2bb79 intel/brw: Fix typo in DPAS emission code
The enums were mixed up.  Code was working because they were being
used only for their numerical values.

Fixes: e666872c75 ("intel/compiler: Initial bits for DPAS instruction")
Acked-by: Iván Briano <ivan.briano@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29762>
2024-06-18 18:25:21 +00:00
Kenneth Graunke
9e750f00c3 intel/brw: Make opt_copy_propagation_defs clean up its own trash
Copy propagation often eliminates all uses of an instruction.  If we
detect that we've done so, we can eliminate the instruction ourselves
rather than leaving it hanging until the next DCE pass.

This saves some CPU time as other passes don't see dead code.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
2af84c2d49 intel/brw: Use the defs-based copy propagation along with the old one
The new def-based pass works better in many cases, and should be less
resource intensive.  However, the limited visibility of the defs-based
pass due to many values not being SSA yet makes it unable to fully
replace the old pass.  Try the new one, and if it can't make progress,
then try the old one.  That way, things will mostly be handled by the
new pass, but everything that was being cleaned up still will be.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
580e1c592d intel/brw: Introduce a new SSA-based copy propagation pass
(Quite a few of the restrictions here are ported from the old pass.)

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
9690bd369d intel/brw: Delete old local common subexpression elimination pass
We no longer use this older pass, so there's no need to keep it.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
8f09c58ddc intel/brw: Switch to the new defs-based global CSE pass
While the limited visibility due to partial SSA is a downside to the new
pass, it has a huge number of advantages that make it worth switching
over even now.  It's much more efficient, can eliminate redundant memory
loads across blocks, and doesn't generate loads of unnecessary copies
that other passes have to clean up.  This means we also eliminate the
infighting between the old CSE, coalescing, and copy propagation passes.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
234c45c929 intel/brw: Write a new global CSE pass that works on defs
This has a number of advantages compared to the pass I wrote years ago:

- It can easily perform either Global CSE or block-local CSE, without
  needing to roll any dataflow analysis, thanks to SSA def analysis.
  This global CSE is able to detect and coalesce memory loads across
  blocks.  Although it may increase spilling a little, the reduction
  in memory loads seems to more than compensate.

- Because SSA guarantees that values are never written more than once,
  the new CSE pass can directly reuse an existing value.  The old pass
  emitted copies at the point where it discovered a value because it
  had no idea whether it'd be mutated later.  This led it to generate
  a ton of trash for copy propagation to clean up later, and also a
  nasty fragility where CSE, register coalescing, and copy propagation
  could all fight one another by generating and cleaning up copies,
  leading to infinite optimization loops unless we were really careful.
  Generating less trash improves our CPU efficiency.

- It uses hash tables like nir_instr_set and nir_opt_cse, instead of
  linearly walking lists and comparing each element.  This is much more
  CPU efficient.

- It doesn't use liveness analysis, which is one of the most expensive
  analysis passes that we have.  Def analysis is cheaper.

In addition to CSE'ing SSA values, we continue to handle flag writes,
as this is a huge source of CSE'able values.  These remain block local.
However, we can simply track the last flag write, rather than creating
entire sets of instruction entries like the old pass.  Much simpler.

The only real downside to this pass is that, because the backend is
currently only partially SSA, it has limited visibility and isn't able
to see all values.  However, the results appear to be good enough that
the new pass can effectively replace the old pass in almost all cases.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
2b30b3bbd4 intel/brw: Print defs in dump_instructions
Like NIR, we print SSA defs as %1, %2, and so on.  The number here is
the VGRF number.  VGRFs that don't correspond to a SSA def remain
printed as vgrf1, vgrf2, and so on.

This makes it much easier to see what values are SSA and which aren't.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Caio Oliveira
08da7edc0e intel/brw: Track the number of uses of each def in def_analysis
Even without a full use list, simply tracking the number of uses will
let us tell "this is the only use of the def" or "we've just replaced
all uses of a def".  It's inexpensive to calculate and will be useful.

(rebased by Kenneth Graunke)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
0d144821f0 intel/brw: Add a new def analysis pass
This introduces a new analysis pass that opportunistically looks for
VGRFs which happen to satisfy the SSA definition properties.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
ad9e414aa9 intel/brw: Skip LOAD_PAYLOADs after every texture instruction if possible
This avoids generating a bunch of trash we have to clean up later.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00
Kenneth Graunke
84219892ad intel/brw: Make gl_SubgroupInvocation lane index loading SSA
Our code to initialize gl_SubgroupInvocation uses multiple instructions
some of which are partial writes.  This makes it difficult to analyze
expressions involving gl_SubgroupInvocation, which appear very
frequently in compute shaders.

To make this easier, we add a new virtual opcode which initializes
a full VGRF to the value of gl_SubgroupInvocation.  (We also expand
it to UD for SIMD8 so there are not partial write issues.)  We then
lower it to the original code later on in compilation, after we've
done the bulk of our optimizations.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
2024-06-18 09:02:25 +00:00