Instead of having a hardcoded list of endian-independent format aliases
in the header, generate them from the format definitions.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29649>
If you have something like this :
con 32 %66 = @load_reg (%62) (base=0, legacy_fabs=0, legacy_fneg=0)
con 32 %27 = @resource_intel (%22 (0xdeaddead), %66, %67, %17 (0x0)) (desc_set=2, binding=96, resource_intel=0, resource_block_intel=-1)
Just copying the brw_reg in ssa_values[] is not enough for the
load_reg intrinsic. We need to call get_nir_src() to force some logic
to create the register correct.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: b8209d69ff ("intel/fs: Add support for new-style registers")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30050>
I haven't been able to find this restriction mentioned anywhere in the
hardware documentation, but the simulator has code to reject this case
as invalid, and it doesn't appear to work on hardware anymore.
Having lower_regioning() handle this takes care of the issue so we
don't have to worry about generating it in random places.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11489
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30140>
We were generating odd instructions like:
math inv(8) g93<1>HF g85<8,8,1>HF null<8,8,1>F { align1 1Q @7 $4 };
It's unclear whether the type of the null operand matters, but sometimes
these things don't get ignored properly. Out of caution, retype the
null source to match the actual operand's type. It'll at least look
less surprising in assembly dumps.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30193>
Handle '\n' when inside the MSGDESC start condition,
otherwise the lexer would apply its default rule (write
to stdout).
Without that, newlines were "leaking" to the output when
parsing a multiple line "MsgDesc". E.g. given the file
example.asm below
```
send(8) nullUD g126UD nullUD 0x02000000 0x00000000
thread_spawner MsgDesc: mlen 1 ex_mlen 0 rlen 0
{ align1 WE_all 1Q @1 EOT };
```
the assembler would produce one extra newline
```
$ brw_asm -t hex -g tgl example.asm
31 01 03 80 04 00 00 00 0c 7e 00 70 00 00 00 00
```
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30100>
The function handles the Xe2 case where NibCtrl is gone. Also add
error messages for invalid input when assembling for Xe2, e.g. "2N".
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30060>
Commit f695a9fed2 moved the 64-bit float <-> 16-bit float conversion
splitting into a core NIR pass, so the code remaining here is only
needed for 64-bit integer types.
Presumably in an attempt to remove the float handling, it replaced
simple bit_size == 64 checks with this expression:
(full_type & (nir_type_int64 | nir_type_uint64))
I believe that the intended expression was:
(full_type == nir_type_int64 || full_type == nir_type_uint64)
Unfortunately, the former is incorrect. Any integer or unsigned
NIR type would trigger the former expression. For example:
nir_type_uint32 & (nir_type_int64 | nir_type_uint64) => nir_type_uint
This meant that we were splitting e.g. u2f16 on 32-bit unsigned types
into u2f32 and f2f16, when we can easily natively handle that case.
To fix this, we go back to simple bit_size == 64 checks. This pass is
already run after nir_lower_fp16_casts which will split the float case,
so we will never see it here.
fossil-db on Alchemist shows a -1.14% reduction in affected shaders for
google-meet-clvk shaders. In another ChromeOS workload, it improves
performance by around 8% on Meteorlake.
Thanks to Sushma Venkatesh Reddy for finding this performance issue!
Fixes: f695a9fed2 ("intel/compiler: use nir_lower_fp16_casts")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30091>
Indirect addressing(vx1 and vxh) not supported with UB/B datatype for
src0, so we need to change the data type for both dest and src0.
This fixes following tests cases on Xe2+
- dEQP-VK.spirv_assembly.instruction.compute.8bit_storage.push_constant_8_to_16*
- dEQP-VK.spirv_assembly.instruction.compute.8bit_storage.push_constant_8_to_32*
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29316>
We can do CSEL on F, HF, *W, and *D on Gfx11+. Gfx9 can only do F.
We can lower unsupported types to CMP+CSEL, allowing us to use CSEL
in the IR and not worry about the limitations.
Rework: (Sagar)
- Update validation pass for CSEL
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29316>
Reorganize the code to make clearer all the lowering cases:
(a) Single invocation workgroup. Index and IDs are all zero.
(b) Local ID provided by hardware.
(c) Local Index provided by the hardware. Depending on the case this
might not be the final local index, e.g. heuristics for tile.
(d) Neither provided by the hardware.
Case (c) is new and supported by Mesh/Task shaders. At the moment the
nir_lower_compute_system_values handle lowering of LocalID for
Task/Mesh, but a later patch will flip that on ANV.
This will make the Task/Mesh use the same lowering as Compute shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29828>
Looks like some of the tests uses the bias which does not fit into half
float parameter, so it's better to use float param for sample_b.
If we have cube arrays, we anyway combine BIAS and array index properly
so we don't have to worry about the first parameter.
This fixes: GTF-GL46.gtf21.GL3Tests.texture_lod_bias.texture_lod_bias_clamp_m_g_M
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29533>
This has no impact on the generated shaders, but does have a small
(positive) impact on the amount of time spent in shader compilation.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29126>
Following https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28957,
some Xe2 code paths started triggering asserts.
In the cases fixed by this patch, it was because of the assert added
to brw_type_larger_of() in cf8ed9925f ("intel/brw: Make a helper for
finding the largest of two types"), and then brw_type_larger_of() is
used in 674e89953f. (For example, the assert was triggering when the
SHL types differed between D and UD.)
Fixes: 674e89953f ("intel/brw: Use new builder helpers that allocate a VGRF destination")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29925>
When the destination of both instructions is NULL and the conditional
modifier matches, operands_match (by way of instructions_match) will
only test the first two operands. This can result in bad CSE
happening.
This is a very, very narrow edge case.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29848>
This one is a bit more complex in that we need to handle 3-source
commutative opcodes. But it's also quite useful:
fossil-db results on Alchemist (A770):
Instrs: 151659750 -> 150164959 (-0.99%); split: -0.99%, +0.01%
Cycles: 12822686329 -> 12574996669 (-1.93%); split: -2.05%, +0.12%
Subgroup size: 7589608 -> 7589592 (-0.00%)
Send messages: 7375047 -> 7375053 (+0.00%); split: -0.00%, +0.00%
Loop count: 46313 -> 46315 (+0.00%); split: -0.01%, +0.01%
Spill count: 110184 -> 54670 (-50.38%); split: -50.79%, +0.41%
Fill count: 213724 -> 104802 (-50.96%); split: -51.43%, +0.47%
Scratch Memory Size: 9406464 -> 3375104 (-64.12%); split: -64.35%, +0.23%
Our older Shadow of the Tomb Raider fossil is particularly helped with
over a 90% reduction in scratch access (spills, fills, and scratch
size). However, benchmarking in the actual game shows no change in
performance. We're thinking the game's shaders have been updated since
our capture.
Ian noted that there was a bug here where we'd accidentally CSE two ADD3
instructions with null destinations and different src[2] that couldn't
be dead code eliminated due to conditional mods. However, this is only
a bug in the new cse_defs pass so we don't need to nominate this for
stable branches.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29848>
This implements a replacement for the previous implementation of
nir_intrinsic_load_barycentric_at_sample that relied on the Pixel
Interpolator shared function, since it's going to be removed from the
hardware from Xe2 onwards.
This implementation simply looks up the X/Y offsets of each sample
index on the table provided in the PS thread payload by using indirect
addressing, then does the actual interpolation by recursing into
emit_pixel_interpolater_alu_at_offset() introduced in the previous
commit.
Note that even though this is only immediately useful on Xe2+
platforms there's no reason why it shouldn't work on earlier
platforms, as long as we have the sample X/Y offsets available in the
thread payload.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
This implements a replacement for the previous implementation of
nir_intrinsic_load_barycentric_at_offset that relied on the Pixel
Interpolator shared function, since it's going to be removed from the
hardware from Xe2 onwards.
That's okay since we can get all the primitive setup information
needed for interpolation at an arbitrary coordinate: We use the X/Y
offset relative to the "X/Y Start" coordinates from the thread payload
order to evaluate the plane equations also provided in the thread
payload for each barycentric coordinate of each polygon. The
evaluation of the barycentric plane equations (and the RHW plane
equation for perspective-correct interpolation) uses the accumulator
and MAD/MAC for ALU efficiency, but that means we need to manually
split instructions to fit the width of the accumulator. The division
and scaling for perspective-correct interpolation is also now done in
the shader if necessary.
Note that even though this is only immediately useful on Xe2+, the
thread payload numbers are filled out for older platforms, and the EU
restrictions of previous Xe platforms are taken into account, mostly
for the purposes of testing and performance evaluation.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
Floating-point offsets work fine in combination with the
floating-point arithmetic we're about to lower these intrinsics into,
and they require less instructions than converting to fixed-point and
then back. No reason to take the precision/range hit nor the extra
instructions.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
The ALU-based implementation of the barycentric interpolation
intrinsics introduced by a subsequent commit will require some
primitive setup information not delivered in the PS thread payload
unless explicitly requested:
- "Source Depth and/or W Attribute Vertex Deltas" if a
perspective-correct interpolation mode is used -- Note that this is
already requested for CPS interpolation, we just need to enable it
in more cases.
- "Perspective Bary Planes" if a perspective-correct interpolation
mode is used.
- "Non-Perspective Bary Planes" if a non-perspective-corrected
interpolation mode is used.
- "Sample offsets" if any at_sample interpolation is used so the
coordinate offsets of the sample can be calculated.
This ALU implementation of barycentric interpolation will only be
needed for *_at_offset and *_at_sample interpolation, since the fixed
function hardware still computes barycentrics for us at the current
sample coordinates, only the cases that previously relied on the Pixel
Interpolator shared function need to be re-implemented with ALU
instructions, since that shared function will no longer exist on Xe2
hardware.
Thanks to Rohan for a bugfix of the uses_sample_offsets calculation,
this patch includes his fix squashed in.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29847>
It turns out the problem I was trying to catch in be4fa59a72
("intel/brw: Clear write_accumulator flag when changing the
destination") also came from the DPAS lowering pass itself. Checking for
invalid uses of the feature in fs_validate helped detect the problem.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
The original goal was to get rid of a bunch of the magic constants
sprinkled through the function. Once I did that, I realized that there
was a lot my symmertry between the row-major and column-major paths
possible.
It's +6 lines of code, but about 15 of those lines are comments
explaining things that were not obvious in the original code.
v2: Save duplicated condition in a variable with a meaningful
name. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
Even though the hardware does not naively support these configurations,
there are many potential benefits to advertising them. These
configurations can theoretically use half the memory bandwidth for loads
and stores. For large matrices, that can be the limiting in performance.
The current implementation, however, has a number of significant
problems.
The conversion from float16 to float32 is performed in the driver during
conversion from NIR. As a result, many common usage patterns end up
doing back-to-back conversions to and from float16 between matrix
multiplications (when the result of one multiplication is used as the
accumulator for the next).
The float16 version of the matrix waste half the possible register
space. Each float16 value sits alone in a dword. This is done so that
the per-invocation slice of an 8x8 float16 result matrix and an 8x8
float32 result matrix will have the same number of elements. This makes
it possible to do straightforward implementations of all the unary_op
type conversions in NIR.
It would be possible to perform N:M element type conversions in the
backend using specialized NIR intrinsics. However, per #10961, this
would be very, very painful. My hope is that, once a suitable resolution
for that issue can be found, support for these configs can be restored.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
There are a lot of places where we add 0 to an offset. Avoiding
generating this can save us algebraic + copy_propagation later.
Cuts compile time in Borderlands 3 by -0.590631% +/- 0.170108% (n=25).
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29849>
Instead of replicating the whole thing in macros, just make an alu2()
function and use that in the wrappers. It ought to get inlined anyway.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29849>