Make sure that lowering undone in elk_nir_optimize are reapplied.
No shader-db or fossil-db changes on any Intel platform. This is most
likely to impact either Gfx8 on ANV or Gfx7.5 on HASVK. I don't
fossil-db test either of those platforms.
I tried doing a similar thing here as is done in BRW (previous commit),
but that caused a couple Haswell shaders to fall off a performance
cliff:
total spills in shared programs: 8247 -> 8311 (0.78%)
spills in affected programs: 6 -> 70 (1066.67%)
helped: 0 / HURT: 2
total fills in shared programs: 8558 -> 8910 (4.11%)
fills in affected programs: 6 -> 358 (5866.67%)
helped: 0 / HURT: 2
Fixes: 442daeb54a ("nir/opt_algebraic: use fcanonicalize")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39567>
This matches the behavior of nir_op_unpack_half_2x16_split_x. Gfx7
uses a special opcode for this conversion. Fixes numerous assertion
failures in shader-db on Ivy Bridge and Haswell.
I am not sure why this was never encountered previously.
Fixes: 609c46cf23 ("nir/lower_alu_width: emit f2f32 for unpack_half_2x16")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39567>
This was incorrect for OpenCL due to the possibility of variable shared memory
existing despite shared_size == 0. Fortunately the optimization it was trying to
do should be done in NIR via nir_opt_barrier_modes so we can just drop the brw
code and move on with our merry lives. Fixes OpenCL tests on Iris:
non_uniform_work_group non_uniform_3d_barriers
basic async_strided_copy_local_to_global
Cc: mesa-stable
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39795>
Right now all the drivers align push data to GRF (32B pre Xe2, 64B
post Xe2) but the push constant delivery mechanism can actually pack
32B ranges so alignment is not required.
Off course we need the push UBO masking to deal with unaligned pushed
ranges.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Calder Young <cgiacun@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35160>
Will use the load/store_ssbo with nir_resource_intel_internal later in
this series.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35160>
`operands_match` was modifying instruction source operands in-place
(through the `elk_fs_reg *src` pointer member) and relying on a
save/restore pattern to undo the modifications. Work on local copies
instead, which is simpler and avoids mutating shared state in a
comparison function.
Fixes: 47c4b38540 ("i965/fs: Allow CSE to handle MULs with negated arguments.")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39814>
The MUL case in `operands_match` was reading and writing the `.f` union
member unconditionally, even when the register's `.file != IMM`. In that
case `.f` aliases the struct containing `.nr`/`.swizzle`/etc, so the
`fabsf()` call could corrupt the `.nr` by clearing bit 31.
Guard all `.f` accesses with `.file == IMM` checks.
Fixes: 47c4b38540 ("i965/fs: Allow CSE to handle MULs with negated arguments.")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39814>
`operands_match` was modifying instruction source operands in-place
(through the `brw_reg *src` pointer member) and relying on a
save/restore pattern to undo the modifications. Work on local copies
instead, which is simpler and avoids mutating shared state in a
comparison function.
Fixes: 47c4b38540 ("i965/fs: Allow CSE to handle MULs with negated arguments.")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39814>
The MUL case in `operands_match` was reading and writing the `.f` union
member unconditionally, even when the register's `.file != IMM`. In that
case `.f` aliases the struct containing `.nr`/`.swizzle`/etc, so the
`fabsf()` call could corrupt the `.nr` by clearing bit 31.
Guard all `.f` accesses with `.file == IMM` checks.
Fixes: 47c4b38540 ("i965/fs: Allow CSE to handle MULs with negated arguments.")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39814>
The entire array is always initialized to zero and never modified.
Cuts the size of brw_wm_prog_data by 32%.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39791>
This is the shader key for the fragment shader. Nobody even knows
what the windowizer/masker unit is or does anymore. Even on Gen4-6,
"fs" is still clearer. This makes the codebase easier to read.
This is only about 15 years overdue.
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39748>
This is the program data for the fragment shader. Nobody even knows
what the windowizer/masker unit is or does anymore. Even on Gen4-6,
"fs" is still clearer. This makes the codebase easier to read.
This is only about 15 years overdue.
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39748>
This started out as dynamic configuration for MSAA related state, but
has since expanded to cover many dynamic fragment shader options.
We rename it to intel_fs_config, similar to intel_tess_config, to
better indicate its purpose.
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39748>
For VS/TES/GS, we lower all outputs to temporaries and emit copies at the
end of the shader (or for GS, at each EmitVertex() call) from those
temporaries back to real outputs. We use vec8 URB writes without
writemasking, since our output area's contents are undefined anyhow.
This is simpler than what TCS and Mesh do, which allow for output
variables to be read/written at a per-component level at any time,
with the output memory being used for cross-thread communication.
Rather than using the complicated TCS/Mesh handling and relying on
vectorization, we port the emit_urb_writes() approach to NIR. This
also takes care of emitting the VUE header with default values when
fields aren't explicitly written by the shader.
We also handle multiview in the process. It simplifies things, and
also drops another case of non-semantic IO in brw.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39666>
The TES workaround code is still going to be needed even after
we rework URB output handling for VS/TES/GS to use NIR intrinsics.
For VS, we know at least one URB write will have been emitted at
the end of the program, so we can just tag it.
GS already handles EOT via emit_gs_thread_end().
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39666>
This lets us look up things in varying_to_slot[] without having to
special case VIEWPORT, LAYER, and PRIMITIVE_SHADING_RATE. All of them
map to the same slot as PSIZ, slot 0, the VUE header.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39666>
We'll need the VUE map when we convert to using URB intrinsics.
Prepare for that by reordering VUE map setup before IO lowering.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39666>
This brings what ANV reports closer to what Iris reports, and is mostly dropping
redundancies.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39633>
This is for parity with what we do in the current GL shader-db path.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39633>
This fix a brw_ip_ranges shader analysis, were it fails because there is a
different number of instructions than expected after brw_opt_split_virtual_grfs()
optimization.
Reproduced in Piglit test spec@arb_sample_shading@builtin-gl-sample-mask 0:
arb_sample_shading-builtin-gl-sample-mask: ../src/intel/compiler/brw_analysis.h:150: T& brw_analysis<T, C>::require() [with T = brw_ip_ranges; C = brw_shader]: Assertion `p->validate(c)' failed.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39629>
Pick the value from the brw_shader instead of from the prog_data, since
when there are multiple variants, the prog_data one will have the
maximum value. Picking the wrong value also caused compute shaders
that had a single variant to report 0 GRFs since the prog_data was
being filled after the generate_code() call.
Issue spotted by Felix DeGrood.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39601>
Geometry shaders load from separate handles for each vertex, so they
don't incorporate the vertex index in the URB offset like tessellation
shaders do. This means we can have a constant offset (within a vertex's
section) but not have a constant vertex index.
Prior to 41d7debcfe we were emitting non-folded ALU so we thought the
offset was non-constant at this point. Now we can properly detect
constant offsets...but still don't want to use push inputs for
non-constant vertex indices.
Fixes: 41d7debcfe ("brw: Use nir_imul_imm in per-vertex/per-primitive offset calculation")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39603>
This now also removes dead variables created by split_array_vars,
and in the future it is reasonable other optimizations inside the
optimization loop to make temp variables dead.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39596>
This macro will stop the loop early if there's no chance to make further
progress.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39504>
Add a pass tracker struct that can live the whole lifetime
of brw_compile() functions, it will keep track of the debug_archiver
and also store some metadata that allow us to name the passes.
With that, we can also embed the loop tracking in the same struct,
so that is free for any loop to use the "early break" optimization.
There are other brw_nir_* passes that are called in the pre-processing
phase. These are not currently included in the mda yet. Will be
handled when we hook debug_archiver or similar to the runtime/driver.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39504>
For mesh/task shaders, the thread payload provides a local invocation
index, but it's always linear so it doesn't give the correct value when
quad derivatives are in use.
The lowering pass where all of this is done correctly for compute
shaders assumes load_local_invocation_index will be lowered in the
backend for mesh/task, calculates the values for the quads correctly but
then avoid replacing the original intrinsic and we remain with the wrong
results.
Add an intel specific intrinsic and always lower the generic one to that
(or whatever else was calculated) to avoid ambiguities and fix the value
for quad derivatives.
Fixes future CTS tests using mesh/task shaders under:
dEQP-VK.spirv_assembly.instruction.compute.compute_shader_derivatives.*
Fixes: d89bfb1ff7 ("intel/brw: Reorganize lowering of LocalID/Index to handle Mesh/Task")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39276>
This avoids generating some useless math that would need to be cleaned
up later, without complicating things too much.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
This helps cut down URB messages on tessellation and mesh shaders
significantly. fossil-db results on Battlemage:
Instrs: 505172392 -> 505207187 (+0.01%); split: -0.00%, +0.01%
Send messages: 23678197 -> 23656126 (-0.09%); split: -0.09%, +0.00%
Cycle count: 63150470088 -> 63147482640 (-0.00%); split: -0.01%, +0.00%
Spill count: 576554 -> 576616 (+0.01%)
Fill count: 545304 -> 545413 (+0.02%)
Max live registers: 141099192 -> 141150675 (+0.04%); split: -0.00%, +0.04%
Max dispatch width: 39856192 -> 39856208 (+0.00%)
Totals from 4231 (0.27% of 1583648) affected shaders:
Instrs: 1620161 -> 1654956 (+2.15%); split: -0.25%, +2.40%
Send messages: 128652 -> 106581 (-17.16%); split: -17.18%, +0.03%
Cycle count: 24650700 -> 21663252 (-12.12%); split: -12.82%, +0.70%
Spill count: 378 -> 440 (+16.40%)
Fill count: 1308 -> 1417 (+8.33%)
Max live registers: 364676 -> 416159 (+14.12%); split: -0.24%, +14.36%
Max dispatch width: 67952 -> 67968 (+0.02%)
There are several reasons we didn't go with nir_opt_vectorize_io:
1. nir_opt_vectorize_io appears to work on the slot location level.
We want to be able to vectorize based on the URB offsets, especially
for cases like point size, layer, and viewport which have different
VARYING_SLOT_* values but live in the same vec4 in a URB entry.
2. We want vec8 stores, and nir_opt_vectorize_io only seems to vectorize
within a single 32-bit vec4. It does handle 8 components, but that's
only for packing 16-bit values into a 32-bit vec4.
Improves performance of Sascha Willems' tessellation demo by around 4%
on Meteorlake.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
Both the URB Global Offset and Per-Slot Offsets are specified to be
unsigned numbers. The URB Global Offset is only 11 bits, and so is
limited to be between [0, 2047]. While the per-slot offsets are
given as U32 values, it would appear that adding the two offsets
does not handle 32-bit overflow/unsigned wrap correctly.
This pops up in Piglit's TCS variable-indexing tests, which ends up
performing loads from offset (x - 16) and a base of 18, and at an offset
(x) with a base of 2. These should be equivalent, but when x <= 15, the
per-slot offset calculated in the shader is negative (0xfffffff[0-f])
and adding the base of 18 is not wrapping around correctly to [2, 17].
To work around this, avoid using the global offset when the per-slot
offset is present, and just add the two in the shader where unsigned
wrap works correctly.
Tigerlake and later don't seem to have this issue.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
We were checking for 0xf which is fine for vec4, but vec8 gets 0xff.
Either way, nothing is writemasked, so we can skip sending the mask.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>