Commit graph

664 commits

Author SHA1 Message Date
Alejandro Piñeiro
9cbc3ab239 broadcom/compiler: update how we compute return_words_of_texture_data on non-ssa
For the non-ssa case, we were trying to use reg->num_components. But
this is not the same that nir_ssa_def_components_read. It is the
number of components of the destination register. And in the 16bit
case, even if nir_lower_tex packs the outcome, it doesn't update the
number of components, as nir_tex_instr_dest_size would still return
4. And nir validate would check that those values are the same.

So this change focuses on the last part of this comment at
nir_lower_tex:

 * Note that we don't change the destination num_components, because
 * nir_tex_instr_dest_size() will still return 4.  The driver is just
 * expected to not store the other channels, given that nothing at the
 * NIR level will read them.

We just limit how many channels we would use for the f16 case.

It is also worth to note, based on the CTS and different applications
we test, that this is a corner case.

This was detected when we experimented to enable nir_opt_gcm for v3d,
that lead to raise an assertion slightly below with some shaderdb
tests, but technically it could happen without it.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17185>
2022-10-26 12:29:30 +00:00
Alejandro Piñeiro
ec10a37a52 broadcom/compiler: don't call nir_opt_load_store_vectorize on all v3d_optimize_nir calls
For compute shaders, to avoid a crash with that optimization, it requires
doing some optimizations and lowerings before. Example:

  static void
  lower_cs_shared(struct nir_shader *nir)
  {
     NIR_PASS_V(nir, nir_lower_vars_to_explicit_types,
                nir_var_mem_shared, shared_type_info);
     NIR_PASS_V(nir, nir_lower_explicit_io,
                nir_var_mem_shared, nir_address_format_32bit_offset);
  }

In the same way other drivers (like anv) calls
nir_opt_load_store_vectorize as part of their post-process-nir.

So one option would be to move nir_opt_load_store_vectorize outsize
the common v3d_nir_optimize, to a post-process nir method.

To make things simpler, this change calls that optimization only if we
have a v3d_compiler object, that is when each frontend has already
done their lowerings, and call the v3d_compiler to get the final
assembly (so we are already on a kind of post processing nir step).

This avoids dEQP-VK.memory_model.shared.basic_types.3 crashing if we
start to call v3d_optimize_nir on v3dv directly.

Slight shaderdb changes, but not significant.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17185>
2022-10-26 12:29:30 +00:00
Iago Toral Quiroga
b6093ffbe7 v3dv: expose VK_EXT_image_robustness
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18820>
2022-09-27 09:08:29 +00:00
Iago Toral Quiroga
c7e022abfd broadcom/compiler: add a lowering for robust image access
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18820>
2022-09-27 09:08:29 +00:00
Iago Toral Quiroga
adcfd9bc2f broadcom/compiler: rename static helpers involved with robust buffer access
To make it explicit that they involve buffers, since we will be adding
robust image access shortly.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18820>
2022-09-27 09:08:29 +00:00
Iago Toral Quiroga
5e5eaa3f1a broadcom/compiler: rename v3d_nir_lower_robust_buffer_access.c
We are going to add code to handle image robustness shortly, so
better rename this to v3d_nir_lower_robust_access.c

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18820>
2022-09-27 09:08:29 +00:00
Iago Toral Quiroga
4ea916f704 broadcom/compiler: don't apply robust buffer access to shared variables
This feature is only concerned with buffers bound through a descriptor
set. We are still keeping the code for this (disabled by default) since
it may be useful for debugging some scenarios.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18744>
2022-09-23 06:27:54 +00:00
Iago Toral Quiroga
44b02b5cb1 broadcom/compiler: handle shared stores with robust buffer access
For some reason we supported all shared intrinsics but this one.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18744>
2022-09-23 06:27:54 +00:00
Iago Toral Quiroga
b2bce9c98a broadcom/compiler: fix robust buffer access
Our implemention was bogus, it was only putting a cap on the offset
based on the aligned buffer size and this doesn't ensure the access
to the buffer happens within its valid range.

I think the only reason we have been passing the tests is that we
align all buffers sizes to 256B and the tests create buffers with a
size that is smaller than that (like 64B). When get the size of the
buffer from the shader, we get the actual bound range (so 64B in this
case) and by capping to that we don't ensure the access will stay
within that range, but we ensure it will stay within the underlying
memory bound to the buffer (256B), and this is fine by the spec,
however, I think if the actual buffer range was the same as the
underlying allocation we would fail the tests.

A valid behavior for robust buffer access on an out-of-bounds access
is to return any valid bytes within the buffer, so we can just
make that offset 0.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18744>
2022-09-23 06:27:54 +00:00
Iago Toral Quiroga
15cdf5bb48 v3dv: optimize ldunif load into unifa write
If we emit a ldunif to load the ubo/ssbo base address and
then we are immediately moving it to the unifa register we
can have the ldunif write directly to unifa and avoid the mov
in between, which won't be done by copy propagation because that
only works with temp registers.

Also, since we can't read from unifa we must be careful to disallow
reuse of the ldunif result for a future ldunif of the same base address.
We do that by only reusing ldunif results from temp registers.

total instructions in shared programs: 12468943 -> 12455139 (-0.11%)
instructions in affected programs: 1661233 -> 1647429 (-0.83%)
helped: 8307
HURT: 3994

total uniforms in shared programs: 3704532 -> 3704522 (<.01%)
uniforms in affected programs: 339 -> 329 (-2.95%)
helped: 7
HURT: 0

total max-temps in shared programs: 2148158 -> 2148290 (<.01%)
max-temps in affected programs: 9320 -> 9452 (1.42%)
helped: 175
HURT: 295

total spills in shared programs: 2202 -> 2202 (0.00%)
spills in affected programs: 0 -> 0
helped: 0
HURT: 0

total fills in shared programs: 3059 -> 3057 (-0.07%)
fills in affected programs: 27 -> 25 (-7.41%)
helped: 1
HURT: 0

total sfu-stalls in shared programs: 21167 -> 21056 (-0.52%)
sfu-stalls in affected programs: 497 -> 386 (-22.33%)
helped: 209
HURT: 127

total inst-and-stalls in shared programs: 12490110 -> 12476195 (-0.11%)
inst-and-stalls in affected programs: 1662875 -> 1648960 (-0.84%)
helped: 8312
HURT: 3987

total nops in shared programs: 316563 -> 313553 (-0.95%)
nops in affected programs: 24269 -> 21259 (-12.40%)
helped: 2158
HURT: 1006

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18667>
2022-09-20 06:56:28 +00:00
Iago Toral Quiroga
cbc5169ef9 broadcom/compiler: check signal writes to magic regs when updating scoreboard
We have only been checking magic writes from ADD and MUL ports, but signals
can potentially write to magic registers too.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18667>
2022-09-20 06:56:28 +00:00
Eric Engestrom
5bfca00d31 broadcom: fix dependencies in static_library() calls
The first argument is the name of the library, and the second argument
is the list of files; those two got a bit mixed up.

Fixes: 1ae8018a6a ("meson: Add support for the vc4 driver.")
Fixes: 4f3e380fa0 ("meson: Add support for the vc5 driver.")
Signed-off-by: Eric Engestrom <eric@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18593>
2022-09-14 09:38:28 +00:00
Iago Toral Quiroga
ca33c319e5 v3dv: implement VK_KHR_zero_initialize_workgroup_memory
This only requires that we call the relevant lowering pass in NIR.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18312>
2022-08-31 07:33:19 +02:00
Eric Engestrom
e767f54f28 v3d: introduce V3D_DBG() macro to make V3D_DEBUG checks consistent
The main issue was the inconsistent use of `unlikely()`, but the macro
also simplifies the code a little bit.

Signed-off-by: Eric Engestrom <eric@igalia.com>
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18086>
2022-08-24 23:03:57 +00:00
Iago Toral Quiroga
87a9951073 broadcom/compiler: track number of TMU operations in prog data
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17854>
2022-08-15 23:35:16 +00:00
Iago Toral Quiroga
8ecea47f06 broadcom/compiler: simplify code emitted for centroid coordinates
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17909>
2022-08-06 22:34:25 +00:00
Iago Toral Quiroga
20591573f1 broadcom/compiler: use nir_opt_idiv_const
total instructions in shared programs: 12463625 -> 12463571 (<.01%)
instructions in affected programs: 1758 -> 1704 (-3.07%)
helped: 12
HURT: 0

total uniforms in shared programs: 3704589 -> 3704591 (<.01%)
uniforms in affected programs: 17 -> 19 (11.76%)
helped: 0
HURT: 1

total max-temps in shared programs: 2148088 -> 2148138 (<.01%)
max-temps in affected programs: 170 -> 220 (29.41%)
helped: 0
HURT: 10

Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17871>
2022-08-05 09:28:22 +00:00
Iago Toral Quiroga
73e8fc3efb broadcom/compiler: don't use imprecise_32bit_lowering for idiv lowering
This is known to produce bogus results for certain combinations of
operands, so don't use it. See this issue for details:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/6555

With this change, the idiv lowering will produce mul_high instructions,
so we need to instruct the compiler to lower those with the ALU lowering
right after the idiv lowering by adding the lower_mul_high option (we
only need to add this to V3D, since V3DV already had it set). This will
cause injection of uadd_carry instructions, for which we have backend
implementations that produce better code for us than the NIR lowering.

total instructions in shared programs: 12457692 -> 12463625 (0.05%)
instructions in affected programs: 23115 -> 29048 (25.67%)
helped: 0
HURT: 111

total threads in shared programs: 416372 -> 416368 (<.01%)
threads in affected programs: 8 -> 4 (-50.00%)
helped: 0
HURT: 2

total uniforms in shared programs: 3704067 -> 3704589 (0.01%)
uniforms in affected programs: 5804 -> 6326 (8.99%)
helped: 2
HURT: 109

total max-temps in shared programs: 2147845 -> 2148088 (0.01%)
max-temps in affected programs: 2456 -> 2699 (9.89%)
helped: 6
HURT: 91

Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17871>
2022-08-05 09:28:22 +00:00
Alejandro Piñeiro
efc827ceea v3d/v3dv: use NIR_PASS(_
Instead of NIR_PASS_V, when possible.

This was done recently on anv (see commit ce60195ec and MR#17014)

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17609>
2022-07-20 11:35:25 +00:00
Alejandro Piñeiro
0a50330c3d broadcom/compiler: make several passes to return a progress
Two advantages:
   * When using NIR_DEBUG=nir_print_xx, will print outcome only if
     there is a change
   * We can use NIR_PASS(_, ...) instead of NIR_PASS_V, that has
     slightly more validation checks.

This includes:
  * v3d_nir_lower_image_load_store
  * v3d_nir_lower_io
  * v3d_nir_lower_line_smooth
  * v3d_nir_lower_load_store_bitsize
  * v3d_nir_lower_robust_buffer_access
  * v3d_nir_lower_scratch
  * v3d_nir_lower_txf_ms

As we are here we also simplify some of them by using the
nir_shader_instructions_pass helper.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17609>
2022-07-20 11:35:25 +00:00
Alejandro Piñeiro
81ca0b4191 broadcom/compiler: removed unused function
It is not even implemented.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17609>
2022-07-20 11:35:25 +00:00
Alejandro Piñeiro
d8fee4cdaa broadcom/compiler: use NIR_PASS for nir_lower_vars_to_ssa at v3d_optimize_nir
There's no reason to not take into account progress at that point.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17609>
2022-07-20 11:35:24 +00:00
Alejandro Piñeiro
dea0fe8a06 broadcom/compiler: wrap nir_convert_to_lcssa with NIR_PASS_V
So we get it included with the NIR_DEBUG=print_xx debug options.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17609>
2022-07-20 11:35:24 +00:00
Iago Toral Quiroga
90054e9c5d broadcom/compiler: track if a shader uses global intrinsics
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17275>
2022-07-19 09:47:34 +02:00
Iago Toral Quiroga
fa03d9c8be broadcom/compiler: implement 2x32 global intrinsics
Notice we ignore the high 32-bit component of the address because
we know it must be 0.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17275>
2022-07-19 09:47:34 +02:00
Iago Toral Quiroga
871a7536e8 broadcom/compiler: don't over-estimate latency of TMU instructions
Over-estimating latency can cause us to delay the critical paths of
the shader unnecessarily, producing larger QPU programs that take more
time to execute as a result (and it also adds register pressure) so
striking a balance is important. The thread switching model in V3D
is quite effective at hiding latency and usuallly we just need to
hint it to delay TMU instructions a little bit to find the best
compromise for performance.

The new latency numbers have been chosen empirically by testing
V3DV with Sponza and a few UE4 samples.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17451>
2022-07-11 10:34:58 +00:00
Iago Toral Quiroga
f227aa7c98 broadcom/compiler: don't try to hide TMU latency at QPU scheduling
Based on empirical testing with Sponza and a few UE4 samples this is
consistently slightly benefitial for performance.

The most likely reason why this helps is that thrsw is probably
already quite effective at hiding latency and we are already trying
to hide latency at NIR scheduling and also via TMU pipelining, so
piling up on this when scheduling QPU typically ends up providing no
benefit at all for latency and is instead possibly preventing us to
unblock critical paths in the shader that depend on the TMU result,
requiring us to execute more cycles to complete the program.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17451>
2022-07-11 10:34:58 +00:00
Iago Toral Quiroga
152fc4fd28 v3dv: don't lower uadd_carry and usub_borrow
We can produce slightly better code for these in the backend, so
do that. For this we need to:

1. Fix our implementation of uadd_carry (which wasn't used) to return
   an integer instead of a boolean value.
2. Add an implementation of usub_borrow.

Notice these are only used in Vulkan. In GL these instructions are
always unconditionally lowered by the state tracker in GLSL IR so
we never get to see them in the backend.

Shader-db stats from a collection of Vulkan samples:

total instructions in shared programs: 122351 -> 122345 (<.01%)
instructions in affected programs: 196 -> 190 (-3.06%)
helped: 2
HURT: 0

total uniforms in shared programs: 18670 -> 18672 (0.01%)
uniforms in affected programs: 59 -> 61 (3.39%)
helped: 0
HURT: 2

total max-temps in shared programs: 13145 -> 13147 (0.02%)
max-temps in affected programs: 27 -> 29 (7.41%)
helped: 0
HURT: 2

total inst-and-stalls in shared programs: 123052 -> 123046 (<.01%)
inst-and-stalls in affected programs: 197 -> 191 (-3.05%)
helped: 2
HURT: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17372>
2022-07-07 09:16:24 +00:00
Iago Toral Quiroga
cfccd93efc broadcom/compiler: don't predicate postponed spills
The postponed spill is predicated using the condition from the
last write, but this is only correct if the register was only
written once in the TMU sequence, or if it is always written with
the same predication.

While we could try to track whether this is the case or not, it
would make the postponed spill path even more complex than it
already is, so let's just avoid predicating these. We are already
discouraging TMU spilling of registers in the middle of TMU
sequences, so this should not be a very common case.

Cc: mesa-stable
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17201>
2022-06-28 05:49:51 +00:00
Iago Toral Quiroga
98420408d0 broadcom/compiler: fix postponed TMU spills with multiple writes
If we are spilling a register that is used in the middle of a TMU
sequence, we postpone the spill until the TMU sequence finishes,
at which point we inject the spill and rewrite the original
instruction to write to the new temp.

However, this doesn't work if the register is written multiple
times during the TMU sequence. In that scenario, we need to ensure
that all writes are rewritten to use the new temp, not just the last
one.

Cc: mesa-stable
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17201>
2022-06-28 05:49:51 +00:00
Iago Toral Quiroga
a97f78eb14 broadcom/compiler: disable flags optimization for loop conditions
This is not safe because it may skip regenerating the flags for the
loop condition in the loop continue block and these flags may be
stomped in the loop body by other conditionals.

Fixes: 9909fe6ba ('broadcom/compiler: Skip bool_to_cond where possible')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17020>
2022-06-14 11:30:33 +00:00
Erik Faye-Lund
873ec432b3 broadcom/compiler: use macro for power-of-two check
This will allow the use of static_assert here instead of our
compiler-specific implementation.

Reviewed-by: Jesse Natalie <jenatali@microsoft.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16670>
2022-06-03 07:14:43 +00:00
Iago Toral Quiroga
b90d7b9b38 broadcom/compiler: don't promote early fragment tests when writing sample mask
If the sample mask is being written it means we want to discard some of the
samples generated so we should not be promoting the fragment shader to
do early tests, since that would not take into account the sample mask
written from the shader.

Fixes:
dEQP-VK.fragment_operations.early_fragment.sample_count_early_fragment_tests_depth_samples_4

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16626>
2022-05-20 13:04:32 +00:00
Iago Toral Quiroga
487c213142 v3d/compiler: add more stats to prog_data
So we can expose them via VK_KHR_pipeline_executable_properties.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16370>
2022-05-09 12:12:35 +00:00
Emma Anholt
536c8ee96d nir/lower_tex: Make the adding a 0 LOD to nir_op_tex in the VS optional.
This controls the whole lowering of "make tex ops with implicit
derivatives on non-implicit-derivative stages be tex ops with an explicit
lod of 0 instead", but it's really hard to describe that in a git commit
summary.

All existing callers get it added except:
- nir_to_tgsi which didn't want it.
- nouveau, which didn't want it (fixes regressions in shadowcube and
  shadow2darray with NIR, since the shading languages don't expose txl of
  those sampler types and thus it's not supported in HW)
- optional lowering passes in mesa/st (lower_rect, YUV lowering, etc)

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16156>
2022-04-28 21:26:08 +00:00
Iago Toral Quiroga
cf4b3cb563 broadcom/compiler: prefer reconstruction over TMU spills when possible
We have been reconstructing/rematerializing uniforms for a while, but we
can do this in more scenarios, namely instructions which result is
immutable along the execution of a shader across all channels.

By doing this we gain the capacity to eliminate TMU spills which not
only are slower, but can also make us drop to a fallback compilation
strategy.

Shader-db results show a small increase in instruction counts caused
by us now being able to choose preferential compiler strategies that
are intended to reduce TMU latency. In some cases, we are now also
able to avoid dropping thread counts:

total instructions in shared programs: 12658092 -> 12659245 (<.01%)
instructions in affected programs: 75812 -> 76965 (1.52%)
helped: 55
HURT: 107

total threads in shared programs: 416286 -> 416412 (0.03%)
threads in affected programs: 126 -> 252 (100.00%)
helped: 63
HURT: 0

total uniforms in shared programs: 3716916 -> 3716396 (-0.01%)
uniforms in affected programs: 19327 -> 18807 (-2.69%)
helped: 94
HURT: 50

total max-temps in shared programs: 2161796 -> 2161578 (-0.01%)
max-temps in affected programs: 3961 -> 3743 (-5.50%)
helped: 80
HURT: 24

total spills in shared programs: 3274 -> 3266 (-0.24%)
spills in affected programs: 98 -> 90 (-8.16%)
helped: 6
HURT: 0

total fills in shared programs: 4657 -> 4642 (-0.32%)
fills in affected programs: 130 -> 115 (-11.54%)
helped: 6
HURT: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15710>
2022-04-08 05:37:28 +00:00
Iago Toral Quiroga
597560e27c broadcom/compiler: always enable per-quad on spill operations
This ensures that any channels used for helper invocations are
also spilled/filled correctly.

Alternatively, we could recursively track all temps that get
involved in computing values that are then used in explicit
(dfdx,dfdy) or implicit (texture coordinates for mipmap or
anisotropic filtering, etc) derivatives, and only enable
per-quad on these (or disable spilling of any of these
values).

Fixes:
dEQP-VK.graphicsfuzz.cov-dfdx-dfdy-after-nested-loops

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15705>
2022-04-01 08:53:50 +00:00
Iago Toral Quiroga
ce849032a4 broadcom/compiler: allow ldunifa with indirect uniform loads
We handle uniforms by copying them into the uniform stream to be
consumed with ldunif when they have a constant offset. Otherwise
we fallback to general TMU access, which has more latency.

However, just like we did for UBOs and read-only SSBOs, we can
also try to use the unifa mechanism to handle indirect accesses
in certain cases instead of the TMU fallback.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15575>
2022-03-28 10:44:13 +00:00
Iago Toral Quiroga
ea3223e7a4 v3dv: implement VK_EXT_inline_uniform_block
Inline uniform blocks store their contents in pool memory rather
than a separate buffer, and are intended to provide a way in which
some platforms may provide more efficient access to the uniform
data, similar to push constants but with more flexible size
constraints.

We implement these in a similar way as push constants: for constant
access we copy the data in the uniform stream (using the new
QUNIFORM_UNIFORM_UBO_*) enums to identify the inline buffer from
which we need to copy and for indirect access we fallback to
regular UBO access.

Because at NIR level there is no distinction between inline and
regular UBOs and the compiler isn't aware of Vulkan descriptor
sets, we use the UBO index on UBO load intrinsics to identify
inline UBOs, just like we do for push constants. Particularly,
we reserve indices 1..MAX_INLINE_UNIFORM_BUFFERS for this,
however, unlike push constants, inline buffers are accessed
through descriptor sets, and therefore we need to make sure
they are located in the first slots of the UBO descriptor map.
This means we store them in the first MAX_INLINE_UNIFORM_BUFFERS
slots of the map, with regular UBOs always coming after these
slots.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15575>
2022-03-28 10:44:13 +00:00
Alejandro Piñeiro
e3d905ec39 v3dv/pipeline: use new helper vk_shader_module_to_nir
In addition to use the helper, we also remove some of the lowering we
had at preprocess_nir, as they are called now by the helper.

As we are here we also move the call to nir_lower_sysvals_to_varyings,
that for some reason we were calling it before preprocess_nir.

It is worth to note that with this change we lose the ability to debug
the NIR just after spirv_to_nir using V3D_DEBUG, as now this is done
on vk_spirv_to_nir, and as mentioned that includes several lowerings
now. The workaround to that is to use NIR_DEBUG.

We also needed to change how to check the entrypoint on the broadcom
compiler, checking just if it is an entrypoint, instead of assuming
that the name will be "main".

v2: tweak comment, squash v3dv and compiler change (Iago)

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15449>
2022-03-18 11:05:11 +00:00
Iago Toral Quiroga
49b5431197 broadcom/compiler: remove unused functions
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15302>
2022-03-10 07:25:37 +00:00
Iago Toral Quiroga
44feff93c2 broadcom/compiler: don't always assign r5 if available
Instead, only favor assigning r5 if we have first decided to
assign an accumulator. This helps with assining r5 to short
lived uniforms, favoring accumulator rotation to facilitate
QPU merges.

total instructions in shared programs: 12656164 -> 12628339 (-0.22%)
instructions in affected programs: 5368373 -> 5340548 (-0.52%)
helped: 17420
HURT: 9996

total uniforms in shared programs: 3704776 -> 3704863 (<.01%)
uniforms in affected programs: 12247 -> 12334 (0.71%)
helped: 23
HURT: 78

total max-temps in shared programs: 2153505 -> 2152684 (-0.04%)
max-temps in affected programs: 26468 -> 25647 (-3.10%)
helped: 569
HURT: 328

total fills in shared programs: 4656 -> 4657 (0.02%)
fills in affected programs: 43 -> 44 (2.33%)
helped: 0
HURT: 1

total sfu-stalls in shared programs: 34728 -> 34403 (-0.94%)
sfu-stalls in affected programs: 3411 -> 3086 (-9.53%)
helped: 842
HURT: 534

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
77f58b46d9 broadcom/compiler: add comment on why we don't use r5 with ldunifa
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
5b140428b0 broadcom/compiler: adjust register threshold for 2-thread compiles
We have twice the registers in this case so it makes sense to double
this as well. While this causes slight regressions in shader-db
stats (due to additional register pressure), it helps us hide latency
of memory reads better on 2-thread compiles, where the thread switch
mechanism will be less effective. This shows a ~3% performance
improvement on the UE4 SunTemple demo.

total instructions in shared programs: 12642413 -> 12656164 (0.11%)
instructions in affected programs: 2272652 -> 2286403 (0.61%)
helped: 2924
HURT: 3389

total uniforms in shared programs: 3703861 -> 3704776 (0.02%)
uniforms in affected programs: 213729 -> 214644 (0.43%)
helped: 823
HURT: 1272

total max-temps in shared programs: 2150686 -> 2153505 (0.13%)
max-temps in affected programs: 191332 -> 194151 (1.47%)
helped: 1900
HURT: 1891

total spills in shared programs: 3255 -> 3274 (0.58%)
spills in affected programs: 166 -> 185 (11.45%)
helped: 3
HURT: 6

total fills in shared programs: 4630 -> 4656 (0.56%)
fills in affected programs: 367 -> 393 (7.08%)
helped: 7
HURT: 15

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
a35b47a0b1 broadcom/compiler: add a strategy to disable scheduling of general TMU reads
This can add quite a bit of register pressure so it makes sense to disable it
to prevent us from dropping to 2 threads or increase spills:

total instructions in shared programs: 12672813 -> 12642413 (-0.24%)
instructions in affected programs: 256721 -> 226321 (-11.84%)
helped: 719
HURT: 77

total threads in shared programs: 415534 -> 416322 (0.19%)
threads in affected programs: 788 -> 1576 (100.00%)
helped: 394
HURT: 0

total uniforms in shared programs: 3711370 -> 3703861 (-0.20%)
uniforms in affected programs: 28859 -> 21350 (-26.02%)
helped: 204
HURT: 455

total max-temps in shared programs: 2159439 -> 2150686 (-0.41%)
max-temps in affected programs: 32945 -> 24192 (-26.57%)
helped: 585
HURT: 47

total spills in shared programs: 5966 -> 3255 (-45.44%)
spills in affected programs: 2933 -> 222 (-92.43%)
helped: 192
HURT: 4

total fills in shared programs: 9328 -> 4630 (-50.36%)
fills in affected programs: 5184 -> 486 (-90.62%)
helped: 196
HURT: 0

Compared to the stats before adding scheduling of non-filtered
memory reads we see we that we have now gotten back all that was
lost and then some:

total instructions in shared programs: 12663186 -> 12642413 (-0.16%)
instructions in affected programs: 2051803 -> 2031030 (-1.01%)
helped: 4885
HURT: 3338

total threads in shared programs: 415870 -> 416322 (0.11%)
threads in affected programs: 896 -> 1348 (50.45%)
helped: 300
HURT: 74

total uniforms in shared programs: 3711629 -> 3703861 (-0.21%)
uniforms in affected programs: 158766 -> 150998 (-4.89%)
helped: 1973
HURT: 499

total max-temps in shared programs: 2138857 -> 2150686 (0.55%)
max-temps in affected programs: 177920 -> 189749 (6.65%)
helped: 2666
HURT: 2035

total spills in shared programs: 3860 -> 3255 (-15.67%)
spills in affected programs: 2653 -> 2048 (-22.80%)
helped: 77
HURT: 21

total fills in shared programs: 5573 -> 4630 (-16.92%)
fills in affected programs: 3839 -> 2896 (-24.56%)
helped: 81
HURT: 15

total sfu-stalls in shared programs: 39583 -> 38154 (-3.61%)
sfu-stalls in affected programs: 8993 -> 7564 (-15.89%)
helped: 1808
HURT: 1038

total nops in shared programs: 324894 -> 323685 (-0.37%)
nops in affected programs: 30362 -> 29153 (-3.98%)
helped: 2513
HURT: 2077

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
f783bd0d2a broadcom/compiler: define v3d-specific delays for NIR instructions
We do a few changes over NIR's defaults:

1. Lower delay for texture reads. Empirically, we don't observe any
   benefits with delays over 50 and since this delay value is still
   used by the scheduler in the "favor register pressure" case it is
   benefitial to avoid overestimating it too much.

2. Adjust delay for non-filtered TMU reads to the delay selected for
   texture reads.

3. In our case, UBO reads from dynamically uniform addresses don't
   use the TMU and have a latency of 1 instruction in the best case
   scenario or 4 at worse, so we go with 1 so we don't try to move
   this early.

This helps us get back some of what we lost when updating the
default scheduler configuration to add a delay for non-filtered
memory reads:

total instructions in shared programs: 13126587 -> 12671765 (-3.46%)
instructions in affected programs: 3764097 -> 3309275 (-12.08%)
helped: 14664
HURT: 4244

total threads in shared programs: 407208 -> 415522 (2.04%)
threads in affected programs: 8716 -> 17030 (95.39%)
helped: 4224
HURT: 67

total uniforms in shared programs: 3812698 -> 3711224 (-2.66%)
uniforms in affected programs: 335170 -> 233696 (-30.28%)
helped: 2816
HURT: 3551

total max-temps in shared programs: 2318430 -> 2159345 (-6.86%)
max-temps in affected programs: 539991 -> 380906 (-29.46%)
helped: 13173
HURT: 1440

total spills in shared programs: 49086 -> 5966 (-87.85%)
spills in affected programs: 48306 -> 5186 (-89.26%)
helped: 1655
HURT: 28

total fills in shared programs: 55810 -> 9328 (-83.29%)
fills in affected programs: 54821 -> 8339 (-84.79%)
helped: 1659
HURT: 22

LOST:   0
GAINED: 3

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
e7a4e97076 nir/schedule: use larger delay for non-filtered memory reads
This has been pending for a long time. It is not very consistent to
add a significant delay for textures and not do it for UBOs, etc
The reason we have not been doing this so far is the accumulated effect
on register pressure for V3D as shown by shader-db results below, but
from the point of view of a generic scheduler it makes sense to do this.

Later patches will address V3D specific issues with register pressure
derived from this by letting the driver control its instruction delay
settings.

total instructions in shared programs: 12662138 -> 13126587 (3.67%)
instructions in affected programs: 1813091 -> 2277540 (25.62%)
helped: 2410
HURT: 10499

total threads in shared programs: 415858 -> 407208 (-2.08%)
threads in affected programs: 17348 -> 8698 (-49.86%)
helped: 8
HURT: 4333

total uniforms in shared programs: 3711483 -> 3812698 (2.73%)
uniforms in affected programs: 128012 -> 229227 (79.07%)
helped: 3474
HURT: 2143

total max-temps in shared programs: 2138763 -> 2318430 (8.40%)
max-temps in affected programs: 318780 -> 498447 (56.36%)
helped: 588
HURT: 11997

total spills in shared programs: 3860 -> 49086 (1171.66%)
spills in affected programs: 709 -> 45935 (6378.84%)
helped: 23
HURT: 1595

total fills in shared programs: 5573 -> 55810 (901.44%)
fills in affected programs: 1067 -> 51304 (4708.25%)
helped: 23
HURT: 1595

LOST:   3
GAINED: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:04 +00:00
Iago Toral Quiroga
9ef499b315 broadcom/compiler: stop moving UBO loads before NIR scheduling
This doesn't have any significant impact shader-db stats and would
reduce our capacity to hide latency from the loads, so it is probably
undesirable:

total instructions in shared programs: 12663189 -> 12663186 (<.01%)
instructions in affected programs: 4222 -> 4219 (-0.07%)
helped: 9
HURT: 4

total uniforms in shared programs: 3711624 -> 3711629 (<.01%)
uniforms in affected programs: 186 -> 191 (2.69%)
helped: 0
HURT: 2

total max-temps in shared programs: 2138822 -> 2138857 (<.01%)
max-temps in affected programs: 569 -> 604 (6.15%)
helped: 1
HURT: 9

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15276>
2022-03-09 15:53:03 +00:00
Iago Toral Quiroga
f761f8fd9e broadcom/compiler: simplify node/temp translation during register allocation
Now that we don't sort our nodes we can arrange them so we can
easily translate between nodes and temps without a mapping table,
just applying an offset.

To do this we have a single array of nodes where twe put first the nodes
for accumulators and then the nodes for temps. With this setup we can
ensure that for any given temp T, its node is always T + ACC_COUNT.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga
871b0a7f6a broadcom/compiler: don't sort nodes for register allocation
Nodes are allocated in order to registers so initially sorting
was used to ensure that nodes with smaller life ranges would
be assigned first and therefore be more likely to get
accumulators.

However, since d81a6e5f1d now we don't rely on order to make
decisions about accumulators and instead we make policy decisions
based on actual liveness, so sorting is no longer strictly
relevant to this decision.

Furthermore, we are not re-sorting nodes after each spill either,
since that would probably require that we rebuild the interference
graph after each spill (the graph identifies nodes by their index).

Shader-db results show a significant improvement in instruction
counts, due to more optimal accumulator assignments. The reason for
this is that we use a round-robin policy for choosing the next
accumulator to assign. The idea behind this is preventing nearby
temps to be assigned to the same accumulator so that QPU scheduling
is more flexible, but if we  sort our nodes, we are basically not
assigning temps in program order any more and the round-robin policy
becomes less effective:

total instructions in shared programs: 13000420 -> 12663189 (-2.59%)
instructions in affected programs: 11791267 -> 11454036 (-2.86%)
helped: 62890
HURT: 19987

total threads in shared programs: 415874 -> 415870 (<.01%)
threads in affected programs: 20 -> 16 (-20.00%)
helped: 2
HURT: 4

total uniforms in shared programs: 3711652 -> 3711624 (<.01%)
uniforms in affected programs: 43430 -> 43402 (-0.06%)
helped: 134
HURT: 173

total max-temps in shared programs: 2144876 -> 2138822 (-0.28%)
max-temps in affected programs: 123334 -> 117280 (-4.91%)
helped: 4112
HURT: 1195

total spills in shared programs: 3870 -> 3860 (-0.26%)
spills in affected programs: 1013 -> 1003 (-0.99%)
helped: 14
HURT: 12

total fills in shared programs: 5560 -> 5573 (0.23%)
fills in affected programs: 1765 -> 1778 (0.74%)
helped: 14
HURT: 17

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00