rather than always early killing and then hitting pathological shuffle
situations, only early-kill when we can prove that we won't need to shuffle. it
turns out that's most of the time.
even with this heuristic, we still get hurt bad in shader-db due to extra moves.
but hopefully, the #s here are small enough that we can move on with our lives
and fix this source of known unsoundness.
this is tagged for backport as it's needed to avoid a perf regression with the
previous patch.
combined stats from this commit and the previous commit:
total instrs in shared programs: 2846065 -> 2852257 (0.22%)
instrs in affected programs: 618734 -> 624926 (1.00%)
total alu in shared programs: 2329477 -> 2335534 (0.26%)
alu in affected programs: 508119 -> 514176 (1.19%)
total gprs in shared programs: 894762 -> 901327 (0.73%)
gprs in affected programs: 36946 -> 43511 (17.77%)
Backport-to: 25.1
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34595>
shader-db stats combined with next commit. this is the rip off the bandaid, next
is the optimize. split to enable bisecting.
the code we have to shuffle clobbered killed sources is broken and, after
thinking about that for a Long time, I don't see a reasonable way to fix it. But
if we late-kill sources - or model our calculations as-if we were late-killing
souces - we never have to shuffle onto a killed source and the problem goes away
entirely.
this is similar in spirit to what NAK does. it's not "optimal", but it's sane.
Backport-to: 25.1
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34595>
This hurts us in two ways:
* slightly more spilling (not actually a big problem)
* slightly worse occupancy (the shaders that are "helped" here are from trying
less hard to fit at higher occupancy levels)
However, in exchange we get a LOT more flexibility in the RA.
total instrs in shared programs: 2847015 -> 2846065 (-0.03%)
instrs in affected programs: 84134 -> 83184 (-1.13%)
total alu in shared programs: 2330406 -> 2329477 (-0.04%)
alu in affected programs: 62305 -> 61376 (-1.49%)
total code size in shared programs: 20497326 -> 20491690 (-0.03%)
code size in affected programs: 586664 -> 581028 (-0.96%)
total gprs in shared programs: 894202 -> 894762 (0.06%)
gprs in affected programs: 8900 -> 9460 (6.29%)
total scratch in shared programs: 13292 -> 13304 (0.09%)
scratch in affected programs: 2924 -> 2936 (0.41%)
total threads in shared programs: 27819712 -> 27814272 (-0.02%)
threads in affected programs: 55296 -> 49856 (-9.84%)
total spills in shared programs: 907 -> 914 (0.77%)
spills in affected programs: 419 -> 426 (1.67%)
total fills in shared programs: 857 -> 862 (0.58%)
fills in affected programs: 389 -> 394 (1.29%)
Backport-to: 25.1
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34595>
../src/asahi/compiler/agx_nir_lower_address.c:82:30: runtime error: passing zero to ctz(), which is not a valid argument
#0 0x56175dca5684 in pass ../src/asahi/compiler/agx_nir_lower_address.c:82
#1 0x56175dca2eda in nir_function_intrinsics_pass ../src/compiler/nir/nir_builder.h:164
#2 0x56175dca308c in nir_shader_intrinsics_pass ../src/compiler/nir/nir_builder.h:191
#3 0x56175dca68ae in agx_nir_lower_address ../src/asahi/compiler/agx_nir_lower_address.c:167
#4 0x56175dc885c0 in agx_optimize_nir ../src/asahi/compiler/agx_compile.c:3063
#5 0x56175dc9650d in agx_compile_shader_nir ../src/asahi/compiler/agx_compile.c:3827
#6 0x56175dc52148 in main ../src/asahi/clc/asahi_clc.c:359
#7 0x7fdf1c343249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#8 0x7fdf1c343304 in __libc_start_main_impl ../csu/libc-start.c:360
#9 0x56175dc44760 in _start (/builds/mesa/mesa/_build/src/asahi/clc/asahi_clc+0xbd0760)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33685>
This currently treats coarse and fine derivatives the same, but Qualcomm
needs to know whether just coarse derivatives are used or fine
derivatives/quad ops are also used. Rename this to
needs_coarse_quad_helper_invocations make clear the difference from the
new field, needs_full_quad_helper_invocations.
Reviewed-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Reviewed-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Fixes: 264d8a6766 ("ir3: Set need_full_quad depending on info.fs.require_full_quads")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33862>
this will be needed with vtn_bindgen2
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Jesse Natalie <jenatali@microsoft.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33067>
A negative hole size means the loads overlap. This will be used by drivers
to handle overlapping loads in the callback easily.
Reviewed-by: Mel Henning <drawoc@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32699>
so we can use designated initializers and other fun features.
all affected shaders are in gfxbench:
total instructions in shared programs: 2750549 -> 2750497 (<.01%)
instructions in affected programs: 10832 -> 10780 (-0.48%)
helped: 4
HURT: 2
Inconclusive result (value mean confidence interval includes 0).
total alu in shared programs: 2278478 -> 2278760 (0.01%)
alu in affected programs: 7040 -> 7322 (4.01%)
helped: 2
HURT: 4
Alu are HURT.
total fscib in shared programs: 2276985 -> 2277267 (0.01%)
fscib in affected programs: 7040 -> 7322 (4.01%)
helped: 2
HURT: 4
Fscib are HURT.
total bytes in shared programs: 19922466 -> 19922734 (<.01%)
bytes in affected programs: 71412 -> 71680 (0.38%)
helped: 4
HURT: 2
Inconclusive result (value mean confidence interval includes 0).
total regs in shared programs: 865070 -> 865086 (<.01%)
regs in affected programs: 142 -> 158 (11.27%)
helped: 0
HURT: 2
total uniforms in shared programs: 2120930 -> 2121034 (<.01%)
uniforms in affected programs: 244 -> 348 (42.62%)
helped: 0
HURT: 2
total scratch in shared programs: 11576 -> 11600 (0.21%)
scratch in affected programs: 2744 -> 2768 (0.87%)
helped: 0
HURT: 2
total spills in shared programs: 958 -> 868 (-9.39%)
spills in affected programs: 958 -> 868 (-9.39%)
helped: 6
HURT: 0
total fills in shared programs: 732 -> 626 (-14.48%)
fills in affected programs: 732 -> 626 (-14.48%)
helped: 4
HURT: 2
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Mary Guillemard <mary.guillemard@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32529>
This is a prerequisite for enabling nir_opt_varyings for all gallium
drivers.
nir_lower_io_passes (called by the GLSL linker) only uses NIR options
to lower indirect IO access before lowering IO and calling
nir_opt_varyings.
Most drivers report full support for indirect IO and lower it themselves,
which prevents compaction of lowered indirectly accessed varyings because
nir_opt_varyings doesn't touch indirect varyings.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> (Rb for asahi)
Reviewed-by: Pavel Ondračka <pavel.ondracka@gmail.com> (for r300)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32423>
switch libagx to the precompilation pipeline. see the big comment in the
previous commit for why we're doing this.
while doing so, we move some dispatch stuff. there was so much churn from
precompile that this avoids doing the churn twice. that new header will be used
for DGC down the road.
there's also a small vtn/bindgen patch in here to skip bindgen'ing entrypoints,
as that conflicts with the new dispatch macros. this is the sane behaviour, we
just need to do the full precomp switch across the tree at once.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32339>
can help a lot for scratchy shaders. affected shader is a compute shader in
graphics bench.
total instructions in shared programs: 2751334 -> 2750950 (-0.01%)
instructions in affected programs: 4308 -> 3924 (-8.91%)
helped: 2
HURT: 0
total bytes in shared programs: 21482512 -> 21480592 (<.01%)
bytes in affected programs: 27448 -> 25528 (-7.00%)
helped: 2
HURT: 0
total fills in shared programs: 732 -> 540 (-26.23%)
fills in affected programs: 396 -> 204 (-48.48%)
helped: 2
HURT: 0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32320>
If we know the shader doesn't use global atomics, we don't care if the target
has this quirk or not and we can produce a single binary for all G13/G14
hardware. Model that in the shader key.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32224>
This will need to be set to true when the GLSL linker lowers IO, which
can later be unlowered by st/mesa, and then drivers can lower it again
without load_interpolated_input. Therefore, it can't be a global
immutable option.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32229>