It's always set to a fixed value and not used in many places. Use the
value directly where it's needed.
Suggested-by: Lucas Fryzek <lfryzek@igalia.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37648>
With EXT_shader_object, it became possible to compile shaders
independently and then use them together later, so we cannot rely on the
lack of task shader data to decide that no task shader will be used. The
flag VK_SHADER_CREATE_NO_TASK_SHADER_BIT_EXT exists for that purpose,
but it doesn't really make any difference for us. Always assume that if
the mesh shader is reading the task payload, it's going to be used with
one, as otherwise the application is doing it wrong.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13983
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37648>
Sometimes the compute shader workgroup size requires a larger SIMD width
than the minimum in order to fit in the available threads. In that case
we'll skip the SIMD8 shader, and need to try SIMD16 regardless of how
the register pressure estimate looks.
Fixes: 3af4e63061 ("brw: Skip compilation of larger SIMDs when pressure is too high")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37649>
I didn't bother switching either iris or elk/hasvk but one could.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37517>
It remove a duplication and also it will be used in a future patch
from other file.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37670>
This allows us to skip the entire backend compilation process for
large SIMD widths when register pressure is high enough that we'd
likely decide to prefer a smaller one in the end anyway. The hope
is to make the same decisions as before, but with less CPU overhead.
We are making mostly the same decisions as before:
| API / Platform | Total Shaders | Changed | % Identical
--------------------------------------------------
| VK / Arc A770 | 905,525 | 1,157 | 99.872% |
| VK / Arc B580 | 788,127 | 53 | 99.993% |
| VK / Panther | 786,333 | 13 | 99.998% |
| GL / Arc A770 | 308,618 | 269 | 99.913% |
| GL / Arc B580 | 264,066 | 13 | 99.995% |
| GL / Panther | 273,212 | 0 | 100.000% |
Improves compile times on my i7-12700K:
| Game | Arc B580 | Arc A770 |
---------------------------------------------------
| Assassins Creed: Odyssey | -13.47% | -10.98% |
| Borderlands 3 (DX12) | -10.05% | -11.31% |
| Dark Souls 3 | -21.06% | -21.08% |
| Oblivion Remastered | -11.10% | -9.82% |
| Phasmophobia | -32.73% | -31.00% |
| Red Dead Redemption 2 | -20.10% | -14.38% |
| Total War: Warhammer III | -10.11% | -14.44% |
| Wolfenstein Youngblood | -15.91% | -13.47% |
| Shadow of the Tomb Raider | -30.23% | -25.86% |
It seems to have nearly no effect on compile times on Xe3 unfortunately,
as only 1,014 shaders in fossil-db even fail SIMD32 compilation in the
first place, and we want to let most of the "might succeed" cases
through to the backend for throughput analysis.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
This tries to calculate an underestimate (lower bound) for the register
pressure at various SIMD widths, by counting live values in the NIR
shader. This fundamentally won't be accurate, but it can give us an
idea of whether it's even worth trying a certain SIMD-width compile.
Doing this at the NIR level means we:
- Can use SSA structure rather than fuzzy liveness intervals
- Can avoid the backend scheduler aggressively trying to hide latency,
presenting an overinflated view of the register pressure
- Have divergence information on-hand, making it easier to "scale up"
- Can skip cloning and optimizing NIR for compute shader SIMD widths
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
We were doing a lot of NIR work repeatedly for each SIMD variant of
compute and mesh shaders. Instead, do it once before cloning, and
just do one final optimization loop and out-of-SSA for each.
fossil-db results on Arc B580:
Totals:
Instrs: 233771096 -> 233794024 (+0.01%); split: -0.01%, +0.02%
Subgroup size: 15922768 -> 15922736 (-0.00%); split: +0.00%, -0.00%
Send messages: 12095619 -> 12098234 (+0.02%); split: -0.00%, +0.02%
Loop count: 137562 -> 137523 (-0.03%)
Cycle count: 32600323744 -> 32667411252 (+0.21%); split: -0.06%, +0.27%
Spill count: 540908 -> 542027 (+0.21%); split: -0.07%, +0.28%
Fill count: 700938 -> 698983 (-0.28%); split: -0.73%, +0.45%
Scratch Memory Size: 37266432 -> 37304320 (+0.10%); split: -0.10%, +0.20%
Max live registers: 72691728 -> 72692987 (+0.00%); split: -0.00%, +0.00%
Non SSA regs after NIR: 67690309 -> 67688352 (-0.00%); split: -0.01%, +0.00%
Totals from 3576 (0.45% of 789301) affected shaders:
Instrs: 6932956 -> 6955884 (+0.33%); split: -0.41%, +0.74%
Subgroup size: 88816 -> 88784 (-0.04%); split: +0.09%, -0.13%
Send messages: 329168 -> 331783 (+0.79%); split: -0.02%, +0.81%
Loop count: 8753 -> 8714 (-0.45%)
Cycle count: 15153678820 -> 15220766328 (+0.44%); split: -0.14%, +0.58%
Spill count: 213751 -> 214870 (+0.52%); split: -0.18%, +0.71%
Fill count: 282616 -> 280661 (-0.69%); split: -1.82%, +1.13%
Scratch Memory Size: 13056000 -> 13093888 (+0.29%); split: -0.27%, +0.56%
Max live registers: 834757 -> 836016 (+0.15%); split: -0.11%, +0.26%
Non SSA regs after NIR: 995033 -> 993076 (-0.20%); split: -0.48%, +0.28%
Looking at a few of the shaders with substantial instruction count
increases, it appears that it is largely due to more loops being
unrolled, which is probably actually a good thing.
The compile time impact of this patch appears to be negligable.
However, doing postprocessing before SIMD cloning allows us to
examine the postprocessed SSA-form NIR for improvements in an
upcoming patch.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
brw_postprocess_nir contains a lot of stuff these days. The first part
does a bunch of lowering and cleanup optimizations in SSA form. The
second part does some post-optimization lowering and the out-of-SSA
conversion.
We may want to do additional work before the post-optimization/post-SSA
phase. Splitting this allows us to insert such tasks in the "middle".
For convenience, brw_postprocess_nir() becomes a wrapper which invokes
both parts, so callers can continue working as they did until they have
a reason to do otherwise.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
This allows us to lower known subgroup size cases earlier, giving us
some earlier optimization opportunities. We would need to know the
actual SIMD width to handle certain cases, but we can just pass 0 here,
which will lead to get_subgroup_size returning 0 - the same as leaving
this unset. We can come back to that later during the per-SIMD-width
postprocessing.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
We were printing the SSA form, then immediately running divergence
analysis. This patch flips those, so we can see con/div in INTEL_DEBUG
output for SSA form, which is really useful.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
float_controls2 may have marked these as needing to preserve NaN or
other values. If so, our newly contracted ffma needs to as well.
Fixes dEQP-VK.spirv_assembly.instruction.compute.float_controls2.*.input_args.mat_det_testedWithout_NotNan*
when nir_opt_algebraic is run after this pass.
Cc: mesa-stable
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
If the (NIR) destination is a register (i.e., not an SSA value), the
destination of the BRW instruction will not be is_scalar. This occurs in
some shaders in Final Fantasy XVI (and
finalfantasytype0_1.rdc.2826e29da3722a83.1.foz).
If the destination is not is_scalar, revert most of this code to the
state previous to f3593df877. This means
- Allocate a SIMD1 register and UNDEF it.
- Emit a SIMD1 MOV_RELOC_IMM to that register.
- Emit an additional MOV to expand the SIMD1 result.
Closes: #12520
Fixes: f3593df877 ("brw/nir: Treat load_reloc_const_intel as convergent")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37384>
src[1]/src0 is signed and Xe2+ SHR don't support operations over signed
data types so lets switch this over ASR that supports signed data
types.
Cc: mesa-stable
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37557>
os_get_option() is a wrapper for getenv() that checks properties in
Android. It should be a no-op for other OS but will allow full use of
env vars in Android.
The environment variable names are automatically renamed by
os_get_option() and the order of precedence thus becomes:
1. getenv (non-Android)
2. debug.mesa.* (Android)
3. vendor.mesa.* (Android)
4. mesa.* (Android, as a fallback for older versions)
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37587>
This doesn't replace existing support for INTEL_DEBUG=shaders -- so both
`shaders` and `mda` can be used.
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29146>
Instead of dumping multiple files with the optimizer passes, write a single
archive file with all the contents. The actual file is created
by the drivers, so later commits will actually enable the feature in
anv and iris.
This removes the use of INTEL_DEBUG=optimizer (and the corresponding
enum value) in brw. That environment variable is still used by ELK --
which currently doesn't support mda.
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29146>
Only if required. I somehow misunderstood that those would need to be
independent too, not just the vertex slots.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 8dee4813b0 ("brw: add ability to compute VUE map for separate tcs/tes")
Acked-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37251>
We had twice surface/sampler sources for no good reason, just add a
boolean to tell whether they are bindless or not.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37527>
One advantage here of moving a bunch of stuff to NIR is that we can
now have consistent payload types straight from the NIR conversion to
BRW.
This massively simplifies the BRW lowering code and avoids type errors
that are quite common to make in the backend.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37527>
In very large shaders, first_use_ip, last_use_ip, and even (register) nr
can overflow 16 bits. Increase the size of these fields. Some structure
components are rearranged to promote better packing.
Fixes: 2dad1e3abd ("i965/fs: Add pass to combine immediates.")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37482>
In very large shaders, first_use_ip, last_use_ip, and even (register) nr
can overflow 16 bits. Increase the size of these fields.
used_in_single_block is moved earlier in the structure to promote better
packing.
Fixes: 2dad1e3abd ("i965/fs: Add pass to combine immediates.")
Closes: #9489
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Tested-by: Tapani Pälli <tapani.palli@intel.com>
Tested-by: @joostruis
Tested-by: @Snoucher
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37482>
Use FD20 macro that will account for the implicit LSB zero value and is
already used for sources. For the new macro we need to use the entire
bit-range of the field (55-51), so remove the adjustments we used to
do prior to encoding and decoding.
Fixes assertion in vkpeak (https://github.com/nihui/vkpeak) when running
bf16 tests on BMG. And the code now will correctly apply the subreg_nr
to the destination, e.g. a mad(32) gets splitted into two pieces, the
generation would not fill out the upper-part of the register
```
mad(16) g13<1>BF g10<8,8,1>BF g12<8,8,1>BF g56<1,1,1>F { align1 1H A@5 };
-mad(16) g13<1>BF g10.16<8,8,1>BF g12.16<8,8,1>BF g57<1,1,1>F { align1 2H A@5 };
+mad(16) g13.16<1>BF g10.16<8,8,1>BF g12.16<8,8,1>BF g57<1,1,1>F { align1 2H A@5 };
```
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37236>
This is all dead code since we weren't even seting the cap in iris/crocus!
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37447>
ideally we'd have no stage switching, but this is just a cleanup for now.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37447>
I see no point, we allocate for every shader stage anyway. This is a bit
simpler.
I'm not a fan of the brw_compiler singleton at all but torching that is not on
today's agenda. Flattening it a little bit very much is.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37447>
The following Vulkan CTS tests that emit massive shaders were
regressing after "intel/brw/xe3+: Select scheduler heuristic with best
trade-off between register pressure and latency.":
dEQP-VK.graphicsfuzz.cov-nested-loops-set-struct-data-verify-in-function
dEQP-VK.graphicsfuzz.cov-dfdx-dfdy-after-nested-loops
The reason is that they have so many nested loops that they cause the
performance analysis utilization estimates to overflow the 32-bit
floating-point variables used to calculate them, which causes our
throughput estimate to underflow and equal zero for those shaders,
which breaks the logic introduced in brw_allocate_registers() to
select the scheduling variant with highest throughput, since none of
the scheduling modes tried has better throughput than the initial
value equal to zero of "best_perf". Instead use -INFINITY as initial
value for "best_perf" so we always select a scheduling mode.
This should have been caught by CI but oddly the tests above are
showing up as "not run" on my last baseline runs, so this wasn't
flagged as a regression for me.
v2: Use -INFINITY instead of previous approach that used NaN (Ian).
Fixes: 531a34c7dd ("intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency.")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13884
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13885
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1)
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37322>
Renamed brw_nir_trig_workarounds.py to brw_nir_workarounds.py to reflect
its expanded scope beyond just trignometric workarounds.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36990>
Added a workaround for a known shader in Horizon Forbidden West that causes
visual corruption on Intel anv driver. The fix clamps fsqrt inputs using
fmax(x, 1e-12) to avoid invalid values. Integrated the workaround via
brw_nir_apply_sqrt_workarounds() and applied it conditionally in the Vulkan
pipeline based on the shader's BLAKE3 hash.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12555
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36990>
Move from a single ralloc allocation per instruction to contiguous
blocks of allocations. Still use ralloc for those large blocks.
Each ralloc allocation has at least 5 pointers of overhead, which would
be about a third of the current brw_inst, and get worse as we try to
pack brw_inst better.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
In Release build, goes from 72 to 64 bytes, and now fits
in a single cacheline.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>