This keeps the directory structure a bit more organized:
- brw specific code
- elk specific code
- common NIR passes that could be used in both places
It also means that you can now 'git grep' in the brw directory without
finding a bunch of elk code, or having to "grep thing b*".
Reviewed-by: Dylan Baker <dylan.c.baker@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37755>
I didn't bother switching either iris or elk/hasvk but one could.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37517>
This allows us to skip the entire backend compilation process for
large SIMD widths when register pressure is high enough that we'd
likely decide to prefer a smaller one in the end anyway. The hope
is to make the same decisions as before, but with less CPU overhead.
We are making mostly the same decisions as before:
| API / Platform | Total Shaders | Changed | % Identical
--------------------------------------------------
| VK / Arc A770 | 905,525 | 1,157 | 99.872% |
| VK / Arc B580 | 788,127 | 53 | 99.993% |
| VK / Panther | 786,333 | 13 | 99.998% |
| GL / Arc A770 | 308,618 | 269 | 99.913% |
| GL / Arc B580 | 264,066 | 13 | 99.995% |
| GL / Panther | 273,212 | 0 | 100.000% |
Improves compile times on my i7-12700K:
| Game | Arc B580 | Arc A770 |
---------------------------------------------------
| Assassins Creed: Odyssey | -13.47% | -10.98% |
| Borderlands 3 (DX12) | -10.05% | -11.31% |
| Dark Souls 3 | -21.06% | -21.08% |
| Oblivion Remastered | -11.10% | -9.82% |
| Phasmophobia | -32.73% | -31.00% |
| Red Dead Redemption 2 | -20.10% | -14.38% |
| Total War: Warhammer III | -10.11% | -14.44% |
| Wolfenstein Youngblood | -15.91% | -13.47% |
| Shadow of the Tomb Raider | -30.23% | -25.86% |
It seems to have nearly no effect on compile times on Xe3 unfortunately,
as only 1,014 shaders in fossil-db even fail SIMD32 compilation in the
first place, and we want to let most of the "might succeed" cases
through to the backend for throughput analysis.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
We were doing a lot of NIR work repeatedly for each SIMD variant of
compute and mesh shaders. Instead, do it once before cloning, and
just do one final optimization loop and out-of-SSA for each.
fossil-db results on Arc B580:
Totals:
Instrs: 233771096 -> 233794024 (+0.01%); split: -0.01%, +0.02%
Subgroup size: 15922768 -> 15922736 (-0.00%); split: +0.00%, -0.00%
Send messages: 12095619 -> 12098234 (+0.02%); split: -0.00%, +0.02%
Loop count: 137562 -> 137523 (-0.03%)
Cycle count: 32600323744 -> 32667411252 (+0.21%); split: -0.06%, +0.27%
Spill count: 540908 -> 542027 (+0.21%); split: -0.07%, +0.28%
Fill count: 700938 -> 698983 (-0.28%); split: -0.73%, +0.45%
Scratch Memory Size: 37266432 -> 37304320 (+0.10%); split: -0.10%, +0.20%
Max live registers: 72691728 -> 72692987 (+0.00%); split: -0.00%, +0.00%
Non SSA regs after NIR: 67690309 -> 67688352 (-0.00%); split: -0.01%, +0.00%
Totals from 3576 (0.45% of 789301) affected shaders:
Instrs: 6932956 -> 6955884 (+0.33%); split: -0.41%, +0.74%
Subgroup size: 88816 -> 88784 (-0.04%); split: +0.09%, -0.13%
Send messages: 329168 -> 331783 (+0.79%); split: -0.02%, +0.81%
Loop count: 8753 -> 8714 (-0.45%)
Cycle count: 15153678820 -> 15220766328 (+0.44%); split: -0.14%, +0.58%
Spill count: 213751 -> 214870 (+0.52%); split: -0.18%, +0.71%
Fill count: 282616 -> 280661 (-0.69%); split: -1.82%, +1.13%
Scratch Memory Size: 13056000 -> 13093888 (+0.29%); split: -0.27%, +0.56%
Max live registers: 834757 -> 836016 (+0.15%); split: -0.11%, +0.26%
Non SSA regs after NIR: 995033 -> 993076 (-0.20%); split: -0.48%, +0.28%
Looking at a few of the shaders with substantial instruction count
increases, it appears that it is largely due to more loops being
unrolled, which is probably actually a good thing.
The compile time impact of this patch appears to be negligable.
However, doing postprocessing before SIMD cloning allows us to
examine the postprocessed SSA-form NIR for improvements in an
upcoming patch.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
Instead of dumping multiple files with the optimizer passes, write a single
archive file with all the contents. The actual file is created
by the drivers, so later commits will actually enable the feature in
anv and iris.
This removes the use of INTEL_DEBUG=optimizer (and the corresponding
enum value) in brw. That environment variable is still used by ELK --
which currently doesn't support mda.
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29146>
The brw_shader::uniforms now is derived from the nir_shader. The
only exception is compute shaders for older Gfx versions, so we
move the adjust logic for that.
The benefit here is untangling the code for compilation variants,
that before needed to keep track of the first that compiled to,
in most cases, copy an integer.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>
And unify the initialization code for brw_shader. Avoid passing
brw_compile_params since for a single compilation we might have
multiple shaders (the case for BS stage).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>
All remaining uses of that constructor would also use at_end(),
and vice-versa. So just implement that behavior in the constructor
itself.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33815>
This will be useful for pulling constants in device bound shaders. A64
allows us to put the constants anywhere.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32895>
This is similar to what we used to do on pre-SNB platforms, the number
of GRF registers used by the shader will be used on Xe3+ to adjust the
trade-off between thread-level parallelism and size of the GRF file.
Plumb the value through prog_data so the driver can set up the
hardware state accordingly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
This is similar in principle to the previous commit "intel/brw/xe3+:
brw_compile_fs() implementation for Xe3+." but applied to compute-like
shader stages. It changes the implementation of brw_compile_cs/task/mesh()
to reduce compile time and take advantage of wider dispatch modes more
aggressively than the original logic, since as of Xe3 SIMD32 builds
succeed without spills in most cases thanks to VRT.
The new "optimistic" SIMD selection logic starts with the SIMD width
that is potentially highest performance and only compiles additional
narrower variants if that fails (typically due to spilling), while the
old "pessimistic" logic did the opposite: It started with the
narrowest SIMD width and compiled additional variants with increasing
register pressure until one of them failed to compile.
In typical non-spilling cases where we formerly compiled SIMD16 and
SIMD32 variants of the same compute shader, this change will halve the
number of backend compilations required to build it.
XXX - Possibly don't do this in cases with variable workgroup size
until effect on runtime performance can be measured directly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: Don't do this for now in cases with variable workgroup size, still
compile every possible variant in such cases.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32664>
According to HSD 14016252163 if compute shader uses the sample
operation, morton walk order and set the thread group batch size to 4 is
expected to increase sampler cache hit rates by increasing sample
address locality within a subslice.
Rework:
* Caio: "||" => "&&" for type checking in instr_uses_sampler()
* Jordan: Use nir's foreach macros rather than
nir_shader_lower_instructions()
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32430>
This intrinsic was initially dedicated to mesh/task shaders, but the
mechanism it exposes also exists in the compute shaders on Gfx12.5+.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31508>