In 957bbc6ad9 we merged all the per stages allocations of push
constants into a single one. Unfortunately one field remained per
stage.
This fixes the issue by including all the per stage values of the
masked registers for robust buffer access into the push constant data.
v2: Drop unneeded loop (Jason)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 957bbc6ad9 ("anv: simplify push constant emissions")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6505>
This commit fixes performance regressions introduced by e03f965280
in which we started bounds checking our push constants. This added a
LOT of shader code to shaders which use the robustBufferAccess feature
and led to substantial spilling. The checking we just added to the FS
back-end is far more efficient for two reasons:
1. It can be done at a whole register granularity rather than per-
scalar and so we emit one SIMD8 SEL per 32B GRF rather than one
SIMD16 SEL (executed as two SELs) for each component loaded.
2. Because we do it with NoMask instructions, we can do it on whole
pushed GRFs without splatting them out to SIMD8 or SIME16 values.
This means that robust buffer access no longer explodes our register
pressure for no good reason.
As a tiny side-benefit, we're now using can use AND instead of SEL which
means no need for the flag and better scheduling.
Vulkan pipeline database results on ICL:
Instructions in all programs: 293586059 -> 238009118 (-18.9%)
SENDs in all programs: 13568515 -> 13568515 (+0.0%)
Loops in all programs: 149720 -> 149720 (+0.0%)
Cycles in all programs: 88499234498 -> 84348917496 (-4.7%)
Spills in all programs: 1229018 -> 184339 (-85.0%)
Fills in all programs: 1348397 -> 246061 (-81.8%)
This also improves the performance of a few apps:
- Shadow of the Tomb Raider: +4%
- Witcher 3: +3.5%
- UE4 Shooter demo: +2%
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>
This fixes two bugs: First, if the same block index showed up twice, we
only pick the first one. Second, we weren't multiplying by 32. This
didn't show up in tests because RBA testing is garbage. Found while
looking at shaders from the UE4 Shooter demo.
Fixes: e03f9652 "anv: Bounds-check pushed UBOs when..."
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4578>
There was a disconnect between anv_nir_compute_push_layout and the code
which sets up the push_ubo_sizes array. The NIR code we emit checks
relative to the start of the bound UBO range so that, if we end up with
a vector which straddles the start of the push range, we can perform the
bounds check without risking overflow issues. The code which sets up
the push_ubo_sizes, on the other hand, assumed it was relative to the
start of the push range. Somehow, this didn't get get caught by any of
the available tests.
Fixes: e03f965280 "anv: Bounds-check pushed UBOs when ..."
Closes: #2623
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4195>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4195>
As a result of 9baa33cef0 our backend compiler leaves params pretty
much untouched. So in order to avoid storing uninitialized values in
the shader cache blobs, just 0 out this array.
I've considered not even allocating this array which works on gen8+
but the vec4 backend still makes a copy of this array and so it
crashes on memcpy on HSW.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 9baa33cef0 ("anv: Rework push constant handling")
Reported-by: Tapani Pälli <tapani.palli@intel.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Tapani Pälli <tapani.palli@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3516>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3516>
Store push_ranges in ascending order, and only "shift" them to the end
of the array during state packet emission.
We don't need this workaround with the new 3DSTATE_CONSTANT_ALL packet.
So instead of applying the workaround here just for GEN < 12 (which
requires and extra loop through all the ranges to figure out if we
should shift them or not), we simply move the whole logic to the state
emission code. At that point, in a later commit, we are already looping
through all of the ranges anyway to check which packet we will be using,
so we might as well implement the workaround there, where it is going to
be used.
v3: Move gen8+ workaround to the state emission code (Caio).
v4: Add explanation of why we moved the workaroudn (Caio).
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Instead of blindly dirtying descriptors and push constants the moment we
see a pipeline change, check to see if it actually changes the bind
layout or push constant layout. This doubles the runtime performance of
one CPU-limited example running with the Dawn WebGPU implementation when
running on my laptop.
NOTE: This effectively reverts beca63c6c0. While it was a nice
optimization, it was based on prog_data and we can't do that anymore
once we start allowing the same binding table to be used with multiple
different pipelines.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This substantially reworks both the state setup side of push constant
handling and the pipeline compile side. The fundamental change here is
that we're no longer respecting the prog_data::param array and instead
are just instructing the back-end compiler to leave the array alone.
This makes the state setup side substantially simpler because we can now
just memcpy the whole block of push constants and don't have to
upload one DWORD at a time.
This also means that we can compute the full push constant layout
up-front and just trust the back-end compiler to not mess with it.
Maybe one day we'll decide that the back-end compiler can do useful
things there again but for now, this is functionally no different from
what we had before this commit and makes the NIR handling cleaner.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
It turns off that emitting push constants is one of the hottest paths in
the driver and ANY work we do there costs us. By pre-computing things a
bit ahead of time, we shave 5% off the runtime of a CPU-limited example
running with the Dawn WebGPU implementation.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>