This fixes corruption of push constants on Xe2 due to a mismatch in
the uniform layout implemented by the compiler and assumed by the
driver. To fix it we need to align the push constant ranges computed
by the Vulkan driver to a multiple of the GRF size of the platform.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29926>
Lowering/layout is pretty much the same as direct descriptors. The
caveats is that since the descriptor buffers are not visible from the
binding tables we can't promote anything to the binding table (except
push descriptors).
The reason for this is that there is nothing that prevents an
application to use both types of descriptors and because descriptor
buffers have visible address + capture replay, we can't merge the 2
types in the same virtual address space location (limited to 4Gb max,
limited 2Gb with binding tables).
If we had the guarantee that both are not going to be used at the same
time, we could consider a 2Gb VA for descriptor buffers.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22151>
align is a function and when we want use it, the align variable will shadow it
So replace it with other names
Signed-off-by: Yonggang Luo <luoyonggang@gmail.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25997>
We had the unfortunate finding on a recent platform to learn that the
bindless sampler heap is not functioning as expected.
Nowhere in the documentation is the size of the heap written down. So
most people assumed that's the max number that we can program (4Gb).
The reality is that it's only 64Mb.
Though it is appearing like it's working properly for the whole 4Gb
range for most apps, this is only because the HW bounds checking
applied is broken. Instead of clamping anything beyong 64Mb, it's only
clamping the last 4Kb of each 64Mb region.
So this heap is useless for us to make a 4Gb region of both sampler &
surface states...
This change essentially turns off the bindless sampler heap on DG2+.
The only location where we can put SAMPLER_STATE elements is the
dynamic state heap. Unfortunately we cannot align the dynamic state
heap with the bindless surface state heap. So the solution is to
allocate sampler & surface states separately, each from the own heap
in the descriptor pool.
We now have to provide 2 sets of offsets for surfaces & samplers.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25897>
Instead, we replace every use of it with nir_def. Most of this commit
was generated by sed:
sed -i -e 's/dest.ssa/def/g' src/**/*.h src/**/*.c src/**/*.cpp
A few manual fixups were required in lima and the nir_legacy code.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24674>
And split them into UBO and SSBO
v2 (Lionel):
- Get rid of robustness fields in anv_shader_bin
v3 (Lionel):
- Do not pass unused parameters around
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17545>
Now that descriptor sets are located a in a 1Gb area, we can avoid
storing the whole address to the descriptor and add the base address
of the area to a 32bit offset.
Replay a bunch of fossils with this and changes not really significant
one way or another :
Totals:
Instrs: 9278246 -> 9277148 (-0.01%); split: -0.01%, +0.00%
Cycles: 3547598421 -> 3547579435 (-0.00%); split: -0.00%, +0.00%
Totals from 353 (1.14% of 31021) affected shaders:
Instrs: 581546 -> 580448 (-0.19%); split: -0.23%, +0.04%
Cycles: 25885422 -> 25866436 (-0.07%); split: -0.31%, +0.24%
No difference on send messages or spills/fills.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21645>
We make the compiler assume the worst possible case (it's not great
because we have to burn 32 GRFs of potential input data) and then we
push the actual value through push constants.
This enables VK_EXT_gpl usage on zink, which causes two traces to change
their results. Raven is an imperceptible change, blender has missing
original pngs but looks plausible.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22378>
With independent sets, we're not able to compute immediate values for
the index at which to read anv_push_constants::dynamic_offsets to get
the offset of a dynamic buffer. This is because the pipeline layout
may not have all the descriptor set layouts when we compile the
shader.
To solve that issue, we insert a layer of indirection.
This reworks the dynamic buffer offset storage with a 2D array in
anv_cmd_pipeline_state :
dynamic_offsets[MAX_SETS][MAX_DYN_BUFFERS]
When the pipeline or the dynamic buffer offsets are updated, we
flatten that array into the
anv_push_constants::dynamic_offsets[MAX_DYN_BUFFERS] array.
For shaders compiled with independent sets, the bottom 6 bits of
element X in anv_push_constants::desc_sets[] is used to specify the
base offsets into the anv_push_constants::dynamic_offsets[] for the
set X.
The computation in the shader is now something like :
base_dyn_buffer_set_idx = anv_push_constants::desc_sets[set_idx] & 0x3f
dyn_buffer_offset = anv_push_constants::dynamic_offsets[base_dyn_buffer_set_idx + dynamic_buffer_idx]
It was suggested by Faith to use a different push constant buffer with
dynamic_offsets prepared for each stage when using independent sets
instead, but it feels easier to understand this way. And there is some
room for optimization if you are set X and that you know all the sets in
the range [0, X], then you can still avoid the indirection. Separate
push constant allocations per stage do have a CPU cost.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15637>
Bindless shaders don't have binding tables so they have to get at the
descriptor sets via a different mechanism.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8637>
The vec4 back-end can't push UBOs just yet but it soon will be able.
When it starts pushing UBOs, it will have a lower limit than scalar due
to a crummy register allocator. Mirror that limit in ANV so we don't
run into asserts due to ANV and the back-end making different choices.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10571>
In 957bbc6ad9 we merged all the per stages allocations of push
constants into a single one. Unfortunately one field remained per
stage.
This fixes the issue by including all the per stage values of the
masked registers for robust buffer access into the push constant data.
v2: Drop unneeded loop (Jason)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 957bbc6ad9 ("anv: simplify push constant emissions")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6505>
This commit fixes performance regressions introduced by e03f965280
in which we started bounds checking our push constants. This added a
LOT of shader code to shaders which use the robustBufferAccess feature
and led to substantial spilling. The checking we just added to the FS
back-end is far more efficient for two reasons:
1. It can be done at a whole register granularity rather than per-
scalar and so we emit one SIMD8 SEL per 32B GRF rather than one
SIMD16 SEL (executed as two SELs) for each component loaded.
2. Because we do it with NoMask instructions, we can do it on whole
pushed GRFs without splatting them out to SIMD8 or SIME16 values.
This means that robust buffer access no longer explodes our register
pressure for no good reason.
As a tiny side-benefit, we're now using can use AND instead of SEL which
means no need for the flag and better scheduling.
Vulkan pipeline database results on ICL:
Instructions in all programs: 293586059 -> 238009118 (-18.9%)
SENDs in all programs: 13568515 -> 13568515 (+0.0%)
Loops in all programs: 149720 -> 149720 (+0.0%)
Cycles in all programs: 88499234498 -> 84348917496 (-4.7%)
Spills in all programs: 1229018 -> 184339 (-85.0%)
Fills in all programs: 1348397 -> 246061 (-81.8%)
This also improves the performance of a few apps:
- Shadow of the Tomb Raider: +4%
- Witcher 3: +3.5%
- UE4 Shooter demo: +2%
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>
This fixes two bugs: First, if the same block index showed up twice, we
only pick the first one. Second, we weren't multiplying by 32. This
didn't show up in tests because RBA testing is garbage. Found while
looking at shaders from the UE4 Shooter demo.
Fixes: e03f9652 "anv: Bounds-check pushed UBOs when..."
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4578>
There was a disconnect between anv_nir_compute_push_layout and the code
which sets up the push_ubo_sizes array. The NIR code we emit checks
relative to the start of the bound UBO range so that, if we end up with
a vector which straddles the start of the push range, we can perform the
bounds check without risking overflow issues. The code which sets up
the push_ubo_sizes, on the other hand, assumed it was relative to the
start of the push range. Somehow, this didn't get get caught by any of
the available tests.
Fixes: e03f965280 "anv: Bounds-check pushed UBOs when ..."
Closes: #2623
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4195>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4195>
As a result of 9baa33cef0 our backend compiler leaves params pretty
much untouched. So in order to avoid storing uninitialized values in
the shader cache blobs, just 0 out this array.
I've considered not even allocating this array which works on gen8+
but the vec4 backend still makes a copy of this array and so it
crashes on memcpy on HSW.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 9baa33cef0 ("anv: Rework push constant handling")
Reported-by: Tapani Pälli <tapani.palli@intel.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Tapani Pälli <tapani.palli@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3516>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3516>
Store push_ranges in ascending order, and only "shift" them to the end
of the array during state packet emission.
We don't need this workaround with the new 3DSTATE_CONSTANT_ALL packet.
So instead of applying the workaround here just for GEN < 12 (which
requires and extra loop through all the ranges to figure out if we
should shift them or not), we simply move the whole logic to the state
emission code. At that point, in a later commit, we are already looping
through all of the ranges anyway to check which packet we will be using,
so we might as well implement the workaround there, where it is going to
be used.
v3: Move gen8+ workaround to the state emission code (Caio).
v4: Add explanation of why we moved the workaroudn (Caio).
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Instead of blindly dirtying descriptors and push constants the moment we
see a pipeline change, check to see if it actually changes the bind
layout or push constant layout. This doubles the runtime performance of
one CPU-limited example running with the Dawn WebGPU implementation when
running on my laptop.
NOTE: This effectively reverts beca63c6c0. While it was a nice
optimization, it was based on prog_data and we can't do that anymore
once we start allowing the same binding table to be used with multiple
different pipelines.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This substantially reworks both the state setup side of push constant
handling and the pipeline compile side. The fundamental change here is
that we're no longer respecting the prog_data::param array and instead
are just instructing the back-end compiler to leave the array alone.
This makes the state setup side substantially simpler because we can now
just memcpy the whole block of push constants and don't have to
upload one DWORD at a time.
This also means that we can compute the full push constant layout
up-front and just trust the back-end compiler to not mess with it.
Maybe one day we'll decide that the back-end compiler can do useful
things there again but for now, this is functionally no different from
what we had before this commit and makes the NIR handling cleaner.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
It turns off that emitting push constants is one of the hottest paths in
the driver and ANY work we do there costs us. By pre-computing things a
bit ahead of time, we shave 5% off the runtime of a CPU-limited example
running with the Dawn WebGPU implementation.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>