The uber shader relies on a number of sysvals being lowered to push
costants in order to dynamically controlling the behaviour of some
lowering passes.
Add a pass to perform such lowering.
Introduce the logic to implement uber shaders.
The way variants is handled changes significantly: a uber program is
expected to be compiling asynchronously and is used whenever possible.
Specialized variant shaders are compiled asynchronously, though they
might be compiled synchronously if the uber program can't be used.
Each variant is a separate program as that simplifies gpl/obj caching.
A new key is introduced, the st_key, that keeps track of the state of
the features emulated by the uber shader.
This is split in a dynamic part, always sent through push constants, and
a more compact part that is used as a key for caching optimized
variants.
Lower sysvals for `flat_flags` and `pv_last_vert` as push constants.
This allows to change them dynamically without recompiling shaders which
is necessary for uber shaders.
Switched to the new VMA allocator that provides explicit GPU VA
control via util_vma_heap.
This is architectural preparation for ray tracing capture/replay,
which requires the ability to reserve and allocate shaders at specific
VAs. The state pool's free-list design makes VA reservation difficult
to add, while the new chunk allocator is designed for explicit VA
management from the ground up.
Signed-off-by: Michael Cheng <michael.cheng@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38869>
Introduce a VMA-first chunk allocator for shader binaries to eventually
replace the anv_state_pool-based implementation. This allocator works
directly with GPU virtual addresses through util_vma_heap, making the
virtual address space an explicit resource managed by ANV.
No functional change in this commit.
v2(Michael Cheng): Use existing instruction state pool anv_va_range
v3(Lionel): Simplify allocator
Signed-off-by: default avatarMichael Cheng <michael.cheng@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38869>
The shader assembly was only available when not hitting the cache.
Additionally the serialized shader code was also the relocated variant
which meant that it could differ from one run to the next. Instead
serialize the unrelocated code produced by the compiler.
With this change we now decode the copy of the ISA we have on the
host.
NIR dumps are only available for shaders not loaded from the cache
(much like the other drivers).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 8f4c2bd566 ("anv: add runtime shader statistic support")
Acked-by: Michael Cheng <michael.cheng@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38869>
This is potentially nicer for some drivers. AMD drivers will use it.
mesa_frag_result_get_color_index will be used often.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38604>
We used to maximize threads_per_task, but that is ideal when the system
has a single gpu client. When there are multiple gpu clients, we want
smaller threads_per_task such that cores can be more fairly shared among
the clients.
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Tested-by: Yiwei Zhang <zzyiwei@chromium.org>
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37988>
task_axis selects the dim of the global workgroup, not the dim of the
local workgroup.
v2: fix assert for dEQP-VK.compute.pipeline.basic.empty_workgroup*
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Tested-by: Yiwei Zhang <zzyiwei@chromium.org> (v1)
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com> (v1)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37988>
Set compute_ep_limit to max_tasks_per_core on v12+. It is generally a
good idea to queue as many tasks as possible to better utilize the
cores.
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Tested-by: Yiwei Zhang <zzyiwei@chromium.org>
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37988>
Since v12, RUN_COMPUTE.ep_limit specifies the size of the compute task
queue. RUN_COMPUTE stalls when there are more tasks in the queue than
the specified ep_limit.
Sensible values are 0 (treated as 4), 4, or 16 (max_tasks_per_core).
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Tested-by: Yiwei Zhang <zzyiwei@chromium.org>
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37988>
Coverity notices that we read past the end of the array we're pointing
to, which is intentional, we want to copy additional members from the
source struct into the target pointer. As such, cast to a `void *`,
since this will make Coverity happy.
CID: 1649589
Fixes: 314de7af06 ("anv: Initial support for VP9 decoding")
Reviewed-by: Hyunjun Ko <zzoon@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38438>
Instead of manually applying the patch, backport the version that landed
in main, which requires a cmake argument to enable.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38071>