In case of userq, fences are not installed in kernel kms handled. fences are
handled internally in mesa. So when unmapping a buffer, fences will have to
be passed by mesa to kernel so that kernel can wait on these fences to unmap
the buffer.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29010>
In case of userq when destroying bo, fences are gathered and passed to kernel.
Fences are gathered using bo_fence_lock, In do_winsys_deinit() currently
bo_cache is destroyed after destroying bo_fence_lock. This leads to crash.
Fix this by moving destroying bo_fence_lock late in do_winsys_deinit().
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29010>
In case of kernel queues method of job submission, buffer list for the job
is passed to amdgpu_cs ioctl. Kernel can ensure that VM mapping is
completed before submitting the job.
With user queues amdgpu_cs ioctl is not called, so the kernel can't determine
automatically when BO should be prepared for submissions. To achieve this, a
timeline syncobj is attach to the gem_va ioctls, which can then be used as a
dependency for future jobs.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29010>
This will make it easy when adding timeline syncobj parameter
for user queue.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29010>
This patch:
- adds a new subquery (AMDGPU_INFO_UQ_FW_AREAS) in AMDGPU_INFO_IOCTL
to get the size and alignment of shadow and csa objects from the
kernel. This information is required for a userqueue consumer (like
MESA/libdrm) to create the userqueue metadata objects properly.
- also adds supporting metadata structures and a high level wrapper
function (amdgpu_query_uq_metadata_info) to the query, to make it
easy to use.
The corresponding kernel changes for this UAPI extension can be found
in amd-gfx mailing list, link:
https://patchwork.freedesktop.org/patch/621390/?series=139715&rev=2
This patch adds support only for the GFX IP, and the other engines may
be supported in subsequent development.
This patch was reviewed in libdrm library at
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/400
Cc: Marek Olsak <marek.olsak@amd.com>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Arvind Yadav <arvind.yadav@amd.com>
Reviewed-by: Marek Olsak <marek.olsak@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29010>
This patch adds new IOCTL functions to support
userqueue create, remove, signal and wait etc.
This patch was reviewed in libdrm library at
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/392
Cc: Deucher, Alexander <alexander.deucher@amd.com>
Cc: Koenig, Christian <christian.koenig@amd.com>
Cc: Sharma, Shashank <shashank.sharma@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29010>
RS and BLT operations can exhibit issues in some cases. To help in debugging
such issues stall after RS and BLT operations when ETNA_MESA_DEBUG=draw_stall
is enabled. In that case the FE will point right at the faulty RS/BLT
operation, instead of the next stall which may be many state loads later.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32444>
This moves the tlb job load/store logic to the new helper
v3d_update_job_tlb_load_store. Then an early return is included
so if the rasterizer discard is enabled, no load/stores are
emitted because of the draw call.
This helps in situations where transform feedback is used
and there is only interest in the geometry results. We identified
that some jobs were not rendering at all, but they were having the
performance cost of doing several loads and stores.
This generates a huge performance improvement on manhattan benchmarks.
fps_avg helped: gl_gfxbench_manhattan.trace: 8.37 -> 11.54 (37.85%)
fps_avg helped: gl_gfxbench_manhattan31.trace: 6.02 -> 7.51 (24.62%)
total fps_avg in affected (through threshold) runs: 14.39 -> 19.04 (32.32%)
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32351>
RADV is the only driver in Mesa CI to use VKCTS main but it doesn't
recognize 1.4 correctly yet. This will be fixed with a VKCTS uprev.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32432>
GFX6-7 can't support Vulkan 1.4 because indexTypeUint8 isn't supported
in hardware, and emulating features for very old hardware isn't the
option I would personally choose.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32432>
This performs some very basic verifications with the faulty VA we get
from the kernel. This will probably be improved over time.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32403>
Move the tool to summarize a failed pipeline to a generic .marge/hooks
directory. This will allow the fdo-bots repo to handle all marge hooks in
a consistent way across repositories that use this service.
Add a symlink to the bin/ci directory so that the pipeline summary tool
can still be run locally as well.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32413>
Notably, our convergent block loads were already overfetching - we
rounded up to block sizes of 8, 16, 32, or 64(LSC-only). But we did
so in the backend, rather than NIR.
With recent changes, nir_opt_load_store_vectorizer allows holes of up
to 28 bytes (7 components at 4 bytes each). This allows us to detect
cases where we did a convergent block load for 1 component (but loaded
a whole vec8), then another load for the next vec8, and combine them
into a single V16 load. Single component loads aren't the most common,
but convergent loads of a vec2 in one group and a vec3 in another are
quite common, and it makes no sense to do V8+V8 loads instead of V16.
For non-block loads, we allow a max hole of 4 bytes. This allows the
common case of XYZ_ + XYZ_ loads (where the last component is unread)
to combine into a single larger load.
fossil-db results on Lunarlake:
Totals:
Instrs: 146692608 -> 146246432 (-0.30%); split: -0.33%, +0.02%
Subgroup size: 11100528 -> 11100512 (-0.00%)
Send messages: 7003425 -> 6862529 (-2.01%); split: -2.01%, +0.00%
Cycle count: 22396273274 -> 22523048654 (+0.57%); split: -1.08%, +1.64%
Spill count: 67671 -> 67594 (-0.11%); split: -1.59%, +1.48%
Fill count: 128999 -> 130223 (+0.95%); split: -1.73%, +2.68%
Scratch Memory Size: 5986304 -> 6042624 (+0.94%); split: -1.40%, +2.34%
Max live registers: 48898858 -> 48881655 (-0.04%); split: -0.05%, +0.01%
Non SSA regs after NIR: 172397792 -> 167577380 (-2.80%); split: -2.80%, +0.00%
Totals from 451003 (80.87% of 557667) affected shaders:
Instrs: 134111754 -> 133665578 (-0.33%); split: -0.36%, +0.03%
Subgroup size: 9039104 -> 9039088 (-0.00%)
Send messages: 6127775 -> 5986879 (-2.30%); split: -2.30%, +0.00%
Cycle count: 20306336726 -> 20433112106 (+0.62%); split: -1.19%, +1.81%
Spill count: 56230 -> 56153 (-0.14%); split: -1.92%, +1.78%
Fill count: 112920 -> 114144 (+1.08%); split: -1.97%, +3.06%
Scratch Memory Size: 3769344 -> 3825664 (+1.49%); split: -2.23%, +3.72%
Max live registers: 43750259 -> 43733056 (-0.04%); split: -0.05%, +0.01%
Non SSA regs after NIR: 158449343 -> 153628931 (-3.04%); split: -3.04%, +0.00%
In particular, sends get cut by 20.85% for Borderlands 3 DX12, 13.82%
on Cyberpunk 2077, 10.75% on Strange Brigade, and 10.20% on Red Dead
Redemption 2. Yet, spill/fills remain about the same.
fossil-db results on Alchemist are similar though not quite as good.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
nir_opt_load_store_vectorize checks for potential address wrapping
when vectorizing two loads ("low" and "high"). It looks for cases where
"low" might have a large address, and "high" has a positive offset
which, when added together, could trigger integer wraparound. The issue
here is that if the large address of "low" was considered out-of-bounds,
adding offset could wrap around to a small address, which might actually
be in-bounds. Thus, when loaded separately, "low" will fail and trigger
robustness out-of-bound-read behavior, but "high" would read correctly.
When vectorized, the entire load would fail. This is explicitly tested
for with 32-bit SSBO addresses in the Vulkan CTS.
However, anv's 64-bit global addresses and VMA handling effectively
prevent this case. Addresses 0-4095 are a reserved page so that if
people try to use 0 as a NULL pointer, it never maps to a valid BO.
That alone guarantees that the above case where "high" gets a small
address would never be in-bounds, so we don't need to check for it.
In fact, we allocate most user allocations out of high addresses,
and have specialized allocation heaps for certain types of GPU data
structures in the lower GB of memory. For a load to wrap around and
successfully land in the right heap, it would have to load gigabytes.
Disabling this allows load vectorization and overfetching in more cases.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
The load_*_uniform_block_intel intrinsics always load either 8x or 16x
32-bit components worth of data (so 32 byte increments). This leads to
cases where we load a few components from one vec8, followed by a few
components of an adjacent vec8. We want to combine those into a vec16
load, as that loads a whole cacheline at a time, and requires less hoops
to calculate addresses and request memory loads.
So, we allow 7 * 4 = 28 bytes of holes, which handles vec8+vec8 where
only the .x component is read.
Most drivers and intrinsics will not want such large holes. I thought
about adding a per-intrinsic max_hole to the core code, but decided that
since we already have driver callbacks, we can just rely on them to
reject what makes sense to them.
No driver callbacks currently allow holes, so this should not currently
affect any drivers. But any work in progress branches may need to be
updated to reject larger holes.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
Just calculate the block size using util_logbase2() - it's simpler.
Also drop the name "oword" as this refers to legacy HDC messages,
rather than the newer LSC "vector size" field.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
This will matter more with overfetching, where we may suggest loading
additional data that we don't actually need for vectorization purposes.
We want to make sure that push ranges have the data we actually need;
any extra padding is irrelevant.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
i965 used to upload its own regular GL uniforms and push those in
addition to UBO ranges. st/mesa instead uploads regular uniforms
and presents those to use as UBO 0. So this really isn't a thing
anymore.
nir_intrinsic_load_uniform is still used today but it represents
Vulkan push constants. anv_nir_compute_push_layout already takes
care of ensuring too many ranges aren't present, so it doesn't need
the pass to do so. iris doesn't use this intrinsic at all.
We can also drop the compute shader check, because neither iris nor
anv use UBO push analysis for compute shaders - except for anv's
internal kernels, which already have well specified push layouts.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>