It didn't save states properly. The only correct place to save them is
si_blitter_begin. Unfortunately, we can't skip saving and restoring
those states because we don't know in advance whether the rectangle path
will be used.
Cc: mesa-stable
Reviewed-by: Pierre-Eric
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40634>
This should be faster because 2 triangles are inefficient on the diagonal,
generating helper invocations and potentially extra memory loads from dst
because tiles aren't fully covered.
Reviewed-by: Pierre-Eric
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40633>
The most egregious case was AS updates, in which case radv_copy_memory
would decide to use compute, which overwrites the bound pipeline with
a copy shader. Subsequent dispatches assumed the update pipeline to be
bound, but dispatched another copy shader instead.
There is also a chance of this happening for geometry info copying for
RRA, so add another pass for that.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39985>
The dirty state of stencil ops is not checked when deciding whether to
rebuild the ISP state, although the values are part of the ISP state
(the 27:16 bits of ISPB word).
Add MESA_VK_DYNAMIC_DS_STENCIL_OP to the condition for rebuilding ISP
control registers.
Fixes GLCTS tests when running on top of Zink:
dEQP-GLES2.functional.fragment_ops.stencil.zero_stencil_fail
Fixes: 88f1fad3f7 ("pvr: Use common pipeline & dynamic state frameworks")
Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40623>
When running GLES2 conformance tests with Zink on the PowerVR driver, I
found that the PowerVR driver has the same kind of weird behavior of not
ignoreing wrap mode for seamless cubes with Apple AGX (See !21978 for
the description of the quirk on AGX).
As GLES2 exposes non-seamless cubes, exposing non-seamless cube support
at PowerVR help seems to help lot about these GLES2 tests. Implementing
full GLES 3 and relying on the workaround for AGX is another choice, but
it's still too far.
Implementing non-seamless cube seems to be as easy as setting a bit in
the sampler control word, so do it.
Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40660>
Use KGSL_CONTEXT_NO_FAULT_TOLERANCE to push context into an error state
when a GPU fault is detected. This is useful when dealing with replays of
captures that are producing a GPU fault but might seem to replay just fine
because of the KGSL kernel fault tolerance.
Signed-off-by: Zan Dobersek <zdobersek@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40667>
This allows us to avoid dirtying all of the state for user compute
dispatches when we run a precomp shader.
Signed-off-by: Olivia Lee <olivia.lee@collabora.com>
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Reviewed-by: Mary Guillemard <mary.guillemard@collabora.com>
Reviewed-by: Eric R. Smith <eric.smith@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37970>
The VK_KHR_present_wait extension contains no functionality to announce
(the lack of) support for vkWaitForPresentKHR() on a WSI (or WSI-bound
object) granularity.
On any driver advertising that extension and the headless WSI, the
application will expect vkWaitForPresentKHR() to be usable with the
headless WSI, which leads to assertion failure in debug Mesa builds or
crash in release builds.
Create a trivial wait_for_present implementation for the headless WSI,
which just assumes the image is immediately presented at the time of
queue_present is called, so it only checks the WSI present semaphore.
Tested with `dEQP-VK.wsi.headless.present_id_wait.wait.*` on RADV
without any failures.
Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Reviewed-by: Yiwei Zhang <zzyiwei@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40347>
Currently the wsi_headless_surface_create_swapchain() function abuses
the corresponding destroy function to perform cleanup operations when
any failure happens during images creation. This practice sounds
fragile and prevents further changes to the swapchain creation
procedure.
Implement a proper cleanup sequence to reverse all operations.
As another cleanup codepath above already contains call of vk_free(),
the call is changed to a goto targetting the corresponding label.
Regression tested with `dEQP-VK.wsi.headless.swapchain.simulate_oom.*`
on RADV.
Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Reviewed-by: Yiwei Zhang <zzyiwei@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40347>
Just a bit cleaner, and we can unify point size too.
Signed-off-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40677>
This could've saved me a lot of time debugging stack corruption.
Signed-off-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40677>
The state uploader was hardcoded to 4096 bytes, which doesn't fill the
full page on systems with 16KB pages. Use devinfo->page_size instead so
the uploader default matches the actual allocation granularity.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40496>
The variable doesn't store a granularity specific to CLE buffers. It
stores the granularity that the OS imposes on buffer allocations (that
is, the OS page size). Therefore, rename the variable to best reflect
its meaning.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40496>
Previously, each sampler view allocated a dedicated BO for its,
TEXTURE_SHADER_STATE packet (~24 bytes), which got rounded up to a
full 4KB page. This wastes memory and inflates the per-job BO handle
count.
Use u_upload_alloc_ref() to sub-allocate texture shader state from the
shared state_uploader, matching the pattern already used by image views.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40496>
From the documentation, the state uploader should be used inside the
driver for long-term state inside buffers, while the stream uploader
should be used by Gallium's internals. Considering that the image view
texture shader state can be considered long-lived state data, use
`state_uploader` instead of `uploader` for consistency.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40496>
Searching device->queues only according to queueIndex and queueFamilyIndex
could cause this issue: if there are two queues A and B created with same
queueIndex and queueFamilyIndex but different flags. When user try to get
B but vk_foreach_queue loop return A when it get A and find it have the
request queueIndex and queueFamilyIndex.
So this add a check of queue flags and return the queue with matching
flags, queueIndex and queueFamilyIndex.
Signed-off-by: Julia Zhang <Julia.Zhang@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40669>
Because there is no way to know where the address has been allocated
(GTT or VRAM), the existing entrypoints aren't dropped and the sparse
bit is derived from VK_ADDRESS_COMMAND_FULLY_BOUND_BIT_KHR.
It would be nice to figure out if the CP DMA vs compute heuristic for
GTT BOs on dGPUs could be removed to simplify this implementation.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40386>
This doubles vkoverhead's draw_16vattrib_change_dynamic performance.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40603>
Only use LDS for VGPR spilling if we can use addtid access, to avoid having a VGPR addr.
Limit to single wave workgroups, to avoid needing the wave_id for the offset.
If we have a scratch stack pointer, don't use LDS at all.
Limit LDS spilling to not reduce occupancy further.
Note that in theory, this can still limit occupancy of other shaders running
on the CU at the same time, but that's unlikely and impossible to know at this point.
Removes all scratch usage in emulated FSR4 and parallel_rdp.
Besides that, only a single GoW shader is affected.
Foz-DB Navi31:
Totals from 9 (0.01% of 114641) affected shaders:
Instrs: 68863 -> 68830 (-0.05%); split: -0.07%, +0.02%
CodeSize: 416108 -> 416000 (-0.03%); split: -0.05%, +0.02%
LDS: 2048 -> 45056 (+2100.00%)
Scratch: 261888 -> 220672 (-15.74%)
Latency: 727951 -> 657155 (-9.73%); split: -9.73%, +0.00%
InvThroughput: 418644 -> 383269 (-8.45%)
VClause: 1506 -> 1200 (-20.32%)
Copies: 10651 -> 10624 (-0.25%)
VALU: 48700 -> 48684 (-0.03%)
SALU: 6200 -> 6199 (-0.02%); split: -0.05%, +0.03%
VMEM: 4139 -> 3589 (-13.29%)
VOPD: 580 -> 574 (-1.03%)
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36367>
PyTorch Conv2d without explicit bias produces a NULL bias_tensor
in the Gallium pipe_ml_operation. Guard against NULL dereferences
in two places:
- ethosu_lower.c: pass NULL to fill_coefs when bias_tensor is NULL
- ethosu_coefs.c: treat missing biases as zero
Fixes crashes when running Conv2d models without bias through the
Ethos-U NPU backend.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40578>
Add ethosu_ml_subgraph_deserialize() which reconstructs a subgraph
from a serialized byte buffer. Parses the header (cmdstream size,
coefs size, io size, tensors size), restores the tensor array,
cmdstream, and coefficient buffers.
DRM buffer object creation is deferred to prepare_for_submission()
which is called lazily on first invoke.
Wire pctx->ml_subgraph_deserialize in ethosu_create_context().
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40578>
Add ml_subgraph_deserialize() to pipe_context for reconstructing
a previously-serialized ML subgraph at runtime. This complements
ml_subgraph_serialize() on pipe_ml_device and allows the runtime
to load pre-compiled subgraphs.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40578>
On LSC platforms the SLM writes are unfenced between workgroups. This
means a workgroup W1 finishing might have uncompleted SLM writes.
Another workgroup W2 dispatched after W1 which gets allocated an
overlapping SLM location might have writes that race with the previous
W1 operations.
The solution to this is fence all write operations (store & atomics)
of a workgroup before ending the threads. We do this by emitting a
single SLM fence either at the end of the shader or if there is only a
single unfenced right, at the end of that block.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13924
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40430>
Expose float32 atomic exchange support for buffer, shared, and image
operations on all architectures. The existing axchg instruction is
type-agnostic, so no compiler changes are needed. Image atomics are
already lowered to global atomics via nir_lower_image_atomics_to_global.
Also add R32_FLOAT to the STORAGE_IMAGE_ATOMIC format feature flag so
image atomic operations are accepted for r32f images.
Signed-off-by: Christian Gmeiner <cgmeiner@igalia.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40506>
Previously bifrost_nir_lower_shader_output grouped outputs in separate
if blocks and made a best-effort attempt to group them together. This
also assumed that pan_nir_lower_store_component wrote each output only
once and that nir_lower_io_vars_to_temporaries pulled them out of any
control flow.
Now all of these are handled by the new pan_nir_lower_vs_outputs pass
that handles write masks, control flow, per_view and grouping for IDVS.
This makes the overall dependencies much simpler, ensures that the
stores are grouped in the same ifs and should be more robust.
Signed-off-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40537>