Handle null descriptors by emitting zeroed descriptor state.
When the nullDescriptor feature is enabled, a dedicated null_bo is
allocated. Null image descriptors now pack a TEXTURE_SHADER_STATE whose
base address points to this BO, ensuring that the TMU reads from valid
memory.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40485>
Add `v3d_nir_lower_null_descriptors` NIR pass to bypass operations
if the descriptor size is zero, returning 0 where necessary.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40485>
This also introduces a new tier since the common helper exposes
25% of memory as heap on devices with <=1GiB memory. Previously
50% was being used.
This also fixes `device->heap_used` not using atomic read.
Signed-off-by: Karmjit Mahil <karmjit.mahil@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41242>
Currently, display_fd gets leaked during vulkan loader driver
probing on platforms where there's no v3dv device, as nothing
closes this fd before returning with INCOMPATIBLE_DRIVER. As
the display_fd also holds MASTER, this in turn prevents the
actual driver from becoming master on the display node.
Close the fd before returning to prevent this.
Fixes: bb532a7a ("v3dv: Fix assertion failure for not-found primary_fd during enumeration.")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41058>
Threaded submit relies on DRM syncobj wait ioctls blocking until the
GPU signals completion. Under drm-shim there is no real GPU, so
SYNCOBJ_WAIT returns immediately, creating a race between the submit
thread and vkQueueWaitIdle that leads to use-after-free crashes.
Detect if we are running under drm-shim by checking the DRM version
description, skip enabling threaded submit in that case.
Assisted-by: Cursor Agent (Opus 4.6)
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41779>
So far with drm-shim we were always emulating V3D 4.2.
Now we always emulate V3D 7.1, but we allow selecting 4.2 through an
envvar: `V3D_GPU_ID=(42|71)`
Borrowed from etnaviv.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41779>
When using drm-shim there is no primary node for the driver. This is
fine, and hence we only mark that we don't have primary device.
This fixes using v3dv with drm-shim.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41779>
The TFU stride-0 fill path allocates a 64 KiB staging BO
(V3D_TFU_MAX_DIM * cpp = 16384 * 4), maps it, fills it with the
pattern, and caches it on the command buffer. For non-zero patterns
the per-cmd-buffer cache works well, but WebGPU/Dawn workloads
issue many zero-fills (lazy buffer init) across separate command
buffers, so the cache misses almost every time and each fill pays
for a fresh alloc + mmap + memcpy.
Add a device-wide staging BO held in v3dv_device::meta.tfu_fill_zero,
lazily allocated under meta.mtx and used whenever data == 0. The BO
is read-only after init so it can be shared across queues without
extra synchronization, and it is freed in destroy_device_meta.
Measured on a Dawn/WebGPU zero-fill-heavy workload (RPi5, ~60
meta_fill_buffer calls, ~218 MiB total, all zero-fills):
before: TFU branch total 7.328 ms, avg 115.55 us/call
after: TFU branch total 0.296 ms, avg 4.78 us/call (~24x)
Non-zero patterns continue to use the per-cmd-buffer cache.
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
Adjust eligibility check on imageExtent vs slice dimensions
rather than on the buffer addressing dimensions. The TFU codepath
here always writes/reads the full slice from its origin, so the
required invariant is 'imageExtent == slice'; bufferRowLength and
bufferImageHeight may be larger than imageExtent (the spec permits
this for non-zero values), in which case the TFU reads/writes at the
buffer's row/layer stride but only touches slice->width pixels per
row and slice->height rows per layer, leaving the trailing padding
untouched.
The previous combined check (width == slice->width && height ==
slice->height applied to the buffer dimensions) would reject any
caller that set bufferRowLength or bufferImageHeight larger than the
image (this is common for buffers shared across mip levels or
for alignment requirements like Dawn aligning bufferRowLength to 2
for 1-pixel-wide textures), forcing those copies through the slower
TLB / blit / compute paths.
For compressed formats, keep the strict equality check since
block-level stride semantics are more complex.
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
Generalize copy_buffer_image_tfu with a to_buffer flag selecting which
side is the raster destination, and wire it into v3dv_CmdCopyImageToBuffer2
before the TLB path.
The to_buffer=true direction has the same eligibility constraints as
buffer-to-image, except that V3D 4.2 is unsupported as its TFU cannot
produce raster output, and for image-to-buffer the destination is
always a raster buffer.
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
Drop the direction from the function name in preparation for sharing
this implementation with image-to-buffer copies in the next commit.
Pure rename, no functional change.
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
Replace the TLB-based meta_fill_buffer path on V3D 7.1+ with a TFU
raster-to-raster copy that broadcasts a single staging row across
the output via iis=0 (stride-0 input). This eliminates the per-fill
CL render job and its tile_alloc/TSDA BO overhead, which is
substantial on workloads that issue many small fills (e.g. WebGPU
lazy buffer initialization in Dawn).
The staging BO holding one row of the fill pattern is cached on the
command buffer and reused across fills with the same data value, so
sequences of identical-pattern fills share a single staging BO.
The existing TLB-based fill is kept as a fallback and is also used
when V3D_DEBUG=disable_tfu is set, or on V3D simulator builds where
the stride-0 TFU input mode is not supported and would assert.
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
Move from v3dv_meta_copy.c to a generic v3dv_cmd_buffer_destroy_bo_cb
in the cmd buffer module. This makes it reusable for different callers
that want to attach a v3dv_bo to a command buffer's private_objs list.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
Buffer-to-buffer copies on V3D 7.1+ can be served by the TFU as a
raster-to-raster copy, avoiding the per-copy CL render job and
tile_alloc/TSDA BO overhead of the TLB-based path.
Treat the buffer as a raster texture and chunk the copy into TFU
jobs of up to 16384x16384 pixels. Pick the largest pixel size
(cpp in {4,2,1}) such that src/dst offsets and size are all
cpp-aligned: cpp=4 (R8G8B8A8_UINT) is the expected common case;
cpp=2 (R8G8_UINT) and cpp=1 (R8_UINT) handle Vulkan-permitted
unaligned vkCmdCopyBuffer regions that would otherwise fall back
to the slow TLB path. Skipped when V3D_DEBUG=disable_tfu is set;
emits perf_debug when the cpp=1/2 fallback is taken.
Drop the `if (copy_job)` guard on src_bo cleanup registration in
v3dv_CmdUpdateBuffer: the TFU path queues jobs without returning a
v3dv_job*, so the staging BO must be tracked unconditionally to
avoid leaking once the cmd buffer is submitted.
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41725>
The scalarBlockLayout feature was already exposed via the Vulkan 1.2
feature struct, but Vulkan 1.1 clients (e.g. Dawn) need the EXT to
discover it.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41673>
B10G11R11_UFLOAT_PACK32 maps to V3D_INTERNAL_TYPE_16F on the TLB,
which canonicalizes NaN bit patterns when arbitrary 32 bits are
reinterpreted as that format. The same canonicalization happens in
the blit shader when sampling a B10G11R11 source. Both break the
bit-exactness that vkCmdCopyImage, vkCmdCopyImageToBuffer and
vkCmdCopyBufferToImage require, since the spec defines them as raw
byte copies for any pair of texel-size compatible formats.
Fix it by aliasing the format to R32_UINT whenever B10G11R11 is
involved.
This fixes dEQP-VK.api.copy_and_blit.*b10g11r11*,
dEQP-VK.image.subresource_layout.*b10g11r11* and
dEQP-VK.api.image_clearing.*b10g11r11* failures on V3D 7.1.7 (rpi5)
and V3D 4.2 (rpi4).
Assisted-by: Claude Opus 4.7
Cc: mesa-stable
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41599>
The two BOs come from distjoint allocation nowadays. So they
would never share the BO handle. In case this becomes false
in the future, the BO hanldes needs to be de-duped as happens
with TFU submisions.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41616>
v3d_submit_cpu_ioctl() takes a separate ww_acquire_ctx for the cpu_job's
bo_handles[] and any embedded CSD's bo_handles[]; a BO appearing in both
lists makes the second lock wait on a reservation held by the first
context, deadlocking the ioctl.
We avoid adding a duplicate BO handle when it's already in the cpu_job's
list. This collided when an app suballocates an indirect VkBuffer and a
CSD bind-group VkBuffer out of one VkDeviceMemory.
Fixes: e404ccba5b ("v3dv: use the indirect CSD user extension")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41616>
V3DV hardcoded maxFragmentOutputAttachments to 4, from
V3D 4.x when V3D_MAX_RENDER_TARGETS was 4. On V3D 7.x (RPi5)
V3D_MAX_RENDER_TARGETS is 8.
WebGPU's mandatory maxColorAttachments minimum is 8, and wgpu computes
max_color_attachments as min(maxColorAttachments,
maxFragmentOutputAttachments). With the previous value V3DV capped
WebGPU clients to 4 color attachments on RPi5.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41600>
This extension is part of Vulkan 1.2 core and the feature is already
exposed; we just weren't advertising the extension separately.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41624>
I've pulled in a pile of changes to reduce the overhead (runtime and
memory) when sharding for deqp-runner, along with a bunch of fixes for
KHR_display testing that we recently enabled, plus a few others that
affect our drivers.
The big new set of failures looks like it's from more complete coverage of
blitting between formats.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41243>
v3dv_CmdFillBuffer was passing only the user-supplied dstOffset to
meta_fill_buffer, ignoring the destination VkBuffer's mem_offset.
When several VkBuffers share one VkDeviceMemory at different offsets
(sub-allocation) the fill landed on whichever VkBuffer was
bound at offset 0 of the memory object instead of the requested one.
Fixes: 5ed78d91fe ("v3dv: implement vkCmdFillBuffer")
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41436>
The QPU prefetches the next instruction during shader execution.
If the shader assembly size perfectly aligns with a page boundary,
the prefetching mechanism reads past the compiled boundary,
leading to an MMU error.
This commit insert an explicit NOP instruction at the end of the shader
and increases the qpu_inst_count by one when the instruction count
exactly hits a page boundary. This ensures we don't fall off the end
of the last executable instruction page and into invalid memory.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40983>
Add QPU disassembler tests for V3D 7.1, covering
small immediates in both add and mul slots, as well
as setnnmode_uu paired with v8dot.
Assisted-by: OpenAI Codex
Acked-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41280>
V3D advertises maxComputeWorkGroupInvocations = 256 but ggml-vulkan
in many cases ignores this limit an creates compute pipelines with
over this limit. Although this is a bug in the application we can
take advantage of nir_lower_workgroup_size and make the application
work.
This issue was causing an assertion failure at nir_to_vir.c:
assert(c->local_invocation_index_bits <= 8);
The solution is lowering the oversized workgroups to a 256-invocation
workgroup loop, like radv and radeonsi are doing on GFX7, by running
nir_lower_workgroup_size(256) for this scenario.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41257>
Currently local shared memory is backed by a BO that is read/written
using the TMU.
ggml-vulkan probes the size of maxComputeSharedMemorySize and rejects
V3DV (falling back to CPU) when the value is below what its larger
compute pipelines request, although in the end the shaders ollama
runs don't actually use shared memory.
32 KB is what ggml-vulkan demands; the value can grow further with no
real per-op cost since shared memory currently goes through the TMU
like any other BO.
V3D OpenGL driver also has 32 KB for SharedMemory.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41257>
The combination of nir_opt_if and nir_lower_undef_to_zero running inside
the optimization loop could make it to not converge.
This was exercised by ollama running gemma3 compute shaders.
Removing the pass from the optimization loop results in No changes in
shader-db.
Assisted-by: Claude Opus 4.6
Fixes: cbe24a0e9c ("broadcom/compiler: use nir_lower_undef_to_zero")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41256>
Expose VK_KHR_shader_integer_dot_product 4x8-bit packed dot
products using native HW instructions v8dot and setnnmode.
QPU instruction count for sdot_4x8_iadd compute shader:
Before (scalar decomposition): 18 ALU cycles
After (setnnmode + v8dot): 3 ALU cycles (6x)
We advertise integerDotProduct4x8BitPacked*Accelerated for V3D 7.1+
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
This new VIR optimization pass tracks the current NN signedness
mode per block and removes duplicate setnnmode instructions.
When consecutive dot products use the same signedness mode, the backend
emits one setnnmode per dot product. This pass removes the redundant
ones, keeping only the first.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
As nnmode register is read by v8dot instruction we need to add dependencies
between setnnmode instructions and v8dot via the nnmode register, so they
are scheduled correcty using last_nn_mode virtual register..
Add a last_nn_mode virtual register to the scheduler state and create:
- Write dependencies for all SETNNMODE variants
- Read dependencies for V8DOT.
This follows the same pattern as the existing MULTOP/UMUL24 rtop tracking.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
VIR instructions and nir_to_vir implementation of 4x8-bit dot products
using native HW accelerated ALU instructions.
setnnmode instructions are marked as having side effects.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
Add QPU instruction definitions, metadata, and encoding for V3D 7.1
v8dot product instruction and the setnnmode instruction that allows
defining the signedness (UU/SU/US/SS) of the v8dot operation.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>