I've pulled in a pile of changes to reduce the overhead (runtime and
memory) when sharding for deqp-runner, along with a bunch of fixes for
KHR_display testing that we recently enabled, plus a few others that
affect our drivers.
The big new set of failures looks like it's from more complete coverage of
blitting between formats.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41243>
v3dv_CmdFillBuffer was passing only the user-supplied dstOffset to
meta_fill_buffer, ignoring the destination VkBuffer's mem_offset.
When several VkBuffers share one VkDeviceMemory at different offsets
(sub-allocation) the fill landed on whichever VkBuffer was
bound at offset 0 of the memory object instead of the requested one.
Fixes: 5ed78d91fe ("v3dv: implement vkCmdFillBuffer")
Assisted-by: Claude Opus 4.7
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41436>
The QPU prefetches the next instruction during shader execution.
If the shader assembly size perfectly aligns with a page boundary,
the prefetching mechanism reads past the compiled boundary,
leading to an MMU error.
This commit insert an explicit NOP instruction at the end of the shader
and increases the qpu_inst_count by one when the instruction count
exactly hits a page boundary. This ensures we don't fall off the end
of the last executable instruction page and into invalid memory.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40983>
Add QPU disassembler tests for V3D 7.1, covering
small immediates in both add and mul slots, as well
as setnnmode_uu paired with v8dot.
Assisted-by: OpenAI Codex
Acked-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41280>
V3D advertises maxComputeWorkGroupInvocations = 256 but ggml-vulkan
in many cases ignores this limit an creates compute pipelines with
over this limit. Although this is a bug in the application we can
take advantage of nir_lower_workgroup_size and make the application
work.
This issue was causing an assertion failure at nir_to_vir.c:
assert(c->local_invocation_index_bits <= 8);
The solution is lowering the oversized workgroups to a 256-invocation
workgroup loop, like radv and radeonsi are doing on GFX7, by running
nir_lower_workgroup_size(256) for this scenario.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41257>
Currently local shared memory is backed by a BO that is read/written
using the TMU.
ggml-vulkan probes the size of maxComputeSharedMemorySize and rejects
V3DV (falling back to CPU) when the value is below what its larger
compute pipelines request, although in the end the shaders ollama
runs don't actually use shared memory.
32 KB is what ggml-vulkan demands; the value can grow further with no
real per-op cost since shared memory currently goes through the TMU
like any other BO.
V3D OpenGL driver also has 32 KB for SharedMemory.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41257>
The combination of nir_opt_if and nir_lower_undef_to_zero running inside
the optimization loop could make it to not converge.
This was exercised by ollama running gemma3 compute shaders.
Removing the pass from the optimization loop results in No changes in
shader-db.
Assisted-by: Claude Opus 4.6
Fixes: cbe24a0e9c ("broadcom/compiler: use nir_lower_undef_to_zero")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41256>
Expose VK_KHR_shader_integer_dot_product 4x8-bit packed dot
products using native HW instructions v8dot and setnnmode.
QPU instruction count for sdot_4x8_iadd compute shader:
Before (scalar decomposition): 18 ALU cycles
After (setnnmode + v8dot): 3 ALU cycles (6x)
We advertise integerDotProduct4x8BitPacked*Accelerated for V3D 7.1+
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
This new VIR optimization pass tracks the current NN signedness
mode per block and removes duplicate setnnmode instructions.
When consecutive dot products use the same signedness mode, the backend
emits one setnnmode per dot product. This pass removes the redundant
ones, keeping only the first.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
As nnmode register is read by v8dot instruction we need to add dependencies
between setnnmode instructions and v8dot via the nnmode register, so they
are scheduled correcty using last_nn_mode virtual register..
Add a last_nn_mode virtual register to the scheduler state and create:
- Write dependencies for all SETNNMODE variants
- Read dependencies for V8DOT.
This follows the same pattern as the existing MULTOP/UMUL24 rtop tracking.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
VIR instructions and nir_to_vir implementation of 4x8-bit dot products
using native HW accelerated ALU instructions.
setnnmode instructions are marked as having side effects.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
Add QPU instruction definitions, metadata, and encoding for V3D 7.1
v8dot product instruction and the setnnmode instruction that allows
defining the signedness (UU/SU/US/SS) of the v8dot operation.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41255>
On unconditional branches qpu_set_branch_targets() can fill the delay slots
with a copy of the first instructions of the successor block.
As the qpu validator is sequential it would detect an incorrect hazard
when the MULTOP was copied but the UMUL24 wasn't.
This was identified in debug build when running gfxbench5.aztec_ruins_vk.
Assisted-by: Claude Opus 4.6
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40923>
The validation of branch instructions happening in branch and thrsw
delay slots has been dead code since it was introduced as the check
was after:
if (inst->type != V3D_QPU_INSTR_TYPE_ALU)
return;
Now last_branch_ip is updated and checks in_branch_delay_slots()
are active.
Fixes in_branch_delay_slots, as for branch there are always 3 delay slots.
As scheduler enforces this restrictions shader-db does not show any
regression.
Assisted-by: Claude Opus 4.6
Fixes: 90269ba353 ("broadcom/vc5: Use THRSW to enable multi-threaded shaders.")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40923>
Remove the allocate_tile_state_now parameter from v3dv_job_start_frame().
So v3dv_job_allocate_tile_state() is explicitly called after
job_emit_binning_flush() as we know the value of job->draw_count instead
of using always 0.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
Replace the inline tile_alloc/TSDA sizing in v3dv_job_allocate_tile_state()
with a call to the new v3d_tile_alloc_sizes() helper. This switches from
64B to 128B initial tile alloc blocks (avoiding overflow for simple draws)
and from a flat 512KB headroom to a draw-proportional formula.
Set tile_allocation_initial_block_size and tile_allocation_block_size
in all TILE_BINNING_MODE_CFG emissions and update the
TILE_LIST_INITIAL_BLOCK_SIZE packets to match.
Benchmarked on RPi5 (V3D 7.1) with GfxBench Vulkan Aztec Ruins at
1920x1040. Average tile_alloc BO size dropped 75% (535 KB to 132 KB)
with 20% fewer OOM events (521 to 417) and no FPS regression.
This avoids exhausting GPU memory when multiple blit or fill jobs
are batched in the same command buffer, with a huge reduction of
the memory footprint avoiding the 512 KB of the tile_alloc per batched
job.
Reviewed-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
Add V3D_TILE_ALLOC_INITIAL_BLOCK_SIZE = 128 and
V3D_TILE_ALLOC_OVERFLOW_BLOCK_SIZE = 64 to v3d_limits.h.
Corresponding _ENUM macros provide the 2-bit hardware encoding for the
TILE_BINNING_MODE_CFG packets.
The previous implicit 64B initial blocks were too small: a single draw
call emits ~88 bytes of per-tile BCL state, immediately overflowing
into continuation blocks. 128B initial blocks avoid the first
continuation allocation for simple single-draw passes.
Add v3d_tile_alloc_sizes() to v3d_util with the full tile alloc BO and
TSDA sizing logic. This uses the 128B initial blocks and tile_alloc
becomes proportional to the number of draws and size of the initial
blocks allocation with the cap of the previous fixed allocation. So
jobs with 0 or 1 drawcalls (blits/fills) reduce their headroom
dramatically.
The draw-proportional formula replaces a flat 512 KB continuation pool:
headroom = MIN2((tiles_size * draw_count) / 2, 512 KB)
Benchmarked on RPi5 (V3D 7.1) against GfxBench GL tests and
apitrace replays at 1080p. Tile-alloc memory reduction versus the
flat 512 KB headroom (taking into account 256kb kernel alloc per OOM):
GfxBench (5 benchmarks): -45% to -70% reduction, OOM at or below baseline
Apitrace (19 traces): -4% to -77% reduction on 20/24 traces
No FPS regressions observed on any workload.
Reviewed-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
Instead of loading and parsing the XML spec everytime a CLIF is created,
do it once and cache for further calls.
This also avoids leaking the spec loading.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40747>
Replace printf and nir_print_shaders by proper mesa_logX and
nir_log_shaderX functions, that provides better features (like logging
to a file, setting the logging verbosity, etc) and works better with
Android.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40434>
This will give better flexibility on how and where the dumps will be
done.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40434>
Replace printf and nir_print_shaders by proper mesa_logX and
nir_log_shaderX functions, that provides better features (like logging
to a file, setting the logging verbosity, etc) and works better with
Android.
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40434>
When converting the index buffer from 4-bytes to 2-bytes, we use the
uploader for the job. Since commit b3133e250e we do an uploader alloc
ref, which releases the uploader buffer if there is no enough space,
creating a new one.
The problem happens when we also need this buffer because it is the one
containing the index buffer to convert. This happens, for instance, if
we need to convert the primitives because they are not supported (e.g.,
converting quads to triangles), as this is done
also using the uploader.
The solution is to ensure the uploader's buffer has an extra reference
so when released, it is not destroyed. This can easily achieved by
calling first pipe_buffer_map_range(), which is required to access the
buffer, and it increases the references.
This fixes `spec@!opengl 1.1@longprim`.
Fixes: b3133e250e ("gallium: add pipe_context::resource_release to eliminate buffer refcounting")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40642>
The extension is implemented in shared Vulkan/WSI code and
not driver specific. The underlying kms driver needs to
support HDR metadata signalling on the drm connector, which
vc4 kms does for VideoCore 5 and later since April 2021.
Successfully tested on RaspberryPi 4/400 with a HDR-10
capable HDMI monitor.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40696>
These extensions are implemented in shared Vulkan/WSI code and
not driver specific. A Vulkan driver just needs to support
VK_KHR_timeline_semaphore, which v3dv already supports via
emulated timeline semaphores since April 2022.
Successfully tested on RaspberryPi 4/400.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40696>
We have 4 image intrinsic variants now. This enum is useful for
nir_rewrite_image_intrinsic() and it will be used by other NIR passes.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40709>
The variable doesn't store a granularity specific to CLE buffers. It
stores the granularity that the OS imposes on buffer allocations (that
is, the OS page size). Therefore, rename the variable to best reflect
its meaning.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40496>
When a resolve attachment is created with VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT,
the render pass may use a view format that differs from the image creation
format (e.g. view=R16G16_SINT on an image created as B8G8R8A8_SRGB).
cmd_buffer_emit_resolve() was calling v3dv_CmdResolveImage2() which only
receives images but not the view format. This means that blit_shader()
will use the wrong format, causing miss-renderings.
So instead of using directly v3dv_CmdResolveImage2(), let's have an
intermediate function that receives both images and view formats to do
the resolve.
This fixes dEQP-VK.image.mutable.* failures.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40234>
Split the monolithic v3dv_private.h (~2600 lines) into self-contained
sub-headers so each .c file only includes what it needs:
v3dv_common.h, v3dv_device.h, v3dv_image.h, v3dv_pass.h,
v3dv_query.h, v3dv_pipeline.h, v3dv_descriptor_set.h,
v3dv_cmd_buffer.h, v3dv_version_dispatch.h
As part of this commit we remove v3dv_private.h.
We keep v3dvx_private.h as it is, because the gain would be really
small (a lot of really small sub-headers).
In addition to keep things more tidy, we made a quick performance
check. We measured how many files are re-compiled and the performance
difference when touching one of the headers, compared with keeping
just one monolithic header.
Header touch (incremental) Split Monolithic Speedup
-------------------------- ----- ---------- -------
v3dv_image.h 2369 (24f) 2436 (33f) 1.03x
v3dv_query.h 2357 (20f) 2436 (33f) 1.03x
v3dv_pass.h 2352 (20f) 2436 (33f) 1.04x
v3dv_cmd_buffer.h 2354 (20f) 2436 (33f) 1.03x
v3dv_descriptor_set.h 2436 (33f) 2436 (33f) 1.00x
v3dv_pipeline.h 2437 (33f) 2436 (33f) 1.00x
v3dv_device.h 2418 (31f) 2436 (33f) 1.01x
v3dv_common.h 2419 (33f) 2436 (33f) 1.01x
v3dv_version_dispatch.h 2371 (26f) 2436 (33f) 1.03x
Header touch (incremental) Split Monolithic Speedup
-------------------------- ---------- ---------- -------
v3dv_image.h 2377 (24f) 2443 (33f) 1.03x
v3dv_query.h 2346 (20f) 2443 (33f) 1.04x
v3dv_pass.h 2360 (20f) 2443 (33f) 1.04x
v3dv_cmd_buffer.h 2351 (20f) 2443 (33f) 1.04x
v3dv_descriptor_set.h 2438 (33f) 2443 (33f) 1.00x
v3dv_pipeline.h 2429 (33f) 2443 (33f) 1.01x
v3dv_device.h 2418 (31f) 2443 (33f) 1.01x
v3dv_common.h 2432 (33f) 2443 (33f) 1.00x
v3dv_version_dispatch.h 2373 (26f) 2443 (33f) 1.03x
The bigger gain is on the files recompiled for some headers (going
from 33 down to 20 in some cases). The performance gain is not so
relevant though.
Acked-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40169>
The V3D 7.1 TFU ICFG register restructured the IFORMAT field to 3 bits
(25:23) vs 4 bits on V3D 4.2. The defines were still using the V3D 4.2
encoding (11-15) which overflows the 3-bit field. Fix values to the
correct 3-7 range.
This was working by accident because the overflow bits land in the
SVTWID field, which is not used for the affected tiling formats.
Also rename SAND_128 to SAND since V3D 7.1 has a single SAND input
format; the tile width is now controlled by SVTWID.
Fixes: 146ceadcf4 ("v3dv: add support for TFU jobs in v71")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40540>