We already have a nice script, let's add the only missing option here.
Signed-off-by: Mary Guillemard <mary@mary.zone>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40932>
We only need to deal with the fixed function last vertex stage case,
for prior stages nir_opt_varyings is enough.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40907>
Remove the allocate_tile_state_now parameter from v3dv_job_start_frame().
So v3dv_job_allocate_tile_state() is explicitly called after
job_emit_binning_flush() as we know the value of job->draw_count instead
of using always 0.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
Replace the inline tile_alloc/TSDA sizing in v3dv_job_allocate_tile_state()
with a call to the new v3d_tile_alloc_sizes() helper. This switches from
64B to 128B initial tile alloc blocks (avoiding overflow for simple draws)
and from a flat 512KB headroom to a draw-proportional formula.
Set tile_allocation_initial_block_size and tile_allocation_block_size
in all TILE_BINNING_MODE_CFG emissions and update the
TILE_LIST_INITIAL_BLOCK_SIZE packets to match.
Benchmarked on RPi5 (V3D 7.1) with GfxBench Vulkan Aztec Ruins at
1920x1040. Average tile_alloc BO size dropped 75% (535 KB to 132 KB)
with 20% fewer OOM events (521 to 417) and no FPS regression.
This avoids exhausting GPU memory when multiple blit or fill jobs
are batched in the same command buffer, with a huge reduction of
the memory footprint avoiding the 512 KB of the tile_alloc per batched
job.
Reviewed-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
Replace the inline tile_alloc/TSDA sizing in alloc_tile_state() with a
call to the new v3d_tile_alloc_sizes() helper. This switches from 64B
to 128B initial tile alloc blocks (avoiding overflow for simple draws)
and from a flat 512KB headroom to a draw-proportional formula.
Set tile_allocation_initial_block_size and tile_allocation_block_size
in all TILE_BINNING_MODE_CFG emissions and update the
TILE_LIST_INITIAL_BLOCK_SIZE packet to match.
Reviewed-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
Add V3D_TILE_ALLOC_INITIAL_BLOCK_SIZE = 128 and
V3D_TILE_ALLOC_OVERFLOW_BLOCK_SIZE = 64 to v3d_limits.h.
Corresponding _ENUM macros provide the 2-bit hardware encoding for the
TILE_BINNING_MODE_CFG packets.
The previous implicit 64B initial blocks were too small: a single draw
call emits ~88 bytes of per-tile BCL state, immediately overflowing
into continuation blocks. 128B initial blocks avoid the first
continuation allocation for simple single-draw passes.
Add v3d_tile_alloc_sizes() to v3d_util with the full tile alloc BO and
TSDA sizing logic. This uses the 128B initial blocks and tile_alloc
becomes proportional to the number of draws and size of the initial
blocks allocation with the cap of the previous fixed allocation. So
jobs with 0 or 1 drawcalls (blits/fills) reduce their headroom
dramatically.
The draw-proportional formula replaces a flat 512 KB continuation pool:
headroom = MIN2((tiles_size * draw_count) / 2, 512 KB)
Benchmarked on RPi5 (V3D 7.1) against GfxBench GL tests and
apitrace replays at 1080p. Tile-alloc memory reduction versus the
flat 512 KB headroom (taking into account 256kb kernel alloc per OOM):
GfxBench (5 benchmarks): -45% to -70% reduction, OOM at or below baseline
Apitrace (19 traces): -4% to -77% reduction on 20/24 traces
No FPS regressions observed on any workload.
Reviewed-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40554>
For Valhall, use SHADDX instruction for 64-bit integer addition instead
of lowering it to 32-bit operations. The instruction sequence for doing
it in 32-bit costs 3 cycles but SHADDX only takes 2 cycles to perform.
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40841>
va_mark_last currently expects 64-bit registers to always be split in
two, this commit changes it to check first if a 64-bit register is split
or not.
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40841>
According to the GL_EXT_multisampled_render_to_texture specification,
copy operations should be allowed when the extension is supported.
Previously, glCopyTexImage* would unconditionally fail with
GL_INVALID_OPERATION when copying from any multisampled framebuffer
(samples > 0), even when using render-to-texture attachments.
Fixes: d7b9da2673 ("mesa/main: fix artifacts with GL_EXT_multisampled_render_to_texture")
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Signed-off-by: Wujian Sun <wujian.sun_1@nxp.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40863>
When round_to_nearest_even is enabled with NEAREST filtering, texture
coordinates near texel boundaries (e.g. 0.9999999404) can be incorrectly
rounded up to the next texel instead of being floor()'d.
According to OpenCL spec section 8.2, for CLK_FILTER_NEAREST:
i = address_mode((int)floor(u))
Backport-to: *
Signed-off-by: Eric Guo <eric.guo@nxp.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40912>
Allocate to hold 256 wait fences. Since there is only one queue per
per ip per process, the idea is that there won't be app or windowing system
that would have large number of job dependencies / wait fences.
If there is an app that has wait fences greater than 256, there won't
be corruption issues since kernel will wait for the extra fences.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40698>
AMD APUs are hitting this case where they have very small discrete VRAM,
but a lot of staging memory, which can be used additionally.
Fixes: 7487ac2046 ("rusticl/device: support query_memory_info to retrieve available memory")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30123>
Is a single triangle is selected, it can be the case that the next iteration
can't merge any pair with the triangle. In that case, the HW node with a
single triangle will not have the highest hw_node_index, triggering an
assert.
Fixes: c18a7d0 ("radv: Emit compressed primitive nodes on GFX12")
Reviewed-by: Natalie Vock <natalie.vock@gmx.de>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39655>
Required for SM6.6 in vkd3d-proton and used in a number of UE5 titles.
From descriptor side R64 images are R32G32_UINT, and to get storage_descriptor
we have to move early-return if format doesn't support rendering after
storage_descriptor setup.
Passes vkd3d-proton test:
test_shader_sm66_64bit_atomics
CTS tests:
dEQP-VK.image.atomic_operations.*.r64*
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39932>
Initial data is loaded via bindless_image_load, atomic swap via
bindless_image_atomic_swap.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39932>