The spec says
VUID-VkQueryPoolCreateInfo-queryCount-02763
queryCount must be greater than 0
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32697>
The spec says
After query pool creation, each query is in an uninitialized state and
must be reset before it is used.
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32697>
Format re-interpretation is no longer a problem with texture views. The
clear color address now points to a clear color that is in the expected
format.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31374>
On gfx12+, the pre-amble and post-amble flushes contain the stalls
necessary to ensure the prior operation is complete. Remove the extra
uses of ANV_PIPE_END_OF_PIPE_SYNC_BIT in post-amble flushes. Also do
this for the pre-amble flushes, but this doesn't have any impact. The
flush application function will implicitly add the bit.
For A750, this improves the TWWH3 trace in the performance CI by 0.52%
(n=2).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31600>
On gfx12+, the post-amble flushes contain the stalls necessary to ensure
the prior operation is complete. Remove the extra uses
iris_emit_end_of_pipe_sync().
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31600>
Each reg may store a list of conflict regs. This was handled by
util_dynarray, however each of those hold an extra pointer for
the ra_regs (which serves as mem_ctx for that). Since the usage
here is very simple, we just handle the array growth manually.
The initial size remains the same as before.
The mem_ctx of each ra_reg was being used to identify the case
in which the list wasn't used. Change to use a bool in the
ra_regs struct instead.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25744>
For Intel, looking at a few fossils, the majority of nodes
have more than 32 entries in the list. I'd expect other backends
to have similar numbers.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25744>
Each node stores a list of adjacent nodes. This was handled by
util_dynarray, however each of those hold an extra pointer for
the ra_graph (which serves as mem_ctx for that). Since the usage
here is very simple, we just handle the array growth manually.
For now keep using the same initial size as was being used by dynarray.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25744>
Create a parallel array to hold them. In particular, the `spill_cost` is
used at a completely different moment than the main node data.
Reduces the `struct ra_node` size to 40 bytes.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25744>
Fast-clears require expensive flushes beforehand and afterwards. The
cost of flushes are decreased in a series of back-to-back fast-clears as
no extra fast-clear flushes are required in between them. If the ratio
of a command buffer's recorded back-to-back fast clears over independent
fast-clears falls below 1/2, prevent that command buffer from recording
any further fast-clears.
Averaging two runs of our Factorio trace on an A750 shows a +14.37%
improvement in FPS.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32984>
RBBM_PRIMCTR registers are used for different pipeline statistics that can
be queried, but current usage was wrong in some cases. Comments in the
register file are updated, and the per-statistic register index mapping is
updated accordingly.
Fixes on a750:
test_query_pipeline_statistics in vkd3d-proton
arb_query_buffer_object failures in piglit (zink)
Signed-off-by: Zan Dobersek <zdobersek@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32900>
If we don't have surface descriptor, treat this as a hint from
application that it will export the surface later.
This matches Intel driver behavior.
Reviewed-by: Leo Liu <leo.liu@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32970>
The `sad` instruction (used for iadd3) doesn't support the scalar ALU.
This means we might fall back to non-earlypreamble whenever we use it in
the preamble. Prevent this by emitting it as two adds instead.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32943>
If a polygon offset is set via glPolygonOffset, but the functionality
isn't enabled via glEnable(GL_POLYGON_OFFSET_FILL) the offset must not
be taken into account when computing the sample depth. As the Vivante
hardware does not have a separate enable state, the offset units and
scale must both be set to 0 to keep the sample depth unchanged.
Fixes dEQP-GLES2.functional.polygon_offset.default_enable
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32982>
Through the spec it's required that cl_kernel doesn't hold references to
its bound kernel arguments.
There is a CL CTS test verifying this, but because the arguments were not
used in the test kernel, a reference was never taken. This will change
with SVM and BDA as we need to know all bound memory objects even if they
aren't directly used in kernels.
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: @LingMan
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32961>
Turnip already had a pipeline stat to indicate whether we were using
early-preamble or not. But no way to tell if there was a preamble at
all. Adding a preamble instruction count tells us whether there is a
preamble, but also how big it is.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32977>
This limits the address register to simple cases inside a block.
Validation ensures that the address register is only written once and
read once.
Instruction scheduling makes sure that instructions using the address
register in the generator are not scheduled while there is an usage of
the register in the IR.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28199>
We want to reuse the brw::nr field as a virtual address register
identifer. So we can't use brw::file=ARF brw::nr=ADDRESS.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28199>
force_gl_names_reuse is changed to integer.
-1 means default (currently disabled), 0 means disabled, 1 means enabled
The names reuse initialization is moved to _mesa_alloc_shared_state ->
_mesa_InitHashTable instead of _mesa_HashEnableNameReuse.
It will be enabled by default.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32715>
Moving output file creation to coincide with swapchain creation
ensures only rendering thread will create/destroy log file. This
was causing problems with non-rendering processes stomping log file.
Reviewed-by: Caleb Callaway <caleb.callaway@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32814>
Rather than emitting FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD to do block
loads that were cacheline aligned, loading entire cachelines at a time,
we now rely on NIR passes to group, CSE, and vectorize things into
appropriately sized blocks. This means that we'll usually still load
a cacheline, but we may load only 32B if we don't actually need anything
from the full 64B. Prior to Xe2, this saves us registers, and it ought
to save us some bandwidth as well as the response length can be lowered.
The cacheline-aligning hack was the main reason not to simply call
fs_nir_emit_memory_access(), so now we do that instead, porting yet
one more thing to the common memory opcode framework.
We unfortunately still emit the old FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD
opcode for non-block intrinsics. We'd have to clean up 16-bit handling
among other things in order to eliminate this, but we should in the
future.
fossil-db results on Alchemist for this and the previous patch together:
Instrs: 161481888 -> 161297588 (-0.11%); split: -0.12%, +0.01%
Subgroup size: 8102976 -> 8103000 (+0.00%)
Send messages: 7895489 -> 7846178 (-0.62%); split: -0.67%, +0.05%
Cycle count: 16583127302 -> 16703162264 (+0.72%); split: -0.57%, +1.29%
Spill count: 72316 -> 67212 (-7.06%); split: -7.25%, +0.19%
Fill count: 134457 -> 125970 (-6.31%); split: -6.83%, +0.52%
Scratch Memory Size: 4093952 -> 3787776 (-7.48%); split: -7.53%, +0.05%
Max live registers: 33037765 -> 32947425 (-0.27%); split: -0.28%, +0.00%
Max dispatch width: 5780288 -> 5778536 (-0.03%); split: +0.17%, -0.20%
Non SSA regs after NIR: 177862542 -> 178816944 (+0.54%); split: -0.06%, +0.60%
In particular, several titles see incredible reductions in spill/fills:
Shadow of the Tomb Raider: -65.96% / -65.44%
Batman: Arkham City GOTY: -53.49% / -28.57%
Witcher 3: -16.33% / -14.29%
Total War: Warhammer III: -9.60% / -10.14%
Assassins Creed Odyssey: -6.50% / -9.92%
Red Dead Redemption 2: -6.77% / -8.88%
Far Cry: New Dawn: -7.97% / -4.53%
Improves performance in many games on Arc A750:
Cyberpunk 2077: 5.8%
Witcher 3: 4%
Shadow of the Tomb Raider: 3.3%
Assassins Creed: Valhalla: 3%
Spiderman Remastered: 2.75%
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
The hope here is to replace our backend handling for loading whole
cachelines at a time from UBOs into NIR-based handling, which plays
nicely with the NIR load/store vectorizer.
Rounding down offsets to multiples of 64B allows us to globally CSE
UBO loads across basic blocks. This is really useful. However, blindly
rounding down the offset to a multiple of 64B can trigger anti-patterns
where...a single unaligned memory load could have hit all the necessary
data, but rounding it down split it into two loads.
By moving this to NIR, we gain more control of the interplay between
nir_opt_load_store_vectorize and this rebasing and CSE'ing. The backend
can then simply load between nir_def_{first,last}_component_read() and
trust that our NIR has the loads blockified appropriately.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
This will translate to HDC Constant Cache loads or LSC UGM loads.
On LSC, MEMORY_MODE_UNTYPED would be fine, but for HDC we need to
distinguish between the regular and constant cache data ports.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
The NIR vectorizer may produce block loads with unread trailing
components. Upcoming passes may produce unread leading components
as well. With a bit of finesse, we can skip loading those, and only
bother with the ones we actually need. This can sometimes save us on
loads and MOVs.
v2: Skip this for SLM reads on pre-LSC platforms (caught by Lionel).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
If we pass an immediate, just trivially return that immediate.
This preserves the property that if x was an IMM, emit_uniformize(x)
will also be an IMM, without the need for optimizations to eliminate
unnecessary operations. That way, you can call emit_uniformize() on
a value and still check whether it's constant afterwards.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
We return an immediate for 32-bit constant values, but fall back to
calling get_nir_src() for other values, as 64-bit, and even 8-bit
immediates have odd restrictions. We could probably support 16-bit
here without too many issues, but we leave it be for now.
This makes it usable for case where we'd like to get constants for
32-bit values but where it may be a different bit-size too.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
We were already skipping unread trailing components, but now we skip
them on both ends.
About -3.5% spills on Shadow of the Tomb Raider on Alchemist (mostly a
wash elsewhere, but it will help additional shaders with later patches).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
HDC doesn't support block loads/stores with sub-DWord (<4B) aligned
offsets, and shared local memory has to use the Aligned OWord Block
messages which require OWord (16B) alignment.
Make the validator detect this case and say no. Also make the lowering
code assert that the alignment is valid as a second line of defense.
LSC has no such restrictions.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>