On the last iteration of the loop, `offset` will point to the location
just beyond the last instruction in the program. If the program exactly
fills `p->store` then calling `next_offset()` will read out of bounds.
Instead just let the inner while loop call `next_offset()` one
additional time.
Fixes: a35b9cb625 ("i965: Add annotation data structure and support code.")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12486
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33101>
* GLES3.x is only valid for x <= 2
* The expected error is GLXBadProfileARB, not BadValue
cc: mesa-stable
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33036>
Found on simulation, complaining about SIMD32 shaders enabled when
using MSAA 16x.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30753>
there's nothing NIR specific here and these routines will be useful otherwise.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Jesse Natalie <jenatali@microsoft.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33067>
instead of relying on an implicit value which doesn't make much sense.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Jesse Natalie <jenatali@microsoft.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33067>
Update expectation files for the test
runs with kernel 6.13-rc4.
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Reviewed-by: David Heidelberg <None>
Reviewed-by: Sergi Blanch Torné <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32788>
Move to 6.13-rc4 for all mesa-ci jobs except anv-jsl.
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Reviewed-by: David Heidelberg <None>
Reviewed-by: Sergi Blanch Torné <None>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32788>
The Xe uAPI is designed to use bind queues such that binds without input
dependencies (sync objects) do not block on binds with input
dependencies.
For example:
- Bind A (sparse) is submitted with a list of input dependencies.
- Bind B (immediate) is subsequently submitted without a list of input
dependencies.
If Bind A and Bind B share a single bind queue, Bind B will not be
scheduled until Bind A completes. Using individual bind queues decouples
Bind A and Bind B, allowing Bind B to make immediate progress.
This change creates a separate bind queue for each ANV queue, enabling
support for sparse bindings that may have input dependencies.
v2:
- Bail on bind queue creation failure (Linoel)
- Only create bind queue if VK_QUEUE_SPARSE_BINDING_BIT is set (Jose)
v3:
- Add comment around submit->queue usage (Jose)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32873>
Format re-interpretation is no longer a problem with texture views. The
clear color address now points to a clear color that is in the expected
format.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31374>
On gfx12+, the pre-amble and post-amble flushes contain the stalls
necessary to ensure the prior operation is complete. Remove the extra
uses of ANV_PIPE_END_OF_PIPE_SYNC_BIT in post-amble flushes. Also do
this for the pre-amble flushes, but this doesn't have any impact. The
flush application function will implicitly add the bit.
For A750, this improves the TWWH3 trace in the performance CI by 0.52%
(n=2).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31600>
Fast-clears require expensive flushes beforehand and afterwards. The
cost of flushes are decreased in a series of back-to-back fast-clears as
no extra fast-clear flushes are required in between them. If the ratio
of a command buffer's recorded back-to-back fast clears over independent
fast-clears falls below 1/2, prevent that command buffer from recording
any further fast-clears.
Averaging two runs of our Factorio trace on an A750 shows a +14.37%
improvement in FPS.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32984>
This limits the address register to simple cases inside a block.
Validation ensures that the address register is only written once and
read once.
Instruction scheduling makes sure that instructions using the address
register in the generator are not scheduled while there is an usage of
the register in the IR.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28199>
We want to reuse the brw::nr field as a virtual address register
identifer. So we can't use brw::file=ARF brw::nr=ADDRESS.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28199>
Rather than emitting FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD to do block
loads that were cacheline aligned, loading entire cachelines at a time,
we now rely on NIR passes to group, CSE, and vectorize things into
appropriately sized blocks. This means that we'll usually still load
a cacheline, but we may load only 32B if we don't actually need anything
from the full 64B. Prior to Xe2, this saves us registers, and it ought
to save us some bandwidth as well as the response length can be lowered.
The cacheline-aligning hack was the main reason not to simply call
fs_nir_emit_memory_access(), so now we do that instead, porting yet
one more thing to the common memory opcode framework.
We unfortunately still emit the old FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD
opcode for non-block intrinsics. We'd have to clean up 16-bit handling
among other things in order to eliminate this, but we should in the
future.
fossil-db results on Alchemist for this and the previous patch together:
Instrs: 161481888 -> 161297588 (-0.11%); split: -0.12%, +0.01%
Subgroup size: 8102976 -> 8103000 (+0.00%)
Send messages: 7895489 -> 7846178 (-0.62%); split: -0.67%, +0.05%
Cycle count: 16583127302 -> 16703162264 (+0.72%); split: -0.57%, +1.29%
Spill count: 72316 -> 67212 (-7.06%); split: -7.25%, +0.19%
Fill count: 134457 -> 125970 (-6.31%); split: -6.83%, +0.52%
Scratch Memory Size: 4093952 -> 3787776 (-7.48%); split: -7.53%, +0.05%
Max live registers: 33037765 -> 32947425 (-0.27%); split: -0.28%, +0.00%
Max dispatch width: 5780288 -> 5778536 (-0.03%); split: +0.17%, -0.20%
Non SSA regs after NIR: 177862542 -> 178816944 (+0.54%); split: -0.06%, +0.60%
In particular, several titles see incredible reductions in spill/fills:
Shadow of the Tomb Raider: -65.96% / -65.44%
Batman: Arkham City GOTY: -53.49% / -28.57%
Witcher 3: -16.33% / -14.29%
Total War: Warhammer III: -9.60% / -10.14%
Assassins Creed Odyssey: -6.50% / -9.92%
Red Dead Redemption 2: -6.77% / -8.88%
Far Cry: New Dawn: -7.97% / -4.53%
Improves performance in many games on Arc A750:
Cyberpunk 2077: 5.8%
Witcher 3: 4%
Shadow of the Tomb Raider: 3.3%
Assassins Creed: Valhalla: 3%
Spiderman Remastered: 2.75%
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
The hope here is to replace our backend handling for loading whole
cachelines at a time from UBOs into NIR-based handling, which plays
nicely with the NIR load/store vectorizer.
Rounding down offsets to multiples of 64B allows us to globally CSE
UBO loads across basic blocks. This is really useful. However, blindly
rounding down the offset to a multiple of 64B can trigger anti-patterns
where...a single unaligned memory load could have hit all the necessary
data, but rounding it down split it into two loads.
By moving this to NIR, we gain more control of the interplay between
nir_opt_load_store_vectorize and this rebasing and CSE'ing. The backend
can then simply load between nir_def_{first,last}_component_read() and
trust that our NIR has the loads blockified appropriately.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>
This will translate to HDC Constant Cache loads or LSC UGM loads.
On LSC, MEMORY_MODE_UNTYPED would be fine, but for HDC we need to
distinguish between the regular and constant cache data ports.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32888>