As each anv_device has its own address space it was necessary create
one dummy_aux_bo per anv_device.
Also this workaround requires us to disable the
buffer_length_in_aux_addr optimization, that is done in the physical
device creating because isl_dev of physical device is copied
to isl_dev in anv_device.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29619>
This workaround ask us to set a dummy aux address to all
SURFTYPE_BUFFERs with AuxiliarySurfaceMode == AUX_NONE.
It also says that the same dummy aux address can be reused acrsoss all
buffers.
So here adding dummy_aux_address to isl_device, ANV and Iris will
set a value to when running a in a GPU affected.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29619>
7da5b1caef ("anv: move trtt submissions over to the anv_async_submit")
added a hard dependency on timeline semaphore which is still optional.
And since it gates the sparseBinding feature, we should not use it if
sparseBinding is not enabled.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 7da5b1caef ("anv: move trtt submissions over to the anv_async_submit")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29779>
We no longer need to reserve registers for constructing spill/fill
messages. We have split sends and construct message headers in new
temporary registers with a very short lifespan which are simply added
to the existing interference graph as new nodes and allocated via the
normal mechanism.
This means that when we need to spill for the first time, we can avoid
discarding and recomputing the entire interference graph. We also avoid
needing to recreate all spill candidate information once ra_allocate()
fails, because the graph remains valid, and none of the existing nodes
had any changes to their interference. The existing spill candidates
remain valid.
This will slightly help improve compile time when needing to spill.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25811>
Instead of reserving a register to contain the spill header, which
gets marked live for the entire program, we can just emit the ALU
instructions to build it on the fly. (This is similar to the way
we handle scratch on Alchemist with the newer LSC data port.)
There are a couple of downsides that make this not obviously a win.
First, in order to construct the scratch header on Gfx9-12, we have
to use fields from g0, which will have to remain live anywhere that
scratch access is required. This could negate the register pressure
benefits of creating the header on the fly. However, g0 is oft used
in other places anyway, so it may already be there. Another is that
it's a non-trivial number of ALU instructions to construct the value.
Still, trading lower pressure (so fewer spills, less memory access
and stalls) for more cheap ALU seems like it ought to be a win.
There is another valuable benefit: by not reserving a register, we
eliminate the need to reconstruct the interference graph. (The next
patch will actually do so.)
shader-db on Icelake shows spills/fills at 54/53 helped, 4/10 hurt,
and an 8% increase in ALU on affected shaders. Synmark's OglCSDof
(a benchmark that spills) performance remains the same on Alderlake.
fossil-db on Icelake shows a 5.6%/5.1% reduction in spills/fills and a
4% reduction in scratch memory size on affected shaders. Instruction
counts go up by 11.07%, but cycle estimates only increase by 0.57%.
Assassin's Creed Odyssey and Wolfenstein Youngblood both see 20-30%
reductions in spills/fills, a significant improvement.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25811>
The intention of this block was to set one of the flags that is used
to select a PAT index but this was doing more than that.
It was promoting WB+0 way coherency BOs to WC+1 way coherency possibly
causing regression in platforms without LLC.
anv_device_get_pat_entry() return WC/writecombining if no flags is
set so we don't need this block after all.
Reported-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Fixes: a65e982b44 ("anv: Split ANV_BO_ALLOC_HOST_CACHED_COHERENT into two actual flags")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29769>
Change so the size rounds up to the next multiple of the horizontal stride like
is done for VGRF. This was causing an inconsistency in regs_read() -- The original
component_size() calculation for FIXED_GRF excluded any padding at the end but it was
still being discounted by regs_read().
Suggested by Curro.
Fixes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11069
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29736>
Completely skip the stall & programming if the bindless address has
not changed. Only on Gfx12.5+ since previous generations also program
the binding table pool base address through STATE_BASE_ADDRESS.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29595>
In a following change the predicate registers might be used when
flushing the state.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29595>
The mi-builder already takes care of mi write/read fences, but we have
a few cases in Anv where we also need to fence mi-write ->
shader-read.
We also have one case where a command buffer jump address is modified
by a previous mi write command.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29595>
The enums were mixed up. Code was working because they were being
used only for their numerical values.
Fixes: e666872c75 ("intel/compiler: Initial bits for DPAS instruction")
Acked-by: Iván Briano <ivan.briano@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29762>
Copy propagation often eliminates all uses of an instruction. If we
detect that we've done so, we can eliminate the instruction ourselves
rather than leaving it hanging until the next DCE pass.
This saves some CPU time as other passes don't see dead code.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
The new def-based pass works better in many cases, and should be less
resource intensive. However, the limited visibility of the defs-based
pass due to many values not being SSA yet makes it unable to fully
replace the old pass. Try the new one, and if it can't make progress,
then try the old one. That way, things will mostly be handled by the
new pass, but everything that was being cleaned up still will be.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
While the limited visibility due to partial SSA is a downside to the new
pass, it has a huge number of advantages that make it worth switching
over even now. It's much more efficient, can eliminate redundant memory
loads across blocks, and doesn't generate loads of unnecessary copies
that other passes have to clean up. This means we also eliminate the
infighting between the old CSE, coalescing, and copy propagation passes.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
This has a number of advantages compared to the pass I wrote years ago:
- It can easily perform either Global CSE or block-local CSE, without
needing to roll any dataflow analysis, thanks to SSA def analysis.
This global CSE is able to detect and coalesce memory loads across
blocks. Although it may increase spilling a little, the reduction
in memory loads seems to more than compensate.
- Because SSA guarantees that values are never written more than once,
the new CSE pass can directly reuse an existing value. The old pass
emitted copies at the point where it discovered a value because it
had no idea whether it'd be mutated later. This led it to generate
a ton of trash for copy propagation to clean up later, and also a
nasty fragility where CSE, register coalescing, and copy propagation
could all fight one another by generating and cleaning up copies,
leading to infinite optimization loops unless we were really careful.
Generating less trash improves our CPU efficiency.
- It uses hash tables like nir_instr_set and nir_opt_cse, instead of
linearly walking lists and comparing each element. This is much more
CPU efficient.
- It doesn't use liveness analysis, which is one of the most expensive
analysis passes that we have. Def analysis is cheaper.
In addition to CSE'ing SSA values, we continue to handle flag writes,
as this is a huge source of CSE'able values. These remain block local.
However, we can simply track the last flag write, rather than creating
entire sets of instruction entries like the old pass. Much simpler.
The only real downside to this pass is that, because the backend is
currently only partially SSA, it has limited visibility and isn't able
to see all values. However, the results appear to be good enough that
the new pass can effectively replace the old pass in almost all cases.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
Like NIR, we print SSA defs as %1, %2, and so on. The number here is
the VGRF number. VGRFs that don't correspond to a SSA def remain
printed as vgrf1, vgrf2, and so on.
This makes it much easier to see what values are SSA and which aren't.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
Even without a full use list, simply tracking the number of uses will
let us tell "this is the only use of the def" or "we've just replaced
all uses of a def". It's inexpensive to calculate and will be useful.
(rebased by Kenneth Graunke)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
This introduces a new analysis pass that opportunistically looks for
VGRFs which happen to satisfy the SSA definition properties.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
Our code to initialize gl_SubgroupInvocation uses multiple instructions
some of which are partial writes. This makes it difficult to analyze
expressions involving gl_SubgroupInvocation, which appear very
frequently in compute shaders.
To make this easier, we add a new virtual opcode which initializes
a full VGRF to the value of gl_SubgroupInvocation. (We also expand
it to UD for SIMD8 so there are not partial write issues.) We then
lower it to the original code later on in compilation, after we've
done the bulk of our optimizations.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
This gathers a number of sources into a contiguous vector register,
typically using LOAD_PAYLOAD. However, it uses MOV for a single source.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
These allow avoiding dead-locks in non-compliant applications that
execute barriers under non-uniform control flow. They're not expected
to have any major disadvantage so let's enable them unconditionally.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29562>
See also HSDES#14015504893 regarding the region-based tessellation
redistribution feature which allows fine-tuning the number of regions
per patch. This sets it to the recommended value, since region-based
redistribution is enabled by default.
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29562>
This flag is mostly redundant with uses_discard and was only
introduced to implement demote with LLVM when it didn't have
that intrinsic.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27617>
The semantics of discard differ between GLSL and HLSL and
their various implementations. Subsequently, numerous application
bugs occurred and SPV_EXT_demote_to_helper_invocation was written
in order to clarify the behavior. In NIR, we now have 3 different
intrinsics for 2 things, and while demote and terminate have clear
semantics, discard still doesn't and can mean either of the two.
This patch entirely removes nir_intrinsic_discard and
nir_intrinsic_discard_if and replaces all occurences either with
nir_intrinsic_terminate{_if} or nir_intrinsic_demote{_if} in the
case that the NIR option 'discard_is_demote' is being set.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27617>
Previously, we left it NULL until later in the compile. However, some
builder helpers are starting to check the options and they blow up when
options == NULL.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27617>
The MCS region to ambiguate needs to shift 4KB from its
starting address. The first 4KB is reserved for hardware.
Signed-off-by: Jianxun Zhang <jianxun.zhang@intel.com>
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28919>
The scaledown rectangle of MSAA fast clear on Xe2 is 8 times
in X and 2 in Y dimension of previous platforms.
Absorb refactoring change suggested by
Nanley Chery <nanley.g.chery@intel.com>
Signed-off-by: Jianxun Zhang <jianxun.zhang@intel.com>
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28919>
MOV_INDIRECT picks one lane from the src[0] and moves it to all lanes
in the destination. Even if we split the instruction, src[0] should
remain identical.
Noticed this while trying to use this instruction in SIMD32. All
current use cases are limited to SIMD8 shaders (or SIMD16 on Xe2). Or
maybe in SIMD32 but with a uniform src[0]. That's we think we've never
seen the issue so far.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28036>
Since we don't need to share that data with other fixed functions.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29571>