When lower_simd_width() encounters an instruction that needs a larger
SIMD, for example SHADER_OPCODE_TXS_LOGICAL in Gfx4 needs at least
SIMD16. In this case the builder needs to be at least as large as
max_width, otherwise the group() setup will assert.
Turns out this did not assert before "by accident", since it was
relying on the default fs_visitor builder that had a dispatch width of 64,
a bogus placeholder value, expected not to be used.
However, when we changed the code to remove that builder (and the bogus
value), we created a new builder in the pass shader dispatch_width --
which work fine except in the case where we want to "lower" the SIMD above
the shader dispatch width. The fix is to also consider the already
calculated max_width when creating the builder.
Fixes: 5b8ec015f2 ("intel/compiler: Don't use fs_visitor::bld in remaining places")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10338
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27782>
We first generate the logical opcodes, and these days fully lower to
SHADER_OPCODE_SEND. In the past, we lowered to a non-logical variant
and handled that in the generator. These days, we were just using the
non-logical opcodes as an awkward intermediate opcode change during
the lowering...which isn't really necessary at all.
This patch eliminates them by using the original logical opcodes.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27908>
We're not really doing any fancy texturing here, just emitting a TEX
instruction that writes multiple destination registers. I plan to
remove the non-logical TEX instruction in the next commit, so we swap
these over to use the logical version instead. It should work just as
well for the purposes of the test.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27908>
Unaliged stores are unspecified behavior according to C rules, hence
address sanitizers may complain. Even though this worked fine in
practice (it's almost impossible here for the compiler to do something
"wrong" even if it assumes the store is aligned, given such stores work
just fine on x86), we should follow the rules.
The widely accepted solution for this (it may be somewhat surprising
you can't actually do unaligned assignments explicitly somehow in C)
nowadays is to just use memcpy(). The compiler should figure out (at
least with optimizations enabled) it's just a trivial store and
optimize it back to a single cpu instruction, while still satisfying
asan. (I've verified that even in debug builds the memcpy() is actually
optimized away anyway, I suspect there's some compiler flags somewhere
forcing this behavior.)
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10208
Reviewed-by: Jose Fonseca <jose.fonseca@broadcom.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27896>
Some people seem to have systems with more than 8 GPUs installed at
once. 256 is the maximum number of devices returned by libdrm, so using
this seems like a good choice for now.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27901>
Pick the compressed-no-clear aux state instead of aux-invalid state to
reduce ambiguates in iris. Hopefully this will help debug the issues
seen around MCS-enabling on ACM.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27881>
Xe KMD also checks if cpu_caching caching set during bo creationg
matches with caching of the PAT index set in the VM unbind.
This was being unnoticed until now by luck and lack of testing in MTL.
So here always setting PAT index for all VM operations that has a bo
associated.
v2: (Jose)
- Move pat_index little bit up
- Copy commit message from iris patch
Fixes: 19439624d9 ("anv: Use DRM_XE_VM_BIND_OP_UNMAP_ALL to unbind whole bos")
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27893>
Xe KMD also checks if cpu_caching caching set during bo creationg
matches with caching of the PAT index set in the VM unbind.
This was being unnoticed until now by luck and lack of testing in MTL.
So here always setting PAT index for all VM operations that has a bo
associated.
Fixes: eb18a92ef9 ("iris: Fill PAT fields in Xe KMD gem_create and vm_bind uAPIs")
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27893>
Thanks to Ken for suggesting this URB refactoring change and pointing
out that the LSC can operate on the byte offset granularity.
This should fix the geometry shader test cases where we have more than
32 vertices since previously we were failing to write the correct
control data bits because of incorrect write mask.
Shader-db results for Xe2:
total instructions in shared programs: 153475 -> 153437 (-0.02%)
instructions in affected programs: 1374 -> 1336 (-2.77%)
helped: 11
HURT: 0
helped stats (abs) min: 3 max: 5 x̄: 3.45 x̃: 3
helped stats (rel) min: 1.67% max: 4.92% x̄: 3.23% x̃: 2.70%
95% mean confidence interval for instructions value: -3.92 -2.99
95% mean confidence interval for instructions %-change: -4.10% -2.36%
Instructions are helped.
total loops in shared programs: 140 -> 140 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 16002649 -> 16002329 (<.01%)
cycles in affected programs: 9174 -> 8854 (-3.49%)
helped: 11
HURT: 0
helped stats (abs) min: 22 max: 38 x̄: 29.09 x̃: 32
helped stats (rel) min: 2.62% max: 5.54% x̄: 3.78% x̃: 3.85%
95% mean confidence interval for cycles value: -33.56 -24.62
95% mean confidence interval for cycles %-change: -4.48% -3.08%
Cycles are helped.
total spills in shared programs: 52 -> 52 (0.00%)
spills in affected programs: 0 -> 0
helped: 0
HURT: 0
total fills in shared programs: 94 -> 94 (0.00%)
fills in affected programs: 0 -> 0
helped: 0
HURT: 0
total sends in shared programs: 4240 -> 4240 (0.00%)
sends in affected programs: 0 -> 0
helped: 0
HURT: 0
LOST: 0
GAINED: 0
Rework: (Sagar)
- Adjust offset/indirect offset calculation.
- Add shader-db results
- Always calculate dword index
- Drop changes for indirect writes
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27602>
PKT3_SET_PREDICATION is GFX only, even on GFX6.
This fixes recent
dEQP-VK.conditional_rendering.dispatch.*_compute_queue.
Cc: mesa-stable
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27746>
Fold and/or into braa/brao when profitable. Only do this when the and/or
is not used for any non-branch instructions as this would increase total
instruction count.
Add an algebraic nir pass that performs the inverse DeMorgan's laws to
try to bring and/or in front of branches. Again, only do this when the
original inot in only used for branches. This should always decrease
instruction count since the extra inots can be folded into the branch.
Fold inot into branches by using the inv1/inv2 cat0 fields.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27411>
Instead of creating a cmps.s.ne for every use of a predicate, create
just one and insert it after the instruction whose def is tested. This
reduces the number of compares or, when they are folded into bitwise
operations, those operations.
It also decreases register pressure on GPRs by increasing pressure on
predicate registers. This should be preferred in general since at worst,
the predicate register will be spilled to a GPR again.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27411>
On a6xx+, bitwise operations can directly write to predicate registers.
The result will be 1 iff the result of the non-predicate operation would
be non-zero.
When generating instructions that need a predicate source, ir3 will
insert a cmps.s.ne 0 instruction to guarantee a predicate can be
produced. This is kept in place by this patch and we add a pass that
tries to optimize useless comparisons away.
Concretely:
- Look through chains of multiple cmps.s.ne instructions and remove all
but the first.
- If the source of the cmps.s.ne can write directly to predicates,
remove the cmps.s.ne.
In both cases, no instructions are actually removed but clones are made
and we rely on DCE to remove anything that became unused. Note that it's
fine to always make a clone since even in the case that the original
instruction is also used for non-predicate sources (so it won't be
DCE'd), we replaced a cmps.ne.s with another instruction so this pass
should never increase instruction count.
Note that this pass replaces the double-comparison folding that was
performed by ir3_cp before.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27411>
In principle, validating the result of predicate RA works exactly the
same as for normal RA. There is one slight issue: spilling is
implemented by cloning the instruction that produced the original def
which might cause different defs to legitimately reach the same source.
For example:
bool b = ...;
if (...) {
// b gets spilled and reloaded
} else {
// b is not spilled
}
use b;
The use of b might see different reaching defs. To solve this, RA will
store a pointer to the original def in the instruction's data field.
Validation then uses the original def.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27411>
This would already be caught by another check but would produce a
message that was difficult to interpret. Better to check for it
explicitly.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27411>
Up to now, ir3 only supported one predicate register (p0.x). However,
since a6xx, four predicate registers are available. This patch adds a
register allocator for predicate registers that allows all of them to be
used. The RA also works for older generations with only one register.
The use of p0.x was hard-coded in many places in ir3. This has been
replaced by a new flag, IR3_REG_PREDICATE, to indicate that an SSA value
should be allocated to a predicate register.
The RA uses the standard liveness analysis available in ir3. Using this,
registers are allocated in a single pass over all blocks. For each block
we keep track of currently live defs in the registers. Predicate
destinations allocate a new register and sources take the register from
their def.
The live defs of a block are initialized with the intersection of the
live-out defs of their predecessors: if all predecessors have the same
live-out def in the same register, it is used as live-in. However, we
only do this for defs that are actually live-in according to the
liveness analysis.
This doesn't work for loops: since predecessors from back edges are
processed after their successors, we don't know their live-out state
yet. We solve this by ignoring such predecessors while calculating the
live-in state. When this predecessor is later processed, we fix-up its
live-out state to match what its successor expects by reloading defs if
necessary.
Spilling is implemented by reloading, or rematerializing, the
instruction that produced the def. Whenever we need a new register while
none are available, we simply free one. If the freed def is later needed
again, we clone the original instruction in front on the new use. We
keep track of the original def the reload is cloned from so that
subsequent uses can reuse the reload.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27411>