If the BO handle is greater than 2x what fits inside the current bitmap
size, then we end up overflowing. Make sure to always reallocate to a
large enough bitmap, not just 2x the previous size.
Found while replaying firefox apitraces with looping (which apparently
leaks a ton of objects, but that might just be apitrace).
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
When a bound RT is not written to, we need to force the pass type to
translucent to ensure that this draw does not cull draws that do write
to that RT.
Fixes Inochi2D regression after c24b753378.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Do not apply "nir_lower_fragcolor" in the common code.
This fix a crash on agxv side when a frag shader have SSBO writes.
This is caused by "nir_lower_frag_color" assuming that every
"store_deref" will have a variable backing the
output.
Signed-off-by: Mary <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We need actual analysis to set it properly, and improperly setting it can cause
random data dependency hazards it turns out. Stop setting it. Fixes some flaky
tests with shuffle code inserted.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
This avoids an entire class of bugs with live range splitting. Fixes with
AGX_MESA_DEBUG=demand:
dEQP-GLES31.functional.separate_shader.random.8
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
All affected shaders are in pubg. Presumably, with the new demand calculation,
RA is hitting a higher target thread count at the expense of a little more live
range splitting.
total instructions in shared programs: 1773295 -> 1773310 (<.01%)
instructions in affected programs: 6058 -> 6073 (0.25%)
helped: 0
HURT: 15
Instructions are HURT.
total bytes in shared programs: 11695360 -> 11695450 (<.01%)
bytes in affected programs: 40496 -> 40586 (0.22%)
helped: 0
HURT: 15
Bytes are HURT.
total halfregs in shared programs: 530844 -> 530724 (-0.02%)
halfregs in affected programs: 1785 -> 1665 (-6.72%)
helped: 15
HURT: 0
Halfregs are helped.
total threads in shared programs: 18909440 -> 18910400 (<.01%)
threads in affected programs: 12480 -> 13440 (7.69%)
helped: 15
HURT: 0
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
The mask is based on the format, which can be at most 32-bits per channel. So if
we have 64-bit loads/stores we're still using a 32-bit format with double the
bits set in the mask. This will fix validation fails with spilling.
No shader-db changes.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We can't calculate after since ssa_to_reg[] gets overwritten during live range
splits. Theoretical issue only, but let's fix it while squashing live range
splitting bugs.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We know logical_end instructions are only at the end of the block (validated),
so by changing how we iterate the pass goes from O(instructions) to O(blocks)
which is strictly better.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We can use extr to swap the low and high halves of a 32-bit register in one
instruction.
No shader-db changes, but it reduces xor's on a deqp I'm looking at. Yes, I'm
procrastinating on debugging deqps, how'd you guess?
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
There's no good reason to allow non-immediate nesting values, and this lets us
use the (smaller) mov_imm instruction without special casing. This matches what
Metal produces, so it seems like a good preference.
total bytes in shared programs: 11720338 -> 11717310 (-0.03%)
bytes in affected programs: 2341580 -> 2338552 (-0.13%)
helped: 1385
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Man, this expression was wrong. First of all, raw is 64-bit so our mask needs to
be too. Second, length is in bytes -- not bits -- so we need to multiply by 8 to
get something sensible. In effect, the old wrong expression would always use the
long encoding for ALU instructions... whoops. This particular bug probably goes
back to the very first version of agx_pack...
Massive improvement in code density. Noticed while comparing assembly with the
blob. It's my Saturday, I can pointless optimize if I want to.
total bytes in shared programs: 12175112 -> 11720338 (-3.74%)
bytes in affected programs: 11963800 -> 11509026 (-3.80%)
helped: 16624
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Now that they're in the right blocks, this is easy. Includes an informal proof
and the implementation itself is built around a finite state machine, which
together meant this code worked on its first try :~)
And hey, it's a pointless little instruction saving optimization I've wanted to
do for a while~
Major note is that this HAS to be done after register allocation, since it
doesn't update the control flow graph and would introduce critical edges
if it tried to actually deleted the else block. The intuitive reason for this is
simple: sometimes RA needs to insert instructions into the else block, even if
it was empty in the original NIR, so we always need an else block even if we can
delete it with this pass after RA.
total instructions in shared programs: 1778390 -> 1776725 (-0.09%)
instructions in affected programs: 268459 -> 266794 (-0.62%)
helped: 1013
HURT: 0
Instructions are helped.
total bytes in shared programs: 12185102 -> 12175112 (-0.08%)
bytes in affected programs: 1927524 -> 1917534 (-0.52%)
helped: 1013
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Rather than duplicating the condition. This matches the blob, so is presumably
the most energy-efficient way of expressing the logic.
No shader-db changes.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
According to Dougall's pseudocode, else_icmp operates as:
if r0l == 0:
r0l = n
elif r0l == 1:
if cc.compare(A[thread], B[thread]):
r0l = 0
else:
r0l = 1
exec_mask[thread] = (r0l == 0)
Notice that the comparison only happens when r0l == 1, that is, for threads that
are about to enter the else block. Threads that just executed the if body are
still active (r0l = 0) and skip the comparison. As such, the sources of
else_icmp are only read in the else block, and hence the whole instruction
should be placed in the else block for correctness with respect to live range
splitting.
shader-db is a wash, but shows some improvements due to correctly modelling the
liveness of the condition variable.
total instructions in shared programs: 1778376 -> 1778390 (<.01%)
instructions in affected programs: 14753 -> 14767 (0.09%)
helped: 35
HURT: 39
Inconclusive result (value mean confidence interval includes 0).
total bytes in shared programs: 12185018 -> 12185102 (<.01%)
bytes in affected programs: 101522 -> 101606 (0.08%)
helped: 35
HURT: 39
Inconclusive result (value mean confidence interval includes 0).
total halfregs in shared programs: 531174 -> 531032 (-0.03%)
halfregs in affected programs: 2320 -> 2178 (-6.12%)
helped: 40
HURT: 1
Halfregs are helped.
total threads in shared programs: 18909184 -> 18909440 (<.01%)
threads in affected programs: 1792 -> 2048 (14.29%)
helped: 2
HURT: 0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
NIR->AGX translation relies on bindless handles being vec2 instructions with a
constant first index. Moving the entire vec2 into the preamble would mess this
up, so tell nir_opt_preamble to never do this. It's still allowed to move the
offset calculation into the preamble, if it thinks that's beneficial.
Fixes dEQP-GLES31.functional.shaders.opaque_type_indexing.sampler.*
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We need to:
* properly null out the dest in DCE.
* not assert out when packing with null dest
Fixes potential reg pressure blow up with atomics that don't use their
destinations, though I don't see shader-db changes.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
We have an SR for it, which can save a bit of math. This came up while working
on the spiller.
total instructions in shared programs: 1778396 -> 1778376 (<.01%)
instructions in affected programs: 3036 -> 3016 (-0.66%)
helped: 10
HURT: 3
Instructions are helped.
total bytes in shared programs: 12185182 -> 12185018 (<.01%)
bytes in affected programs: 38640 -> 38476 (-0.42%)
helped: 18
HURT: 2
Bytes are helped.
total halfregs in shared programs: 531218 -> 531174 (<.01%)
halfregs in affected programs: 471 -> 427 (-9.34%)
helped: 6
HURT: 0
Halfregs are helped.
total threads in shared programs: 18909056 -> 18909184 (<.01%)
threads in affected programs: 1280 -> 1408 (10.00%)
helped: 2
HURT: 0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Consumers might not pass through the modifier information in this case.
Fixes XWayland/mutter using dma-buf v4 feedback (though the fact they
try to use implicit modifiers is likely a bug on their end, and will
decrease performance).
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Don't call lower_mediump_io for no16. This is helpful for debugging and soon
driconf-shaming apps with broken precision qualifiers.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
Apps can have pathological use cases where huge resources are shadowed
repeatedly. An app that alternately writes to a resource and then uses
it to draw can create an unbounded amount of shadow BOs.
To fix this, introduce both a maximum resource size for shadowing, and a
maximum cumulative size that resource may be shadowed before we start
flushing readers. The flush path then clears the counter, as does the
happy path where there are no readers left after flushing writers.
Fixes massive memory bloating in Firefox and probably others.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
BOs can be oversized, as they can come from the BO cache. Make sure to
always use the resource layout size, not the BO size, when we need this
for some reason.
This fixes BO shadowing creating overlarge BOs, and also the attachment
size for submissions (probably doesn't matter, but it's more correct now).
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
These should "just" work, promoting the 8-bit channels to 16-bit registers
internally, allowing us to use our 8-bit stores with 8-bit data vectors packed
in 16-bit registers. All other non-conversion ALU gets lowered by the previous
patch, this is just needed for simple things like nir_op_vec of lowered math
passed to a vectorized store.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
These have no real Vulkan or Gallium dependence and are (as such) useful for
both VK and GL without any real change in level of abstraction. Do the code
motion.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24635>
When NV21 lowering with hardware sampling and shader CSC was added, the
incorrect PIPE_FORMAT_G8_B8R8_UNORM was used. That format is supposed to
represent vulkan NV12 instead.
This commit introduces PIPE_FORMAT_R8_B8G8_UNORM, which correctly describes the
gallium mapping for YUV CSC, with R as Y, instead of G as Y.
Fixes: 26e3be513d ("gallium/st: add support for PIPE_FORMAT_NV21 and PIPE_FORMAT_G8_B8R8_420")
Signed-off-by: Italo Nicola <italonicola@collabora.com>
Reviewed-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24266>
YV12 is the same as DRM_FORMAT_YVU420.
We lower it to PIPE_FORMAT_R8_B8_G8_420, which is equivalent to
PIPE_FORMAT_R8_G8_B8_420 with U/V planes swapped.
This is used for hardware that can sample from YUV but need CSC in shader.
Signed-off-by: Italo Nicola <italonicola@collabora.com>
Reviewed-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24266>
This new format is similar to PIPE_FORMAT_G8_B8_R8_420, but with R as Y, G as U
and B as V. The need for two diferent formats here is because gallium maps the
YUV channels differently from vulkan.
Some hardware, e.g. Mali GPUs, can sample from I420 but need CSC in shader,
this patch implements that.
Signed-off-by: Italo Nicola <italonicola@collabora.com>
Reviewed-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24266>