For implementing image atomics (and multisample image writes), we need
information about the image layout in the shader. It's a lot nicer to determine
the image layouts on the CPU (where we have ail) and stash the results in the
PBE descriptor, where we have a convenient hole to do so, rather than trying to
do all the layout calculations on the GPU on the fly. Add a data structure that
the driver will fill out and the image atomic lowering will consider as part of
the hardware.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Add a simple API so that decode can be used as a shared library by the
Python hypervisor. Note that this is not thread-safe. If we ever want to
use this in other contexts with thread safety, it will need a refactor
(along with the core decode code anyway).
Signed-off-by: Asahi Lina <lina@asahilina.net>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
We want to plug this library into the hypervisor, but there we don't
have all GPU memory already mapped in our address space. Refactor the
GPU mem read function to always allocate local buffers and copy in the
data there.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Needed for some Metal demos that end up creating multiple queues.
This is still definitely broken/not fully correct, but it at least
gets things working for those.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Sooner or later we were going to need divergent codepaths in decode, and
it looks like now is the time. Add a `params` typedef and pass it
through all the decoder callbacks. This is an alias for
drm_asahi_params_global, but use a typedef so we can change that later
without changing dozens of instances.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Often not needed and makes the NIR harder to read.
shader-db is noise.
total instructions in shared programs: 1752712 -> 1752688 (<.01%)
instructions in affected programs: 8338 -> 8314 (-0.29%)
helped: 21
HURT: 8
Inconclusive result (%-change mean confidence interval includes 0).
total bytes in shared programs: 11943572 -> 11943434 (<.01%)
bytes in affected programs: 56716 -> 56578 (-0.24%)
helped: 21
HURT: 8
Inconclusive result (%-change mean confidence interval includes 0).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
If we have two 16-bit copies to/from adjacent 16-bit registers, we can instead
use a single 32-bit copy from the 32-bit register pair. Since 32-bit integer
arithmetic is (almost) as efficient as 16-bit on AGX, this (almost) doubles
performance of affected parallel copies.
total instructions in shared programs: 1788606 -> 1788301 (-0.02%)
instructions in affected programs: 17057 -> 16752 (-1.79%)
helped: 150
HURT: 0
Instructions are helped.
total bytes in shared programs: 12196492 -> 12194662 (-0.02%)
bytes in affected programs: 122894 -> 121064 (-1.49%)
helped: 150
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Not meaningfully using more registers since this is just about how we assign
registers after fixing the maximum # of registers used (note that thread count
is unaffected).
total instructions in shared programs: 1790901 -> 1788666 (-0.12%)
instructions in affected programs: 230680 -> 228445 (-0.97%)
helped: 681
HURT: 2
Instructions are helped.
total bytes in shared programs: 12210266 -> 12196852 (-0.11%)
bytes in affected programs: 1634100 -> 1620686 (-0.82%)
helped: 682
HURT: 2
Bytes are helped.
total halfregs in shared programs: 532130 -> 532218 (0.02%)
halfregs in affected programs: 848 -> 936 (10.38%)
helped: 3
HURT: 13
Halfregs are HURT.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
All shaders affected for thread count are in pubg... by chance the allocation
before used fewer registers than the calculated register demand (I guess because
we're conservative with our vector handling) and so got lucky and got higher
thread count. That shader is also helped massively for instructions.
The halfreg change doesn't matter -- we're not actually increasing register
demand, we're just being more choosy about our registers.
total instructions in shared programs: 1799738 -> 1790901 (-0.49%)
instructions in affected programs: 306081 -> 297244 (-2.89%)
helped: 889
HURT: 14
Instructions are helped.
total bytes in shared programs: 12263290 -> 12210266 (-0.43%)
bytes in affected programs: 2150966 -> 2097942 (-2.47%)
helped: 889
HURT: 14
Bytes are helped.
total halfregs in shared programs: 531981 -> 532130 (0.03%)
halfregs in affected programs: 1925 -> 2074 (7.74%)
helped: 0
HURT: 26
Halfregs are HURT.
total threads in shared programs: 18885184 -> 18884224 (<.01%)
threads in affected programs: 13440 -> 12480 (-7.14%)
helped: 0
HURT: 15
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Common logic the next few patches will use to try to assign something to the
same register as something else. "If it's already been assigned a register and
that register is free now, use it, otherwise bail."
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
We need to set a particular bit on atomics for them to be coherent across
clusters. Fixes atomics on G13X.
Setting this bit on the single-cluster G13G, on the other hand, wedges the GPU.
So best be careful ;-)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
Fixes KHR-GLES31.core.draw_indirect.advanced-twoPass-transformFeedback-arrays
and KHR-GLES31.core.draw_indirect.advanced-twoPass-transformFeedback-elements
on M1 Ultra (G13D). Let's assume that same bits are required on M1 Pro
and Max.
Signed-off-by: Janne Grunau <j@jannau.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
This lets us force small tiles when they otherwise would not be
necessary, which is useful for decoupling tile size and the logic that
depends on it from things like MSAA and MRT which can trigger small
tiles.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
This requests synchronous TVB growth (instead of split renders). Mostly
for testing at this point.
Only works with newer kernels and the kernel will complain on dmesg for
now.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24258>
calloc'd in the RA, should be freed in the RA. Identified with valgrind.
Fixes: 6b13616cba2 ("agx: Implement vector live range splitting")
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
The GPU ABI requires varyings to be grouped as follows:
- Position
- Smooth shaded fp32
- Flat shaded fp32
- Linear shaded fp32
- Smooth shaded fp16
- Flat shaded fp16
- Linear shaded fp16
- Point size
Use the flat shaded mask info we now have in the vertex shader key to
sort things properly, and pass the counts to the hardware.
FP16 is still TODO.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
We need this information in order to arrange varyings properly, which
means we need shader variants. Add this to the shader key, taking the
value from the FS input info.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
We can't attempt to access the fs union member if this is not a FS.
That worked so far since there wasn't a VS shader key at all, but we're
about to introduce one.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
We need to propagate shading model metadata from the FS to the VS in
order to correctly lay out the uniforms in the right order. This means
we need VS variants depending on this data.
We could use the existing shader info structure, but that applies to
compiled shaders which would introduce a dependency from the VS compile
to the FS compile. This information does not change with FS variants, so
we can introduce an agx_uncompiled_shader_info structure and gather it
early at precompilation time.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
Flat/goraud/linear and 32/16 need to be specified separately. This
change identifies the new fields but should be a functional no-op.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
This especially helps with image stores, where we otherwise insert a bunch of
pointless moves to collect a vector even when we know the format only has a
single channel.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
It's used for all PBE operations, including regular image writes, so use the
more general name. Compare the powervr driver.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
Sometimes it's nice to have boolean flags with ? in the name, allow this by
stripping ? when generating the sanitized C name.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
An offset may be negative, indexing backwards from the array base.
When we right shift an offset by the format shift, we need to use a
signed shift to ensure that the resulting offset is still negative.
Fixes Nautilus faults/pink crashes.
Signed-off-by: Asahi Lina <lina@asahilina.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
In 97a1bbeaf26 ("agx: Fix discards"), we made our discard lowering very simple,
since we had just discovered the underlying instruction behaviour and needed a
hotfix for misrendering in the wild. Now that we understand the behaviour, we
can do better. There are two potential performance issues with the lowering in
that commit:
1. It generates extra sample_mask instructions. For a shader that has a single
discard_if at root level, it would generate two instructions
sample_mask foo, 0
sample_mask ~0, ~0
rather than a single
sample_mask ~0, ~foo
2. It runs depth/stencil testing/updates at the end of the shader, even when it
could be run immediately after the discard. This might cause pipeline stalls.
The solution is to insert the "trigger testing" sample_mask instruction as soon
after the "discard" instruction as possible, fusing them if they would be next
to each other. There are two cases:
1. The last discard is executed unconditionally. In this case, we can test
immediately after, unconditionally, and fuse together.
2. The last discard is executed conditionally. In this case, we test in the
first unconditional block after the discard. Example shader:
...
loop {
if .. {
loop {
discard_if <-- discard here
...
}
..
}
...
}
<---- we test here
...
store_output
Together this covers all the usual patterns for single-sampled discard. We could
still do better with multisampling, but whatever.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
When lowering discards, it will be convenient to generate the pattern:
(cond ? 255 : 0) ^ 255
Add rules to optimize that to
(cond ? 0 : 255)
This is not part of the main algebraic optimizer since this lowering happens
late.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23998>
The SSA killer feature is that, under an "optimal" allocator, the number of
registers used (register demand) is *equal* to the number of registers required
(register pressure, the maximum number of variables simultaneously live at any
point in the program). I put "optimal" in scare quotes, because we don't need to
use the exact minimum number of registers as long as we don't sacrifice thread
count or introduce spilling, and using a few extra registers when possible can
help coalesce moves. Details-shmetails.
The problem is that, prior to this commit, our register allocator was not
well-behaved in certain circumstances, and would require an arbitrarily large
number of registers. In particular, since different variables have different
sizes and require contiguous allocation, in large programs the register file may
become fragmented, causing the RA to use arbitrarily many registers despite
having lots of registers free.
The solution is vector live range splitting. First, we calculate the register
pressure (the minimum number of registers that it is theoretically possible to
allocate successfully), and round up to the maximum number of registers we will
actually use (to give some wiggle room to coalesce moves). Then, we will treat
this maximum as a *bound*, requiring that we don't use more registers than
chosen. In the event that register file fragmentation prevents us from finding a
contiguous sequence of registers to allocate a variable, rather than giving up
or using registers we don't have, we shuffle the register file around
(defragmenting it) to make room for the new variable. That lets us use a
few moves to avoid sacrificing thread count or introducing spilling, which is
usually a great choice.
Android GLES3.1 shader-db results are as expected: some noise / small
regressions for instruction count, but a bunch of shaders with improved thread
count. The massive increase in register demand may seem weird, but this is the
RA doing exactly what it's supposed to: using more registers if and only if they
would not hurt thread count. Notice that no programs whatsoever are hurt for
thread count, which is the salient part.
total instructions in shared programs: 1781473 -> 1781574 (<.01%)
instructions in affected programs: 276268 -> 276369 (0.04%)
helped: 1074
HURT: 463
Inconclusive result (value mean confidence interval includes 0).
total bytes in shared programs: 12196640 -> 12201670 (0.04%)
bytes in affected programs: 1987322 -> 1992352 (0.25%)
helped: 1060
HURT: 513
Bytes are HURT.
total halfregs in shared programs: 488755 -> 529651 (8.37%)
halfregs in affected programs: 295651 -> 336547 (13.83%)
helped: 358
HURT: 9737
Halfregs are HURT.
total threads in shared programs: 18875008 -> 18885440 (0.06%)
threads in affected programs: 64576 -> 75008 (16.15%)
helped: 82
HURT: 0
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23832>