I did this for AGX for some Blender shader but apparently it's not doing all
that much for Jay!
SIMD16:
Totals from 0 (0.00% of 2647) affected shaders:
SIMD32:
Totals from 8 (0.30% of 2647) affected shaders:
Instrs: 29566 -> 29255 (-1.05%); split: -1.08%, +0.03%
CodeSize: 432672 -> 427408 (-1.22%); split: -1.24%, +0.02%
Number of spill instructions: 799 -> 658 (-17.65%)
Number of fill instructions: 1010 -> 906 (-10.30%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
Braun-Hack has a complex algorithm to insert spills on-demand when pressure
exceeds the limit, with fix ups along control flow to ensure we spill along each
execution path. Faith implemented a slightly different version in NAK which is
what Jay did, with some nonobvious tradeoffs between the two.
But actually.. why are we doing this at all?
We can alternatively spill immediately after the definition, which guarantees
(by the usual properties of dominance) that the spill executes before any
reloads. Then we don't need any tricky bookkeeping or control flow fixups.
Beyond the simplification, this has a couple advantages:
* Lower register pressure throughout more of the program. This doesn't affect
the /amount/ of spilling, but it gives RA more freedom so should reduce
shuffling. This in itself probably justifies doing this.
* Less SSA repair needed around memory definitions, which can reduce memory
traffic due to our naive memory definition handling. (I think if we
implemented Braun-Hack properly with CSSA this would be less of a concern, but
whatever.)
* Less reliance on the "no critical edges" property which will come in handy for
UGPR spilling, this is a yak shave for that.
* No risk of executing the same spill twice due to divergence (if we spill
inside of a divergent IF). This means this commit is probably a better idea
for GPUs than CPUs in practice.
This also has a couple of disadvantages explaining why the paper didn't do this:
* If a value only needs to be spilled/filled in conditional control flow, this
executes extra spills. But, spills are cheaper than fills (they burn bandwidth
but have basically no latency since they're stores), so I'm not super
concerned about this corner case.
* If a value is defined in a loop and needs to be spilled due to a use outside
the loop, it spills N times instead of 1. This is more compelling reason to do
the paper's think. But we demand the input program is LCSSA so this shouldn't
actually happen for us.
* The spill stalls on the definition value. That's probably not a big deal and
the eventual post-RA scheduler should be able to cope.
Overall I think this is a reasonable direction. We can revisit later but I don't
want to add more complexity to the spiller than absolutely necessarily, and it's
about to be necessary to add complexity for UGPRs.
SIMD16:
Totals from 17 (0.64% of 2647) affected shaders:
Instrs: 250304 -> 249221 (-0.43%); split: -0.44%, +0.01%
CodeSize: 3476640 -> 3461312 (-0.44%); split: -0.45%, +0.01%
Number of spill instructions: 555 -> 223 (-59.82%)
Number of fill instructions: 551 -> 543 (-1.45%)
SIMD32:
Totals from 420 (15.87% of 2647) affected shaders:
Instrs: 1779193 -> 1698683 (-4.53%); split: -4.53%, +0.01%
CodeSize: 25455456 -> 24198416 (-4.94%); split: -4.95%, +0.01%
Number of spill instructions: 36900 -> 14440 (-60.87%)
Number of fill instructions: 36550 -> 35103 (-3.96%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
This saves instructions on Jay. We probably could teach backend predication to
chew thru the mess, but I don't see a reason not to just do this everywhere.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Acked-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42197>
Right now demote_if only works with 32-bit registers but in NIR it can
also have 16-bit sources, we have a couple of bug on those. First is
during NIR->BIR translation (h0 swizzle was not set), second is in
discard_b32 to discard_f32 lowering (bifrost has restrictions).
Signed-off-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42197>
Our get_buffer/image_memory_requirements() pad TRANSFER_SRC resources
with V3D_TFU_READAHEAD_SIZE, so allocating the reported requirements of
a resource of exactly maxMemoryAllocationSize failed with
VK_ERROR_OUT_OF_DEVICE_MEMORY.
Accept up to one extra page over the limit: since the allocation size
is page-aligned, that covers any sub-page padding.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42179>
This prevents holding open file descriptors after physical devices
are enumerated. This also prevents potential (and unknown) multithreading
issues with the winsys being shared between more than one logical device.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41824>
The winsys will be moved to logical devices. This creates a
ac_drm_device on-demand to make the call faster because otherwise it's
too slow for a function that can be called every frame.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41824>
Also adds dEQP-VK.reconvergence.subgroup_uniform_control_flow_ballot.compute.nesting4.7.10
to CI skips due to it having a runtime of > 5m with the following:
Test case 'dEQP-VK.reconvergence.subgroup_uniform_control_flow_ballot.compute.nesting4.7.10'..
NotSupported (No compatible memory type found at vkMemUtil.cpp:652)
which hits the timeout.
Signed-off-by: Simon Perretta <simon.perretta@imgtec.com>
Acked-by: Frank Binns <frank.binns@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41833>
- Split pseudo-instruction legalization into pre/post RA stages.
- Add vote pseudo-op and lowering.
Signed-off-by: Simon Perretta <simon.perretta@imgtec.com>
Acked-by: Frank Binns <frank.binns@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41833>
Replaces the no-op subgroup implementation with a real one, covering
what's needed for VK_SUBGROUP_FEATURE_BASIC_BIT.
The setl instruction will only execute on the first valid instance within
a slot/thread-group (comprising 32 instances/threads, i.e. our subgroup size),
which enables a subgroupElect() implementation.
Instances within a slot execute in lockstep which allows us to continue
discarding subgroup barriers as per the no-op implementation.
Signed-off-by: Simon Perretta <simon.perretta@imgtec.com>
Acked-by: Frank Binns <frank.binns@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41833>
There are 12 TI AM62 Starter Kit boards available in the farm for this testing.
Signed-off-by: Robert Mazur <robert.mazur@imgtec.com>
Reviewed-by: Frank Binns <frank.binns@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42172>
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Fixes: 49fb361c0a ("aco: don't emit workgroup-scope p_barrier for single-wave workgroups")
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42214>
Enable getting and using the optimal number of temps instead of the maximum.
Instead of going straight to the maximum amount and then spilling,
register allocation will now first try to allocate with the optimal
amount of temps, then try with the maximum, then spill.
Signed-off-by: Radu Costas <radu.costas@imgtec.com>
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42078>
Too many boolean variables handling what is essentially the same state
of the register allocation cause unnecessary complexity. Moved to an
enum and a single struct member in the ra_ctx.
Signed-off-by: Radu Costas <radu.costas@imgtec.com>
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42078>
Adds force spilling control and optimal allocation disabling debug variables.
Signed-off-by: Radu Costas <radu.costas@imgtec.com>
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42078>