This is the instruction that the hardware actually supports. Do the rename, use
the more specific accurate model in the IR, and rework the Valhall texturing
code to emit MKVEC.v2i8 instead of MKVEC.v4i8.
Will fix:
dEQP-GLES31.functional.texture.gather.offset_dynamic.implementation_offset.*
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
They are in a different place, but the encoding is otherwise as usual. This will
be required for texture gathers with dynamic offsets.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
Valhall does not have Bifrost's 4-source MKVEC.v4i8. Instead, it has a (somewhat
limtied) 3-source MKVEC.v2i8. The full MKVEC.v4i8 may be lowered to a pair of
MKVEC.v2i8 instructions.
For good code quality on both Bifrost and Valhall, we need to model both
instructions in their full generality.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
This avoids needless variation from Bifrost. While at it, fix the opcode
definition: there are no abs/neg/swizzle modifiers on the signed integer source,
and there's no clamp. However, there are round and infinity modes, like on
Bifrost.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
This will fix:
dEQP-GLES31.functional.shaders.multisample_interpolation.interpolate_at_offset.at_sample_position.default_framebuffer
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
We generate FADD_RSCALE.f32 in our sample variables implementations. Valhall
doesn't have a dedicated FADD_RSCALE.f32 implementation, it should be aliased to
FMA_RSCALE.f32. Handle that alias in isel lowering. This will fix:
dEQP-GLES31.functional.shaders.multisample_interpolation.interpolate_at_offset.*
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
When lowering vars to scratch, we need to be careful with alignment on Valhall,
where packed TLS access must not straddle a 16-byte boundary. Fixes regressions
when enabling indirect access to temps on Valhall.
Fixes: 6761dbf891 ("panfrost: Use packed TLS on Valhall")
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
This was missing the message, breaking UBO-to-push and who-knows-what-else, when
enabling fp16 const buffers.
Fixes: 3dc2095b07 ("pan/bi: Model LD_BUFFER instructions")
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17101>
This pass is super easy to unit test, so we have no excuse not to test
thoroughly. va_mark_last only inserts annotations in a shader without any
annotations, so our test cases are simply annotated shaders. The CASE macro just
has to compare the case against the case with the annotations stripped and added
back with va_mark_last.
In retrospect, I should have used that technique for the flow control insertion
tests too.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
On Valhall, register reads may be marked as "last" [1]. Setting the last flag
promises the hardware that the value of the register is no longer required. This
may enable hardware optimizations. In particular, it may permit the hardware to
avoid register file writes if a write to the marked register is still in the
forwarding buffer. This may improve power efficiency.
In principle, this is trivial: run liveness analysis and mark killed sources,
like we would in an SSA-based register allocator. In practice, there are a few
wrinkles to avoid hazards around staging registers and 64-bit register pairs,
requiring some additional data flow analysis and fix ups. However, nothing here
is particularly "hard", and all the ideas are already in use for the Bifrost
scheduler and the Bifrost/Valhall scoreboard analyses.
[1] In Mesa's compiler, this is called discard for historical reasons.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
This helps "contain the crazy" and avoids special casing BLEND in compiler
passes. The Valhall instruction is roughly the same as its Bifrost counterpart,
as long as we fix up the source order (as we already do for bitwise operations)
everything works out.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
Post-RA liveness relies on the caller updating the live variable with the
results of bi_postra_liveness_ins. It is not automatic, as with regular
liveness. This means ignoring the result of bi_postra_liveness_ins is surely an
error. Mark it as MUST_CHECK to catch that error at compile time.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
Add a unit test for the quirk discovered in the previos commit, because this
will cause flakes (instead of fails) if we get it wrong. Better have a
deterministic fail mode.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
For some unknown reason, waiting for general slots (at least for memory stores)
doesn't work properly on a BARRIER instruction. We need to wait for all general
slots right before issuing the BARRIER in addition to the general wait on the
BARRIER itself. I don't know if this is a hardware bug or some hideous
gate-saving quirk, but I observe the Mali-G78 DDK using the same workaround,
which implies this really is necessary.
Fixes rare flakes in:
dEQP-GLES31.functional.compute.shared_var.work_group_size.float_128_1_1
Note that the flakes from that test are extremely timing dependent. Without this
change, that test is racy but we almost always win the race. Reproducing the
issue reliably requires high system load (e.g. running the CTS in the
background) and simultaneously running that test a large number of times.
Minimal shader-db impact. In particular, no cycle count regressions.
total instructions in shared programs: 2699419 -> 2699458 (<.01%)
instructions in affected programs: 22014 -> 22053 (0.18%)
helped: 2
HURT: 25
helped stats (abs) min: 1.0 max: 1.0 x̄: 1.00 x̃: 1
helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12%
HURT stats (abs) min: 1.0 max: 3.0 x̄: 1.64 x̃: 1
HURT stats (rel) min: 0.07% max: 2.82% x̄: 0.69% x̃: 0.49%
95% mean confidence interval for instructions value: 1.01 1.87
95% mean confidence interval for instructions %-change: 0.38% 0.88%
Instructions are HURT.
total cvt in shared programs: 14468.81 -> 14469.42 (<.01%)
cvt in affected programs: 221.33 -> 221.94 (0.28%)
helped: 2
HURT: 25
helped stats (abs) min: 0.015625 max: 0.015625 x̄: 0.02 x̃: 0
helped stats (rel) min: 0.18% max: 0.18% x̄: 0.18% x̃: 0.18%
HURT stats (abs) min: 0.015625 max: 0.046875 x̄: 0.03 x̃: 0
HURT stats (rel) min: 0.10% max: 4.44% x̄: 1.06% x̃: 0.79%
95% mean confidence interval for cvt value: 0.02 0.03
95% mean confidence interval for cvt %-change: 0.57% 1.36%
Cvt are HURT.
total quadwords in shared programs: 1462496 -> 1462528 (<.01%)
quadwords in affected programs: 4632 -> 4664 (0.69%)
helped: 0
HURT: 4
HURT stats (abs) min: 8.0 max: 8.0 x̄: 8.00 x̃: 8
HURT stats (rel) min: 0.35% max: 7.69% x̄: 4.03% x̃: 4.03%
95% mean confidence interval for quadwords value: 8.00 8.00
95% mean confidence interval for quadwords %-change: -2.71% 10.76%
Inconclusive result (%-change mean confidence interval includes 0).
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
Test cases for insert flow are necessarily the reference test cases with the
NOPs stripped out. That means we don't need to duplicate the test bodies.
Deduplicate.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
This definition is a hardware property. It's not specific to the flow control
insertion pass, so move it to common code where other passes can use it.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17091>
Starting with Valhall, the provoking vertex state is specified per-framebuffer
(batch) instead of per-draw. We use the pan_tristate infrastructure to translate
between desktop OpenGL's per-draw semantics to Valhall's per-framebuffer
semantic. This is notably not required for GLES or Vulkan.
If the provoking vertex is unset when the tiler context is generated, it could
be set (incompatibly) later in the batch, and the tiler context's provoking
vertex field would no longer match the framebuffer's. That would violate a
hardware invariant. To ensure that doesn't happen, we make sure to set provoking
vertexes *before* generating the tiler context so it can't change after.
Fixes arb-provoking-vertex-render on Valhall.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17068>
Iterating over a util_sparse_array is very expensive; replace this
with a standard dynarray.
Using the sparse 'nodearray' datastructure instead was tested, but
found to be slower in some cases.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16988>
If we have a combined Z/S image, the image has depth, so we proceed down the
depth path, which does not set clear.s even though there's *also* a stencil
component. Unify the control flow to fix this.
Fixes (among others):
dEQP-VK.api.image_clearing.core.clear_depth_stencil_image.single_layer.d24_unorm_s8_uint_multiple_subresourcerange
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16950>
Rather than generating shaders to clear depth and stencil attachments, run the
rasterizer without a shader and configure the depth/stencil hardware to do the
clear. These settings are known to be efficient on Valhall, presumably the
depth/stencil pipeline on Bifrost is similar enough that it is also the
efficient way there. It's certainly much simpler.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Acked-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16950>
These were removed in an earlier series containing ae77c207e0 ("panvk: Use push
constants for copy shaders"), but the unused variables hung around.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16950>
On Bifrost and newer, blend descriptors are decoupled from render target. That
means we can always use a clear shader reading from blend_descriptor_0 and
specify the desired render target in the sole blend descriptor we pass.
Likewise on Bifrost and newer we don't need blend descriptors when we don't
blend, which is the case for the Z/S clears.
This reduces the number of shaders compiled on startup from 468 to 426.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16950>
The hardware writes one CRC per (effective) tile, the tile size of the CRC
buffer is the same as the configured effective tile size. However, all our CRC
infrastructure assumes 16x16 tiles. In case CRC is used with smaller tiles,
buffer overflows and incorrect rendering are all possible. Don't use CRC at
smaller tile sizes. Note disabling CRC correctly invalidates any bound CRC
buffers.
Fixes: 2e97d7c835 ("panfrost: Transaction elimination support")
Closes: #6332
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16983>
It has a single user -- in a section of code that only runs for MFBD GPUs and
that has already decided whether to use CRCs -- so inlining it simplifies its
definition greatly and may avoid redeciding the CRC setting.
[Note for mesa-stable maintainers: This is not a bug fix but is marked for
backport so the next patch applies cleanly.]
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16983>
info.fs.sidefx considers discard() to be a side effect. That definition is...
dubious at best. It certainly isn't the definition needed for forward pixel
kill. The only reason pixels couldn't be killed by FPK is if the shader has side
effects in the sense of writing to memory. Use that more precise condition so
FPK works more often.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Closes: #5607
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16984>
The only reason for the wrapper was so that we could dummy signal the
semaphore and fence. Now that the WSI code always dos this for us, we
can drop our wrapper.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037>
We've discussed this at length and have agreed that Midgard + Vulkan is DOA, but
have let the code linger. Now it's getting in the way of forward progress for
PanVK... That means it's time to drop the code paths and commit t to not
supporting it.
Midgard is only *barely* Vulkan 1.0 capable, Arm's driver was mainly
experimental. Today, there are no known workloads today for hardware of that
class, given the relatively weak CPU and GPU, Linux, and arm64. Even with a
perfect Vulkan driver, FEX + DXVK on RK3399 won't be performant.
There is a risk here: in the future, 2D workloads (like desktop compositors)
might hard depend on Vulkan. It seems this is bound to happen but about a decade
out. I worry about contributing to hardware obsolescence due to missing Vulkan
drivers, however such a change would obsolete far more than Midgard v5...
There's plenty of GL2 hardware that's still alive and well, for one. It doesn't
look like Utgard will be going anywhere, even then.
For the record: I think depending on Vulkan for 2D workloads is a bad idea. It's
unfortunately on brand for some compositors.
Getting conformant Vulkan 1.0 on Midgard would be a massive amount of work on
top of conformant Bifrost/Valhall PanVK, and the performance would make it
useless for interesting 3D workloads -- especially by 2025 standards.
If there's a retrocomputing urge in the future to build a Midgard + Vulkan
driver, that could happen later. But it would be a lot more work than reverting
this commit. The compiler would need significant work to be appropriate for
anything newer than OpenGL ES 3.0, even dEQP-GLES31 tortures it pretty bad.
Support for non-32bit types is lacklustre. Piles of basic shader features in
Vulkan 1.0 are missing or broken in the Midgard compiler. Even if you got
everything working, basic extensions like subgroup ops are architecturally
impossible to implement.
On the core driver side, we would need support for indirect draws -- on Vulkan,
stalling and doing it on the CPU is a nonoption. In fact, the indirect draw code
is needed for plain indexed draws in Vulkan, meaning Zink + PanVK can be
expected to have terrible performance on anything older than Valhall. (As far as
workloads to justify building a Vulkan driver, Zink/ANGLE are the worst
examples. The existing GL driver works well and is not much work to maintain. If
it were, sticking it in Amber branch would still be less work than trying to
build a competent Vulkan driver for that hardware.)
Where does PanVK fit in? Android, for one. High end Valhall devices might run
FEX + DXVK acceptably. For whatever it's worth, Valhall is the first Mali
hardware that can support Vulkan properly, even Bifrost Vulkan is a slow mess
that you wouldn't want to use for anything if you had another option.
In theory Arm ships Vulkan drivers for this class of hardware. In practice,
Arm's drivers have long sucked on Linux, assuming you could get your hands on a
build. It didn't take much for Panfrost to win the Linux/Mali market.
The highest end Midgard getting wide use with Panfrost is the RK3399 with the
Mali-T860, as in the Pinebook Pro. Even by today's standards, RK3399 is showing
its limits. It seems unlikely that its users in 10 years from now will also be
using Vulkan-required 2030 desktop environment eye candy. Graphically, the
nicest experience on RK3399 is sway or weston, with GLES2 renderers.
Realistically, sway won't go Vulkan-only for a long-time.
Making ourselves crazy trying to support Midgard poorly in PanVK seems like
letting perfect (Vulkan support) be the enemy of good (Vulkan support). In that
light, future developers making core 2D software Vulkan-only (forcing software
rasterization instead of using the hardware OpenGL) are doing a lot more
e-wasting than us simply not providing Midgard Vulkan drivers because we don't
have the resources to do so, and keeping the broken code in-tree will just get
in the way of forward progress for shipping PanVK at all.
There are good reasons, after all, that turnip starts with a6xx.
(If proper Vulkan support only began with Valhall, will we support Bifrost
long term? Unclear. There are some good arguments on both sides here.)
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Acked-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Acked-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16915>
The performance counter layout depends on the number of L2 blocks and the number
of shader cores. It doesn't make a ton of sense to hardcode these into the XML
files. Instead, let's make the coutner offsets in the XML files relative to the
categories (blocks), so we can calculate the offsets of the categories
themselves at runtime based on the computed layout. This fixes performance
counters on Mali-G57 as implemented on MT8192.
There is little code change here, mainly churn from changing the XML definition.
Postprocessing for the XML to make it suitable for Mesa uses Antonio Caggiano's
https://gitlab.freedesktop.org/panfrost/hwc-helper tool.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Antonio Caggiano <antonio.caggiano@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16803>
The number of L2 performance counter blocks equals the number of L2 slices, so
add a query to get this. This information isn't needed by the Mesa driver, so
don't get it in the default device initialization path.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Antonio Caggiano <antonio.caggiano@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16803>
The input is specified in two identical structs, tear that apart.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16916>
Much simpler than creating a UBO and relying on it getting optimized to a push
constant, with possible reordering.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16916>
Otherwise, load_push_constant won't work properly. This could probably be made
to work if we tried hard enough, but we still don't want reordering for internal
(meta) shaders which are layed out deliberately.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16916>
Bifrost supports "fast access uniforms" loaded from a single contiguous buffer.
This maps directly to Vulkan push constants, with some caveats:
* No indirect access. Indirects need to be lowered to a UBO pull.
* Strict alignment requirements. These will be met in practice.
Implement the NIR intrinsic and map it to the native hardware construct.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16916>
NIR has fdiv, and all the NIR backends have to have lower_fdiv set
appropriately already since various passes (format conversions,
tgsi_to_nir, nir_fast_normalize(), etc.) might generate one.
This causes softpipe and llvmpipe to now do actual divides, since
lower_fdiv is not set there. Note that llvmpipe's rcp implementation is a
divide of 1.0 by x, so now we're going to be just doing div(x, y) instead
of mul(x, div(1.0, y)).
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16823>