WMASK is used to lower subgroup ballot/ballot_relaxed operations. Unlike
ordinary ALU instructions, the result of WMASK depends not only on its
explicit source, but also on the implicit subgroup execution state, such
as the active lane mask.
Treating WMASK as a normal CSE-able instruction can collapse subgroup
predicate chains incorrectly. This may remove the producer of a predicate
SSA value while leaving later predicate-combine instructions still
referencing it. The invalid SSA source then reaches the Valhall packer
and fails with errors like:
Invalid type of source 2:
r2 = ICMP_AND.u32.eq.m1 r2^, r0, %23^
Fixed: dEQP-VK.subgroups.vote.graphics.subgroupallequal_ivec2
Signed-of-by: Ryan Zhang <ryan.zhang@nxp.com>
I originally added this mechanism to have the first (SIMD8) compile
note that certain features were in use which would prevent SIMD16/32
from compiling, so we could skip the work of trying those.
But these days, there aren't many cases, and the ones we have are
easily detectable based on the NIR. We can detect it earlier without
even having to do the SIMD8 compile.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41122>
We checked that ver is 11 or 12. It can't be >= 20. This is dead code.
Dual source blending on Xe2 does not have native SIMD32 RT write message
support, but SIMD splitting is currently lowering it to low/high SIMD16
message pairs when using SIMD32 dispatch. I'm not aware of any of the
hardware errata from previous platform still applying.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41122>
The new FRAG_RESULT_DUAL_SRC_BLEND option is easier to work with than
looking for FRAG_RESULT_DATA0 with an index of 1. This also means we
no longer care about the dual source blend index, and can just use the
FRAG_RESULT location. That cascades to meaning we no longer have to
store a tuple in driver_location. And, if we just need location, we
can avoid populating that at all and use nir_io_semantics to get it.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41122>
Detecting dual source blending is currently annoying: you can either
look at info->fs.color_is_dual_source, or FRAG_RESULT_DUAL_SRC_BLEND
being in the info->outputs_written bitfield.
The former is only set if nir_shader_gather_info runs prior to
nir_lower_io lowering it to FRAG_RESULT_DUAL_SRC_BLEND.
The latter is only set if nir_shader_gather_info runs after the
nir_lower_io lowering.
Just make the IO lowering also set the outputs_written flag so if
you're trying to use FRAG_RESULT_DUAL_SRC_BLEND, you can always
check outputs_written without worrying about pass ordering.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41122>
This older code really only exists for Gen8 and elk. On brw platforms,
we moved to handling this in brw_nir_lower_fs_load_output, and don't
need to lower FS outputs before iris_setup_binding_tables.
Call the right compiler's lowering so that things continue working
on elk once we change brw_lower_fs_outputs in the next commit.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41122>
We can just have iris look at its own program key and change the
fragment shader output variable's location/index in the NIR. By
doing this before lowering fragment shader outputs, the rest of
the output lowering does the right thing, and the backend no longer
has to consider hacks for broken OpenGL apps.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41122>
To verify that some GPUs are compatible and that shader binaries can be
shared to avoid precompiling twice for SteamOS.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41346>
Otherwise we'll hit a LLVM assert when handling load_const:
llvm/include/llvm/ADT/APInt.h:121: llvm::APInt::APInt(unsigned int, uint64_t, bool, bool): Assertion `llvm::isIntN(BitWidth, val) && "Value is not an N-bit signed value"' failed.
Cc: mesa-stable
Reviewed-by: Dave Airlie <airlied@redhat.com>
Acked-by: Adam Jackson <ajax@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41404>
llvmpipe supports real function calls, so we need to call nir_progress on
every function, not just the entry point.
Cc: mesa-stable
Reviewed-by: Dave Airlie <airlied@redhat.com>
Acked-by: Adam Jackson <ajax@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41404>
The panfrost compiler is now able to handle these on v9+ and we don't
need to lower them ourselves anymore. We only need the lowering on
Bifrost because we don't have the magic LD_PKA there.
Reviewed-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Acked-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41352>
This new pass, pan_nir_lower_image(), will eventually subsume all image
lowering. For now, though, it only lowers image_size and only on
Valhall.
Reviewed-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Lars-Ivar Hesselberg Simonsen <lars-ivar.simonsen@arm.com>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41352>
Some vulkancts tests rely on vkGetImageMemoryRequirements to return the same
exact size after exporting and importing an image. This broke when we started
adding padding to sampled surfaces to manage overfetch, because the texture
usage flag does not get applied to the ISL surface when the image is recreated
using an explicit layout.
Fixes: 8d13628f7 ("isl: Add additional alignment/padding requirements to prevent overfetch")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41376>
This is less heavy handed, avoiding unnecessary stalls after SENDs in a
bunch of common cases. The stats (SIMD32) are:
Totals:
CodeSize: 70345392 -> 71674272 (+1.89%)
Totals from 1774 (67.02% of 2647) affected shaders:
CodeSize: 67359248 -> 68688128 (+1.97%)
What's happening here is we are inserting extra SYNC.nop instructions in a
bunch of cases for the .src preceding the eventual .dst. However, putting aside
the i-cache impact for a moment, this is showing the optimization doing what it
should (deferring dst syncs and inserting cheaper src syncs first). So this
should be positive in reality despite the negative stat impact.
The most hurt shaders are pooling up SYNC.nop's at the end of blocks due to
local-only SWSB and lack of SYNC.allwr optimization. The latter is added later
in this MR. The former is planned.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41398>
update the tracking with what we actually waited on, not what we ideally wanted
to wait on. reduces extra annotations in some cases.
SIMD32:
Totals from 194 (7.33% of 2647) affected shaders:
CodeSize: 14473840 -> 14469088 (-0.03%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41398>
IGC does these optimizations and I think they should be safe given my mental
model. Given a sequence like:
r0 = add.f32 r1, r2
r1 = add.f32 r3, r4
Each ALU pipe is pipelined but in-order. Therefore, the second add cannot
possibly complete before the first add, so it cannot write r1 before the first
add reads r1, so we can elide the write-after-read dependency. That in term
avoids a pipeline bubble between the two instructions. Ditto for
write-after-write.
Similarly if the distance is too great within an in-order pipe since there is a
maximum pipeline length, it's not infinite.
Note that if there was cross-pipe dependencies we do need the annotation since
the pipes themselves are parallel.
SIMD32:
Totals from 58 (2.19% of 2647) affected shaders:
CodeSize: 3316592 -> 3315056 (-0.05%); split: -0.05%, +0.00%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41398>
Greedy post-RA substitution pass, similar to IGC's AccSubstitution pass.
Stats together with the previous commits.
SIMD16:
Totals from 2209 (83.45% of 2647) affected shaders:
Instrs: 2701029 -> 2696350 (-0.17%)
CodeSize: 39166720 -> 40372272 (+3.08%); split: -0.36%, +3.44%
SIMD32:
Totals from 2211 (83.53% of 2647) affected shaders:
Instrs: 4691165 -> 4641188 (-1.07%)
CodeSize: 69365792 -> 69341616 (-0.03%); split: -0.50%, +0.47%
The instruction count reduction is from RA shuffle code getting coalesced via
accumulators. The code size changes are from:
* Fewer moves from the instr count reduction (helped)
* Smaller MADs encoded as MACs (helped)
* Fewer SYNC.nop due to fewer scoreboarding annotations (helped)
* Less compaction due to explicit accumulator operands (hurt)
I expect significant cycle count changes from this but we don't have a cycle
model wired up yet, so reading the assembly will have to do.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41398>
handle cases where attachments are remapped to higher indices than
the renderpass was created with
fixes:
dEQP-VK.renderpasses.dynamic_rendering.primary_cmd_buff.local_read.mapping_*_attachments_to_locs_from_*
cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41407>
According to documents linked in HSD 1209977789, the push constant
allocation for PS stage is not applicable on Gfx12.5+ (removed). The
documents says push constant data is fetched by SBE in URB.
The HW must still parse the command and do nothing with it.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39584>
This will simplify things for PERFCNTR_CONFIG, where we will be getting
GPU timestamps along with the sampled counter values.
Signed-off-by: Rob Clark <rob.clark@oss.qualcomm.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41315>