For instance, to load uniform data with the LSC we usually rely on
tranpose messages which have to execute in SIMD1. Those end up being
considered as partial writes so within loops their life span spread to
the whole loop, increasing register pressure.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21867>
Fossil-db results:
All Intel platforms had similar results. (Ice Lake shown)
Cycles in all programs: 9098346105 -> 9098333765 (-0.0%)
Cycles helped: 6
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19042>
Unlike ufind_msb, ifind_msb is only defined in NIR for 32-bit values, so
no @32 annotation is required.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19042>
When enabled, on gfx12 plus, we will add the sync nop instruction after
each instruction to make sure that current instruction depends on the
previous instruction explicitly.
This option will help us to get a hint if something is missing or broken
in software scoreboard pass.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21797>
Now that we aren't using them on Gfx8+ we can drop a lot of cruft.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
This mainly lets the software scoreboarding pass correctly mark the
instructions, without needing to resort to fragile manual handling in
the generator.
We can also make small improvements. On Gfx 8LP-12.0, we no longer have
the restrictions about DWord alignment, so we can simply write each half
into its intended location, rather than writing it to the low DWord and
then shifting it in place.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
I originally thought that we were intentionally emitting the legacy
opcodes here to make them opaque to the optimizer, so that it wouldn't
eliminate the explicit type conversions, as they're actually required
to do the quantization. But...we don't actually optimize those away
currently anyway. So...go ahead and use the helpers for consistency.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
With the previous patch, we no longer need to special case this, as we
emit a MOV with an HF source, rather than F16TO32 with an UW source,
on all platforms that need scoreboarding.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
This gets us a MOV at the IR level on Gfx8+ which should be more
optimizable than F16TO32. It also removes confusion about which
pipe which the instruction will run on.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
We can just use the new builder helpers to get the optimization
advantages of a MOV on Gfx8+ while also getting the necessary F32TO16
on Gfx7.x and yet not worry too hard about it.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
These take care of emitting the F32TO16/F16TO32 instructions on Gfx7.x
but otherwise just emit a type converting MOV on Gfx8+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
For converting half-float to float, we currently emit BRW_OPCODE_F16TO32
with a UW source, to match legacy Gfx7 behavior. In the generator, this
becomes a MOV with a HF source on Gfx8+. Unfortunately, this UW source
confuses the scoreboarding pass into thinking it's an integer source,
leading to incorrect SWSB annotations on Alchemist.
We should ultimately fix the IR to stop being so...legacy...here, but
this is the simplest fix for stable branches.
Fixes misrendering in Elden Ring and likely Sekiro: Shadows Die Twice.
Cc: mesa-stable
Tested-by: Chuansheng Liu <chuansheng.liu@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
References: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8018
References: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8375
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21783>
Wa_14017989577 is a clone of Wa_14015360517, which applies to several
platforms beyond INTEL_PLATFORM_DG2_G10.
Update references to Wa_14017989577, and use the generated workaround
helper to ensure application to the proper platforms.
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21744>
Wa_1209978020 is a clone of Wa_18012201914. Update references to
refer to the originating bug, and use generated helpers to ensure it
is applied to future platforms as needed.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21741>
Now that we handle scoped barriers with execution scope during
NIR -> Backend IR translation, this lowering is not needed anymore.
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21634>
For GLSL, we want to optimize code like
memoryBarrierBuffer();
controlBarrier();
into a single scoped_barrier intrinsic for the backend to consume. Now that
backends can get scoped_barriers everywhere, what's left is enabling backends to
combine these barriers together. We already have an Intel-specific pass for
combining memory barriers; it just needs a teensy bit of generalization to allow
combining all sorts of barriers together.
This avoids code quality regression on Asahi when switching to purely scoped
barriers. It's probably useful for other backends too.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21661>
This probably doesn't affect Vulkan or GL because they can't have
anything bigger than a vec4 anyway unless it's a u64vec4 and those have
to be at least 8B aligned. This may affect CL apps if they use
__attribute__((packed)) on something with big vectors, depending on how
LLVM decides to translate that.
Fixes: f8aa83f0c8 ("intel/nir: Use nir_lower_mem_access_bit_sizes()")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21524>
This ensures that users of libintel_dev.a won't be compiled until
include files are generated, and that they are recompiled when the
header changes.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Mark Janes <markjanes@swizzler.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20825>
Power-of-two SW stack sizes are prone to causing collisions in the
hashing function used by the L3 to map memory addresses to banks,
which can cause stack accesses from most DSSes to bottleneck on a
single L3 bank. Fix it by padding the SW stack stride by a single
cacheline if it was a power of two. This has been reported by Felix
DeGrood to improve Quake2 RTX performance by ~30% on DG2-512 in
combination with other RT patches Lionel Landwerlin has been working
on.
Many thanks to Felix DeGrood for doing much of the legwork and
providing several iterations of Q2RTX performance counter dumps which
eventually prompted me to consider the hash collision theory and
motivated this patch, and for providing additional performance counter
dumps confirming that there is no longer an appreciable imbalance in
traffic across L3 banks after this change.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21461>
Hoping that I didn't miss any, this *should* add assertions
to all functions and passes which explicitly handle 'nir_loop'.
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13962>
Rob added these new helpers a while back, which freedreno and radeonsi
both share. We should use them too. The new helpers use variables and
system value intrinsics, so we can drop the explicit binding table
creation and just use the normal paths.
Because we have to rewrite the system value uploading anyway, we drop
the scrambling of the default tessellation levels on upload, and instead
let the compiler go ahead and remap components like any normal shader.
In theory, this results in more shuffling in the shader. In practice,
we already do MOVs for message setup. In the passthrough shaders I
looked at, this resulted in no extra instructions on Icelake (SIMD8
SINGLE_PATCH) and Tigerlake (8_PATCH). On Haswell, one shader grew by
a single instruction for a pittance of cycles in a stage that isn't a
performance bottleneck anyway. Avoiding remapping wasn't so much of an
optimization as just the way that I originally wrote it. Not worth it.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20809>
This is to avoid out of bound register accesses (potentially leading
to hangs) when the dispatch size is smaller than when is reported in
the NIR subgroup_size.
v2: Implement bounding with a mask (since workgroup sizes are powers of 2) (Faith)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 530de844ef ("intel,anv,iris,crocus: Drop subgroup size from the shader key")
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21282>
When mesh shader is spawned on a different slice than the originating
task shader, then input task urb handle can come from a different
slice, so masking this information off will load data from the current
slice, instead of the one where real data are.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21007>
The rounding mode only needs to be set once, because 16-bit floats or
preserving denorms aren't supported for the platforms where vec4 is
used.
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20232>
The size in src[2] is in byte and needs to cover any possible data
accessed in src[0] by the indirection. That way the register
allocation is aware of what cannot be spilled for the instruction to
execute on valid data.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 70ace2bbcd ("intel/compiler: Implement Task Output and Mesh Input")
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21188>
This hardware bug is the result of a control flow optimization present
in Gfx8-9 meant to prevent the ELSE instruction from disabling all
channels and update the control flow stack only to have them
re-enabled at the ENDIF instruction executed immediately after it.
Instead, on Gfx8-9 an ELSE instruction that would normally have ended
up with all channels disabled would pop off the last element of the
stack and jump directly to JIP+1 instead of to the ENDIF at JIP,
skipping over the ENDIF instruction. In simple cases this would work
okay (though it's actual performance benefit is questionable), but in
cases where a branch instruction within the IF block (e.g. BREAK or
CONTINUE) caused all active channels to jump outside the IF
conditional, the optimization would break the JIP chain of "join"
instructions by skipping the ENDIF, causing the block of instructions
immediately after the ENDIF to execute with all channels disabled
until execution reaches the reconvergence point.
This issue was observed on SKL in the
dEQP-VK.reconvergence.subgroup_uniform_control_flow_elect.compute.nesting4.0.38
test in combination with some Vulkan binding model changes Lionel is
working on. In such cases the execution with all channels disabled
was leading to corruption of an indirect message descriptor, causing a
hang.
Unfortunately the hardware bug doesn't provide a recommended
workaround. In order to fix the problem we point the JIP of an ELSE
instruction to the instruction immediately before the ENDIF -- However
that's not expected to work due to the restriction that JIP and UIP
must be equal if and only if BranchCtrl is disabled -- So this patch
also enables BranchCtrl, which is intended to support join
instructions within the "ELSE" block, which in turn disables the
optimization described above, which in turn causes us to execute the
instruction immediately *before* the ENDIF with all channels disabled
-- So in order to avoid further fallout from executing code with all
channels disabled we need to insert a NOP before ENDIF instructions
that have a matching ELSE instruction.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20921>