We were currently treating explicit flag writes and reads as a full
scheduler barrier, which is unnecessary since the tracking we already
do handles explicit flag access correctly so there is no reason for
taking a possibly large performance hit from add_barrier_deps().
Found by inspection while trying to understand the poor scheduling of
some fragment shaders. Improves performance by a small but
statistically significant amount (4 iterations, 5% significance) for
the following Traci tests in combination with a subsequent commit that
makes the pre-RA scheduler sensitive to instruction latencies:
SpaceEngineers-trace-dx11-2160p-high: 0.66% ±0.30%
MountAndBlade2-trace-dx11-1440p-veryhigh: 0.62% ±0.23%
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
We weren't handling the SHADER_OPCODE_SEND_GATHER instruction in the
instruction scheduler and this was leading to reduced performance in
many programs since SEND instructions have the longest latency and
tend to be among the most critical to schedule efficiently. Handle
SENDG similarly to SEND since the timings of both instructions are
mostly bound by the shared function which doesn't care if the message
was sent by SEND or SENDG.
Improves performance significantly in the following Traci traces (4
iterations, 5% significance), most of them regressions from SENDG
being enabled:
MetroExodus-trace-dx11-2160p-ultra: 1.99% ±0.88%
HogwartsLegacy-trace-dx12-1080p-ultra: 1.33% ±0.20%
GtaV-trace-dx11-2160p-ultra: 1.12% ±0.19%
Borderlands3-trace-dx11-2160p-ultra: 1.00% ±0.58%
TerminatorResistance-trace-dx11-2160p-ultra: 0.98% ±0.27%
Control-trace-dx11-1440p-high: 0.91% ±0.36%
Naraka-trace-dx11-1440p-highest: 0.90% ±0.30%
Ghostrunner2-trace-dx11-1440p-ultra: 0.87% ±0.38%
Palworld-trace-dx11-1080p-med: 0.71% ±0.17%
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
Not required since we've disabled maintenance8 support.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: d39e443ef8 ("anv: add infrastructure for common vk_pipeline")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37242>
refcounting uses atomics, which are a significant source of CPU overhead
in many applications. by adding a method to inform the driver that
the frontend has released ownership of a buffer, all other refcounting
for the buffer can be eliminated
see MR for more details
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36296>
A barrier between two lds/vmem instructions needs to ensure that the
second starts after the first finishes, which means that we can't just
skip workgroup-scope vmem barriers if there is a lds instruction later.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36491>
For code like:
if (cond) {
val = load()
}
use(val)
The "use(val)" now has a similar cost to a use inside the IF.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36491>
* optimize not(compare(a,b)), nir_opt_algebraic does this only if the
comparison result is used only once, but on a vector arch we still get
an advantage when doing this, because it reduces dependencies.
* optimize b2f32(compare(a,b)), this is r600 specific
Signed-off-by: Gert Wollny <gert.wollny@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37205>
This was incorrect (it also lowered int64 reductions/scans), and the only
user can just use the general callback to precisely only lower what it wants.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37164>
This was added with the goal to eventually replace the per
pass subgroup/ballot size options, but that won't work because
some backends don't have a fixed subgroup size across the compilation
process.
It was also mostly added to hack around mesa state tracker behavior,
and we have a better solution there now.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37164>
We skip iterations with ifs.
These can be optimized aways after the subgroup size is known.
Every driver should do that because applications depend on it anyway.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37164>
If the offset is iadd(iadd(iadd(a, 1), b), -1), try_extract_const_addition
will create a dead iadd(a, b) and claim that it didn't modify the shader.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36370>
Commit d2f7b6d5 changed the BLEND_STATE update process so that only
the used render targets will be updated. This mostly works fine, but
in cases when the Dual Source Blending was used previously, we still
must turn it off to avoid the undefined behavior that leads to hangs.
Fixes: d2f7b6d5 ("anv: implement VK_KHR_dynamic_rendering_local_read")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13675
Signed-off-by: Sviatoslav Peleshko <sviatoslav.peleshko@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37246>