The heuristic enables the SIMD32 fragment shader based on whether the
IR performance modeling pass predicts it to have greater throughput
than the SIMD16 and SIMD8 variants of the same shader. It would be
straightforward to do the same thing in order to control whether
SIMD16 dispatch is enabled, but it's pending additional performance
evaluation.
The INTEL_DEBUG=do32 option is left around in order to force the
SIMD32 shader to be used regardless of the result of the heuristic,
since it's useful as a debugging aid e.g. in order to identify
SIMD32-specific codegen issues which may be masked by the SIMD32
heuristic, or cases where the heuristic is incorrectly disabling
SIMD32 shaders that offer a performance advantage.
Currently this is only enabled on Gen6+, since SIMD32 codegen support
is incomplete on earlier platforms.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This makes brw_compile_fs() look a bit more similar to
brw_compile_cs(). It saves us three v*_shader_stats local variables,
and will save us additional triplicated declarations as we start
tracking IR performance analysis results.
The triplicated cfg pointers are left around because they're set to
NULL to mark specific dispatch modes as disabled (e.g. in order to
enforce hardware restrictions). Doing the same thing with the visitor
pointers would cause data leaks.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This introduces an analysis pass intended to estimate several
performance statistics of the shader, including cycle count latency
and throughput values, based on static modeling. It has instruction
performance information more comprehensive than the current scheduling
pass for all platforms between Gen4-11, and works on both the FS and
VEC4 back-end.
The most immediate purpose of this pass is to implement a heuristic
meant to determine whether using SIMD32 dispatch for a fragment shader
can be expected to help more than it hurts. In addition this will
allow the effect of passes run after scheduling (e.g. the TGL software
scoreboard pass and the VEC4 dependency control pass) to be visible in
shader-db statistics.
But that isn't the end of the story, other potential applications of
this pass (not part of this MR) I've been playing around with are:
- Implement a similar SIMD16 heuristic allowing the identification of
inefficient SIMD16 fragment shaders.
- Implement similar SIMD16 and SIMD32 heuristics for the compute
shader stage -- Currently compute shader builds always use the
SIMD16 shader if available and never use the SIMD32 shader unless
strictly necessary, which is suboptimal under certain conditions.
- Hook up to the instruction scheduler in order to improve the
accuracy of its timing information.
- Use as heuristic in order to drive the selection of scheduling
modes (Matt was experimenting with that).
- Plug to the TGL software scoreboard pass in order to implement a
more effective SBID token allocation algorithm, since in general
the optimal token allocation depends on the timings of all
instructions in the program.
- Use its bottleneck detection functionality in order to implement a
heuristic computing a more optimal bound for the number of fragment
shader threads executed in parallel (by adjusting the
MaximumNumberofThreadsPerPSD control of 3DSTATE_PS).
As a follow-up I'm planning to submit updated timing information for
Gen12 platforms -- Everything else required to support Gen12 like SWSB
handling is already included in this patch, but there were some IP
concerns regarding the TGL timing parameters since they cannot
currently be obtained with the documentation and hardware which is
publicly available. The timing parameters for any previous Gen7-11
platforms can be obtained by anyone by sampling the timestamp register
using e.g. shader_time, though I have some more convenient
instrumentation coming up.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Makes more sense considering SIMD32. Relaxing the assertion in
brw_ir_fs.h will be required in order to avoid assertion failures on
SNB with SIMD32 fragment shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In Gen12 the Poly 0 Info DWORD containing the Viewport Index and
Render Target Index fields were moved from r0.0 to r1.1 in order to
make room for dual-polygon dispatch. The render target message format
was updated to expect that information in the same location, so we
didn't need to make any changes for framebuffer fetch to work with
SIMD8 and SIMD16 dispatch. Unfortunately that won't work with SIMD32,
since the render target message header is assembled from r0 and r2
instead of r1, and the r2 thread payload wasn't updated with an
additional copy of the same information. We need to fix things up
manually instead. This avoids a handful of
EXT_shader_framebuffer_fetch regressions in combination with SIMD32
fragment shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This applies the same work-around I commited as b84fa0b31e
"intel/fs/gen11: Work around dual-source blending hangs in combination
with SIMD32." to Gen12, which seems to suffer from the same hardware
bug found empirically. The failure mode seems to be identical.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The Gen12 docs are rather contradictory regarding the dispatch
configurations supported by the fragment shader -- The same table
present in previous generations seems to imply that only one dispatch
mode can be enabled when doing per-sample shading, but a restriction
documented in the 3DSTATE_PS_BODY page implies the opposite: That
SIMD32 can only be used in combination with some other dispatch mode.
The latter seems to match the behavior of real hardware as I could
tell from my testing: A bunch of multisample test-cases that do
per-sample shading hang if we only provide a SIMD32 shader.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Section 2.2.2 (Data Conversions For State Query Commands) of the
OpenGL 4.5 spec says:
Following these steps, if a value is so large in magnitude that
it cannot be represented by the returned data type, then the
nearest value representable using that type is returned.
The current code doesn't do the correct thing, because it truncates a
long (potentially a 64bit values) to an int.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2828
Fixes: 53c36dfcfe
("replace IROUND with util functions")
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4673>
When a resource's backing bo changes, its seqno will be incremented.
Which would result in a new tex state cache key, and nothing to clean
up the old tex state until the sampler view/state is destroyed. But
in some games, that may never happen, or at least not happen before
we run out of memory.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2830
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4744>
This will matter in the next patch, where we need the original
rsc->seqno.
It means slight shuffling of where we call rebind_resource() in the
`fd_try_shadow_resource()` path.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4744>
Track how resources are used, ie. which state they may potentially dirty
if the backing bo is changed/reallocated, to optimize rebind_resource().
This will be more important in a later patch when we hook up eviction of
entries in a6xx tex state cache.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4744>
The `DISCARD_WHOLE_RESOURCE` is just a hint. And `rebind_resource()` is
a bunch of faffing about (and going to get worse in a later patch), so
let's not bother when the bo is already idle.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4744>
This commit fixes two problems:
- In some cases SWR does not correctly report to Gallium
which formats are supported.
- Incorrect LLVM instructions are used in vertex fetch in some situations
Reviewed-by: Krzysztof Raszkowski <krzysztof.raszkowski@intel.com>
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4788>
We should only rely on overflow detection for indirect draws, where we
have no other option.
This doesn't use quite the worst-possible-case sizes, which in practice
seem to be ~20x larger than what is required. But instead uses roughly
half of that.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4750>
For vsc size calculation, we need to know the # of bins per pipe. Or at
least the worst-case # of bins, assuming we don't eliminate an unused depth/
stencil buffer.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4750>
Example situation:
v1 = {v0.hi, v1.lo}
v0.hi = v1.hi
The 4-byte copy's definition is completely used, but swapping it makes no
sense. We have to split it to generate correct code:
swap(v0.hi, v1.lo)
swap(v0.hi, v1.hi)
Found in dEQP-VK.spirv_assembly.type.vec3.i16.constant_composite_vert
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4772>
When splitting a v6b vector into v1 and v2b components, we should ensure
the v1 definition doesn't start at the upper half.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4772>
We can't remove volatile writes and we can't combine them with other
volatile writes so all we can do is clear the unused bits.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4767>
For deref_store, we can still delete invalid stores that write to
statically OOB data. For everything, we need to make sure that we kill
aliases of destinations even if it's volatile.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4767>
Starting with Gen11, we have two indirect clear colors: An unconverted
float/int version which is us used for rendering and a converted pixel
value version which is used for texturing. Because the one used for
texturing is stored as a single pixel of that color, it works no matter
what format is being used. Because it's a simple HW indirect and
doesn't involve copying surface states around, we can use it in the
sampler without having to worry about surface states having out-of-date
clear values. The result is that we can now allow any clear color when
texturing.
This cuts the number of resolves in a RenderDoc trace of Dota2 by 95%
on Gen11+ (you read that right) and improves perf by 3.5%. It improves
perf in a handful of other workloads by < 1%.
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4393>