For terminate operation, jump the invocation without predicating on
the rest of the quad being disabled -- which is what is done for
demote and discard.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7150>
cs_prog_data->slm_size is basically redundant with
prog_data->total_shared, which is the field that we actually use for
controlling the shared local memory size in all drivers. We were
still using it in one place for VK_EXT_pipeline_executable_properties,
but we should just fix that and delete the field.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7152>
This opcode is responsible for setting up the buffer base address and
per-thread scratch space fields of a scratch message header. For the
most part, it's a copy of g0 but some messages need us to zero out g0.2
and the bottom bits of g0.5.
This may actually fix a bug when nir_load/store_scratch is used. The
docs say that the DWORD scattered messages respect the per-thread
scratch size specified in gN.3[3:0] in the message header but we've been
leaving it zero. This may mean that we've been ignoring any scratch
reads/writes from a load/store_scratch intrinsic above the 1KB mark.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
In theory, this fixes a bug where we were dropping the PTSS bound on the
floor. The hardware docs claim that the A32 DWORD and BYTE scattered
read/write messages do a PTSS bounds check. However, in practice, it
seems that the hardware ignores the bounds check so this doesn't
actually matter. I verified this with the following couple of piglit
tests:
https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/399
In practice, this prevents the next commit from making a subtle
behavioral change.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
It's included in declaration of INTEL_DEBUG.
Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6732>
Scratch stores are being lowered to the instructions with side-effects,
however they should be enabled in fs helper invocations, since they
are produced from operations which don't imply side-effects.
To fix this - we move the decision of whether the sample mask predication
is enable to the point where logical brw instructions are created.
GLSL example of the issue:
int tmp[1024];
...
do {
// changes to tmp
} while (some_condition(tmp))
If `tmp` is lowered to scrach memory, `some_condition` would be
undefined if scratch write is predicated on sample mask, making
possible for the while loop to become infinite and hang the GPU.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3256
Fixes: 53bfcdeecf
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6056>
GEN:BUG:1604601757 prevents source modifiers for multiplication of
lower precision integers.
lower_mul_dword_inst() splits 32x32 multiplication into 32x16, and
needs to eliminate source modifiers in this case.
Closes: #3329
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
If INTEL_DEBUG=no16 is used, then simd16 will not be attempted. This,
in turn prevents simd32 from running, because we attempt to skip
simd32 when simd16 fails to compile.
This change more accurately recognizes when we attempted simd16, but
simd16 failed.
One easy way to cause an issue is to set both no8 and no16. Before
this change, we would be left with no FS program, even though simd32
could still be generated in some cases.
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5269>
If no16 was specified, and the shader can't run in simd8 due to the
local_size, then we need to generate a simd32 program.
If both no8 and no16 are specified, then we need to generate a simd32
program.
Rework:
* Drop update of `if` that would have changed `do32` to try simd32
even if simd16 spilled registers. (Caio)
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5269>
2 source instruction only support immediate for src1 operand, so no
point in adding optimization condition for src0 oprand.
v2:
- Update commit message and don't remove ADD optimization (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5341>
Gen9 and Cherryview have the ability to mark texture instructions with
the End-of-thread bit under some conditions, which allows the texture
result to be written to the render target directly, rather than
returning to the EU.
In order to handle overlapping primitives correctly, we have to use the
'sendc' instruction which stalls until other threads potentially writing
to the same locations in the render target are retired. Unfortunately,
this stall happens before the texture is sampled (rather than in
parallel with stall), so for some literal edge cases (like the diagonal
edge between two triangles forming a rectangle) there can be a
performance penalty. As a result, it's probably not a good idea to use
this optimization in general.
I had planned to leave it enabled only for BLORP, where we use rectangle
primitives and are typically clearing/blitting an entire render target
without any overlapping primitives, but I noticed that the optimization
wasn't applied in some normal cases anyway. For example, in the piglit
test tests/shaders/glsl-fs-texture2d-bias.shader_test it is applied to
one BLORP-blit shader but not another due to some kind of mishandling of
register types (the destination register type of the texture operation
is UD while the color source of the render target write is F).
Additionally the instruction scheduler assumed that the combined texture
and render target write operation took 0 cycles, leading to cycle
estimates that are wildly inaccurate. Since the optimization was not
implemented for SIMD32 and our decision whether to use the SIMD32
program is made by comparing the estimated performance with that of the
SIMD16 shader, we wrongly threw out a bunch of SIMD32 programs that are
likely profitable.
total cycles in shared programs: 472807891 -> 473784245 (0.21%)
cycles in affected programs: 108277 -> 1084631 (901.72%)
helped: 0
HURT: 1290
total sends in shared programs: 998955 -> 1000245 (0.13%)
sends in affected programs: 1400 -> 2690 (92.14%)
helped: 0
HURT: 1290
LOST: 0
GAINED: 33
This patch shows no performance changes in Intel's Mesa performance CI.
Given the problems, the lack of evidence that the pass improves
performance, and the fact that the hardware feature was removed from
subsequent GPU generations, I think that the pass is not valuable and
should be removed.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5412>
This will make the GL drivers pick the right SIMD variant for a given
group size set during dispatch. The heuristic implemented in
brw_cs_simd_size_for_group_size() is the same as in brw_compile_cs().
The cs_prog_data::simd_size field was removed. The generated SIMD
sizes are marked in a bitmask, which is already used via
brw_cs_simd_size_for_group_size() by the drivers.
When in variable group size, it is OK if larger SIMD shader spill,
since we'd need it for the cases where the smaller one can't hold all
the invocations.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
Move the decision one level up, let brw_compile_*() functions use the
spilling information to decide whether or not a certain width
compilation can spill (passed via run_*() functions).
The min_dispatch_width was used to compare with the dispatch_width and
decide whether "a previous shader is already available, so don't
accept spill".
This is replaced by:
- Not calling run_*() functions if it is know beforehand a smaller width
already spilled -- since the larger width will spill and fail;
- Explicitly passing whether or not a shader is allowed to spill. For
the cases where the smaller width is available and haven't spilled,
the larger width will be compiled but is only useful if it won't
spill.
Moving the decision to this level will be useful later for variable
group size, which is a case where we want all the widths to be allowed
to spill.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
The motivating factor is: this lowering may cause
nir_intrinsic_load_local_group_size intrinsics to be added to the
shader, and by moving this around we make possible for the drivers to
lower that intrinsic by themselves.
Iris will do just that in a later patch for implementing variable
group size.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
Intrinsic to get the SIMD width, which not always the same as subgroup
size. Starting with a small scope (Intel), but we can rename it later
to generalize if this turns out useful for other drivers.
Change brw_nir_lower_cs_intrinsics() to use this intrinsic instead of
a width will be passed as argument. The pass also used to optimized
load_subgroup_id for the case that the workgroup fitted into a single
thread (it will be constant zero). This optimization moved together
with lowering of the SIMD.
This is a preparation for letting the drivers call it before the
brw_compile_cs() step.
No shader-db changes in BDW, SKL, ICL and TGL.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
These should be more accurate than the current cycle counts, since
among other things they consider the effect of post-scheduling passes
like the software scoreboard on TGL. In addition it will enable us to
clean up some of the now redundant cycle-count estimation
functionality in the instruction scheduler.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This is useful in order to identify codegen issues caused by SIMD32.
It doesn't currently have any effect on compute shaders since SIMD32
dispatch is only enabled for CS when it's strictly necessary to do so
in order to support the workgroup size requested for the shader --
That might change in the future though when we hook up the SIMD32
heuristic to CS compilation.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The heuristic enables the SIMD32 fragment shader based on whether the
IR performance modeling pass predicts it to have greater throughput
than the SIMD16 and SIMD8 variants of the same shader. It would be
straightforward to do the same thing in order to control whether
SIMD16 dispatch is enabled, but it's pending additional performance
evaluation.
The INTEL_DEBUG=do32 option is left around in order to force the
SIMD32 shader to be used regardless of the result of the heuristic,
since it's useful as a debugging aid e.g. in order to identify
SIMD32-specific codegen issues which may be masked by the SIMD32
heuristic, or cases where the heuristic is incorrectly disabling
SIMD32 shaders that offer a performance advantage.
Currently this is only enabled on Gen6+, since SIMD32 codegen support
is incomplete on earlier platforms.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This makes brw_compile_fs() look a bit more similar to
brw_compile_cs(). It saves us three v*_shader_stats local variables,
and will save us additional triplicated declarations as we start
tracking IR performance analysis results.
The triplicated cfg pointers are left around because they're set to
NULL to mark specific dispatch modes as disabled (e.g. in order to
enforce hardware restrictions). Doing the same thing with the visitor
pointers would cause data leaks.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Makes more sense considering SIMD32. Relaxing the assertion in
brw_ir_fs.h will be required in order to avoid assertion failures on
SNB with SIMD32 fragment shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In Gen12 the Poly 0 Info DWORD containing the Viewport Index and
Render Target Index fields were moved from r0.0 to r1.1 in order to
make room for dual-polygon dispatch. The render target message format
was updated to expect that information in the same location, so we
didn't need to make any changes for framebuffer fetch to work with
SIMD8 and SIMD16 dispatch. Unfortunately that won't work with SIMD32,
since the render target message header is assembled from r0 and r2
instead of r1, and the r2 thread payload wasn't updated with an
additional copy of the same information. We need to fix things up
manually instead. This avoids a handful of
EXT_shader_framebuffer_fetch regressions in combination with SIMD32
fragment shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The Gen12 docs are rather contradictory regarding the dispatch
configurations supported by the fragment shader -- The same table
present in previous generations seems to imply that only one dispatch
mode can be enabled when doing per-sample shading, but a restriction
documented in the 3DSTATE_PS_BODY page implies the opposite: That
SIMD32 can only be used in combination with some other dispatch mode.
The latter seems to match the behavior of real hardware as I could
tell from my testing: A bunch of multisample test-cases that do
per-sample shading hang if we only provide a SIMD32 shader.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>