Right now the accumulator-clearing move emitted by the generator for
Wa_14010017096 inherits the SWSB field from the previous instruction.
This can lead to redundant synchronization, or possibly more serious
issues if the previous instruction had a TGL_SBID_SET SWSB
synchronization mode. Take the SWSB synchronization information from
the IR.
Fixes: a27542c5dd ("intel/compiler: Clear accumulator register before EOT")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11433>
Instead of depending on the driver to compile each resume shader
separately, we compile them all in one go in the back-end and build an
SBT as part of the shader program. Shader relocs are used to make the
entries in the SBT point point to the correct resume shader.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8637>
Those helpers exist primarily to sort out some of the weirdness around
Gen4-6 dataport access. On Gen5 and earlier, everything was called
"dataport" and, instead of the SFID we have today there was a "target
cache" parameter in the descriptor. There are also some bits that moved
around on various gens depending on read vs. write. Starting with Gen6,
most things which target one of the data cache SFIDs should use
brw_dp_desc() instead.
v2: Drop backward comment (Ken)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7455>
v2: Drop new internal opcodes (Jason)
Simplify code (Jason)
v3: Add Z computation for coarse pixels
v4: Document things a little
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7455>
v2: Use the new inst->ex_desc field (Jason)
v3: Drop CPS LoD compensation from sampler messages (Lionel)
v4: Drop useless uses_rate_shading (Ken)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7455>
It has been removed from the hardware.
[jordan.l.justen@intel.com: Move to brw_postprocess_nir]
v2: Switch to nir_lower_idiv_precise (Rhys).
v3: Fix for interface changes of nir_lower_idiv.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10000>
According to the hardware spec "Vx1 and VxH indirect addressing for
Float, Half-Float, Double-Float and Quad-Word data must not be used."
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10000>
One exception is src/amd/addrlib/, for which -Wimplicit-fallthrough is
explicitly disabled.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10220>
These checks were originally assertions elsewhere either in the existing
code or later in this MR.
Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9027>
On Gen7, we have to split shuffles into two MOVs for 64-bit types so we
can't handle source modifiers. On Gen12.5, we have to use integer types
all the time so we can't use them there either. Fixing that will be a
different commit but it interacts with this one.
Fixes: 90c9f29518 "i965/fs: Add support for nir_intrinsic_shuffle"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9068>
I meant to do this years ago when I first added SHADER_OPCODE_SEND. At
the time, the only use for the extended descriptor was bindless handles
which were always one thing and never non-constant. However, it doesn't
actually require any extra instructions because we have to OR in ex_mlen
anyway.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8748>
We're about to start using it to implement nir_jump_halt which has
nothing inherently to do with fragment shaders or discards. May as well
name it for the HW instruction it generates.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5071>
The Intel bindless thread dispatch model is very simple. When a compute
shader is to be used for bindless dispatch, it can request a set of
stack IDs. These are allocated per-dual-subslice by the hardware and
recycled automatically when the stack ID is returned. Passed to the
bindless dispatch are a global argument address, a stack ID, and an
address of the BINDLESS_SHADER_RECORD to invoke. When the bindless
shader is dispatched, it is passed its stack ID as well as the global
and local argument pointers. The local argument pointer is the address
of the BINDLESS_SHADER_RECORD plus some offset which is specified as
part of the BINDLESS_SHADER_RECORD.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7356>
This opcode is responsible for setting up the buffer base address and
per-thread scratch space fields of a scratch message header. For the
most part, it's a copy of g0 but some messages need us to zero out g0.2
and the bottom bits of g0.5.
This may actually fix a bug when nir_load/store_scratch is used. The
docs say that the DWORD scattered messages respect the per-thread
scratch size specified in gN.3[3:0] in the message header but we've been
leaving it zero. This may mean that we've been ignoring any scratch
reads/writes from a load/store_scratch intrinsic above the 1KB mark.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
When I copied and pasted the code from MOV_INDIRECT for handling the
dependency controls, I missed a subtle difference between MOV_INDIRECT
and SHUFFLE. Specifically, MOV_INDIRECT gets lowered to a narrow
instruction on Gen7 by the SIMD width lowering whereas SHUFFLE has to
split it in the generator. Therefore, the check safety check for
whether or not we can use dependency control has to be based on the
lowered width rather than the width of the original instruction.
Fixes: a8ac61b0ee "intel/fs: NoMask initialize the address..."
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3593
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6989>
These variables seem to be initialized before being used, so this
patch is not fixing any bug, but leaving them unitialized may become
a bug after some refactoring.
These classes were affected: fs_reg_alloc, fs_visitor, fs_generator,
instruction_scheduler.
Found by Coverity.
Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6667>
Shader instructions which use UIP/JIP now get formatted with a label
in addition with immediate value, labels have "LABEL%d" format.
v2: - Consider brw_jump_scale when calculating label's offset
From: "Lonnberg, Toni" <toni.lonnberg@intel.com>
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4245>
The immediate case is pretty uncommon to see but it can happen, in
theory. BROADCAST is typically used to uniformize values and those are
usually 32-bit. However, it does come up in some subgroup ops.
Fixes: 49c21802cb "intel/compiler: Split has_64bit_types into float/int"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6211>
Using HALT to immediately jump to the end of the shader is required to
implement GL_EXT_gpu_shader4 and OpenGL 3.0. However, vanilla OpenGL
1.2 doesn't forbid it and it likely makes something somewhere faster.
We should be consistent and implement the same discard behavior on all
hardware if we can.
The rules for HALT on Gen4-5 are a bit different from Gen6+. On the
older hardware, there is no stack for HALT; instead it's up to software
to save and restore mask registers. However, there's no real saving
needed since we only use HALT to jump to the end of the program where
we're about about to do our FB writes. All we need to do is reset AMask
to DMask, the value it was initialized to at the start of the thread.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5244>
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.
For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).
Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).
This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.
Shader-db results for IVB (i965):
total instructions in shared programs: 11971190 -> 11971200 (<.01%)
instructions in affected programs: 11482 -> 11492 (0.09%)
helped: 0
HURT: 8
HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1
HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
95% mean confidence interval for instructions value: 0.66 1.84
95% mean confidence interval for instructions %-change: 0.01% 0.27%
Instructions are HURT.
Unlike the previous code, that used the `mov g1 g2` trick to force
both `g1` and `g2` to stall, the scheduling fence will generate `mov
null g1` and `mov null g2`. During review it was decided it was not
worth keeping the special codepath for the small effect will have.
Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below
total cycles in shared programs: 341738444 -> 341710570 (<.01%)
cycles in affected programs: 7240002 -> 7212128 (-0.38%)
helped: 46
HURT: 5
helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362
HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
95% mean confidence interval for cycles value: -777.71 -315.38
95% mean confidence interval for cycles %-change: -1.42% -0.83%
Cycles are helped.
This seems to be the effect of allocating two registers separatedly
instead of a single one with size 2, which causes different register
allocation, affecting the cycle estimates.
while ICL also has not change on instruction count but report changes
negative changes in cycles
total cycles in shared programs: 352665369 -> 352707484 (0.01%)
cycles in affected programs: 9608288 -> 9650403 (0.44%)
helped: 4
HURT: 104
helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48
HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
95% mean confidence interval for cycles value: 256.67 523.24
95% mean confidence interval for cycles %-change: 0.63% 1.03%
Cycles are HURT.
AFAICT this is the result of the case above.
Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions
total instructions in shared programs: 17690586 -> 17690597 (<.01%)
instructions in affected programs: 64617 -> 64628 (0.02%)
helped: 55
HURT: 32
helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2
HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
95% mean confidence interval for instructions value: -2.03 2.28
95% mean confidence interval for instructions %-change: -0.41% 0.15%
Inconclusive result (value mean confidence interval includes 0).
Now that more is done in the IR, more dependencies are visible and
more SWSB annotations are emitted. Mixed with different register
allocation decisions like above, some shaders will see more `sync
nops` while others able to avoid them.
Most of the new `sync nops` are also redundant and could be dropped,
which will be fixed in a separate change.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
So we can eventually remove the cycle count estimates from the CFG
data structure and consolidate performance information in the
brw::performance object.
It would be cleaner to pass the brw::performance object directly to
the disassembler but that isn't straightforward since the disassembler
is built as a plain C file unlike the rest of the compiler back-end.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
These should be more accurate than the current cycle counts, since
among other things they consider the effect of post-scheduling passes
like the software scoreboard on TGL. In addition it will enable us to
clean up some of the now redundant cycle-count estimation
functionality in the instruction scheduler.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Change brw_memory_fence to return the number of messages emitted, and
use that to update the send_count statistic in code generation.
This will fix the book-keeping for IVB since the memory fences will
result in two SEND messages.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4646>
The current workaround for this hardware bug involved marking the ADD
instruction used to initialize the address register as NoMask on
Gen12, which was based on the assumption that the problem was caused
by a hardware bug affecting the application of the execution mask to
the address register write.
However that doesn't seem to be the case: The address register write
was working correctly, the real problem leading to hangs on TGL is
that the indirect addressing logic is unable to deal with garbage
values in the address register (e.g. misaligned offsets), even for
channels which are currently inactive due to non-uniform control flow.
The current workaround isn't able to avoid that situation in general,
since the result of the NoMask ADD instruction for a dead channel is
calculated based on the corresponding (dead) component of the
indirect_byte_offset source, which would still be undefined in the
likely case that the source was initialized under control flow itself.
This would lead to hangs whenever MOV_INDIRECT was used under
non-uniform control flow in some scenarios like a tessellation shader
from GFXBench5/gl_4 (AKA Car Chase) on TGL. In addition I've managed
to reproduce the same issue on earlier platforms by initializing the
whole address register with garbage before the ADD instruction, so
this seems to be a long-standing issue we have avoided mostly by luck.
This patch fixes the problem and applies the workaround to all
platforms, since even when the hardware is able to deal with garbage
address values without hanging there might be a significant
performance cost from reading random GRF registers due to the useless
extra EU cycles spent fetching registers for dead channels and due to
the potential for unintended serialization with respect to other
random instructions that could be executed in parallel, which may have
had a cost of the order of hundreds of cycles in the worst case
scenario.
Fixes: f93dfb509c "intel/fs: Write the address register with NoMask for MOV_INDIRECT"
Tested-by: Rafael Antognolli <rafael.antognolli@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Passing shader_stats to the fs_generator constructor means that the
SIMD8 shader stats from the visitor (such as the scheduler mode) will be
reported out for the SIMD16/SIMD32 versions as well.
As you can see, we are now passing 'shader_stats' and 'stats' to
generate_code(), which is obviously odd looking. Ian rebased and
committed an old patch of mine which added the shader_stats struct on
July 30 in commit dabb5d4bee (i965/fs: Add a shader_stats struct.) and
shortly after on August 12 Jason added the brw_compile_stats struct in
commit 134607760a (intel/compiler: Fill a compiler statistics struct).
I'd like to combine the two, but I'm not sure how. shader_stats is an
input to generate_code() while brw_compile_stats is an output and is
only used by the Vulkan driver. Leave it as is for now...
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>