This opcode is responsible for setting up the buffer base address and
per-thread scratch space fields of a scratch message header. For the
most part, it's a copy of g0 but some messages need us to zero out g0.2
and the bottom bits of g0.5.
This may actually fix a bug when nir_load/store_scratch is used. The
docs say that the DWORD scattered messages respect the per-thread
scratch size specified in gN.3[3:0] in the message header but we've been
leaving it zero. This may mean that we've been ignoring any scratch
reads/writes from a load/store_scratch intrinsic above the 1KB mark.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
Scratch stores are being lowered to the instructions with side-effects,
however they should be enabled in fs helper invocations, since they
are produced from operations which don't imply side-effects.
To fix this - we move the decision of whether the sample mask predication
is enable to the point where logical brw instructions are created.
GLSL example of the issue:
int tmp[1024];
...
do {
// changes to tmp
} while (some_condition(tmp))
If `tmp` is lowered to scrach memory, `some_condition` would be
undefined if scratch write is predicated on sample mask, making
possible for the while loop to become infinite and hang the GPU.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3256
Fixes: 53bfcdeecf
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6056>
We don't care about full IA coherency since we always have the
opportunity in GL or Vulkan to flush the data cache. Using IA-coherent
mode is likely just making A64 access slower than it needs to be.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4819>
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.
For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).
Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).
This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.
Shader-db results for IVB (i965):
total instructions in shared programs: 11971190 -> 11971200 (<.01%)
instructions in affected programs: 11482 -> 11492 (0.09%)
helped: 0
HURT: 8
HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1
HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
95% mean confidence interval for instructions value: 0.66 1.84
95% mean confidence interval for instructions %-change: 0.01% 0.27%
Instructions are HURT.
Unlike the previous code, that used the `mov g1 g2` trick to force
both `g1` and `g2` to stall, the scheduling fence will generate `mov
null g1` and `mov null g2`. During review it was decided it was not
worth keeping the special codepath for the small effect will have.
Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below
total cycles in shared programs: 341738444 -> 341710570 (<.01%)
cycles in affected programs: 7240002 -> 7212128 (-0.38%)
helped: 46
HURT: 5
helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362
HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
95% mean confidence interval for cycles value: -777.71 -315.38
95% mean confidence interval for cycles %-change: -1.42% -0.83%
Cycles are helped.
This seems to be the effect of allocating two registers separatedly
instead of a single one with size 2, which causes different register
allocation, affecting the cycle estimates.
while ICL also has not change on instruction count but report changes
negative changes in cycles
total cycles in shared programs: 352665369 -> 352707484 (0.01%)
cycles in affected programs: 9608288 -> 9650403 (0.44%)
helped: 4
HURT: 104
helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48
HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
95% mean confidence interval for cycles value: 256.67 523.24
95% mean confidence interval for cycles %-change: 0.63% 1.03%
Cycles are HURT.
AFAICT this is the result of the case above.
Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions
total instructions in shared programs: 17690586 -> 17690597 (<.01%)
instructions in affected programs: 64617 -> 64628 (0.02%)
helped: 55
HURT: 32
helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2
HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
95% mean confidence interval for instructions value: -2.03 2.28
95% mean confidence interval for instructions %-change: -0.41% 0.15%
Inconclusive result (value mean confidence interval includes 0).
Now that more is done in the IR, more dependencies are visible and
more SWSB annotations are emitted. Mixed with different register
allocation decisions like above, some shaders will see more `sync
nops` while others able to avoid them.
Most of the new `sync nops` are also redundant and could be dropped,
which will be fixed in a separate change.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
This moves the fi_types to a new mesa_private.h and removes the
imports.c file. The vast majority of this patch is just removing
pound includes of imports.h and fixing up the recursive includes.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
The interpretation of the fields is different depending whether the
instruction is a SEND/MATH or not.
This fixes the disassembly output for non-SEND/MATH instructions that
have both in-order and out-of-order dependencies. Their dependencies
were wrongly represented as `@A $B` when the correct would be `@A
$B.dst`.
Fixes: 6154cdf924 ("intel/eu/gen12: Add auxiliary type to represent SWSB information during codegen.")
Fixes: 83612c0127 ("intel/disasm/gen12: Disassemble software scoreboard information.")
Acked-by: Francisco Jerez <currojerez@riseup.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3660>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3660>
v2: Add a big comment explaining the [IU]SUB_SAT lowering. Suggested by
Caio.
v3: Use get_fpu_lowered_simd_width in get_lowered_simd_width. Suggested
by Ken on IRC.
v4: Fix a typo in a comment. Noticed by Caio.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
Like a SHADER_OPCODE_MEMORY_FENCE but doesn't doesn't generate any
assembly code.
Will be used when the compiler shouldn't reorder certain instructions
but there's no need to generate code for the HW to do it -- as the
ordering will be guaranteed by other means.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
Having the IR opcodes locked to their hardware representation is risky
because it causes opcodes as different as BRC and IFF to compare equal
at the IR level (luckily the back-end only ever uses one opcode from
each group, right now), and it prevents us from supporting
instructions that change their hardware representation across
generations, which will become a problem on Gen12+ platforms.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This rewrites the current opcode description tables as a more compact
flat data structure. The purpose is to allow efficient constant-time
look-up by either HW or IR opcode, which will allow us to drop the
hard-coded correspondence between HW and IR opcodes -- See the next
commits for the rationale.
brw_eu.c is now built as C++ source so we can take advantage of
pointers to member in order to make the look-up function work
regardless of the opcode_desc member used as look-up key.
v2: Optimize devinfo struct comparison (Caio)
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Before this commit, we had only FPRoundingMode decoration (the per
instruction one) that is applied during the SPIR-V handling. In
vtn_alu we find out the rounding mode, and generate the code
accordingly that later will be used to look for the respective
nir_op_f2f16_{rtz,rtne}.
Per-instruction gets prioritized because we make them explicit
conversions (with RTZ or RTNE nir opcodes) and they will override the
default execution mode defined with float controls. However, we need
to come back to the mode defined by float controls after the execution
of the FP Rounding instruction.
Therefore, the new SHADER_OPCODE_FLOAT_CONTROL_MODE opcode will be
used to set the default rounding mode and denorms treatment in the
whole shader while the pre-existent SHADER_OPCODE_RND_MODE, will be
used as prioritized rounding mode in a per-instruction basis.
v2:
- Fix bug in defining BRW_CR0_FP_MODE_MASK.
v3:
- Update comment (Caio).
v4:
- Split the patch into the helper and the new opcode (this
one) (Caio).
v5:
- Add an explanation on the actual purpose and priority of the newly
introduced opcode in the commit log (Caio).
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Gen11 SLM is not on L3 anymore, so now the hardware has two separate
fences. Add a way to control which fence types to use.
At this time, we don't have enough information in NIR to control the
visibility of the memory being fenced, so for now be conservative and
assume that fences will need a stall. With more information later
we'll be able to reduce those.
Fixes Vulkan CTS tests in ICL:
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.workgroup.guard_local.buffer.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.image.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.image.guard_nonlocal.workgroup.comp
The whole set of supported tests in dEQP-VK.memory_model.* group
should be passing in ICL now.
v2: Pass BTI around instead of having an enum. (Jason)
Emit two SHADER_OPCODE_MEMORY_FENCE instead of one that gets
transformed into two. (Jason)
List tests fixed. (Lionel)
v3: For clarity, split the decision of which fences to emit from the
emission code. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: 1) Drop changes for vec4 backend as on Gen11+ we don't support
align16 mode (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
With 8 and 16-bit types and anything where we have to use non-trivial
strides registersto deal with restrictions, we end up with things that
look like partial writes even though we don't care about any values in
the register except those written by that instruction. This is
particularly important when dealing with loops because liveness sees
is_partial_write and the fact that an old version from a previous loop
iteration may be valid at that point and extends all purely partially
written values to the entire loop.
This commit adds a new UNDEF instruction which does nothing (the
generator doesn't emit anything) but which does a fake write to the
register. This informs liveness that we don't care about any values
before that point so it won't consider those registers to be falsely
live. We can safely emit UNDEF instructions for all SSA values that
come in from NIR and nearly all temporaries generated by various stages
of the compiler. In particular, we need to insert UNDEF instructions
when we handle region restrictions because the newly allocated registers
are almost guaranteed to be partially written.
No shader-db changes.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110432
Reviewed-by: Matt Turner <mattst88@gmail.com>
We add two new texture sources for bindless surface and sampler handles.
Bindless surface handles are expected to be pre-shifted so that the
20-bit surface state table index is in the top 20 bits of the 32-bit
handle. This lets us avoid any extra shifts in the shader. Bindless
sampler handles are 32-byte aligned byte offsets from general state base
address. We use 32-byte aligned instead of 16-byte aligned to avoid
having to use more indirect messages than needed. It means we can't
tightly pack samplers but that's probably not a big deal.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The scalar back-end uses SHADER_OPCODE_SEND for all surface messages so
we no longer need the non-logical opcodes there. Prefix them VEC4 so
it's clear that they're only used by the vec4 back-end.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The unused typed surface read/write support in the vec4 back-end has
been dropped and the fs back-end now uses SHADER_OPCODE_SEND for all
image and buffer ops. There's no reason to keep these opcodes around
anymore.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Since switching to SHADER_OPCODE_SEND for image operations, we no longer
need the non-logical opcode.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
It makes more sense to start at the surface then move on to the address
and then the data. Also, this is a really good test of whether or not
we got all the places that use the sources by explicit integer number.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
If you pass a bool in as the value to set, the C standard says that it
gets converted to an int prior to shifting. If you try to set a bool to
bit 31, this lands you in undefined behavior. It's better just to add
the explicit cast and let the compiler delete it for us.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
These are broken on a future platform, but it turns out we don't need
to fix them, since they're just type-converting moves with strided
source. Kill them.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
We have to lower some shadow instructions because they don't exist in
hardware and we have to lower txb+offset+clamp because the message gets
too big and we run into the sampler message length limit of 11 regs.
Acked-by: Ian Romanick <ian.d.romanick@intel.com>
v2: Split changes to the message type field to another patch. Suggested
by Caio.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
There are a number of opcode_desc table entries for many of these
unused opcodes. A symbolic opcode enum will be required in a future
commit in order to keep them in the opcode description tables. The
alternative would be to remove the unused opcodes from the opcode
description tables.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Allows to specify a bitfield based on its upper and lower bounds
instead of a symbolic field definition, kind of what the current
GET_BITS macro is to GET_FIELD.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Adds suppport for ARB_fragment_shader_interlock. We achieve
the interlock and fragment ordering by issuing a memory fence
via sendc.
Signed-off-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
The only reason it was it's own opcode was so that we could detect it
and adjust the source register based on the payload setup. Now that
we're using the ATTR file for FS inputs, there's no point in having a
magic opcode for this.
v2 (Jason Ekstrand):
- Break the bit which removes the CINTERP opcode into its own patch
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
NIR has code to lower these away for us but we can do significantly
better in many cases with register regioning and SIMD4x2.
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>