The callers already have this value, and we would like to make it
follow different rules other than stage that might not be visible to
the helper function, so just pass explicitly.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9779>
v2: Restore the gen == 10 hunk in brw_compile_vs (around line 2940).
This function is also used for scalar VS compiles. Squash in:
intel/vec4: Reindent after removing Gen8+ support
intel/vec4: Silence unused parameter warning in try_immediate_source
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> [v1]
Reviewed-by: Matt Turner <mattst88@gmail.com> [v1]
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> [v1]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6826>
Shader instructions which use UIP/JIP now get formatted with a label
in addition with immediate value, labels have "LABEL%d" format.
v2: - Consider brw_jump_scale when calculating label's offset
From: "Lonnberg, Toni" <toni.lonnberg@intel.com>
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4245>
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.
For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).
Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).
This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.
Shader-db results for IVB (i965):
total instructions in shared programs: 11971190 -> 11971200 (<.01%)
instructions in affected programs: 11482 -> 11492 (0.09%)
helped: 0
HURT: 8
HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1
HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
95% mean confidence interval for instructions value: 0.66 1.84
95% mean confidence interval for instructions %-change: 0.01% 0.27%
Instructions are HURT.
Unlike the previous code, that used the `mov g1 g2` trick to force
both `g1` and `g2` to stall, the scheduling fence will generate `mov
null g1` and `mov null g2`. During review it was decided it was not
worth keeping the special codepath for the small effect will have.
Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below
total cycles in shared programs: 341738444 -> 341710570 (<.01%)
cycles in affected programs: 7240002 -> 7212128 (-0.38%)
helped: 46
HURT: 5
helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362
HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
95% mean confidence interval for cycles value: -777.71 -315.38
95% mean confidence interval for cycles %-change: -1.42% -0.83%
Cycles are helped.
This seems to be the effect of allocating two registers separatedly
instead of a single one with size 2, which causes different register
allocation, affecting the cycle estimates.
while ICL also has not change on instruction count but report changes
negative changes in cycles
total cycles in shared programs: 352665369 -> 352707484 (0.01%)
cycles in affected programs: 9608288 -> 9650403 (0.44%)
helped: 4
HURT: 104
helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48
HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
95% mean confidence interval for cycles value: 256.67 523.24
95% mean confidence interval for cycles %-change: 0.63% 1.03%
Cycles are HURT.
AFAICT this is the result of the case above.
Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions
total instructions in shared programs: 17690586 -> 17690597 (<.01%)
instructions in affected programs: 64617 -> 64628 (0.02%)
helped: 55
HURT: 32
helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2
HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
95% mean confidence interval for instructions value: -2.03 2.28
95% mean confidence interval for instructions %-change: -0.41% 0.15%
Inconclusive result (value mean confidence interval includes 0).
Now that more is done in the IR, more dependencies are visible and
more SWSB annotations are emitted. Mixed with different register
allocation decisions like above, some shaders will see more `sync
nops` while others able to avoid them.
Most of the new `sync nops` are also redundant and could be dropped,
which will be fixed in a separate change.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
So we can eventually remove the cycle count estimates from the CFG
data structure and consolidate performance information in the
brw::performance object.
It would be cleaner to pass the brw::performance object directly to
the disassembler but that isn't straightforward since the disassembler
is built as a plain C file unlike the rest of the compiler back-end.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
These should be more accurate than the current cycle counts, since
among other things they consider the effect of post-scheduling passes
like the software scoreboard on TGL. In addition it will enable us to
clean up some of the now redundant cycle-count estimation
functionality in the instruction scheduler.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Mostly this uses util_is_power_of_two_or_zero, which has the same
behavior as _mesa_is_pow_two when the input is zero. In cases where the
value is known to be != 0 ahead of time I used the _nonzero variant as
it may be faster on some platforms.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
Change brw_memory_fence to return the number of messages emitted, and
use that to update the send_count statistic in code generation.
This will fix the book-keeping for IVB since the memory fences will
result in two SEND messages.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4646>
This make shader-db's report.py work on Haswell and earlier platforms.
The problem is that the script would detect the "sends" output for
scalar shaders and expect in in vec4 shaders too. When it didn't find
it, the script would fail with:
Traceback (most recent call last):
File "./report.py", line 351, in <module>
main()
File "./report.py", line 182, in main
before_count = before[p][m]
KeyError: 'sends'
Fixes: f192741ddd ("intel/compiler: Report the number of non-spill/fill SEND messages")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
This commit is all annoying plumbing work which just adds support for a
new brw_compile_stats struct. This struct provides a binary driver
readable form of the same statistics we dump out to stderr when we
INTEL_DEBUG is set with a shader stage.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
When dumping shader's assembly with INTEL_DEBUG=vs,tcs,...
sha1 of the resulting assembly is also printed, having environment
variable INTEL_SHADER_ASM_READ_PATH present driver will try to
load a "%sha1%.bin" file from the path and substitute current
assembly with the one from the file.
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Gen11 SLM is not on L3 anymore, so now the hardware has two separate
fences. Add a way to control which fence types to use.
At this time, we don't have enough information in NIR to control the
visibility of the memory being fenced, so for now be conservative and
assume that fences will need a stall. With more information later
we'll be able to reduce those.
Fixes Vulkan CTS tests in ICL:
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.workgroup.guard_local.buffer.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.image.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.image.guard_nonlocal.workgroup.comp
The whole set of supported tests in dEQP-VK.memory_model.* group
should be passing in ICL now.
v2: Pass BTI around instead of having an enum. (Jason)
Emit two SHADER_OPCODE_MEMORY_FENCE instead of one that gets
transformed into two. (Jason)
List tests fixed. (Lionel)
v3: For clarity, split the decision of which fences to emit from the
emission code. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
We set header_present but then pass it some random garbage. Give it g0
instead. I'm not actually sure this does anything but g0 is the usual
header data and this is what the windows driver does so it seems like a
good idea.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
For the W or UW (signed or unsigned word) source types, the 16-bit value
must be replicated in both the low and high words of the 32-bit
immediate value.
v2: Fix replication in other places as well
V3: fix a few nits (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
libintel_common depends on libintel_compiler, but it contains debug
functionality that is needed by libintel_compiler. Break the circular
dependency by moving gen_debug files to libintel_dev.
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The scalar back-end uses SHADER_OPCODE_SEND for all surface messages so
we no longer need the non-logical opcodes there. Prefix them VEC4 so
it's clear that they're only used by the vec4 back-end.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
For split indirect sends we have to put the EOT parameter in the
extended descriptor as well as the instruction itself so just calling
brw_inst_set_eot is insufficient. Moving the EOT handling handling into
the send_indirect_[split]_message helper lets us handle it properly.
The original idea was that the backend compiler could eliminate
surfaces, so we would have it mark which ones are actually used,
then shrink the binding table accordingly. Unfortunately, it's a
pretty blunt mechanism - it can only prune things from the end,
not the middle - since we decide the layout before we even start
the backend compiler, and only limit the size. It also basically
gives up if it sees indirect array access.
Besides, we do the vast majority of our surface elimination in NIR
anyway, not the backend - and I don't see that trend changing any
time soon. Vulkan abandoned this plan a long time ago, and I don't
use it in Iris, but it's still been kicking around in i965.
I hacked shader-db to print the binding table size in bytes, and
observed no changes with this patch. So, this code appears to do
nothing useful.
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes failure in the new piglit test
tes-patch-input-array-vec2-index-invalid-rd.shader_test.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: mesa-stable@lists.freedesktop.org
The current approach of returning a setup instruction where additional
descriptor fields can be specified is still supported in order to keep
things working, but it will be removed later in this series.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This replaces brw_set_message_descriptor() with the composition of
brw_set_desc() and a new inline helper function that packs the common
message descriptor controls into an integer. The goal is to represent
all message descriptors as a 32-bit integer which is written at once
into the instruction, which is more flexible (SENDS anyone?), robust
(see d2eecf0b0b fixing an issue
ultimately caused by some bits of the extended message descriptor
being left undefined) and future-proof than the current approach of
specifying the individual descriptor fields directly into the
instruction.
This approach also seems more self-documenting, since it will allow
removing calls to functions with way too many arguments like
brw_set_*_message() and brw_send_indirect_message(), and instead
provide a single descriptor argument constructed from an appropriate
combination of brw_*_desc() helpers.
Note that because brw_set_message_descriptor() was (conditionally?)
overriding fields of the instruction which strictly speaking weren't
part of the message descriptor, this involves calling
brw_inst_set_sfid() and brw_inst_set_eot() in some cases in addition
to brw_set_desc().
v2: Use SET_BITS macro instead of left shift (Ken).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Adds suppport for ARB_fragment_shader_interlock. We achieve
the interlock and fragment ordering by issuing a memory fence
via sendc.
Signed-off-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Since all of the fs_generator::generate_foo methods take a fs_inst * as
the first parameter, just remove the name to quiet the compiler.
src/intel/compiler/brw_fs_generator.cpp: In member function ‘void fs_generator::generate_barrier(fs_inst*, brw_reg)’:
src/intel/compiler/brw_fs_generator.cpp:743:41: warning: unused parameter ‘inst’ [-Wunused-parameter]
fs_generator::generate_barrier(fs_inst *inst, struct brw_reg src)
^~~~
src/intel/compiler/brw_fs_generator.cpp: In member function ‘void fs_generator::generate_discard_jump(fs_inst*)’:
src/intel/compiler/brw_fs_generator.cpp:1326:46: warning: unused parameter ‘inst’ [-Wunused-parameter]
fs_generator::generate_discard_jump(fs_inst *inst)
^~~~
src/intel/compiler/brw_fs_generator.cpp: In member function ‘void fs_generator::generate_pack_half_2x16_split(fs_inst*, brw_reg, brw_reg, brw_reg)’:
src/intel/compiler/brw_fs_generator.cpp:1675:54: warning: unused parameter ‘inst’ [-Wunused-parameter]
fs_generator::generate_pack_half_2x16_split(fs_inst *inst,
^~~~
src/intel/compiler/brw_fs_generator.cpp: In member function ‘void fs_generator::generate_shader_time_add(fs_inst*, brw_reg, brw_reg, brw_reg)’:
src/intel/compiler/brw_fs_generator.cpp:1743:49: warning: unused parameter ‘inst’ [-Wunused-parameter]
fs_generator::generate_shader_time_add(fs_inst *inst,
^~~~
src/intel/compiler/brw_vec4_generator.cpp: In function ‘void generate_set_simd4x2_header_gen9(brw_codegen*, brw::vec4_instruction*, brw_reg)’:
src/intel/compiler/brw_vec4_generator.cpp:1412:52: warning: unused parameter ‘inst’ [-Wunused-parameter]
vec4_instruction *inst,
^~~~
src/intel/compiler/brw_vec4_generator.cpp: In function ‘void generate_mov_indirect(brw_codegen*, brw::vec4_instruction*, brw_reg, brw_reg, brw_reg, brw_reg)’:
src/intel/compiler/brw_vec4_generator.cpp:1430:41: warning: unused parameter ‘inst’ [-Wunused-parameter]
vec4_instruction *inst,
^~~~
src/intel/compiler/brw_vec4_generator.cpp:1432:63: warning: unused parameter ‘length’ [-Wunused-parameter]
struct brw_reg indirect, struct brw_reg length)
^~~~~~
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
These days, we're just passing a pointer to a prog_data field, which
we already have access to. We can just use it directly.
(In the past, it was a pointer to a separate value.)
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
This makes sure that the header-present bit of the message descriptor
is in sync with the IR instruction fields, which gives the optimizer
more control to avoid the overhead of setting up a message header when
it's possible to do so.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This allows representing conditional mods and predicates on f1.0-f1.1
at the IR level by adding an extra bit to the flag_subreg
backend_instruction field.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The old code used an array to store each "instruction group" (the new,
better name than the old overloaded "annotation"), and required a
memmove() to shift elements over in the array when we needed to split a
group so that we could add an error message. This was confusing and
difficult to get right, not the least of which was because the array
has a tail sentinel not included in .ann_count.
Instead use a linked list, a data structure made for efficient
insertion.
Acked-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
I'm going to change the call in a later patch and with the difference in
indentation level it wasn't immediately obvious that the calls were
identical.
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>