All of the other brw_*_desc functions take a devinfo parameter, and all
of the others at least have an assert that uses it. Keep the parameter,
but mark it as unused.
Silences 37 warnings like:
In file included from src/intel/common/gen_disasm.c:27:0:
src/intel/compiler/brw_eu.h: In function ‘brw_pixel_interp_desc’:
src/intel/compiler/brw_eu.h:377:53: warning: unused parameter ‘devinfo’ [-Wunused-parameter]
brw_pixel_interp_desc(const struct gen_device_info *devinfo,
^~~~~~~
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Adds suppport for INTEL_fragment_shader_ordering. We achieve
the fragment ordering by using the same instruction as for
beginInvocationInterlockARB() which is by issuing a memory
fence via sendc.
Signed-off-by: Kevin Rogovin <kevin.rogovin@intel.com>
Reviewed-by: Plamena Manolova <plamena.manolova@intel.com>
INTEL_DEBUG=hex prints 32 bit hex value and due to endianness of CPU
byte order is reversed. In order to disassemble binary files, print
each byte instead of 32 bit hex value.
v2: Print blank spaces in order to vertically align output of compacted
instructions hex value with uncompacted instructions hex value.
(Matt Turner)
v3: Fix line wrap at correct length
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
We have to be a bit careful with this one because we want it to run in
the optimization loop but only in the first brw_nir_optimize call.
Later calls assume that we've lowered away copy_deref instructions and
we don't want to introduce any more.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15176942 -> 15176942 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
In spite of the lack of any shader-db improvement, this patch completely
eliminates spilling in the Batman: Arkham City tessellation shaders.
This is because we are now able to detect that the temporary array
created by DXVK for storing TCS inputs is a copy of the input arrays and
use indirect URB reads instead of making a copy of 4.5 KiB of input data
and then indirecting on it with if-ladders.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Shader-db results on Kaby Lake:
total instructions in shared programs: 15177605 -> 15176765 (<.01%)
instructions in affected programs: 4259 -> 3419 (-19.72%)
helped: 1
HURT: 0
total spills in shared programs: 10954 -> 10855 (-0.90%)
spills in affected programs: 295 -> 196 (-33.56%)
helped: 1
HURT: 0
total fills in shared programs: 22222 -> 22117 (-0.47%)
fills in affected programs: 417 -> 312 (-25.18%)
helped: 1
HURT: 0
The helped shader is from the OglCSDof synmark test. On my Kaby Lake
laptop, the actual framerate of the benchmark didn't appear to improve
beyond the noise.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We call structure splitting once because it is guaranteed to split all
the structures in the entire shader in one go. We call array splitting
in the loop in case future optimizations turn indirects into direct
dereferences and we can split more arrays.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15177605 -> 15177605 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
This is unsurprising because nir_lower_vars_to_ssa already effectively
does structure and array splitting internally. It doesn't actually
split the variables but it's ability to reason about aliasing in the
presence of arrays and structures and pick out scalars or vectors to be
lowered to SSA values is fairly advanced.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
v2: Split changes to the message type field to another patch. Suggested
by Caio.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is necessary for a new Gen9 message type that will be added in the
next patch. There are also Gen8 message types that need the extra bit
(mostly for bindless).
v2: Split off from the next patch. Suggested by Caio.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Commit 4434591bf5 caused substantially more URB messages in
geometry and tessellation shaders. Before we can really enable this
sort of optimization, We either need some way of combining them back
together into vectors or we need to do cross-stage vector element
elimination without splitting everything into scalars.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107510
Fixes: 4434591bf5 "intel/nir: Call nir_lower_io_to_scalar_early"
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Tested-by: Mark Janes <mark.a.janes@intel.com>
Now that all the build scripts are compatible with both Python 2 and 3,
we can flip the switch and tell Meson to use the latter.
Since Meson already depends on Python 3 anyway, this means we don't need
two different Python stacks to build Mesa.
Signed-off-by: Mathieu Bridon <bochecha@daitauha.fr>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
When the SIMD16 Gen4-5 fragment shader payload contains source depth
(g2-3), destination stencil (g4), and destination depth (g5-6), the
single register of stencil makes the destination depth unaligned.
We were generating this instruction in the RT write payload setup:
mov(16) m14<1>F g5<8,8,1>F { align1 compr };
which is illegal, instructions with a source region spanning more than
one register need to be aligned to even registers. This is because the
hardware implicitly does (nr | 1) instead of (nr + 1) when splitting the
compressed instruction into two mov(8)'s.
I believe this would cause the hardware to load g5 twice, replicating
subspan 0-1's destination depth to subspan 2-3. This showed up as 2x2
artifact blocks in both TIS-100 and Reicast.
Normally, we rely on the register allocator to even-align our virtual
GRFs. But we don't control the payload, so we need to lower SIMD widths
to make it work. To fix this, we teach lower_simd_width about the
restriction, and then call it again after lower_load_payload (which is
what generates the offending MOV).
Fixes: 8aee87fe4c (i965: Use SIMD16 instead of SIMD8 on Gen4 when possible.)
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107212
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=13728
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Tested-by: Diego Viola <diego.viola@gmail.com>
During code review, Jason pointed out that:
2b3064c073 "i965, anv: Use INTEL_DEBUG for disk_cache driver flags"
Didn't account for INTEL_SCALER_* environment variables.
To fix this, let the compiler return the disk_cache driver flags.
Another possible fix would be to pull the INTEL_SCALER_* into
INTEL_DEBUG bits, but as we are currently using 41 of 64 bits, I
didn't think it was a good use of 4 more of these bits. (5 since
INTEL_PRECISE_TRIG needs to be accounted for as well.)
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of just looking at the number of color attachments, look at
which ones are actually used by the subpass. This lets us potentially
throw away chunks of the fragment shader. In DXVK, for example, all
subpasses have 8 attachments and most are VK_ATTACHMENT_UNUSED so this
is very helpful in that case.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
The NIR nir_lower_io_arrays_to_elements pass attempts to split I/O
variables which are arrays or matrices into a sequence of separate
variables. This can help link-time optimization by allowing us to
remove varyings at a more granular level.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15177645 -> 15168494 (-0.06%)
instructions in affected programs: 79857 -> 70706 (-11.46%)
helped: 392
HURT: 0
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Otherwise, only the first vec4 of a matrix or other complex type will
get marked as flat and we'll interpolate the others. This was caught by
a dEQP test which started failing because it did a SSO vs. non-SSO
comparison. Previously, we did the interpolation wrong consistently in
both versions. However, with one of Tim Arceri's NIR linkingpatches, we
started splitting the matrix input into vectors at link time in the
non-SSO version and it started getting correctly interpolated which
didn't match the broken SSO version. As of this commit, they both get
correctly interpolated.
Fixes: e61cc87c75 "i965/fs: Add a flat_inputs field to prog_data"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
The hardware doesn't support byte immediates, so similar to setup_imm_df()
for doubles, these helpers work by loading the constant value into a
VGRF.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The pass can create a temporary result for the instruction and then
moves from it to the original destination, however, if the original
instruction was predicated, the mov has to be predicated as well.
Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Until now, we had separate passes for lowering gl_PatchVerticesIn to
a statically known constant (for TES inputs when linked against a TCS),
and a uniform in the other cases. Annoyingly, one had to be run before
nir_lower_system_values, and the other afterward. This simplified the
passes, but made life painful for the callers.
This patch combines both into a single pass. If you give it a non-zero
static count, it uses that. If you give it Mesa state slots, it turns
it back into a built-in uniform. Otherwise, it does nothing.
This also moves the i965 uniform lowering out to shared code.
v2: Make token arrays const.
Reviewed-by: Eric Anholt <eric@anholt.net>
These are lowered by brw_nir_lower_vs_inputs(). If they weren't, we
would have already hit the unreachable() in emit_system_values_block().
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
we need rounding modes on other conversions involving floats and it is easier
to rename f2f16_undef than renaming all the other ones.
v2: rebased on master
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
The original pass only looked for load_uniform intrinsics but there are
a number of other places that could end up loading a push constant. One
obvious omission was images which always implicitly use a push constant.
Legacy VS clip planes also get pushed into the shader. This fixes some
new Vulkan CTS tests that test random combinations of bindings and, in
particular, test lots of UBOs and images together.
Cc: mesa-stable@lists.freedesktop.org
Cc: Kenneth Graunke <kenneth@whitecape.org>
Explicitly convert to signed integer. Conversion is valid since is the
same (implicitly) used to initialize the loop. Avoids the warning:
../../src/intel/compiler/brw_fs.cpp: In member function ‘bool fs_visitor::lower_simd_width()’:
../../src/intel/compiler/brw_fs.cpp:5761:45: warning: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Wsign-compare]
split_inst.eot = inst->eot && i == n - 1;
~~^~~~~~~~
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
At 232ed89802 "i965/fs: Register allocator
shoudn't use grf127 for sends dest" we didn't take into account the case
of SEND instructions that are not send_from_grf. But since Gen7+ although
the backend still uses MRFs internally for sends they are finally
assigned to a GRFs.
In the case of unspills the backend assigns directly as source its
destination because it is suppose to be available. So we always have a
source-destination overlap. If the reg_allocator assigns registers that
include the grf127 we fail the validation rule that affects Gen8+
"r127 must not be used for return address when there is a src and dest
overlap in send instruction."
So this patch activates the grf127_send_hack_node for Gen8+ and if we
have any register spilled we add interferences to the destination of
the unspill operations.
We also need to avoid that opt_bank_conflicts() optimization, that runs
after the register allocation, doesn't move things around, causing the
grf127 to be used in the condition we were avoiding.
Fixes piglit test tests/spec/arb_compute_shader/linker/bug-93840.shader_test
and some shader-db crashed because of the grf127 validation rule..
v2: make sure that opt_bank_conflicts() optimization doesn't change
the use of grf127. (Caio)
Found by Caio Marcelo de Oliveira Filho
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107193
Fixes: 232ed89802 "i965/fs: Register allocator shoudn't use grf127 for sends dest"
Cc: 18.1 <mesa-stable@lists.freedesktop.org>
Cc: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
There are a number of opcode_desc table entries for many of these
unused opcodes. A symbolic opcode enum will be required in a future
commit in order to keep them in the opcode description tables. The
alternative would be to remove the unused opcodes from the opcode
description tables.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This makes the message length available at the IR level, which should
save some guesswork in a future commit.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Constructing a descriptor in-place as part of the immediate of an ALU
instruction is no longer supported.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The return value is not used anymore. This allows simplifying the
code slightly, and in addition it should frustrate anybody's attempts
to continue using the obsolete piecemeal approach to construct a
message descriptor in combination with brw_send_indirect_message().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
All users of brw_send_indirect_surface_message() should be providing a
full descriptor immediate up front by now, this isn't necessary
anymore.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The current approach of returning a setup instruction where additional
descriptor fields can be specified is still supported in order to keep
things working, but it will be removed later in this series.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This replaces brw_set_message_descriptor() with the composition of
brw_set_desc() and a new inline helper function that packs the common
message descriptor controls into an integer. The goal is to represent
all message descriptors as a 32-bit integer which is written at once
into the instruction, which is more flexible (SENDS anyone?), robust
(see d2eecf0b0b fixing an issue
ultimately caused by some bits of the extended message descriptor
being left undefined) and future-proof than the current approach of
specifying the individual descriptor fields directly into the
instruction.
This approach also seems more self-documenting, since it will allow
removing calls to functions with way too many arguments like
brw_set_*_message() and brw_send_indirect_message(), and instead
provide a single descriptor argument constructed from an appropriate
combination of brw_*_desc() helpers.
Note that because brw_set_message_descriptor() was (conditionally?)
overriding fields of the instruction which strictly speaking weren't
part of the message descriptor, this involves calling
brw_inst_set_sfid() and brw_inst_set_eot() in some cases in addition
to brw_set_desc().
v2: Use SET_BITS macro instead of left shift (Ken).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Allows to specify a bitfield based on its upper and lower bounds
instead of a symbolic field definition, kind of what the current
GET_BITS macro is to GET_FIELD.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This introduces helpers that can be used to specify or extract the
whole descriptor of a SEND message instruction at once. Because the
the instruction encoding of these is rather awkward on some
generations using the generic brw_inst.h macros doesn't seem like an
option.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Until now we have assumed that we could skip emitting these barriers
in the general case based on empirical testing and a few assumptions
detailed in a comment in the driver code, however, recent CTS tests
have showed that we actually need them to produce correct behavior.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>