This greatly simplifies the next patch.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Adds suppport for INTEL_fragment_shader_ordering. We achieve
the fragment ordering by using the same instruction as for
beginInvocationInterlockARB() which is by issuing a memory
fence via sendc.
Signed-off-by: Kevin Rogovin <kevin.rogovin@intel.com>
Reviewed-by: Plamena Manolova <plamena.manolova@intel.com>
v2: Split changes to the message type field to another patch. Suggested
by Caio.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The hardware doesn't support byte immediates, so similar to setup_imm_df()
for doubles, these helpers work by loading the constant value into a
VGRF.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
These are lowered by brw_nir_lower_vs_inputs(). If they weren't, we
would have already hit the unreachable() in emit_system_values_block().
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
we need rounding modes on other conversions involving floats and it is easier
to rename f2f16_undef than renaming all the other ones.
v2: rebased on master
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Until now we have assumed that we could skip emitting these barriers
in the general case based on empirical testing and a few assumptions
detailed in a comment in the driver code, however, recent CTS tests
have showed that we actually need them to produce correct behavior.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
In 6f5abf3146 this code was fixed to calculate the maximum size of
an attribute in a seperate pass and then allocate the registers to
that size. However this wasn’t taking into account ranges that overlap
but don’t have the same starting location. For example:
layout(location = 0, component = 0) out float a[4];
layout(location = 2, component = 1) out float b[4];
Previously, if ‘a’ was processed first then it would allocate a
register of size 4 for location 0 and it wouldn’t allocate another
register for location 2 because it would already be covered by the
range of 0. Then if something tries to write to b[2] it would try to
write past the end of the register allocated for ‘a’ and it would hit
an assert.
This patch changes it to scan for any overlapping ranges that start
within each range to calculate the maximum extent and allocate that
instead.
Fixed Piglit’s arb_enhanced_layouts/execution/component-layout/
vs-fs-array-interleave-range.shader_test
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Fixes: 6f5abf3146 "i965: Fix output register sizes when multiple variables
share a slot."
This reworks INTERPOLATE_AT_PER_SLOT_OFFSET to work more like an ALU
operation and less like a send. This is less code over-all and, as a
side-effect, it now properly handles execution groups and lowering so
SIMD32 support just falls out.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Current discard handling requires dedicating the second flag register to
discard. However, control-flow in SIMD32 requires both flag registers
so it's incompatible with the current discard handling. Just don't
support SIMD32+discard for now.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
The hardware's control flow logic is 16-wide so we're out of luck
here. We could, in theory, support SIMD32 if we know the control-flow
is uniform but we don't have that information at this point.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Since we had to rewrite the deref walking loop anyway, I took the
opportunity to make it a bit clearer and more efficient. In particular,
in the AoA case, we will now emit one minmax instead of one per array
level.
Acked-by: Rob Clark <robdclark@gmail.com>
Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
As the previous use of shuffle_32bit_load_result_to_64bit_data
had a source/destination overlap for 64-bit. Now a temporary destination
is used for 64-bit cases to use shuffle_from_32bit_read that doesn't
handle src/dst overlaps.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Previously, the shuffle function had a source/destination overlap that
needs to be avoided to use shuffle_from_32bit_read. As we can use for
the shuffle destination the destination of removed MOVs.
This change also avoids the internal MOVs done by the previous shuffle
to deal with possible overlaps.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
shuffle_from_32bit_read manages 32-bit reads to 32-bit destination
in the same way that the previous loop so now we just call the new
function for all bitsizes, simplifying also the 64-bit load_input.
v2: Add comment about future 16-bit support (Jason Ekstrand)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This implementation avoids two unneeded MOVs for each 64-bit
component. One was done in the old shuffle, to avoid cases of
src/dst overlap but this is not the case. And the removed MOV
was already being being done in the shuffle.
Copy propagation wasn't able to remove them because shuffle
destination values are defined with partial writes because they
have stride == 2.
v2: Reword commit log summary (Jason Ekstrand)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
do_untyped_vector_read is used at load_ssbo and load_shared.
The previous MOVs are removed because shuffle_from_32bit_read
can handle storing the shuffle results in the expected destination
just using the proper offset.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Using shuffle_from_32bit_read instead of 16-bit shuffle functions
avoids the need of retype. At the same time new function are
ready for 8-bit type SSBO reads.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
These new shuffle functions deal with the shuffle/unshuffle operations
needed for read/write operations using 32-bit components when the
read/written components have a different bit-size (8, 16, 64-bits).
Shuffle from 32-bit to 32-bit becomes a simple MOV.
shuffle_src_to_dst takes care of doing a shuffle when source type is
smaller than destination type and an unshuffle when source type is
bigger than destination. So this new read/write functions just need
to call shuffle_src_to_dst assuming that writes use a 32-bit
destination and reads use a 32-bit source.
As shuffle_for_32bit_write/from_32bit_read components take components
in unit of source/destination types and shuffle_src_to_dst takes units
of the smallest type component, we adjust components and first_component
parameters.
To enable this new functions it is needed than there is no
source/destination overlap in the case of shuffle_from_32bit_read.
That never happens on shuffle_for_32bit_write as it allocates a new
destination register as it was at shuffle_64bit_data_for_32bit_write.
v2: Reword commit log and add comments to explain why first_component
and components parameters are adjusted. (Jason Ekstrand)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This new function takes care of shuffle/unshuffle components of a
particular bit-size in components with a different bit-size.
If source type size is smaller than destination type size the operation
needed is a component shuffle. The opposite case would be an unshuffle.
Component units are measured in terms of the smaller type between
source and destination. As we are un/shuffling the smaller components
from/into a bigger one.
The operation allows to skip first_component number of components from
the source.
Shuffle MOVs are retyped using integer types avoiding problems with
denorms and float types if source and destination bitsize is different.
This allows to simplify uses of shuffle functions that are dealing with
these retypes individually.
Now there is a new restriction so source and destination can not overlap
anymore when calling this shuffle function. Following patches that migrate
to use this new function will take care individually of avoiding source
and destination overlaps.
v2: (Jason Ekstrand)
- Rewrite overlap asserts.
- Manage type_sz(src.type) == type_sz(dst.type) case using MOVs
from source to dest. This works for 64-bit to 64-bits
operation that on Gen7 as it doesn't support Q registers.
- Explain that components units are based in the smallest type.
v3: - Fix unshuffle overlap assert (Jason Ekstrand)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Adds suppport for ARB_fragment_shader_interlock. We achieve
the interlock and fragment ordering by issuing a memory fence
via sendc.
Signed-off-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
The only reason it was it's own opcode was so that we could detect it
and adjust the source register based on the payload setup. Now that
we're using the ATTR file for FS inputs, there's no point in having a
magic opcode for this.
v2 (Jason Ekstrand):
- Break the bit which removes the CINTERP opcode into its own patch
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
This replaces the special magic opcodes which implicitly read inputs
with explicit use of the ATTR file.
v2 (Jason Ekstrand):
- Break into multiple patches
- Change the units of the FS ATTR to be in logical scalars
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
v2 (Jason Ekstrand):
- Break the refactor into its own patch
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
This rollbacks the revert of this same patch introduced in
commit 7b9c15628a.
And also squahes the following patch to prevent a piglit regression caused
by this change:
intel/compiler: Fix lower_conversions for 8-bit types.
Author: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
For 8-bit types the execution type is word. A byte raw MOV has 16-bit
execution type and 8-bit destination and it shouldn't be considered
a conversion case. So there is no need to change alignment and enter
in lower_conversions for these instructions.
Fixes a regresion in the piglit test "glsl-fs-shader-stencil-export"
that is introduced with this patch from the Vulkan shaderInt16 series:
'i965/compiler: handle conversion to smaller type in the lowering
pass for that'. The problem is caused because there is already a case
in the driver that injects Byte instructions like this:
mov(8) g127<1>UB g2<32,8,4>UB
And the aforementioned pass was not accounting for the special
handling of the execution size of Byte instructions. This patch
fixes this.
v2: (Jason Ekstrand)
- Simplify is_byte_raw_mov, include reference to PRM and not
consider B <-> UB conversions as raw movs.
v3: (Matt Turner)
- Indentation style fixes.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106393
Tested-by: Mark Janes <mark.a.janes@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
These are subject to the general restriction that anything that is converted
to 64-bit needs to be aligned to 64-bit. We had this already in place for
32-bit to 64-bit conversions, so this patch generalizes the implementation
to take effect on any conversion to 64-bit from a source smaller than
64-bit.
Fixes assembly validation errors in the following CTS tests in BSW:
dEQP-VK.spirv_assembly.instruction.compute.sconvert.int16_to_int64
dEQP-VK.spirv_assembly.instruction.compute.uconvert.uint16_to_uint64
dEQP-VK.spirv_assembly.instruction.compute.sconvert.int16_to_uint64
Tested-by: Mark Janes <mark.a.janes@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
NIR assumes that booleans are always 32-bit, but Intel hardware produces
16-bit booleans for 16-bit comparisons. This means that we need to convert
the 16-bit result to 32-bit.
In the future we want to add an optimization pass to clean this up and
hopefully remove the conversions.
v2 (Jason): use the type of the source for the temporary and use
brw_reg_type_from_bit_size for the conversion to 32-bit.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The lowering pass was specialized to act on 64-bit to 32-bit conversions only,
but the implementation is valid for other cases.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We need to use 16-bit constants with 16-bit instructions,
otherwise we get the following validation error:
"Destination stride must be equal to the ratio of the sizes of
the execution data type to the destination type"
Because the execution data type is 4B due to the 32-bit integer
constant.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The Vertex Elements are now:
* VE 1: <BaseVertex/firstvertex, BaseInstance, VertexID, InstanceID>
* VE 2: <DrawID, is-indexed-draw, 0, 0>
VE1 is it kept as it was before, VE2 additionally contains the new
system value.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
All operations with offset_reg at do_vector_read are done
with UD type. So copy propagation was not working through
the generated MOVs:
mov(8) vgrf9:UD, vgrf7:D
This change allows removing the MOV generated for reading the
first components for 16-bit and 64-bit ssbo reads with
non-constant offsets.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>