In order to implement the ballot intrinsic, we do a MOV from flag
register to some GRF. If that GRF is used in a SEL, cmod propagation
helpfully changes it into a MOV from the flag register with a cmod.
This is perfectly valid but when lower_simd_width comes along, it simply
splits into two instructions which both have conditional modifiers.
This is a problem since we're reading the flag register. This commit
makes us check whether or not flags_written() overlaps with the flag
values that we are reading via the instruction source and, if we have
any interference, will force us to emit a copy of the source.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
Also needed in freedreno/ir3.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
gcc is throwing this warning in my meson build:
../src/intel/compiler/brw_eu_validate.c:50:11: warning
argument 1 null where non-null expected [-Wnonnull]
return memmem(haystack.str, haystack.len,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
needle.str, needle.len) != NULL;
~~~~~~~~~~~~~~~~~~~~~~~
The first check for CONTAINS has a NULL error_msg.str and 0 len. The
glibc implementation will exit without looking at any haystack bytes if
haystack.len < needle.len, so this was safe, but silence the warning
anyway by guarding against implementation variablility.
Fixes: 122ef3799d ("i965: Only insert error message if not already present")
Reviewed-by: Matt Turner <mattst88@gmail.com>
Align1 mode offers some nice features over align16, like access to more
data types and the ability to use a 16-bit immediate. This patch does
not start using any new features. It just emits ternary instructions in
align1 mode.
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
Put hw_ in the name so that it's clear these are the hardware encodings.
Similar to commit 9fb8323328 ("i965: Rename brw_inst's functions that
access the register type")
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
The instruction word contains SubRegNum[4:2] so it's in units of dwords
(hence the * 4 to get it in terms of bytes). Before this patch, the
subreg would have been wrong for DF arguments.
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
I'm going to call this from brw_inst.h, and I don't want to have to
include all of brw_reg.h.
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
It is already done in NIR.
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
It is already done in NIR.
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Commit a73116ecc6 tried to make add_barrier_deps()
walk to the next barrier, and stop. To accomplish that, it added an
is_barrier flag. Unfortunately, this only works half of the time.
The issue is that add_barrier_deps() walks both backward (to the
previous barrier), and forward (to the next barrier). It also sets
is_barrier. Assuming that we're processing instructions in forward
order, this means that is_barrier will be set for previous instructions,
but not future ones. So we'll never see it, and walk further than we
need to.
dEQP-GLES31.functional.ssbo.layout.random.all_shared_buffer.23
now compiles its shaders in 3.6 seconds instead of 3.3 minutes.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Pallavi G <pallavi.g@intel.com>
This eliminates a layer of wrapping, and makes a backend_instruction
sufficient. The downside is that it exposes 'eot' to the vec4 backend,
which it doesn't need, but can basically happily ignore.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Pallavi G <pallavi.g@intel.com>
This is a lot more natural than special casing it all over the place.
We still have to do a bit of special-casing in assign_constant_locations
but it's not special-cased quite as bad as it was before.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Now that everything is nicely ralloc'd, we can allocate the pull_param
array in assign_constant_locations instead of higher up. We can also
re-allocate the param array so that it's exactly the needed size. This
should save us some memory because we're not allocating the total needed
param space for both push and pull.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Now that we're always growing the param array as-needed, we can
allocate the param array in common code and stop repeating the
allocation everywere. In order to keep things sane, we ralloc the
[pull_]param array off of the compile context and then steal it back
to a NULL context later. This doesn't get us all the way to where
prog_data::[pull_]param is purely an out parameter of the back-end
compiler but it gets us a lot closer.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of requiring the caller of brw_compile_vs to figure it out, just
grow the param array on-demand.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of making the caller of brw_compile_cs add something to the
param array for thread_local_id_index, just add it on-demand in
brw_nir_intrinsics and grow the array. This is now safe to do because
everyone is now using ralloc for prog_data::param.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
It's already only ever called from brw_compile_cs and only handles
compute intrinsics. Let's just make it CS-specific. We can always
make it handle other stages again later if we want.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The Vulkan driver does not support pull constants. It simply limits
things such that we can always push everything. Previously, we were
determining whether or not to push things based on whether or not the
prog_data::pull_param array is non-null. This is rather hackish and
about to stop working.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This burns an extra 10k of memory or so in the case where you don't have
any images. However, if you have several shaders which use images, this
should be much less memory. It also gets rid of a part of prog_data
that really has nothing to do with the compiler.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This moves us away to the array of pointers model and onto a model where
each param is represented by a generic uint32_t handle. We reserve 2^16
of these handles for builtins that get generated by somewhere inside the
compiler and have well-defined meanings. Generic params have handles
whose meanings are defined by the driver.
The primary downside to this new approach is that it moves a little bit
of the work that we would normally do at compile time to draw time. On
my laptop this hurts OglBatch6 by no more than 1% and doesn't seem to
have any measurable affect on OglBatch7. So, while this may come back
to bite us, it doesn't look too bad.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
ARB_enhanced_layouts allows multiple output variables to share the same
location - and these variables may not have the same sizes. For
example, consider these output variables:
// consume X/Y/Z components of 6 vectors
layout(location = 0) out vec3 a[6];
// consumes W component of the first vector
layout(location = 0, component = 3) out float b;
Looking at the first declaration, we see that VARYING_SLOT_VAR0 needs 24
components worth of space (vec3 padded out to a vec4, 4 * 6 = 24). But
looking at the second declaration, we would think that VARYING_SLOT_VAR0
needs only 4 components of space (a single float padded out to a vec4).
nir_setup_outputs() only considered the space requirements of the first
declaration it happened to see, so if 'float b' came first, it would
underallocate the output register space, causing brw_fs_validator.cpp
to assert fail about inst->dst.offset exceeding the register size.
Fixes Piglit's tests/spec/arb_enhanced_layouts/execution/component-layout/
vs-to-fs-array-interleave-single-location.shader_test.
Thanks to Tim Arceri for finding this bug and writing a test!
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
KHR-GL45.shader_ballot_tests.ShaderBallotBitmasks has a MOV that hits
this validation path. MOVs don't have a src1 file, but calling
brw_inst_src1_type() was tripping on src1.file being BRW_IMMEDIATE_VALUE
and the hw_type being something invalid for immediates.
To work around this, just pretend src1 is src0 if there isn't a src1.
Fixes: 2572c2771d (i965: Validate "Special
Requirements for Handling Double Precision Data Types")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102680
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
When computing the total size of the URB for tessellation evaluation
inputs we were not accounting for this, and instead we were always
assuming that each input would take a single vec4 slot, which could
lead to computing a smaller read size than required. Specifically, this
is a problem when the last input is a dvec3/4 such that its XY components
are stored in the the second half of a payload register (which can happen
if the offset for the input in the URB is not 64-bit aligned because
there are 32-bit inputs mixed in) and the ZW components in the
first half of the next, as in this case we would fail to account for the
extra slot required for the ZW components.
Fixes (requires another fix in CTS currently in review):
KHR-GL45.enhanced_layouts.varying_locations
KHR-GL45.enhanced_layouts.varying_array_locations
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
I did not implement:
CNL's restriction on 64-bit int + align16, because I don't think
we'll ever use this combination regardless of hardware generation.
The restriction on immediate DF -> F conversions, because there's no
reason to ever generate that, and I don't even know how DF -> F
conversions are supposed to work in Align16 since (1) the dst stride
must be 1, but (2) the dst stride would have to be 2 for src and dst
strides to be aligned.
Some restrictions require something like strides to match between src
and dest. For multi-source instructions, I'd rather encapsulate the
logic for not inserting already present errors in ERROR_IF than
open-coding it multiple places.
The type suffixes were wrong, and the 16 was missing the 0 prefix.
Fixes: 92f787ff86 ("i965: Add support for disassembling 64-bit integer immediates")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
... without the float -> double conversion. Low power parts have
additional restrictions when it comes to operating on 64-bit types, and
the instruction used to do the conversion violates one of them:
specifically, the restriction that "Source and Destination horizontal
stride must be aligned to the same qword".
Previously we generated a float and then converted, but we can avoid the
conversion by using the same extract-the-sign-bit + or-in-1.0 algorithm
by directly operating on the high four bytes of each double-precision
component in the result.
In SIMD8 and SIMD16 this cuts one instruction from the implementation,
and more importantly that instruction is the one which violated the
regioning restriction.
Along the way I removed some comments that I did not think helped, and
some code about double comparisons which does not seem to be necessary
today.
This prevents validation failures caught by the new EU validation code
added in later patches.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
64-bit operations on Atom parts have additional restrictions over their
big-core counterparts (validated by later patches).
Specifically, the restriction that "Source and Destination horizontal
stride must be aligned to the same qword" is violated by most shift
operations since NIR uses a 32-bit value as the shift count argument,
and this causes instructions like
shl(8) g19<1>Q g5<4,4,1>Q g23<4,4,1>UD
where src1 has a 32-bit stride, but the dest and src0 have a 64-bit
stride.
This caused ~4 pixels in the ARB_shader_ballot piglit test
fs-readInvocation-uint.shader_test to be incorrect. Unfortunately no
ARB_gpu_shader_int64 test hit this case because they operate on
uniforms, and their scalar regions are an exception to the restriction.
We work around this by effectively unpacking the shift count, so that we
can read it with a 64-bit stride in the shift instruction. Unfortunately
the unpack (a MOV with a dst stride of 2) is a partial write, and cannot
be copy-propagated or CSE'd.
Bugzilla: https://bugs.freedesktop.org/101984
A typo caused us to copy src0's reg file to src1 rather than reading
src1's as intended. This caused us to fail to compact instructions like
mov(8) g4<1>D 0D { align1 1Q };
because src1 was set to immediate rather than architecture file. Fixing
this reenables compaction (after the precompact() pass changes the data
types):
mov(8) g4<1>UD 0x00000000UD { align1 1Q compacted };
Fixes: 1cb0a7941b ("i965: Switch to using the logical register types")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We set a similar default value for LOD in the fs backend for TXS/TXL.
Without this we end up generating invalid MOV with a null src.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: "17.2 17.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
In truth gtest is an external dependency that upstream expects you to
"vendor" into your own tree. As such, it makes sense to treat it more
like a dependency than an internal library, and collect it's
requirements together in a dependency object.
v2: - include with -isystem instead of setting compiler args (Eric)
Signed-off-by: Dylan Baker <dylanx.c.baker@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>