At the time of commit 7bc6e455e2 (i965: Add support for saturating
immediates.) we thought mixed type saturates would be impossible. We
were only thinking about type converting moves from D to F, for
example. However, type converting moves w/saturate from F to DF are
definitely possible. This change minimally relaxes the restriction to
allow cases that I have been able trigger via piglit tests.
Fixes new piglit tests:
- arb_gpu_shader_fp64/execution/built-in-functions/fs-sign-sat-neg-abs.shader_test
- arb_gpu_shader_fp64/execution/built-in-functions/vs-sign-sat-neg-abs.shader_test
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
There's no reason for us to emit it a pile of times and then have a
whole pass to clean it up. Just emit it once like we really want.
Reviewed-by: Matt Turner <mattst88@gmail.com>
v2 (Jason Ekstrand):
- Disallow gl_SampleId in SIMD32 on gen7
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
And handle 32-wide payload register reads in fetch_payload_reg().
v2 (Jason Ekstrand);
- Fix some whitespace and brace placement
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
The original code manually handled splitting the MOVs to 8-wide to
handle various regioning restrictions. Now that we have a SIMD width
splitting pass that handles these things, we can just emit everything at
the full width and let the SIMD splitting pass handle it. We also now
have a useful "subscript" helper which is designed exactly for the case
where you want to take a W type and read it as a vector of Bs so we may
as well use that too.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
This reworks INTERPOLATE_AT_PER_SLOT_OFFSET to work more like an ALU
operation and less like a send. This is less code over-all and, as a
side-effect, it now properly handles execution groups and lowering so
SIMD32 support just falls out.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Commit 0d905597f fixed an issue with the placement of the zip and unzip
instructions. However, as a side-effect, it reversed the order in which
we were emitting the split instructions so that they went from high
group to low instead of low to high. This is fine for most things like
texture instructions and the like but certain render target writes
really want to be emitted low to high. This commit just switches the
order back around to be low to high.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 0d905597f "intel/fs: Be more explicit about our placement of [un]zip"
Doing instruction header setup in the generator is awful for a number
of reasons. For one, we can't schedule the header setup at all. For
another, it means lots of implied writes which the instruction scheduler
and other passes can't properly read about. The second isn't a huge
problem for FB writes since they always happen at the end. We made a
similar change to sampler handling in ff4726077d.
Reviewed-by: Matt Turner <mattst88@gmail.com>
The FB write opcode on gen4-5 does implied copies from g0 and g1 to the
message payload. With this commit, we start tracking that as part of
the IR by having the FB write read from g0-1.
Reviewed-by: Matt Turner <mattst88@gmail.com>
shuffle_from_32bit_read can manage the shuffle/unshuffle needed
for different 8/16/32/64 bit-sizes at VARYING PULL CONSTANT LOAD.
To get the specific component the first_component parameter is used.
In the case of the previous 16-bit shuffle, the shuffle operation was
generating not needed MOVs where its results where never used. This
behaviour passed unnoticed on SIMD16 because dead_code_eliminate
pass removed the generated instructions but for SIMD8 they cound't be
removed because of being partial writes.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
fs_visitor::set_gs_stream_control_data_bits generates some code like
"control_data_bits | stream_id << ((2 * (vertex_count - 1)) % 32)" as
part of EmitVertex. The first time this (dynamically) occurs in the
shader, control_data_bits is zero. Many times we can determine this
statically and various optimizations will collaborate to make one of the
OR operands literal zero.
Converting the OR to a MOV usually allows it to be copy-propagated away.
However, this does not happen in at least some shaders (in the assembly
output of shaders/closed/UnrealEngine4/EffectsCaveDemo/301.shader_test,
search for shl).
All of the affected shaders are geometry shaders.
Broadwell and Skylake had similar results. (Skylake shown)
total instructions in shared programs: 14375452 -> 14375413 (<.01%)
instructions in affected programs: 6422 -> 6383 (-0.61%)
helped: 39
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 0.14% max: 2.56% x̄: 1.91% x̃: 2.56%
95% mean confidence interval for instructions value: -1.00 -1.00
95% mean confidence interval for instructions %-change: -2.26% -1.57%
Instructions are helped.
total cycles in shared programs: 531981179 -> 531980555 (<.01%)
cycles in affected programs: 27493 -> 26869 (-2.27%)
helped: 39
HURT: 0
helped stats (abs) min: 16 max: 16 x̄: 16.00 x̃: 16
helped stats (rel) min: 0.60% max: 7.92% x̄: 5.94% x̃: 7.92%
95% mean confidence interval for cycles value: -16.00 -16.00
95% mean confidence interval for cycles %-change: -6.98% -4.90%
Cycles are helped.
No changes on earlier platforms.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
When using multiple RT write messages to the same RT such as for
dual-source blending or all RT writes in SIMD32, we have to set the
"Last Render Target Select" bit on all write messages that target the
last RT but only set EOT on the last RT write in the shader.
Special-casing for dual-source blend works today because that is the
only case which requires multiple RT write messages per RT. When we
start doing SIMD32, this will become much more common so we add a
dedicated bit for it.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
This replaces the special magic opcodes which implicitly read inputs
with explicit use of the ATTR file.
v2 (Jason Ekstrand):
- Break into multiple patches
- Change the units of the FS ATTR to be in logical scalars
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
They are send messages and this makes size_read() and mlen agree. For
both of these opcodes, the payload is just a dummy so mlen == 1 and this
should decrease register pressure a bit.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Cc: mesa-stable@lists.freedesktop.org
The new name make the zero-input behavior more obvious. The next
patch adds a new function with different zero-input behavior.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Suggested-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
This method is similar to the existing ::equals methods. Instead of
testing that two src_regs are equal to each other, it tests that one is
the negation of the other.
v2: Simplify various checks based on suggestions from Matt. Use
src_reg::type instead of fixed_hw_reg.type in a check. Also suggested
by Matt.
v3: Rebase on 3 years. Fix some problems with negative_equals with VF
constants. Add fs_reg::negative_equals.
v4: Replace the existing default case with BRW_REGISTER_TYPE_UB,
BRW_REGISTER_TYPE_B, and BRW_REGISTER_TYPE_NF. Suggested by Matt.
Expand the FINISHME comment to better explain why it isn't already
finished.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> [v3]
Reviewed-by: Matt Turner <mattst88@gmail.com>
We only have a cfg != NULL if we went through one of the paths that set
it, but my compiler doesn't figure that out.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 6411defdcd ("intel/cs: Re-run final NIR optimizations for each SIMD size")
OpenCL kernels also have int8/uint8.
v2: remove changes in nir_search as Jason posted a patch for that
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
v2 (idr): Don't allow CSEL with a non-float src2.
v3 (idr): Add CSEL to fs_inst::flags_written. Suggested by Matt.
v4 (idr): Only set BRW_ALIGN_16 on Gen < 10 (suggested by Matt). Don't
reset the access mode afterwards (suggested by Samuel and Matt). Add
support for CSEL not modifying the flags to more places (requested by
Matt).
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> [v3]
Reviewed-by: Matt Turner <mattst88@gmail.com>
NIR has code to lower these away for us but we can do significantly
better in many cases with register regioning and SIMD4x2.
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
These days, we're just passing a pointer to a prog_data field, which
we already have access to. We can just use it directly.
(In the past, it was a pointer to a separate value.)
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
The main motivation is to enable HDC surface opcodes on ICL which no
longer allows the sample mask to be provided in a message header, but
this is enabled all the way back to IVB when possible because it
decreases the instruction count of some shaders using HDC messages
significantly, e.g. one of the SynMark2 CSDof compute shaders
decreases instruction count by about 40% due to the removal of header
setup boilerplate which in turn makes a number of send message
payloads more easily CSE-able. Shader-db results on SKL:
total instructions in shared programs: 15325319 -> 15314384 (-0.07%)
instructions in affected programs: 311532 -> 300597 (-3.51%)
helped: 491
HURT: 1
Shader-db results on BDW where the optimization needs to be disabled
in some cases due to hardware restrictions:
total instructions in shared programs: 15604794 -> 15598028 (-0.04%)
instructions in affected programs: 220863 -> 214097 (-3.06%)
helped: 351
HURT: 0
The FPS of SynMark2 CSDof improves by 5.09% ±0.36% (n=10) on my SKL
laptop with this change. According to Eero this improves performance
of the same test by 9% on BYT and by 7-8% on BXT J4205 and on SKL GT2
desktop.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Tested-By: Eero Tamminen <eero.t.tamminen@intel.com>
This shouldn't cause any functional change at this point, it changes
SHADER_OPCODE_FIND_LIVE_CHANNEL to use the flag register specified at
the IR level instead of the hard-coded f1.0, now that it can be
represented in backend_instruction::flag_subreg. This will be
necessary for scheduling to behave correctly once more things start
making use of f1.0.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This allows representing conditional mods and predicates on f1.0-f1.1
at the IR level by adding an extra bit to the flag_subreg
backend_instruction field.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This gives the scheduler visibility into the headers which should
improve scheduling. More importantly, however, it lets the scheduler
know that the header gets written. As-is, the scheduler thinks that a
texture instruction only reads it's payload and is unaware that it may
write to the first register so it may reorder it with respect to a read
from that register. This is causing issues in a couple of Dota 2 vertex
shaders.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104923
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
This helper used to load 16bit components from 32-bits read now allows
skipping components with the new parameter first_component. The semantics
now skip components until we reach the first_component, and then reads the
number of components passed to the function.
All previous uses of the helper are updated to use 0 as first_component.
This will allow read 16-bit components when the first one is not aligned
32-bit. Enabling more usages of untyped_reads with 16-bit types.
v2: (Jason Ektrand)
Change parameters order to first_component, num_components
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This looks like it should be protected by the assume() about
nr_color_regions, but my compiler warns anyway.
Reviewed-by: Matt Turner <mattst88@gmail.com>
The GPU hang caused by push constants is apparently fixed, so let's
enable them again.
Signed-off-by: Rafael Antognolli <rafael.antognolli@intel.com>
Cc: "18.0" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
18fde36ced changed the way temporary
registers were allocated in lower_integer_multiplication so that we
allocate regs_written(inst) space and keep the stride of the original
destination register. This was to ensure that any MUL which originally
followed the CHV/BXT integer multiply regioning restrictions would
continue to follow those restrictions even after lowering. This works
fine except that I forgot to reset the register file to VGRF so, even
though they were assigned a number from alloc.allocate(), they had the
wrong register file. This caused some GLES 3.0 CTS tests to start
failing on Sandy Bridge due to attempted reads from the MRF:
ES3-CTS.functional.shaders.precision.int.highp_mul_fragment.snbm64
ES3-CTS.functional.shaders.precision.int.mediump_mul_fragment.snbm64
ES3-CTS.functional.shaders.precision.int.lowp_mul_fragment.snbm64
ES3-CTS.functional.shaders.precision.uint.highp_mul_fragment.snbm64
ES3-CTS.functional.shaders.precision.uint.mediump_mul_fragment.snbm64
ES3-CTS.functional.shaders.precision.uint.lowp_mul_fragment.snbm64
This commit remedies this problem by, instead of copying inst->dst and
overwriting nr, just make a new register and set the region to match
inst->dst.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103626
Fixes: 18fde36ced
Cc: "17.3" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
We already had to switch all of the W types to UW to prevent issues
with vector immediates on gen10. We may as well use unsigned types
everywhere.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Gen 10 has a strange hardware bug involving V immediates with W types.
It appears that a mov(8) g2<1>W 0x76543210V will actually result in g2
getting the value {3, 2, 1, 0, 3, 2, 1, 0}. In particular, the bottom
four nibbles are repeated instead of the top four being taken. (A mov
of 0x00003210V yields the same result.) This bug does not appear in any
hardware documentation as far as we can tell and the simulator does not
implement the bug either.
Commit 6132992cdb was mostly a no-op
except that it changed the type of the subgroup invocation from UW to W
and caused us to tickle this bug with basically every compute shader
that uses any sort of invocation ID (which is most of them). This is
also potentially an issue for geometry shader input pulls and SampleID
setup. The easy solution is just to change the few places where we use
a vector integer immediate with a W type to use a UW type.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
Fixes: 6132992cdb