Commit graph

263 commits

Author SHA1 Message Date
Francisco Jerez
1c6853a9be intel/fs: Refactor predication on sample mask into helper function.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
a792e11f5c intel/fs/gen7+: Swap sample mask flag register and FIND_LIVE_CHANNEL temporary.
FIND_LIVE_CHANNEL was using f1.0-f1.1 as temporary flag register on
Gen7, instead use f0.0-f0.1.  In order to avoid collision with the
discard sample mask, move the latter to f1.0-f1.1.  This makes room
for keeping track of the sample mask of the second half of SIMD32
programs that use discard.

Note that some MOVs of the sample mask into f1.0 become redundant now
in lower_surface_logical_send() and lower_a64_logical_send().

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>x
2020-02-14 14:31:48 -08:00
Francisco Jerez
083fd96a97 intel/fs: Use helper for discard sample mask flag subregister number.
Use it instead of hard-coding f0.1 for the sample mask of programs
that use discard.  This will make the task easier when we replace f0.1
with another flag register location in order to support discard with
SIMD32 shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
a6bc11a789 intel/fs: Make sample_mask_reg() local to brw_fs.cpp and use it in more places.
It's only really useful there.  This will avoid confusion with another
helper with a similar purpose I'm about to introduce that will be
useful in multiple files from the FS back-end.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
57dee58c82 intel/fs: Set src0 alpha present bit in header when provided in message payload.
Currently the "Source0 Alpha Present to RenderTarget" bit of the RT
write message header is derived from brw_wm_prog_data::replicate_alpha.
However the src0_alpha payload is provided anytime it's specified to
the logical message.  This could theoretically lead to an
inconsistency if somebody provided a src0_alpha value while
brw_wm_prog_data::replicate_alpha was false, as I'm planning to do in
a future commit in order to implement a hardware workaround.

Instead calculate the header bit based on whether a src0_alpha value
was provided to the logical message, which guarantees the same
behavior on pre-ICL and ICL+ (the latter used an extended descriptor
bit for this which didn't suffer from the same issue).  Remove the
brw_wm_prog_data::replicate_alpha flag.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
a8ac0bd759 intel/fs/gen12: Workaround unwanted SEND execution due to broken NoMask control flow.
This is a less invasive alternative to the workaround documented in
the hardware spec for GEN:BUG:1407528679, which doesn't involve
disabling structured control flow (it's unlikely that switching to
GOTO/JOIN would have actually fixed the problem anyway).

Under some conditions Gen12 hardware can end up executing a BB with
all channels disabled, which will lead to the execution of any NoMask
instructions in it, even though any execution-masked instructions will
be correctly shot down.  This may break assumptions of some NoMask
SEND messages whose descriptor depends on data generated by live
invocations of the shader.

This avoids the problem by predicating certain instructions on an ANY
horizontal predicate that makes sure that their execution is omitted
when all channels of the program are disabled.  The shader-db impact
of this patch seems to be minimal:

total instructions in shared programs: 17169833 -> 17169913 (0.00%)
instructions in affected programs: 30663 -> 30743 (0.26%)
helped: 0
HURT: 42

total cycles in shared programs: 336966176 -> 336968568 (0.00%)
cycles in affected programs: 2367290 -> 2369682 (0.10%)
helped: 0
HURT: 13

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: 20.0 <mesa-stable@lists.freedesktop.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
008f95a043 intel/fs: Add virtual instruction to load mask of live channels into flag register.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: 20.0 <mesa-stable@lists.freedesktop.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
b8b509fb92 intel/fs/gen7: Fix fs_inst::flags_written() for SHADER_OPCODE_FIND_LIVE_CHANNEL.
We need to pass a width of 32 since the opcode bashes the whole f1.0
register on IVB.  This is unlikely to have caused problems since f1.0
is largely unused currently.  That's likely to change soon though,
even on platforms other than Gen7.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: 20.0 <mesa-stable@lists.freedesktop.org>
2020-02-14 14:31:48 -08:00
Ian Romanick
58907568ec intel/fs: Add SHADER_OPCODE_[IU]SUB_SAT pseudo-ops
v2: Add a big comment explaining the [IU]SUB_SAT lowering.  Suggested by
Caio.

v3: Use get_fpu_lowered_simd_width in get_lowered_simd_width.  Suggested
by Ken on IRC.

v4: Fix a typo in a comment.  Noticed by Caio.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
2020-01-23 00:18:57 +00:00
Ian Romanick
74cd0964d6 intel/fs: Don't lower integer multiplies that don't need lowering
v2: Move the check to fs_visitor::lower_integer_multiplication.
Previously the cases where lowering was skipped, the original
instruction was removed by fs_visitor::lower_integer_multiplication.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
2020-01-23 00:18:57 +00:00
Matt Turner
49c21802cb intel/compiler: Split has_64bit_types into float/int
Gen7 has 64-bit floats but not 64-bit ints.

Acked-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2635>
2020-01-22 00:19:20 +00:00
Caio Marcelo de Oliveira Filho
ff5b74ef32 intel/fs: Add workgroup_size() helper
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
2020-01-21 23:41:35 +00:00
Francisco Jerez
b54b67e067 intel/fs: Switch to standard vector layout for barycentrics at optimization time.
This involves permuting the registers of barycentric vectors to have
the standard X[0-n] Y[0-n] layout at NIR translation time.
Barycentrics are converted to the format expected by the PLN
instruction in the lower_barycentrics() pass run after the
optimization loop.

Main reason is correctness of SIMD32 fragment shaders.  The
shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
during NIR translation are busted for SIMD32.  This leads to serious
corruption at present with INTEL_DEBUG=do32, especially on Gen11+
where these helpers are hit more frequently due to the lack of a
hardware PLN instruction.

Of course one could have chosen to fix those helpers instead, but
there is another far more subtle issue that was reported during review
of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
currently handles SIMD32 barycentric vectors as if they had the
standard X[0-n] Y[0-n] layout, even though they are interleaved for
the PLN instruction, which causes incorrect execution masks to be
applied to the MOVs unzipping barycentric vectors in cases where a
LINTERP instruction occurs under non-uniform control flow.

I'm not aware of any conformance regressions due to the latter issue
at present, but for our peace of mind let's move the conversion to the
PLN layout into the lower_barycentrics() pass run after
lower_simd_width().

This leads to the following shader-db improvements (including SIMD32
shaders) in combination with the previous back-end preparation changes
-- Without them (especially the copy propagation changes) this would
lead to a massive number of regressions.  On ICL:

   total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
   instructions in affected programs: 10538474 -> 10343061 (-1.85%)
   helped: 68775
   HURT: 6

   total spills in shared programs: 8938 -> 8748 (-2.13%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8965 -> 8663 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   146
   GAINED: 43

On SKL:

   total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
   instructions in affected programs: 3876590 -> 3765635 (-2.86%)
   helped: 27492
   HURT: 2

   LOST:   191
   GAINED: 417

On SNB:

   total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
   instructions in affected programs: 5199074 -> 4606107 (-11.41%)
   helped: 29998
   HURT: 0

   LOST:   21
   GAINED: 30

Results are somewhat less impressive but still significant without
SIMD32 fragment shaders enabled.  On ICL:

   total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
   instructions in affected programs: 6114788 -> 6027719 (-1.42%)
   helped: 42046
   HURT: 6

   total spills in shared programs: 8218 -> 8028 (-2.31%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8953 -> 8651 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   0
   GAINED: 3

On SKL:

   total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
   instructions in affected programs: 168850 -> 167594 (-0.74%)
   helped: 711
   HURT: 2

On SNB:

   total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
   instructions in affected programs: 2702172 -> 2666037 (-1.34%)
   helped: 17818
   HURT: 0

All of the hurt shaders are either spilling slightly more or emitting
additional NOP instructions due to the SIMD16 POW workaround for
Gen8-9 combined with differences in scheduling.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:23:12 -08:00
Francisco Jerez
79bd252d6e intel/fs: Introduce barycentric layout lowering pass.
The goal is to represent barycentrics with the standard vector layout
during optimization and particularly SIMD lowering.  Instead of
emitting the barycentric layout conversions at NIR translation time,
do it later as a lowering pass.  For the moment this is only applied
to PI messages, but we'll give the same treatment to LINTERP
instructions too.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:22:59 -08:00
Francisco Jerez
ab0d1b3b3d intel/fs: Rework fs_inst::is_copy_payload() into multiple classification helpers.
This reworks the current fs_inst::is_copy_payload() method into a
number of classification helpers with well-defined semantics.  This
will be useful later on in order to optimize LOAD_PAYLOAD instructions
more aggressively in cases where we can determine it's safe to do so.

The closest equivalent of the present fs_inst::is_copy_payload()
method is the is_coalescing_payload() helper introduced here.

No functional nor shader-db changes.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:21:19 -08:00
Francisco Jerez
1873202f44 intel/fs: Generalize fs_reg::is_contiguous() to register files other than VGRF.
No functional nor shader-db changes.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:20:59 -08:00
Francisco Jerez
d9a57c85cc intel/fs: Try to vectorize header setup in lower_load_payload().
In cases where LOAD_PAYLOAD is provided a pair of contiguous registers
as header sources, try to use a single SIMD16 instruction in order to
initialize them.  This is unlikely to affect the overall cycle count
of the shader, since the compressed instruction has twice the issue
time, except due to the reduced pressure on the instruction cache.

Main motivation is avoiding instruction-count regressions in
combination with the following copy propagation improvements, which
will allow the SIMD16 g0-1 header setup emitted for framebuffer writes
to be copy-propagated into its LOAD_PAYLOAD, leading to the emission
of two SIMD8 MOV instructions instead of a single SIMD16 MOV.

Reverting this commit on top of the copy propagation changes would
lead to the following shader-db regressions on SKL and other
platforms:

 total instructions in shared programs: 14926738 -> 14935415 (0.06%)
 instructions in affected programs: 1892445 -> 1901122 (0.46%)
 helped: 0
 HURT: 8676

Without the following copy propagation changes this doesn't have any
effect on shader-db on Gen7+, because we would typically set up the FB
write header with a separate SIMD16 MOV that isn't currently
copy-propagated into the LOAD_PAYLOAD, so the individual SIMD8 MOVs
result of LOAD_PAYLOAD lowering would get register-coalesced away
under normal circumstances.  However that wasn't the case for MRF
LOAD_PAYLOAD destinations on Gen6 and earlier, because register
coalesce only kicks in for GRFs, leaving a number of redundant SIMD8
MOVs lying around.  On SNB this leads to the following shader-db
improvements:

 total instructions in shared programs: 10770538 -> 10734681 (-0.33%)
 instructions in affected programs: 2700655 -> 2664798 (-1.33%)
 helped: 17791
 HURT: 0

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:20:46 -08:00
Eric Anholt
3d9a3d0be0 i965: Reuse the new core glsl_count_dword_slots().
The only difference I could see was treating interfaces like structs.
Maintain that case.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3297>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3297>
2020-01-14 23:55:00 +00:00
Francisco Jerez
c20dc9b836 intel/fs: Make implied_mrf_writes() an fs_inst method.
This will be convenient in a later commit enabling SIMD32 fragment
shaders, and happens to fix the calculation for MATH instructions
which is currently inaccurate for SIMD-lowered instructions on Gen4-5
platforms (all of them on Gen4 in SIMD16 mode), since it was based on
the shader's dispatch width rather than on the actual execution size
of the instruction.

This causes some shader-db noise on Gen4 due to the more compact
register allocation interacting with the SEND dependency workarounds,
but otherwise no major changes.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-10 11:02:30 -08:00
Francisco Jerez
0a6e46d44d intel/fs/gen11+: Handle ROR/ROL in lower_simd_width().
Prevents invalid code from being emitted for ROR/ROL instructions in
SIMD32 shaders.

The problem can be reproduced with the following tests while forcing
SIMD32 to be used for fragment shaders:

 piglit.shaders.glsl-rotate-left
 piglit.shaders.glsl-rotate-right

However the issue could occur in production already with compute
shaders and a workgroup size large enough to trigger SIMD32 dispatch.

Fixes: 83fdec0f0d "intel/compiler: Enable the emission of ROR/ROL instructions"
Cc: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-10 11:00:24 -08:00
Caio Marcelo de Oliveira Filho
2137be22fa intel/fs: Fix lowering of dword multiplication by 16-bit constant
Existing code was ignoring whether the type of the immediate source
was signed or not.  If the source was signed, it would ignore small
negative values but it also would wrongly accept values between
INT16_MAX and UINT16_MAX, causing the atual value to later be
reinterpreted as a negative number (under 16-bits).

Fixes tests/shaders/glsl-mul-const.shader_test in Piglit for platforms
that don't support MUL with 32x32 types, including ICL and TGL.

Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2186
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-12-17 10:45:22 -08:00
Caio Marcelo de Oliveira Filho
c06ba83589 intel/fs: Lower 64-bit MOVs after lower_load_payload()
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3070>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3070>
2019-12-14 21:12:21 +00:00
Tapani Pälli
32ebd4207a intel/compiler: force simd8 when dual src blending on gen8
Patch introduces option to force simd8 and uses it as a workaround for
dual source blending issues seen with skqp (skia testsuite) on gen8.

Fixes following Piglit test on gen8 platforms:
   arb_blend_func_extended-dual-src-blending-issue-1917

Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/1917
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
c: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2019-12-05 09:42:50 +02:00
Jason Ekstrand
2fca325ea6 Revert "i965/fs: Merge CMP and SEL into CSEL on Gen8+"
This reverts commit 52c7df1643.  The pass,
while clearly useful for some shaders, has at least three bugs that I
was able to find fairly quickly:

 1. It doesn't work for type-converting MOVs because f > 0 is not the
    same as f2i(f) > 0

 2. CSEL is a 3src instruction and only supports one source type; it
    doesn't take this into account and tries to create instructions
    which do a F compare and a D select.  This is especially nasty to
    debug because you don't see that in the dumped assembly because we
    don't properly assert that types are the same in codegen.

 3. While you can handle 2, in theory, by reinterpreting types, you
    can't do that in the presence of source modifiers.  This pass
    doesn't even attempt to detect that.

Those are just the ones I found with the one almost trival shader I was
debugging.  There very likely may be more and.  Best thing to do for now
is just shut it off until someone has the time to figure out how to do
this properly and write tests to ensure it's correct.

Fixes: 3cb085e6d61a "i965/fs: Merge CMP and SEL into CSEL on Gen8+"
Reviewed-by: Brian Paul <brianp@vmware.com>
2019-11-20 20:47:32 +00:00
Jason Ekstrand
d1c4e64a69 intel/compiler: Add a flag to avoid compacting push constants
In vec4, we can just not run the pass.  In fs, things are a bit more
deeply intertwined.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2019-11-18 18:35:14 +00:00
Italo Nicola
59623f211b intel/compiler: remove old comment
This comment was correct some time ago, but since commit
d3c10ad427, it isn't true anymore.

Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
2019-11-18 10:20:34 -08:00
Jason Ekstrand
53bfcdeecf intel/fs: Implement the new load/store_scratch intrinsics
This commit fills in a number of different pieces:

 1. We add support to brw_nir_lower_mem_access_bit_sizes to handle the
    new intrinsics.  This involves simple plumbing work as well as a
    tiny bit of extra logic to always scalarize scratch intrinsics

 2. Add code to brw_fs_nir.cpp to turn nir_load/store_scratch intrinsics
    into byte/dword scattered read/write messages which use the A32
    stateless model.

 3. Add code to lower_surface_logical_send to handle dword scattered
    messages and the A32 stateless model.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-11-11 17:17:02 +00:00
Jason Ekstrand
1dff48af05 intel/fs: refactor surface header setup
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-11-11 17:17:02 +00:00
Jason Ekstrand
a0999bc049 intel/fs: Add DWord scattered read/write opcodes
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-11-11 17:17:02 +00:00
Sagar Ghuge
f729ecefef intel/compiler: Remove emit_alpha_to_coverage workaround from backend
Remove emit_alpha_to_coverage workaround from backend compiler and start
using ported workaround from NIR.

v2: Copy comment from brw_fs_visitor (Caio Marcelo de Oliveira Filho)

Fixes piglit test on HSW:
- arb_sample_shading-builtin-gl-sample-mask-mrt-alpha-to-coverage-combinations

Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-10-21 11:27:29 -07:00
Jason Ekstrand
c92fb60007 intel/fs/gen12: Implement gl_FrontFacing on gen12+.
The bit moved on gen12 in order to prepare for dual-SIMD8 dispatch.
This implementation isn't an entirely complete as it only works on SIMD8
and SIMD16 and not dual-SIMD8.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-10-11 12:24:16 -07:00
Francisco Jerez
cb6db5bfb3 intel/fs/gen12: Don't support source mods for 32x16 integer multiply.
Due to hardware bug filed as HSDES#1604601757.

v2: Only return if result of fs_inst::can_do_source_mods() is known to
    be false for the case new orthogonal restrictions are implemented
    below in the future. (Caio)

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-10-11 12:24:16 -07:00
Francisco Jerez
265c7c8971 intel/fs/gen12: Introduce software scoreboard lowering pass.
Gen12+ hardware lacks the register scoreboard logic that used to
guarantee data coherency between register reads and writes in previous
generations.  This lowering pass runs after register allocation in
order to make up for it.

It works by performing global dataflow analysis in order to determine
the set of potential dependencies of every instruction in the shader,
and then inserts any required SWSB annotations and additional SYNC
instructions in order to guarantee data coherency.

v2: Drop unnecessary _safe list iteration (Caio).

v3: Temporarily workaround potential WaR hazard between FPU
    instruction and subsequent out-of-order write, pending
    clarification from the hardware team.  Drop redundant tracking of
    implicit access of acc0-1, since the hardware guarantees coherency
    of these (but not the other accumulators...).

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-10-11 12:24:16 -07:00
Francisco Jerez
b2ae65c7d9 intel/fs: Fix constness of implied_mrf_writes() argument.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-10-11 12:24:16 -07:00
Francisco Jerez
f326d9d218 intel/fs: Define is_payload() method of the IR instruction class.
This is required because SEND message payload sources are fetched
asynchronously by the hardware, which can lead to WaR data corruption
on Gen12+ platforms if not handled specially by the compiler to
guarantee proper synchronization.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-10-11 12:24:16 -07:00
Francisco Jerez
a42581fa8f intel/fs: Teach fs_inst::is_send_from_grf() about some missing send-like instructions.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2019-10-11 12:24:16 -07:00
Jason Ekstrand
6c858b9a91 intel/fs: Fix fs_inst::flags_read for ANY/ALL predicates
Without this, we were DCEing flag writes because we didn't think their
results were used because we didn't understand that an ANY32 predicate
actually read all the flags.

Fixes: df1aec763e "i965/fs: Define methods to calculate the flag..."
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-09-27 19:31:43 +00:00
Kenneth Graunke
0e4a75f917 intel/compiler: Record whether any pull constant loads occur
I would like for iris to be able to avoid setting up SURFACE_STATE
for UBOs in the common case where all constants are pushed.

Unfortunately, we don't know up front whether everything will be
pushed: the backend is allowed to demote pushed UBOs to pull loads
fairly late in the process.  This is probably desirable though, as
we'd like the backend to be able to re-pull pushed data to break up
long live ranges in response to register pressure.

Here we simply add a "are there any pull loads at all" boolean to
prog_data, which is a bit crude but at least allows us to skip work
in the common "everything pushed" case.  We could skip more work by
tracking exactly which UBO surfaces are pulled in a bitmask, but I
wanted to avoid bringing back the old mark_surface_used() mechanism.

Finer-grained tracking could allow us to skip a bit more work when
multiple UBOs are in use and /some/ are 100% pushed, but others are
accessed via pulls.  However, I'm not sure how common this is and
it would save at most 4 pull descriptors, so we defer that for now.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-18 15:44:22 -07:00
Kenneth Graunke
f76a724e06 intel/compiler: Set "Null Render Target" ex_desc bit on Gen11
When there are no color regions (i.e. a depth only pass), we can set
the "Null Render Target" bit in the Gen11 RT write extended message
descriptor to indicate that it should behave as if it's writing to a
null render target, without the need for a binding table entry.

This lets drivers avoid setting up that null RT binding table entry,
but more importantly means the HW doesn't actually have to bother
looking up the surface state.

Together with the next patch, this improves performance in Car Chase on
an Icelake 8x8 (locked to 700Mhz) by 0.0445526% +/- 0.0132736% (n=832).

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-17 14:27:51 -07:00
Samuel Iglesias Gonsálvez
9b07020a4f i965/fs: add support for shader float control to remove_extra_rounding_modes()
The remove_extra_rounding_modes() optimization will remove duplicated
rounding mode changes.

v2:
- Fix bug in the rounding mode change (Alejandro).

v3:
- Fix rounding modes.

v4:
- Updated to renamed shader info member and enum values (Andres).

v5:
- Simplify flags logic operations (Caio).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-17 23:39:19 +03:00
Jason Ekstrand
acfa2340e6 intel/fs: Handle UNDEF in split_virtual_grfs
When the UNDEF instruction was added, we didn't do anything special in
split_virtual_grfs.  This mean that anything with an UNDEF wasn't
getting split which causes problems for the compiler.  Among other
things, it makes RA harder because things are in bigger chunks.  It also
meant that dvec4s weren't getting split which means that they are larger
than the maximum register size.

Shader-db results on Kaby Lake:

    total instructions in shared programs: 14959202 -> 14960035 (<.01%)
    instructions in affected programs: 96197 -> 97030 (0.87%)
    helped: 140
    HURT: 128
    helped stats (abs) min: 1 max: 17 x̄: 1.62 x̃: 1
    helped stats (rel) min: 0.09% max: 6.15% x̄: 0.65% x̃: 0.45%
    HURT stats (abs)   min: 1 max: 825 x̄: 8.28 x̃: 1
    HURT stats (rel)   min: 0.13% max: 139.83% x̄: 1.70% x̃: 0.50%
    95% mean confidence interval for instructions value: -2.96 9.18
    95% mean confidence interval for instructions %-change: -0.56% 1.51%
    Inconclusive result (value mean confidence interval includes 0).

    total loops in shared programs: 4372 -> 4372 (0.00%)
    loops in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    total cycles in shared programs: 352646771 -> 352840997 (0.06%)
    cycles in affected programs: 218600800 -> 218795026 (0.09%)
    helped: 21167
    HURT: 21411
    helped stats (abs) min: 1 max: 2924 x̄: 36.89 x̃: 10
    helped stats (rel) min: <.01% max: 41.90% x̄: 2.97% x̃: 0.98%
    HURT stats (abs)   min: 1 max: 26027 x̄: 45.54 x̃: 10
    HURT stats (rel)   min: <.01% max: 324.46% x̄: 3.88% x̃: 1.06%
    95% mean confidence interval for cycles value: 2.87 6.26
    95% mean confidence interval for cycles %-change: 0.40% 0.55%
    Cycles are HURT.

    total spills in shared programs: 8840 -> 8953 (1.28%)
    spills in affected programs: 126 -> 239 (89.68%)
    helped: 1
    HURT: 2

    total fills in shared programs: 21782 -> 21914 (0.61%)
    fills in affected programs: 431 -> 563 (30.63%)
    helped: 1
    HURT: 3

    LOST:   0
    GAINED: 5

Shader-db results on Haswell:

    total instructions in shared programs: 13320918 -> 13320769 (<.01%)
    instructions in affected programs: 40998 -> 40849 (-0.36%)
    helped: 146
    HURT: 56
    helped stats (abs) min: 1 max: 8 x̄: 2.73 x̃: 2
    helped stats (rel) min: 0.16% max: 8.60% x̄: 2.52% x̃: 2.22%
    HURT stats (abs)   min: 2 max: 23 x̄: 4.45 x̃: 4
    HURT stats (rel)   min: 0.21% max: 10.26% x̄: 6.83% x̃: 10.26%
    95% mean confidence interval for instructions value: -1.26 -0.21
    95% mean confidence interval for instructions %-change: -0.62% 0.77%
    Inconclusive result (%-change mean confidence interval includes 0).

    total loops in shared programs: 4373 -> 4373 (0.00%)
    loops in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    total cycles in shared programs: 374518258 -> 374384193 (-0.04%)
    cycles in affected programs: 231101954 -> 230967889 (-0.06%)
    helped: 21427
    HURT: 19438
    helped stats (abs) min: 1 max: 2035 x̄: 31.09 x̃: 8
    helped stats (rel) min: <.01% max: 40.95% x̄: 2.42% x̃: 0.86%
    HURT stats (abs)   min: 1 max: 20875 x̄: 27.38 x̃: 8
    HURT stats (rel)   min: <.01% max: 59.09% x̄: 2.49% x̃: 0.80%
    95% mean confidence interval for cycles value: -4.49 -2.07
    95% mean confidence interval for cycles %-change: -0.14% -0.04%
    Cycles are helped.

    total spills in shared programs: 23406 -> 23411 (0.02%)
    spills in affected programs: 3 -> 8 (166.67%)
    helped: 0
    HURT: 2

    total fills in shared programs: 34845 -> 34850 (0.01%)
    fills in affected programs: 3 -> 8 (166.67%)
    helped: 0
    HURT: 2

    LOST:   0
    GAINED: 0

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111566
Fixes: f4ef34f207 "intel/fs: Add an UNDEF instruction to avoid..."
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2019-09-13 04:12:24 +00:00
Jason Ekstrand
d15fe8ca82 Revert "intel/fs: Move the scalar-region conversion to the generator."
This reverts commit c0504569ea.  Now that
we're doing interpolation lowering in NIR, we can continue to stride the
FS input registers directly in the brw_fs_nir code like we did before.
This fixes SIMD32 fragment shaders which broke because lower_simd_width
depended on the 0 stride to split PLN instructions correctly.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2019-09-06 03:58:09 +00:00
Jason Ekstrand
47e9743547 intel/fs: Fix FB write inst groups
This commit does two things.  First, it simplifies the way we compute
the FB write group bit.  There's no reason to use a ternary because
inst->group / 16 can only be 0 or 1.  Second, it fixes an order-of-
operations bug where the ternary wasn't selecting between (1 << 11) and
0 but between (1 << 11) and 0 | brw_dp_write_desc(...).

Fixes: 0d9648416 "intel/compiler: Use generic SEND for Gen7+ FB writes"
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-09-06 03:58:09 +00:00
Kenneth Graunke
23f42f8dcf intel/compiler: Use new Gen11 headerless RT writes for MRT cases
Gen11 adds support for specifying the render target index and src0
alpha present bits in the extended message descriptor.  Previously,
we had to use a message header for this, requiring extra instructions
to write the fields, and two registers of extra payload.

Improves performance on my ICL 8x8 frequency locked to 700Mhz, on iris:

   GfxBench5 Manhattan 3.0: 2.13635% +/- 0.159859% (n=5)
   GfxBench5 Aztec Ruins:   1.57173% +/- 0.128749% (n=5)
   Synmark2 OglDeferred:    2.86914% +/- 0.191211% (n=10)

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-08-27 14:20:07 -07:00
Kenneth Graunke
0d96484165 intel/compiler: Use generic SEND for Gen7+ FB writes
This takes care of generate_fb_write/fire_fb_write/brw_fb_WRITE's stuff
earlier in the visitor.  It will also make it easier to generate SENDSC
messages with indirect extended descriptors in a few patches.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-08-27 14:20:07 -07:00
Kenneth Graunke
86a63b1098 intel/compiler: Refactor FB write message control setup into a helper.
This will be used by visitor code to convert directly to SEND in a bit.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-08-27 14:20:07 -07:00
Paulo Zanoni
848d5e444a intel/fs: grab fail_msg from v32 instead of v16 when v32->run_cs fails
Looks like a copy/paste error. This patch prevents a segfault when
running the following on BDW:

    INTEL_DEBUG=no8,no16,do32 ./deqp-vk -n \
        dEQP-VK.subgroups.arithmetic.compute.subgroupmin_dvec4

For the curious, the message we're getting is:

    CS compile failed: Failure to register allocate.  Reduce number
    of live scalar values to avoid this.

Fixes: 864737ce6c ("i965/fs: Build 32-wide compute shader when needed.")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
2019-08-26 14:54:16 -07:00
Jason Ekstrand
f58e0405b6 intel/fs: Drop the gl_program from fs_visitor
It's not used by anything anymore now that so much lowering has been
moved into NIR.  Sadly, we still need on in brw_compile_gs() for
geometry shaders on Sandy Bridge.  Short of a lot of pointless work,
that one's probably not going away.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-08-25 01:02:52 -05:00
Jason Ekstrand
8d3cbd0393 intel/fs: Add SLM size to brw_cs_prog_data
We don't need it for state setup but it's a useful statistic we want to
pass on to developers.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2019-08-12 22:56:07 +00:00
Jason Ekstrand
134607760a intel/compiler: Fill a compiler statistics struct
This commit is all annoying plumbing work which just adds support for a
new brw_compile_stats struct.  This struct provides a binary driver
readable form of the same statistics we dump out to stderr when we
INTEL_DEBUG is set with a shader stage.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2019-08-12 22:56:07 +00:00