Commit graph

295 commits

Author SHA1 Message Date
Ian Romanick
de6c0f8487 intel/fs: Implement support for NIR opcodes for INTEL_shader_integer_functions2
v2: Remove smashing type to D for nir_op_irhadd.  Caio noticed it was
odd, and removing it fixes an assertion failure in the crucible
func.shader.averageRounded.int64_t test (because the source should be
W).

v3: Emit BRW_OPCODE_MUL directly for nir_op_umul_32x16 and
nir_op_imul_32x16.  Suggested by Curro.

v4: Smash types of MUL instruction generated for nir_op_umul_32x16 and
nir_op_imul_32x16.  With this change, I get the same assembly now as I
did with v2.

v5: Remove support for pre-Gen7.  The integer multiply path was
incorrect, and, since the extension isn't enabled pre-Gen7, there's no
way to test it.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
2020-01-23 00:18:57 +00:00
Matt Turner
88a0523bd2 intel/compiler: Move Gen4/5 rounding to visitor
Gen4/5's rounding instructions operate differently than later Gens'.
They all return the floor of the input and the "Round-increment"
conditional modifier answers whether the result should be incremented by
1.0 to get the appropriate result for the operation (and thus its
behavior is determined by the round opcode; e.g., RNDZ vs RNDE).

Since this requires a second instruciton (a predicated ADD) that
consumes the result of the round instruction, the round instruction
cannot write its result directly to the (write-only) message registers.
By emitting the ADD in the generator, the backend thinks it's safe to
store the round's result directly to the message register file.

To avoid this, we move the emission of the ADD instruction to the NIR
translator so that the backend has the information it needs.

I suspect this also fixes code generated for RNDZ.SAT but since
Gen4/5 don't support GLSL 1.30 which adds the trunc() function, I
couldn't write a piglit test to confirm. My thinking is that if x=-0.5:

      sat(trunc(-0.5)) = 0.0

But on Gen4/5 where sat(trunc(x)) is implemented as

      rndz.r.f0  result, x             // result = floor(x)
                                       // set f0 if increment needed
      (+f0) add  result, result, 1.0   // fixup so result = trunc(x)

then putting saturate on both instructions will give the wrong result.

      floor(-0.5) = -1.0
      sat(floor(-0.5)) = 0.0
      // +1 increment would be needed since floor(-0.5) != trunc(-0.5)
      sat(sat(floor(-0.5)) + 1.0) = 1.0

Fixes: 6f394343b1 ("nir/algebraic: i2f(f2i()) -> trunc()")
Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2355
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3459>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3459>
2020-01-22 23:47:02 +00:00
Caio Marcelo de Oliveira Filho
45164fc8c5 intel/fs: Don't emit control barrier if only one thread is used
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to use a barrier to
ensure all the invocations reach the same point in the shader, because
they are already running lock-step.

Results for SKL running Iris for shader-db tests with compute shaders

    total sends in shared programs: 18361 -> 18339 (-0.12%)
    sends in affected programs: 904 -> 882 (-2.43%)
    helped: 9
    HURT: 0
    helped stats (abs) min: 1 max: 5 x̄: 2.44 x̃: 2
    helped stats (rel) min: 0.84% max: 21.43% x̄: 7.82% x̃: 2.67%
    95% mean confidence interval for sends value: -3.31 -1.58
    95% mean confidence interval for sends %-change: -14.67% -0.97%
    Sends are helped.

Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.

Results for ICL and TGL are similar to SKL.

Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
2020-01-21 23:41:35 +00:00
Caio Marcelo de Oliveira Filho
4f431e870c intel/fs: Don't emit fence for shared memory if only one thread is used
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to synchronize shared
memory access (SLM) since all the requests from a single thread are
already synchronized.  In such case, we just add a scheduling fence.

To be able to identify that case for all platforms, move the handling
of platforms prior to Gen11 (which don't have a separate SLM fence)
after the optimization.

Results for SKL running Iris for shader-db tests with compute shaders

    total sends in shared programs: 18395 -> 18361 (-0.18%)
    sends in affected programs: 938 -> 904 (-3.62%)
    helped: 9
    HURT: 0
    helped stats (abs) min: 1 max: 5 x̄: 3.78 x̃: 4
    helped stats (rel) min: 1.56% max: 26.32% x̄: 10.33% x̃: 2.60%
    95% mean confidence interval for sends value: -4.85 -2.71
    95% mean confidence interval for sends %-change: -19.12% -1.54%
    Sends are helped.

Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.

Results for ICL and TGL are similar to SKL.

Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
2020-01-21 23:41:35 +00:00
Francisco Jerez
b54b67e067 intel/fs: Switch to standard vector layout for barycentrics at optimization time.
This involves permuting the registers of barycentric vectors to have
the standard X[0-n] Y[0-n] layout at NIR translation time.
Barycentrics are converted to the format expected by the PLN
instruction in the lower_barycentrics() pass run after the
optimization loop.

Main reason is correctness of SIMD32 fragment shaders.  The
shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
during NIR translation are busted for SIMD32.  This leads to serious
corruption at present with INTEL_DEBUG=do32, especially on Gen11+
where these helpers are hit more frequently due to the lack of a
hardware PLN instruction.

Of course one could have chosen to fix those helpers instead, but
there is another far more subtle issue that was reported during review
of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
currently handles SIMD32 barycentric vectors as if they had the
standard X[0-n] Y[0-n] layout, even though they are interleaved for
the PLN instruction, which causes incorrect execution masks to be
applied to the MOVs unzipping barycentric vectors in cases where a
LINTERP instruction occurs under non-uniform control flow.

I'm not aware of any conformance regressions due to the latter issue
at present, but for our peace of mind let's move the conversion to the
PLN layout into the lower_barycentrics() pass run after
lower_simd_width().

This leads to the following shader-db improvements (including SIMD32
shaders) in combination with the previous back-end preparation changes
-- Without them (especially the copy propagation changes) this would
lead to a massive number of regressions.  On ICL:

   total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
   instructions in affected programs: 10538474 -> 10343061 (-1.85%)
   helped: 68775
   HURT: 6

   total spills in shared programs: 8938 -> 8748 (-2.13%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8965 -> 8663 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   146
   GAINED: 43

On SKL:

   total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
   instructions in affected programs: 3876590 -> 3765635 (-2.86%)
   helped: 27492
   HURT: 2

   LOST:   191
   GAINED: 417

On SNB:

   total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
   instructions in affected programs: 5199074 -> 4606107 (-11.41%)
   helped: 29998
   HURT: 0

   LOST:   21
   GAINED: 30

Results are somewhat less impressive but still significant without
SIMD32 fragment shaders enabled.  On ICL:

   total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
   instructions in affected programs: 6114788 -> 6027719 (-1.42%)
   helped: 42046
   HURT: 6

   total spills in shared programs: 8218 -> 8028 (-2.31%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8953 -> 8651 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   0
   GAINED: 3

On SKL:

   total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
   instructions in affected programs: 168850 -> 167594 (-0.74%)
   helped: 711
   HURT: 2

On SNB:

   total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
   instructions in affected programs: 2702172 -> 2666037 (-1.34%)
   helped: 17818
   HURT: 0

All of the hurt shaders are either spilling slightly more or emitting
additional NOP instructions due to the SIMD16 POW workaround for
Gen8-9 combined with differences in scheduling.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:23:12 -08:00
Francisco Jerez
79bd252d6e intel/fs: Introduce barycentric layout lowering pass.
The goal is to represent barycentrics with the standard vector layout
during optimization and particularly SIMD lowering.  Instead of
emitting the barycentric layout conversions at NIR translation time,
do it later as a lowering pass.  For the moment this is only applied
to PI messages, but we'll give the same treatment to LINTERP
instructions too.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:22:59 -08:00
Kenneth Graunke
0a1c47074b intel/compiler: Fix illegal mutation in get_nir_image_intrinsic_image
get_nir_image_intrinsic_image() was incorrectly mutating the value held
by the register which holds the intrinsic's first source (image index).

If this happened to be the register for an SSA def which is also used
elsewhere in the program, this meant that we would clobber that value
in subsequent uses.

Note that this only affects i965, because neither anv nor iris use the
binding table start sections, so nothing is ever added here.

Fixes KHR-GL46.compute_shader.resources-max on i965 with Eric Anholt's
MR !3240 applied.  That MR reorders SSBOs and ABOs, so that test uses
image 0 and SSBO 0, causing this code to brilliantly add binding table
index 45 to both the image (correct) and the SSBO (bzzt, wrong!).

Fixes: 09f1de97a7 ("anv,i965: Lower away image derefs in the driver")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3404>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3404>
2020-01-15 19:25:35 +00:00
Caio Marcelo de Oliveira Filho
edf6a40cb2 intel/fs: Only use SLM fence in compute shaders
Fixes: b390ff3517 ("intel/fs: Add support for SLM fence in Gen11")
Fixes: e142061399 ("intel/fs: Implement scoped_memory_barrier")

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2020-01-14 10:55:48 -08:00
Jason Ekstrand
d3737002ee nir/lower_atomics_to_ssbo: Also lower barriers
This is more correct for a pass which is supposed to completely lower
away atomic counters.  It also lets us stop supporting atomic counter
barriers in most of the drivers.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
2020-01-13 17:23:47 +00:00
Jason Ekstrand
e40b11bbcb nir: Rename nir_intrinsic_barrier to control_barrier
This is a more explicit name now that we don't want it to be doing any
memory barrier stuff for us.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
2020-01-13 17:23:47 +00:00
Jason Ekstrand
60097cc840 nir: Add a new memory_barrier_tcs_patch intrinsic
Right now, it's implemented as a no-op for everyone.  For most drivers,
it's a switch case in the NIR -> whatever which just breaks.  For ir3,
they already have code to delete tessellation barriers so we just add a
case to also delete memory_barrier_tcs_patch.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
2020-01-13 17:23:47 +00:00
Francisco Jerez
cc0ea482ad intel/fs: Fix nir_intrinsic_load_barycentric_at_sample for SIMD32.
For uniform sample ID, only the first channel of msg_data will be
initialized.  We need to pass that component only to the SEND message
for SIMD lowering to unzip the descriptor source correctly.

Fixes several dozens of conformance test failures with SIMD32 fragment
shaders enabled, including:

dEQP-GLES31.functional.shaders.multisample_interpolation.interpolate_at_sample.dynamic_sample_number.*

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-10 11:01:52 -08:00
Tapani Pälli
f6004bac1f intel/compiler: add newline to limit_dispatch_width message
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2019-12-05 08:13:58 +02:00
Ian Romanick
e51eda99df intel/fs: Disable conditional discard optimization on Gen4 and Gen5
The CMP instruction on Gen4 and Gen5 generates one bit (the LSB) of
valid data and 31 bits of junk.  Results of comparisons that are used as
Boolean values need to have a fixup applied to generate the proper 0/~0
values.

Calling fs_visitor::nir_emit_alu with need_dest=false prevents the fixup
code from being generated.  This results in a sequence like:

        cmp.l.f0.0(16)  g8<1>F          g14<8,8,1>F     0x0F  /* 0F */
        ...
        cmp.l.f0.0(16)  g4<1>F          g6<8,8,1>F      0x0F  /* 0F */
(+f0.1) or.z.f0.1(16) null<1>UD g4<8,8,1>UD     g8<8,8,1>UD

instead of

        cmp.l.f0.0(16)  g8<1>F          g14<8,8,1>F     0x0F  /* 0F */
        ...
        cmp.l.f0.0(16)  g4<1>F          g6<8,8,1>F      0x0F  /* 0F */
        or(16) g4<1>UD g4<8,8,1>UD     g8<8,8,1>UD
(+f0.1) and.z.f0.1(16) null<1>UD g4<8,8,1>UD     1UD

I examined a couple of the shaders hurt by this change, and ALL of them
would have been affected by this bug. :(

Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/1836
Fixes: 0ba9497e66 ("intel/fs: Improve discard_if code generation")

Iron Lake
total instructions in shared programs: 8122757 -> 8122957 (<.01%)
instructions in affected programs: 8307 -> 8507 (2.41%)
helped: 0
HURT: 100
HURT stats (abs)   min: 2 max: 2 x̄: 2.00 x̃: 2
HURT stats (rel)   min: 0.84% max: 6.67% x̄: 2.81% x̃: 2.76%
95% mean confidence interval for instructions value: 2.00 2.00
95% mean confidence interval for instructions %-change: 2.58% 3.03%
Instructions are HURT.

total cycles in shared programs: 188510100 -> 188510376 (<.01%)
cycles in affected programs: 76018 -> 76294 (0.36%)
helped: 0
HURT: 55
HURT stats (abs)   min: 2 max: 12 x̄: 5.02 x̃: 4
HURT stats (rel)   min: 0.07% max: 3.75% x̄: 0.86% x̃: 0.56%
95% mean confidence interval for cycles value: 4.33 5.71
95% mean confidence interval for cycles %-change: 0.60% 1.12%
Cycles are HURT.

GM45
total instructions in shared programs: 4994403 -> 4994503 (<.01%)
instructions in affected programs: 4212 -> 4312 (2.37%)
helped: 0
HURT: 50
HURT stats (abs)   min: 2 max: 2 x̄: 2.00 x̃: 2
HURT stats (rel)   min: 0.84% max: 6.25% x̄: 2.76% x̃: 2.72%
95% mean confidence interval for instructions value: 2.00 2.00
95% mean confidence interval for instructions %-change: 2.45% 3.07%
Instructions are HURT.

total cycles in shared programs: 128928750 -> 128928982 (<.01%)
cycles in affected programs: 67442 -> 67674 (0.34%)
helped: 0
HURT: 47
HURT stats (abs)   min: 2 max: 12 x̄: 4.94 x̃: 4
HURT stats (rel)   min: 0.09% max: 3.75% x̄: 0.75% x̃: 0.53%
95% mean confidence interval for cycles value: 4.19 5.68
95% mean confidence interval for cycles %-change: 0.50% 1.00%
Cycles are HURT.
2019-11-21 16:40:50 -08:00
Paulo Zanoni
eb6352162d intel/compiler: fix nir_op_{i,u}*32 on ICL
On ICL we have the src1 restriction which is applied through
fix_byte_src() and potentially changes the type of the operands from 8
to 32 bits. When this change happens, we fall into the "else if
(bit_size < 32)" case and miscompute src_type because it takes into
consideration bit_size (8) instead of the adjusted size of temp_op
(32). This results in the shader reading unused memory, giving us
mostly failures, but occasional passes due to whatever was already in
the registers we were reading.

This commit fixes a lot of dEQP subgroup i8vec2 tests on ICL, such as:
    dEQP-VK.subgroups.arithmetic.compute.subgroupadd_i8vec2

This can also be verified by simply changing fix_byte_src() to apply
on all platforms.

Fixes: 5847de6e9a ("intel/compiler: don't use byte operands for src1 on ICL")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
2019-11-13 22:13:52 +00:00
Jason Ekstrand
53bfcdeecf intel/fs: Implement the new load/store_scratch intrinsics
This commit fills in a number of different pieces:

 1. We add support to brw_nir_lower_mem_access_bit_sizes to handle the
    new intrinsics.  This involves simple plumbing work as well as a
    tiny bit of extra logic to always scalarize scratch intrinsics

 2. Add code to brw_fs_nir.cpp to turn nir_load/store_scratch intrinsics
    into byte/dword scattered read/write messages which use the A32
    stateless model.

 3. Add code to lower_surface_logical_send to handle dword scattered
    messages and the A32 stateless model.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-11-11 17:17:02 +00:00
Caio Marcelo de Oliveira Filho
e142061399 intel/fs: Implement scoped_memory_barrier
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-10-24 11:39:56 -07:00
Jason Ekstrand
c92fb60007 intel/fs/gen12: Implement gl_FrontFacing on gen12+.
The bit moved on gen12 in order to prepare for dual-SIMD8 dispatch.
This implementation isn't an entirely complete as it only works on SIMD8
and SIMD16 and not dual-SIMD8.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-10-11 12:24:16 -07:00
Andres Gomez
5e87f48f1d i965/fs: set rounding mode when emitting the flrp instruction
flrp was forgotten when already adding the rounding mode for other
instructions.

Fixes: ba1e25e1aa ("i965/fs: set rounding mode when emitting fadd, fmul and ffma instructions")
Suggested-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2019-09-24 12:06:59 +03:00
Andres Gomez
6f1468c371 i965/fs: add a comment about how the rounding mode in fmul is set
After
1711bf6cf2 ("intel/fs: Generate better code for fsign multiplied by a value"),
the conflicts resolution for setting the rounding mode after the
fused fmul and fsign optimization is non obvious.

Basically, the optimization doesn't really result in a MUL, or any
other operation which would need to have the rounding mode set. Hence,
we set it just before the actual MUL in the treatment of fmul.

Fixes: ba1e25e1aa ("i965/fs: set rounding mode when emitting fadd, fmul and ffma instructions")
Suggested-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2019-09-24 11:24:15 +03:00
Jason Ekstrand
03255da225 intel/fs: Do 8-bit subgroup scan operations in 16 bits
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
2019-09-20 18:02:15 +00:00
Jason Ekstrand
3515c0e9cf intel/fs: Allow UB, B, and HF types in brw_nir_reduction_op_identity
Because byte immediates aren't a thing on GEN hardware, we return a
signed or unsigned word immediate in the byte case.

Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
2019-09-20 18:02:15 +00:00
Caio Marcelo de Oliveira Filho
fa080f03d3 intel/fs: Add Fall-through comment
Reviewed-by: Andres Gomez <agomez@igalia.com>
2019-09-19 10:02:16 -07:00
Kenneth Graunke
0e4a75f917 intel/compiler: Record whether any pull constant loads occur
I would like for iris to be able to avoid setting up SURFACE_STATE
for UBOs in the common case where all constants are pushed.

Unfortunately, we don't know up front whether everything will be
pushed: the backend is allowed to demote pushed UBOs to pull loads
fairly late in the process.  This is probably desirable though, as
we'd like the backend to be able to re-pull pushed data to break up
long live ranges in response to register pressure.

Here we simply add a "are there any pull loads at all" boolean to
prog_data, which is a bit crude but at least allows us to skip work
in the common "everything pushed" case.  We could skip more work by
tracking exactly which UBO surfaces are pulled in a bitmask, but I
wanted to avoid bringing back the old mark_surface_used() mechanism.

Finer-grained tracking could allow us to skip a bit more work when
multiple UBOs are in use and /some/ are 100% pushed, but others are
accessed via pulls.  However, I'm not sure how common this is and
it would save at most 4 pull descriptors, so we defer that for now.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-18 15:44:22 -07:00
Samuel Iglesias Gonsálvez
9bd88d10d8 i965/fs: set rounding mode when emitting nir_op_f2f32 or nir_op_f2f16
v2:
- Consider nir_op_f2f16 case too (Caio).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-17 23:39:19 +03:00
Samuel Iglesias Gonsálvez
ba1e25e1aa i965/fs: set rounding mode when emitting fadd, fmul and ffma instructions
v2:
- Updated to renamed shader info member (Andres).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-17 23:39:19 +03:00
Samuel Iglesias Gonsálvez
9da56ffc52 i965/fs: add emit_shader_float_controls_execution_mode() and aux functions
We need this function to emit code that setups the control register
later with the defined execution mode for the shader. Therefore, we
emit it as the first instruction.

v2:
- Fix bug in setting the default mode mask in brw_rnd_mode_from_nir().
- Fix support for rounding modes in brw_rnd_mode_from_nir().

v3:
- Updated to renamed shader info member and enum values (Andres).

v4:
- Add actual emission as first instruction of emit_nir_code (Caio).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-17 23:39:19 +03:00
Samuel Iglesias Gonsálvez
cdace5b0c6 i965/fs/nir: add nir_op_unpack_half_2x16_split_*_flush_to_zero
The denorm mode is set in the control register, no need to do
something else.

v2:
- Add an assert to make sure that we realize if this assumption is
  broken in the future (Caio).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-09-17 23:39:18 +03:00
Jason Ekstrand
d15fe8ca82 Revert "intel/fs: Move the scalar-region conversion to the generator."
This reverts commit c0504569ea.  Now that
we're doing interpolation lowering in NIR, we can continue to stride the
FS input registers directly in the brw_fs_nir code like we did before.
This fixes SIMD32 fragment shaders which broke because lower_simd_width
depended on the 0 stride to split PLN instructions correctly.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2019-09-06 03:58:09 +00:00
Caio Marcelo de Oliveira Filho
eac8a3b9af anv: Drop unused local variable
Leftover from 021fa28163 ("xintel/nir: Add a helper for getting
BRW_AOP from an intrinsic").

Acked-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-08-23 13:25:27 -07:00
Jason Ekstrand
021fa28163 intel/nir: Add a helper for getting BRW_AOP from an intrinsic
So many duplicated switch statements....

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-08-21 17:19:55 +00:00
Jason Ekstrand
951cf94521 nir: Add explicit signs to image min/max intrinsics
This better matches all the other atomic intrinsics such as those for
SSBOs and shared variables where the sign is part of the intrinsic
opcode.  Both generators (GLSL and SPIR-V) know the sign from the type
of the image variable or handle.  In SPIR-V, signed min/max are separate
opcodes from unsigned.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
2019-08-21 17:19:55 +00:00
Jason Ekstrand
c02c3ff612 intel/nir: Add a common nir comparison -> cmod helper
We already had one in the vec4 code, we just had move it.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-08-03 00:35:48 +00:00
Jason Ekstrand
d03ec807a4 intel/fs: Drop all of the 64-bit varying code
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-07-31 18:14:09 -05:00
Jason Ekstrand
8fd2f2c276 intel/fs: Implement quad_swap_horizontal with a swizzle on gen7
This fixes dEQP-VK.subgroups.quad.compute.subgroupquadswaphorizontal_*
on all gen7 platforms.

Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-07-30 22:38:19 +00:00
Jason Ekstrand
4bb6e6817e intel: Use a system value for gl_FragCoord
It's kind-of an anomaly that the Intel drivers are still treating
gl_FragCoord as an input.  It also makes zero sense because we have to
special-case it in the back-end.

Because ANV is the only user of nir_lower_wpos_center, we go ahead and
just update it to look for nir_intrinsic_load_frag_coord as part of this
patch.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-07-29 23:30:26 +00:00
Daniel Schürmann
e272fdd508 nir,intel: lower if (cond) demote() to new intrinsic demote_if(cond)
This will effectively enable the optimization in anv.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-07-24 13:02:18 -05:00
Jason Ekstrand
4397eb91c1 intel/compiler: Allow for varying subgroup sizes
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
2019-07-24 12:55:40 -05:00
Jason Ekstrand
7ceec21b76 intel/fs: Use a strided MOV instead of a conversion for load_* destinations
In many cases, the compiler can just copy-prop the strided MOV whereas
the conversion is a bit trickier.  This cuts 5% of the instructions off
of one particular Vulkan CTS test which does lots of load_ssbo.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-07-17 18:44:35 +00:00
Caio Marcelo de Oliveira Filho
b390ff3517 intel/fs: Add support for SLM fence in Gen11
Gen11 SLM is not on L3 anymore, so now the hardware has two separate
fences.  Add a way to control which fence types to use.

At this time, we don't have enough information in NIR to control the
visibility of the memory being fenced, so for now be conservative and
assume that fences will need a stall.  With more information later
we'll be able to reduce those.

Fixes Vulkan CTS tests in ICL:

    dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.workgroup.guard_local.buffer.comp
    dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.buffer.guard_nonlocal.workgroup.comp
    dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.image.guard_nonlocal.workgroup.comp
    dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.buffer.guard_nonlocal.workgroup.comp
    dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.image.guard_nonlocal.workgroup.comp

The whole set of supported tests in dEQP-VK.memory_model.* group
should be passing in ICL now.

v2: Pass BTI around instead of having an enum.  (Jason)
    Emit two SHADER_OPCODE_MEMORY_FENCE instead of one that gets
    transformed into two.  (Jason)
    List tests fixed.  (Lionel)

v3: For clarity, split the decision of which fences to emit from the
    emission code.  (Jason)

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2019-07-11 08:29:32 -07:00
Caio Marcelo de Oliveira Filho
45f5db5a84 intel/fs: Implement "demote to helper invocation"
The "demote" intrinsic works like "discard" but don't change the
control flow, allowing derivative operations to work.  This is the
semantics of D3D discard.

The "is_helper_invocation" intrinsic will return true for helper
invocations -- both the ones that started as helpers and the ones that
where demoted.  This is needed to avoid changing the behavior of
gl_HelperInvocation which is an input (so not expected to change
during shader execution).

v2: Emit the discard jump and comment why it is safe.  (Jason)
    Rework the is_helper_invocation() that was stomping f0.1.  (Jason)

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-07-08 08:57:25 -07:00
Jason Ekstrand
2b79a9e5a5 intel/fs: Implement nir_intrinsic_load_fs_input_interp_deltas
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-07-02 16:15:25 +00:00
Jason Ekstrand
8e7d066682 intel/fs: Actually implement the load_barycentric intrinsics
If they never get used, dead code should clean them up.  Also, we rework
the at_offset and at_sample intrinsics so they return a proper vec2
instead of returning things in PLN layout.  Fortunately, copy-prop is
pretty good at cleaning this up and it doesn't result in any actual
extra MOVs.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-07-02 16:15:25 +00:00
Sagar Ghuge
1e92e83856 intel/compiler: Emit ROR and ROL instruction
v2: Reorder patch (Matt Turner)

Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-07-01 10:14:22 -07:00
Lionel Landwerlin
5847de6e9a intel/compiler: don't use byte operands for src1 on ICL
The simulator complains about using byte operands, we also have
documentation telling us.

Note that add operations on bytes seems to work fine on HW (like ADD).
Using dwords operands with CMP & SEL fixes the following tests :

   dEQP-VK.spirv_assembly.type.vec*.i8.*

v2: Drop the GLK changes (Matt)
    Add validator tests (Matt)

v3: Drop GLK ref (Matt)
    Don't mix float/integer in MAD (Matt)

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com> (v1)
Reviewed-by: Matt Turner <mattst88@gmail.com>
BSpec: 3017
Cc: <mesa-stable@lists.freedesktop.org>
2019-06-29 12:56:09 +00:00
Ian Romanick
0ba9497e66 intel/fs: Improve discard_if code generation
Previously we would blindly emit an sequence like:

        mov(1)          f0.1<1>UW       g1.14<0,1,0>UW
        ...
        cmp.l.f0(16)    g7<1>F          g5<8,8,1>F      0x41700000F  /* 15F */
(+f0.1) cmp.z.f0.1(16)  null<1>D        g7<8,8,1>D      0D

The first move sets the flags based on the initial execution mask.
Later discard sequences contain a predicated compare that can only
remove more SIMD channels.  Often times the only user of the result from
the first compare is the second compare.  Instead, generate a sequence
like

        mov(1)          f0.1<1>UW       g1.14<0,1,0>UW
        ...
        cmp.l.f0(16)    g7<1>F          g5<8,8,1>F      0x41700000F  /* 15F */
(+f0.1) cmp.ge.f0.1(8)  null<1>F        g5<8,8,1>F      0x41700000F  /* 15F */

If the results stored in g7 and f0.0 are not used, the comparison will
be eliminated.  This removes an instruction and potentially reduces
register pressure.

v2: Major re-write of the commit message (including fixing the assembly
code).  Suggested by Matt.

All Gen8+ platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17224434 -> 17198659 (-0.15%)
instructions in affected programs: 2908125 -> 2882350 (-0.89%)
helped: 18891
HURT: 5
helped stats (abs) min: 1 max: 12 x̄: 1.38 x̃: 1
helped stats (rel) min: 0.03% max: 25.00% x̄: 1.76% x̃: 1.02%
HURT stats (abs)   min: 9 max: 105 x̄: 51.40 x̃: 35
HURT stats (rel)   min: 0.43% max: 4.92% x̄: 2.34% x̃: 1.56%
95% mean confidence interval for instructions value: -1.39 -1.34
95% mean confidence interval for instructions %-change: -1.79% -1.73%
Instructions are helped.

total cycles in shared programs: 361468458 -> 361170679 (-0.08%)
cycles in affected programs: 38470116 -> 38172337 (-0.77%)
helped: 16202
HURT: 1456
helped stats (abs) min: 1 max: 4473 x̄: 26.24 x̃: 18
helped stats (rel) min: <.01% max: 28.44% x̄: 2.90% x̃: 2.18%
HURT stats (abs)   min: 1 max: 5982 x̄: 87.51 x̃: 28
HURT stats (rel)   min: <.01% max: 51.29% x̄: 5.48% x̃: 1.64%
95% mean confidence interval for cycles value: -18.24 -15.49
95% mean confidence interval for cycles %-change: -2.26% -2.14%
Cycles are helped.

total spills in shared programs: 12147 -> 12176 (0.24%)
spills in affected programs: 175 -> 204 (16.57%)
helped: 8
HURT: 5

total fills in shared programs: 25262 -> 25292 (0.12%)
fills in affected programs: 269 -> 299 (11.15%)
helped: 8
HURT: 5

Haswell
total instructions in shared programs: 13530316 -> 13502647 (-0.20%)
instructions in affected programs: 2507824 -> 2480155 (-1.10%)
helped: 18859
HURT: 10
helped stats (abs) min: 1 max: 12 x̄: 1.48 x̃: 1
helped stats (rel) min: 0.03% max: 27.78% x̄: 2.38% x̃: 1.41%
HURT stats (abs)   min: 5 max: 39 x̄: 25.70 x̃: 31
HURT stats (rel)   min: 0.22% max: 1.66% x̄: 1.09% x̃: 1.31%
95% mean confidence interval for instructions value: -1.49 -1.44
95% mean confidence interval for instructions %-change: -2.42% -2.34%
Instructions are helped.

total cycles in shared programs: 377865412 -> 377639034 (-0.06%)
cycles in affected programs: 40169572 -> 39943194 (-0.56%)
helped: 15550
HURT: 1938
helped stats (abs) min: 1 max: 2482 x̄: 25.67 x̃: 18
helped stats (rel) min: <.01% max: 37.77% x̄: 3.00% x̃: 2.25%
HURT stats (abs)   min: 1 max: 4862 x̄: 89.17 x̃: 35
HURT stats (rel)   min: <.01% max: 67.67% x̄: 6.16% x̃: 2.75%
95% mean confidence interval for cycles value: -14.42 -11.47
95% mean confidence interval for cycles %-change: -2.05% -1.91%
Cycles are helped.

total spills in shared programs: 26769 -> 26814 (0.17%)
spills in affected programs: 826 -> 871 (5.45%)
helped: 9
HURT: 10

total fills in shared programs: 38383 -> 38425 (0.11%)
fills in affected programs: 834 -> 876 (5.04%)
helped: 9
HURT: 10

LOST:   5
GAINED: 10

Ivy Bridge
total instructions in shared programs: 12079250 -> 12044139 (-0.29%)
instructions in affected programs: 2409680 -> 2374569 (-1.46%)
helped: 16135
HURT: 0
helped stats (abs) min: 1 max: 23 x̄: 2.18 x̃: 2
helped stats (rel) min: 0.07% max: 37.50% x̄: 2.72% x̃: 1.68%
95% mean confidence interval for instructions value: -2.21 -2.14
95% mean confidence interval for instructions %-change: -2.76% -2.67%
Instructions are helped.

total cycles in shared programs: 180116747 -> 179900405 (-0.12%)
cycles in affected programs: 25439823 -> 25223481 (-0.85%)
helped: 13817
HURT: 1499
helped stats (abs) min: 1 max: 1886 x̄: 26.40 x̃: 18
helped stats (rel) min: <.01% max: 38.84% x̄: 2.57% x̃: 1.97%
HURT stats (abs)   min: 1 max: 3684 x̄: 98.99 x̃: 52
HURT stats (rel)   min: <.01% max: 97.01% x̄: 6.37% x̃: 3.42%
95% mean confidence interval for cycles value: -15.68 -12.57
95% mean confidence interval for cycles %-change: -1.77% -1.63%
Cycles are helped.

LOST:   8
GAINED: 10

Sandy Bridge
total instructions in shared programs: 10878990 -> 10863659 (-0.14%)
instructions in affected programs: 1806702 -> 1791371 (-0.85%)
helped: 13023
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 1.18 x̃: 1
helped stats (rel) min: 0.07% max: 13.79% x̄: 1.65% x̃: 1.10%
95% mean confidence interval for instructions value: -1.18 -1.17
95% mean confidence interval for instructions %-change: -1.68% -1.62%
Instructions are helped.

total cycles in shared programs: 154082878 -> 153862810 (-0.14%)
cycles in affected programs: 20199374 -> 19979306 (-1.09%)
helped: 12048
HURT: 510
helped stats (abs) min: 1 max: 323 x̄: 20.57 x̃: 18
helped stats (rel) min: 0.03% max: 17.78% x̄: 2.05% x̃: 1.52%
HURT stats (abs)   min: 1 max: 448 x̄: 54.39 x̃: 16
HURT stats (rel)   min: 0.02% max: 37.98% x̄: 4.13% x̃: 1.17%
95% mean confidence interval for cycles value: -17.97 -17.08
95% mean confidence interval for cycles %-change: -1.84% -1.75%
Cycles are helped.

LOST:   1
GAINED: 0

Iron Lake
total instructions in shared programs: 8155075 -> 8142729 (-0.15%)
instructions in affected programs: 949495 -> 937149 (-1.30%)
helped: 5810
HURT: 0
helped stats (abs) min: 1 max: 8 x̄: 2.12 x̃: 2
helped stats (rel) min: 0.10% max: 16.67% x̄: 2.53% x̃: 1.85%
95% mean confidence interval for instructions value: -2.14 -2.11
95% mean confidence interval for instructions %-change: -2.59% -2.48%
Instructions are helped.

total cycles in shared programs: 188584610 -> 188549632 (-0.02%)
cycles in affected programs: 17274446 -> 17239468 (-0.20%)
helped: 3881
HURT: 90
helped stats (abs) min: 2 max: 168 x̄: 9.08 x̃: 6
helped stats (rel) min: <.01% max: 23.53% x̄: 0.83% x̃: 0.30%
HURT stats (abs)   min: 2 max: 10 x̄: 2.80 x̃: 2
HURT stats (rel)   min: <.01% max: 0.60% x̄: 0.10% x̃: 0.07%
95% mean confidence interval for cycles value: -9.35 -8.27
95% mean confidence interval for cycles %-change: -0.85% -0.77%
Cycles are helped.

GM45
total instructions in shared programs: 5019308 -> 5013119 (-0.12%)
instructions in affected programs: 489028 -> 482839 (-1.27%)
helped: 2912
HURT: 0
helped stats (abs) min: 1 max: 8 x̄: 2.13 x̃: 2
helped stats (rel) min: 0.10% max: 16.67% x̄: 2.46% x̃: 1.81%
95% mean confidence interval for instructions value: -2.14 -2.11
95% mean confidence interval for instructions %-change: -2.54% -2.39%
Instructions are helped.

total cycles in shared programs: 129002592 -> 128977804 (-0.02%)
cycles in affected programs: 12669152 -> 12644364 (-0.20%)
helped: 2759
HURT: 37
helped stats (abs) min: 2 max: 168 x̄: 9.03 x̃: 4
helped stats (rel) min: <.01% max: 21.43% x̄: 0.75% x̃: 0.31%
HURT stats (abs)   min: 2 max: 10 x̄: 3.62 x̃: 4
HURT stats (rel)   min: <.01% max: 0.41% x̄: 0.10% x̃: 0.04%
95% mean confidence interval for cycles value: -9.53 -8.20
95% mean confidence interval for cycles %-change: -0.79% -0.70%
Cycles are helped.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-06-05 17:04:13 -07:00
Ian Romanick
a288708506 intel/fs: Add need_dest parameter to fs_visitor::nir_emit_alu
This is the same as the need_dest parameter to
prepare_alu_destination_and_sources.  This allows us to not change the
register that is expected to hold an result if an instruction is
re-emitted.  This is particularly a problem if the re-emitted
instruction is a partial write.  A later patch will use this feature.

No shader-db changes on any Intel platform.

v2: Don't do the Boolean resolve when there is no destination.  If the
ALU instruction didn't write a register, there's nothing to resolve.
This replaces an earlier patch "intel/fs: Allocate dummy destination
register when need_dest is false".

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-06-05 17:04:08 -07:00
Jason Ekstrand
f4ef34f207 intel/fs: Add an UNDEF instruction to avoid excess live ranges
With 8 and 16-bit types and anything where we have to use non-trivial
strides registersto deal with restrictions, we end up with things that
look like partial writes even though we don't care about any values in
the register except those written by that instruction.  This is
particularly important when dealing with loops because liveness sees
is_partial_write and the fact that an old version from a previous loop
iteration may be valid at that point and extends all purely partially
written values to the entire loop.

This commit adds a new UNDEF instruction which does nothing (the
generator doesn't emit anything) but which does a fake write to the
register.  This informs liveness that we don't care about any values
before that point so it won't consider those registers to be falsely
live.  We can safely emit UNDEF instructions for all SSA values that
come in from NIR and nearly all temporaries generated by various stages
of the compiler.  In particular, we need to insert UNDEF instructions
when we handle region restrictions because the newly allocated registers
are almost guaranteed to be partially written.

No shader-db changes.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110432
Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-06-04 14:27:30 -05:00
Jason Ekstrand
9e403dc56e intel/fs: Do a stalling MFENCE in endInvocationInterlock()
Fixes: 939312702e "i965: Add ARB_fragment_shader_interlock support"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-05-30 14:00:26 +00:00
Jason Ekstrand
859de4a748 intel/fs,vec4: Use g0 as the header for MFENCE
We set header_present but then pass it some random garbage.  Give it g0
instead.  I'm not actually sure this does anything but g0 is the usual
header data and this is what the windows driver does so it seems like a
good idea.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-05-30 14:00:26 +00:00