Commit graph

1487 commits

Author SHA1 Message Date
Jason Ekstrand
5e1c42d85f nir: Call nir_metadata_preserve on !progress
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5171>
2020-06-11 05:08:12 +00:00
Caio Marcelo de Oliveira Filho
fe214d60bc intel/fs: Add Fall-through comment
Just to clarify the missing break is intentional.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5365>
2020-06-08 15:49:24 +00:00
Boris Brezillon
689acc7398 intel/compiler: Extract control barriers from scoped barriers
Add a lowering pass extracting all control barriers embedded in scoped
barriers into proper control barriers so we can get rid of the logic
inserting control barriers when an SpvOpControlBarrier with WorkGroup
scope is parsed in spirv_to_nir().

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4900>
2020-06-03 07:39:52 +00:00
Boris Brezillon
345b5847b4 nir: Replace the scoped_memory barrier by a scoped_barrier
SPIRV OpControlBarrier can have both a memory and a control barrier
which some hardware can handle with a single instruction. Let's
turn the scoped_memory_barrier into a scoped barrier which can embed
both barrier types. Note that control-only or memory-only barriers can
be supported through this new intrinsic by passing NIR_SCOPE_NONE to the
unused barrier type.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Suggested-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4900>
2020-06-03 07:39:52 +00:00
Timothy Arceri
04dbf709ed nir: add callback to nir_remove_dead_variables()
This allows us to do API specific checks before removing variable
without filling nir_remove_dead_variables() with API specific code.

In the following patches we will use this to support the removal
of dead uniforms in GLSL.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4797>
2020-06-03 02:22:23 +00:00
Dylan Baker
a8e2d79e02 meson: use gnu_symbol_visibility argument
This uses a meson builtin to handle -fvisibility=hidden. This is nice
because we don't need to track which languages are used, if C++ is
suddenly added meson just does the right thing.

Acked-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4740>
2020-06-01 18:59:18 +00:00
Jason Ekstrand
c48f42e178 intel/fs: Emit HALT for discard on Gen4-5
Using HALT to immediately jump to the end of the shader is required to
implement GL_EXT_gpu_shader4 and OpenGL 3.0.  However, vanilla OpenGL
1.2 doesn't forbid it and it likely makes something somewhere faster.
We should be consistent and implement the same discard behavior on all
hardware if we can.

The rules for HALT on Gen4-5 are a bit different from Gen6+.  On the
older hardware, there is no stack for HALT; instead it's up to software
to save and restore mask registers.  However, there's no real saving
needed since we only use HALT to jump to the end of the program where
we're about about to do our FB writes.  All we need to do is reset AMask
to DMask, the value it was initialized to at the start of the thread.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5244>
2020-05-30 06:21:15 +00:00
Jason Ekstrand
94aa7997e4 intel/fs: Fix unused texture coordinate zeroing on Gen4-5
We were inserting the right number of MOVs but, thanks to the way we
advanced msg_end earlier in the function, were often writing the zeros
past the end of where we actually read in the register file.

Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5243>
2020-05-30 01:08:50 -05:00
Jason Ekstrand
a7c8811fe4 intel/vec4: Stomp the return type of RESINFO to UINT32
We already do this in the FS back-end; we just weren't doing it in vec4
so RESINFO messages weren't returning the right data.

Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5243>
2020-05-30 01:08:50 -05:00
Caio Marcelo de Oliveira Filho
bccf2a25a8 intel: Add helper to calculate GPGPU_WALKER::RightExecutionMask
Suggested by Jason.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
2020-05-27 18:16:31 -07:00
Caio Marcelo de Oliveira Filho
90ec26a800 intel/fs: Generate multiple CS SIMD variants for variable group size
This will make the GL drivers pick the right SIMD variant for a given
group size set during dispatch.  The heuristic implemented in
brw_cs_simd_size_for_group_size() is the same as in brw_compile_cs().

The cs_prog_data::simd_size field was removed.  The generated SIMD
sizes are marked in a bitmask, which is already used via
brw_cs_simd_size_for_group_size() by the drivers.

When in variable group size, it is OK if larger SIMD shader spill,
since we'd need it for the cases where the smaller one can't hold all
the invocations.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
2020-05-27 18:16:31 -07:00
Caio Marcelo de Oliveira Filho
cb26d9c311 intel/fs: Add helper to get prog_offset and simd_size
This indirection will be used by the variable group size case in a
later change.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
2020-05-27 18:16:31 -07:00
Caio Marcelo de Oliveira Filho
5b5e77caa7 intel/fs: Support INTEL_DEBUG=no8,no32 in compute shaders
The "no32" flag will have precedence over "do32", like is done for FS.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
2020-05-27 18:16:31 -07:00
Caio Marcelo de Oliveira Filho
10d0f39beb intel/fs: Remove min_dispatch_width spilling decision from RA
Move the decision one level up, let brw_compile_*() functions use the
spilling information to decide whether or not a certain width
compilation can spill (passed via run_*() functions).

The min_dispatch_width was used to compare with the dispatch_width and
decide whether "a previous shader is already available, so don't
accept spill".

This is replaced by:

- Not calling run_*() functions if it is know beforehand a smaller width
  already spilled -- since the larger width will spill and fail;

- Explicitly passing whether or not a shader is allowed to spill.  For
  the cases where the smaller width is available and haven't spilled,
  the larger width will be compiled but is only useful if it won't
  spill.

Moving the decision to this level will be useful later for variable
group size, which is a case where we want all the widths to be allowed
to spill.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
2020-05-27 18:16:31 -07:00
Danylo Piliaiev
296c04d78c intel/fs: Work around dual-source blending hangs in combination with SIMD16
It was found that dual-source blending hangs with SIMD16 dispatch in some
specific but unknown situation. Which in the wild happen when rgba
anti-aliasing is enabled for fonts.

Cc: <mesa-stable@lists.freedesktop.org>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2183
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5037>
2020-05-27 14:35:13 +03:00
Alyssa Rosenzweig
f59d02a86d intel: Use SATURATE
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5100>
2020-05-26 22:31:31 +00:00
Caio Marcelo de Oliveira Filho
8cc7711924 intel/fs: Remove redundant assert()
This is covered by the two previous similar asserts.  Each time `v` is
assigned this is asserted.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
2020-05-26 20:35:03 +00:00
Caio Marcelo de Oliveira Filho
462bc408fe intel/fs: Early return when can't satisfy explicit group size
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
2020-05-26 20:35:03 +00:00
Caio Marcelo de Oliveira Filho
2a308ee4c7 intel/fs: Remove unused state from brw_nir_lower_cs_intrinsics
After 2663759af0 ("intel/fs: Add and use a new load_simd_width_intel
intrinsic") the local_workgroup_size is not used anymore except for
assertions at the pass' start, so drop it from state struct.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
2020-05-26 20:35:03 +00:00
Caio Marcelo de Oliveira Filho
5e0525e145 intel/fs: Remove unused emission of load_simd_with_intel
The nir_intrinsic_load_simd_width_intel is always lowered by the
brw_nir_lower_simd() pass before the emission happens.  This is likely
a "leftover" from patch rewriting/squashing that happened when this
intrinsic was added.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
2020-05-26 20:35:03 +00:00
Eric Engestrom
444138d6d9 tree-wide: fix deprecated GitLab URLs
They will stop working in the next GitLab release, so let's update them
ASAP to make sure things are propagated to everyone by then.

See:
https://about.gitlab.com/releases/2020/05/06/gitlab-com-13-0-breaking-changes/#removal-of-deprecated-project-paths

Cc: mesa-stable@lists.freedesktop.org
Signed-off-by: Eric Engestrom <eric@engestrom.ch>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5111>
2020-05-23 15:33:50 +00:00
Andrii Simiklit
d1b7462849 i965/vec4: Ignore swizzle of VGRF for use by var_range_end()
Issue description from Matt's commit e7c376ad:
 "var_range_end(v, n) loops over the n components of variable number v and
  finds the maximum value, giving the last use of any component of v.
  Therefore it expects v to correspond to the variable associated with the
  .x channel of the VGRF.

  var_from_reg() however returns the variable for the first channel of the
  VGRF, post-swizzle.

  So, if the last register had a swizzle with y, z, or w in the swizzle
  component, we would read out of bounds. For any other register, we would
  read liveness information from the next register.

  The fix is to convert the src_reg to a dst_reg in order to call the
  dst_reg version of var_from_reg() that doesn't consider the swizzle."

Closes: #3003
Fixes: 48dfb30f ('intel/compiler: Move all live interval analysis results into vec4_live_variables')
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Andrii Simiklit <asimiklit.work@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4941>
2020-05-20 20:19:18 +00:00
Caio Marcelo de Oliveira Filho
6a6c36e977 intel/fs: Use writes_memory from shader_info
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4815>
2020-05-18 21:09:17 +00:00
Arcady Goldmints-Orlov
95fd950d35 intel/compiler: fix alignment assert in nir_emit_intrinsic
Fixes: c643979228 (intel/fs: Choose memory message type based on bit size)
Fixes: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_i8vec2

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5000>
2020-05-12 22:14:31 +00:00
Samuel Pitoiset
04718a9cd6 nir: do not vectorize load/store if offset can overflow and robustness enabled
This prevents vectorization for loads/stores that can overflow if
the low offset is negative and the range greater or equal than 0.

The caller can pass the list of variable modes that matter for
robust access.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4881>
2020-05-11 07:25:15 +00:00
Ian Romanick
f46eabf84e nir/algebraic: Split ibfe and ubfe with two constant sources
I also tried splitting ubfe instructions with one or zero constants,
and zero shaders in shader-db were affected.

The "lost" shader is a compute shader that was promoted from SIMD8 to
SIMD16, so is also counted as the gained shader.

v2: Further restrict bfe splitting.  bfe with multiple constants is
better on at least some Radeon GPUs.  Use -x instead of 32-x in shift
counts.

v3: Fix the outer shift count for ibfe lowering.  Add c=0 optimizations
to prevent bad lowering.  Both suggested by Rhys.  Add shift by -32
optimizations.

Tiger Lake
total instructions in shared programs: 17608764 -> 17596316 (-0.07%)
instructions in affected programs: 303765 -> 291317 (-4.10%)
helped: 113
HURT: 46
helped stats (abs) min: 1 max: 458 x̄: 120.67 x̃: 21
helped stats (rel) min: 0.09% max: 11.23% x̄: 3.47% x̃: 1.39%
HURT stats (abs)   min: 1 max: 201 x̄: 25.83 x̃: 6
HURT stats (rel)   min: 0.23% max: 5.18% x̄: 1.53% x̃: 1.11%
95% mean confidence interval for instructions value: -101.13 -55.45
95% mean confidence interval for instructions %-change: -2.61% -1.44%
Instructions are helped.

total cycles in shared programs: 338390770 -> 333530868 (-1.44%)
cycles in affected programs: 79438330 -> 74578428 (-6.12%)
helped: 112
HURT: 64
helped stats (abs) min: 2 max: 268955 x̄: 44261.93 x̃: 1452
helped stats (rel) min: <.01% max: 29.51% x̄: 4.72% x̃: 2.23%
HURT stats (abs)   min: 2 max: 17618 x̄: 1522.41 x̃: 84
HURT stats (rel)   min: <.01% max: 7.34% x̄: 1.35% x̃: 0.34%
95% mean confidence interval for cycles value: -37232.47 -17993.69
95% mean confidence interval for cycles %-change: -3.37% -1.65%
Cycles are helped.

total spills in shared programs: 8944 -> 8138 (-9.01%)
spills in affected programs: 3240 -> 2434 (-24.88%)
helped: 67
HURT: 0

total fills in shared programs: 9373 -> 7842 (-16.33%)
fills in affected programs: 4736 -> 3205 (-32.33%)
helped: 67
HURT: 0

LOST:   1
GAINED: 2

Ice Lake and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 16123288 -> 16116876 (-0.04%)
instructions in affected programs: 241155 -> 234743 (-2.66%)
helped: 126
HURT: 2
helped stats (abs) min: 1 max: 209 x̄: 50.90 x̃: 7
helped stats (rel) min: 0.07% max: 5.94% x̄: 1.76% x̃: 0.65%
HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel)   min: 0.05% max: 0.24% x̄: 0.15% x̃: 0.15%
95% mean confidence interval for instructions value: -61.29 -38.89
95% mean confidence interval for instructions %-change: -2.05% -1.42%
Instructions are helped.

total cycles in shared programs: 335419163 -> 330438819 (-1.48%)
cycles in affected programs: 77515502 -> 72535158 (-6.42%)
helped: 139
HURT: 37
helped stats (abs) min: 2 max: 269140 x̄: 36374.19 x̃: 597
helped stats (rel) min: <.01% max: 28.60% x̄: 3.67% x̃: 1.31%
HURT stats (abs)   min: 4 max: 17618 x̄: 2045.08 x̃: 174
HURT stats (rel)   min: 0.02% max: 8.32% x̄: 2.61% x̃: 0.62%
95% mean confidence interval for cycles value: -37799.30 -18795.51
95% mean confidence interval for cycles %-change: -3.13% -1.57%
Cycles are helped.

total spills in shared programs: 8065 -> 7306 (-9.41%)
spills in affected programs: 3153 -> 2394 (-24.07%)
helped: 67
HURT: 0

total fills in shared programs: 8710 -> 7412 (-14.90%)
fills in affected programs: 4466 -> 3168 (-29.06%)
helped: 67
HURT: 0

LOST:   1
GAINED: 1

Broadwell
total instructions in shared programs: 14970538 -> 14965967 (-0.03%)
instructions in affected programs: 227040 -> 222469 (-2.01%)
helped: 126
HURT: 2
helped stats (abs) min: 1 max: 136 x̄: 36.29 x̃: 8
helped stats (rel) min: 0.07% max: 6.02% x̄: 1.47% x̃: 0.89%
HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel)   min: 0.05% max: 0.24% x̄: 0.14% x̃: 0.14%
95% mean confidence interval for instructions value: -43.05 -28.37
95% mean confidence interval for instructions %-change: -1.69% -1.19%
Instructions are helped.

total cycles in shared programs: 336237662 -> 333035960 (-0.95%)
cycles in affected programs: 72066394 -> 68864692 (-4.44%)
helped: 134
HURT: 42
helped stats (abs) min: 4 max: 122640 x̄: 24344.54 x̃: 1833
helped stats (rel) min: <.01% max: 26.93% x̄: 4.02% x̃: 2.38%
HURT stats (abs)   min: 1 max: 17205 x̄: 1439.69 x̃: 92
HURT stats (rel)   min: <.01% max: 7.12% x̄: 1.34% x̃: 0.62%
95% mean confidence interval for cycles value: -23753.58 -12629.40
95% mean confidence interval for cycles %-change: -3.50% -1.98%
Cycles are helped.

total spills in shared programs: 21122 -> 20204 (-4.35%)
spills in affected programs: 3644 -> 2726 (-25.19%)
helped: 67
HURT: 0

total fills in shared programs: 24879 -> 23460 (-5.70%)
fills in affected programs: 4883 -> 3464 (-29.06%)
helped: 67
HURT: 0

Haswell
total instructions in shared programs: 13148269 -> 13145444 (-0.02%)
instructions in affected programs: 137046 -> 134221 (-2.06%)
helped: 97
HURT: 3
helped stats (abs) min: 1 max: 137 x̄: 30.58 x̃: 3
helped stats (rel) min: 0.14% max: 4.38% x̄: 1.38% x̃: 0.44%
HURT stats (abs)   min: 1 max: 70 x̄: 47.00 x̃: 70
HURT stats (rel)   min: 0.05% max: 5.82% x̄: 3.90% x̃: 5.82%
95% mean confidence interval for instructions value: -37.15 -19.35
95% mean confidence interval for instructions %-change: -1.56% -0.89%
Instructions are helped.

total cycles in shared programs: 321221834 -> 318333159 (-0.90%)
cycles in affected programs: 54932349 -> 52043674 (-5.26%)
helped: 95
HURT: 53
helped stats (abs) min: 4 max: 123390 x̄: 30648.39 x̃: 702
helped stats (rel) min: <.01% max: 28.87% x̄: 4.27% x̃: 2.87%
HURT stats (abs)   min: 4 max: 2357 x̄: 432.49 x̃: 113
HURT stats (rel)   min: <.01% max: 3.44% x̄: 1.03% x̃: 0.54%
95% mean confidence interval for cycles value: -26154.16 -12881.99
95% mean confidence interval for cycles %-change: -3.20% -1.55%
Cycles are helped.

total spills in shared programs: 19878 -> 19293 (-2.94%)
spills in affected programs: 3020 -> 2435 (-19.37%)
helped: 41
HURT: 2

total fills in shared programs: 20918 -> 19875 (-4.99%)
fills in affected programs: 3968 -> 2925 (-26.29%)
helped: 41
HURT: 2

LOST:   0
GAINED: 1

Ivy Bridge
total instructions in shared programs: 11875585 -> 11873641 (-0.02%)
instructions in affected programs: 78065 -> 76121 (-2.49%)
helped: 27
HURT: 0
helped stats (abs) min: 8 max: 134 x̄: 72.00 x̃: 72
helped stats (rel) min: 0.36% max: 4.23% x̄: 2.42% x̃: 2.42%
95% mean confidence interval for instructions value: -83.68 -60.32
95% mean confidence interval for instructions %-change: -2.78% -2.07%
Instructions are helped.

total cycles in shared programs: 178232734 -> 175769085 (-1.38%)
cycles in affected programs: 50018707 -> 47555058 (-4.93%)
helped: 27
HURT: 0
helped stats (abs) min: 82035 max: 99953 x̄: 91246.26 x̃: 92278
helped stats (rel) min: 4.40% max: 5.69% x̄: 4.93% x̃: 4.95%
95% mean confidence interval for cycles value: -93674.20 -88818.32
95% mean confidence interval for cycles %-change: -5.09% -4.78%
Cycles are helped.

total spills in shared programs: 4182 -> 3739 (-10.59%)
spills in affected programs: 1089 -> 646 (-40.68%)
helped: 27
HURT: 0

total fills in shared programs: 5216 -> 4345 (-16.70%)
fills in affected programs: 1874 -> 1003 (-46.48%)
helped: 27
HURT: 0

No changes on any earlier Intel platforms.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4156>
2020-05-07 10:55:50 -07:00
Caio Marcelo de Oliveira Filho
e645bc6939 intel: Let drivers call brw_nir_lower_cs_intrinsics()
The motivating factor is: this lowering may cause
nir_intrinsic_load_local_group_size intrinsics to be added to the
shader, and by moving this around we make possible for the drivers to
lower that intrinsic by themselves.

Iris will do just that in a later patch for implementing variable
group size.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
2020-05-01 12:50:37 -07:00
Caio Marcelo de Oliveira Filho
2663759af0 intel/fs: Add and use a new load_simd_width_intel intrinsic
Intrinsic to get the SIMD width, which not always the same as subgroup
size.  Starting with a small scope (Intel), but we can rename it later
to generalize if this turns out useful for other drivers.

Change brw_nir_lower_cs_intrinsics() to use this intrinsic instead of
a width will be passed as argument.  The pass also used to optimized
load_subgroup_id for the case that the workgroup fitted into a single
thread (it will be constant zero).  This optimization moved together
with lowering of the SIMD.

This is a preparation for letting the drivers call it before the
brw_compile_cs() step.

No shader-db changes in BDW, SKL, ICL and TGL.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
2020-05-01 12:50:37 -07:00
Caio Marcelo de Oliveira Filho
4b000b491a intel/fs: Add an option to lower variable group size in backend
Adding this since Iris will handle variable group size parameters by
itself.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
2020-05-01 12:50:28 -07:00
Caio Marcelo de Oliveira Filho
0edb58a84e intel/fs: Clean up variable group size handling in backend
Just use the information from NIR shader_info.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
2020-05-01 12:50:28 -07:00
D Scott Phillips
65b05ebdda anv,iris: Fix input vertex max for tcs on gen12
gen12 does away with the single patch dispatch mode for tcs, and
increases some limits so that 8_patch mode can always work. Make the
necessary changes so we don't try to fall back to single patch mode.

Fixes KHR-GL46.tessellation_shader.single.max_patch_vertices and others

Fixes: 44754279ac ("intel/fs/gen12: Use TCS 8_PATCH mode.")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4843>
2020-05-01 16:49:11 +00:00
D Scott Phillips
7bd15135a6 intel/fs: Update location of Render Target Array Index for gen12
Render Target Array Index has moved from R0.0[26:16] to
R1.1[26:16] on gen12.

Fixes dEQP-VK.multiview.input_attachments.*

Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4836>
2020-05-01 08:48:22 -07:00
Jason Ekstrand
4985e380dd intel/eu: Use non-coherent mode (BTI=253) for stateless A64 messages
We don't care about full IA coherency since we always have the
opportunity in GL or Vulkan to flush the data cache.  Using IA-coherent
mode is likely just making A64 access slower than it needs to be.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4819>
2020-04-30 14:45:50 +00:00
Francisco Jerez
0842758ec0 intel/ir: Update performance analysis parameters for memory fence codegen changes.
The SFID field of the SHADER_OPCODE_MEMORY_FENCE and
SHADER_OPCODE_INTERLOCK instructions now indicates the target function
of the memory fence.  Account the cycle-count cost to the right shared
unit.

Fixes: f858fa26b4 ("intel/fs,vec4: Pull stall logic for memory fences up into the IR")
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4817>
2020-04-29 23:40:36 +00:00
Jason Ekstrand
e581ddeeee intel/fs: Don't delete coalesced MOVs if they have a cmod
Shader-db results on ICL:

    total instructions in shared programs: 17133088 -> 17133287 (<.01%)
    instructions in affected programs: 61300 -> 61499 (0.32%)
    helped: 0
    HURT: 199

This means it's likely fixing 199 bugs. :-)  All the changed shaders are
in Mad Max.  It's surprisingly difficult to get the back-end compiler to
generate a pattern that hits this we don't tend to emit a lot coalescable
MOVs.  The pattern in Mad Max that's able to hit is fsign(fsat(x)) under
the right conditions.

Closes: #2820
Cc: mesa-stable@lists.freedesktop.org
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4773>
2020-04-29 16:45:51 +00:00
Caio Marcelo de Oliveira Filho
a3cba3c771 intel/fs: Only stall after sending all memory fence messages
In Gen11+, when emitting a fence for both L3 and SLM, the generated
code would look like

    SEND, MOV (for stall), SEND, MOV (for stall)

This commit change that so two SENDs are emitted before the MOVs for
stall.  This is similar to the approach used in Ivy Bridge for the
render fence.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
2020-04-29 07:17:27 +00:00
Caio Marcelo de Oliveira Filho
f858fa26b4 intel/fs,vec4: Pull stall logic for memory fences up into the IR
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.

For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).

Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics.  The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).

This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.

Shader-db results for IVB (i965):

    total instructions in shared programs: 11971190 -> 11971200 (<.01%)
    instructions in affected programs: 11482 -> 11492 (0.09%)
    helped: 0
    HURT: 8
    HURT stats (abs)   min: 1 max: 3 x̄: 1.25 x̃: 1
    HURT stats (rel)   min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
    95% mean confidence interval for instructions value: 0.66 1.84
    95% mean confidence interval for instructions %-change: 0.01% 0.27%
    Instructions are HURT.

  Unlike the previous code, that used the `mov g1 g2` trick to force
  both `g1` and `g2` to stall, the scheduling fence will generate `mov
  null g1` and `mov null g2`.  During review it was decided it was not
  worth keeping the special codepath for the small effect will have.

Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below

    total cycles in shared programs: 341738444 -> 341710570 (<.01%)
    cycles in affected programs: 7240002 -> 7212128 (-0.38%)
    helped: 46
    HURT: 5
    helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
    helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
    HURT stats (abs)   min: 2 max: 1768 x̄: 646.40 x̃: 362
    HURT stats (rel)   min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
    95% mean confidence interval for cycles value: -777.71 -315.38
    95% mean confidence interval for cycles %-change: -1.42% -0.83%
    Cycles are helped.

  This seems to be the effect of allocating two registers separatedly
  instead of a single one with size 2, which causes different register
  allocation, affecting the cycle estimates.

while ICL also has not change on instruction count but report changes
negative changes in cycles

    total cycles in shared programs: 352665369 -> 352707484 (0.01%)
    cycles in affected programs: 9608288 -> 9650403 (0.44%)
    helped: 4
    HURT: 104
    helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
    helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
    HURT stats (abs)   min: 2 max: 2016 x̄: 408.36 x̃: 48
    HURT stats (rel)   min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
    95% mean confidence interval for cycles value: 256.67 523.24
    95% mean confidence interval for cycles %-change: 0.63% 1.03%
    Cycles are HURT.

  AFAICT this is the result of the case above.

Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions

    total instructions in shared programs: 17690586 -> 17690597 (<.01%)
    instructions in affected programs: 64617 -> 64628 (0.02%)
    helped: 55
    HURT: 32
    helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
    helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
    HURT stats (abs)   min: 1 max: 65 x̄: 7.44 x̃: 2
    HURT stats (rel)   min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
    95% mean confidence interval for instructions value: -2.03 2.28
    95% mean confidence interval for instructions %-change: -0.41% 0.15%
    Inconclusive result (value mean confidence interval includes 0).

  Now that more is done in the IR, more dependencies are visible and
  more SWSB annotations are emitted.  Mixed with different register
  allocation decisions like above, some shaders will see more `sync
  nops` while others able to avoid them.

  Most of the new `sync nops` are also redundant and could be dropped,
  which will be fixed in a separate change.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
2020-04-29 07:17:27 +00:00
Caio Marcelo de Oliveira Filho
0e96b0d6dd intel/fs: Allow FS_OPCODE_SCHEDULING_FENCE stall on registers
It will generate the MOVs (or SYNC_NOP in Gen12+) needed for stall.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
2020-04-29 07:17:27 +00:00
Francisco Jerez
5e2a7e11b4 intel/ir: Remove scheduling-based cycle count estimates.
The cycle count estimation logic part of the scheduler is now
redundant with the shader performance modeling pass, and the estimates
can be consolidated into the brw::performance analysis result object
instead of being part of the CFG, which guarantees that the estimates
cannot be accessed without previously calling the
performance_analysis::require() method, which makes sure that the
right analysis pass is executed at the right time if we don't already
have up-to-date cached results.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
486f3b04a5 intel/ir: Pass block cycle count information explicitly to disassembler.
So we can eventually remove the cycle count estimates from the CFG
data structure and consolidate performance information in the
brw::performance object.

It would be cleaner to pass the brw::performance object directly to
the disassembler but that isn't straightforward since the disassembler
is built as a plain C file unlike the rest of the compiler back-end.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
6579f562c3 intel/ir: Use brw::performance object instead of CFG cycle counts for codegen stats.
These should be more accurate than the current cycle counts, since
among other things they consider the effect of post-scheduling passes
like the software scoreboard on TGL.  In addition it will enable us to
clean up some of the now redundant cycle-count estimation
functionality in the instruction scheduler.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
65342be3ae intel/fs: Add INTEL_DEBUG=no32 debugging flag.
This is useful in order to identify codegen issues caused by SIMD32.
It doesn't currently have any effect on compute shaders since SIMD32
dispatch is only enabled for CS when it's strictly necessary to do so
in order to support the workgroup size requested for the shader --
That might change in the future though when we hook up the SIMD32
heuristic to CS compilation.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
14f0a5cf64 intel/fs: Implement performance analysis-based SIMD32 heuristic for fragment shaders.
The heuristic enables the SIMD32 fragment shader based on whether the
IR performance modeling pass predicts it to have greater throughput
than the SIMD16 and SIMD8 variants of the same shader.  It would be
straightforward to do the same thing in order to control whether
SIMD16 dispatch is enabled, but it's pending additional performance
evaluation.

The INTEL_DEBUG=do32 option is left around in order to force the
SIMD32 shader to be used regardless of the result of the heuristic,
since it's useful as a debugging aid e.g. in order to identify
SIMD32-specific codegen issues which may be masked by the SIMD32
heuristic, or cases where the heuristic is incorrectly disabling
SIMD32 shaders that offer a performance advantage.

Currently this is only enabled on Gen6+, since SIMD32 codegen support
is incomplete on earlier platforms.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
d6aa0c261f intel/fs: Heap-allocate fs_visitors in brw_compile_fs().
This makes brw_compile_fs() look a bit more similar to
brw_compile_cs().  It saves us three v*_shader_stats local variables,
and will save us additional triplicated declarations as we start
tracking IR performance analysis results.

The triplicated cfg pointers are left around because they're set to
NULL to mark specific dispatch modes as disabled (e.g. in order to
enforce hardware restrictions).  Doing the same thing with the visitor
pointers would cause data leaks.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
188a3659ae intel/ir: Import shader performance analysis pass.
This introduces an analysis pass intended to estimate several
performance statistics of the shader, including cycle count latency
and throughput values, based on static modeling.  It has instruction
performance information more comprehensive than the current scheduling
pass for all platforms between Gen4-11, and works on both the FS and
VEC4 back-end.

The most immediate purpose of this pass is to implement a heuristic
meant to determine whether using SIMD32 dispatch for a fragment shader
can be expected to help more than it hurts.  In addition this will
allow the effect of passes run after scheduling (e.g. the TGL software
scoreboard pass and the VEC4 dependency control pass) to be visible in
shader-db statistics.

But that isn't the end of the story, other potential applications of
this pass (not part of this MR) I've been playing around with are:

 - Implement a similar SIMD16 heuristic allowing the identification of
   inefficient SIMD16 fragment shaders.

 - Implement similar SIMD16 and SIMD32 heuristics for the compute
   shader stage -- Currently compute shader builds always use the
   SIMD16 shader if available and never use the SIMD32 shader unless
   strictly necessary, which is suboptimal under certain conditions.

 - Hook up to the instruction scheduler in order to improve the
   accuracy of its timing information.

 - Use as heuristic in order to drive the selection of scheduling
   modes (Matt was experimenting with that).

 - Plug to the TGL software scoreboard pass in order to implement a
   more effective SBID token allocation algorithm, since in general
   the optimal token allocation depends on the timings of all
   instructions in the program.

 - Use its bottleneck detection functionality in order to implement a
   heuristic computing a more optimal bound for the number of fragment
   shader threads executed in parallel (by adjusting the
   MaximumNumberofThreadsPerPSD control of 3DSTATE_PS).

As a follow-up I'm planning to submit updated timing information for
Gen12 platforms -- Everything else required to support Gen12 like SWSB
handling is already included in this patch, but there were some IP
concerns regarding the TGL timing parameters since they cannot
currently be obtained with the documentation and hardware which is
publicly available.  The timing parameters for any previous Gen7-11
platforms can be obtained by anyone by sampling the timestamp register
using e.g. shader_time, though I have some more convenient
instrumentation coming up.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:03 -07:00
Francisco Jerez
c8ce1cfc9c intel/vec4: Fix constness of vec4_instruction::reads_flag() and ::writes_flag().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
bda1d72dd9 intel/fs: Replace fs_visitor::bank_conflict_cycles() with stand-alone function.
This will be re-usable by the IR performance analysis pass.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
d2ed740795 intel/fs: Fix constness of argument of fs_instruction_scheduler::is_compressed().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
6310a05f68 intel/fs: Rename half() helpers to quarter(), allow index up to 3.
Makes more sense considering SIMD32.  Relaxing the assertion in
brw_ir_fs.h will be required in order to avoid assertion failures on
SNB with SIMD32 fragment shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
bdad7f429a intel/ir: Add missing initialization of backend_reg::offset during construction.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00