Commit graph

292 commits

Author SHA1 Message Date
Francisco Jerez
6579f562c3 intel/ir: Use brw::performance object instead of CFG cycle counts for codegen stats.
These should be more accurate than the current cycle counts, since
among other things they consider the effect of post-scheduling passes
like the software scoreboard on TGL.  In addition it will enable us to
clean up some of the now redundant cycle-count estimation
functionality in the instruction scheduler.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
65342be3ae intel/fs: Add INTEL_DEBUG=no32 debugging flag.
This is useful in order to identify codegen issues caused by SIMD32.
It doesn't currently have any effect on compute shaders since SIMD32
dispatch is only enabled for CS when it's strictly necessary to do so
in order to support the workgroup size requested for the shader --
That might change in the future though when we hook up the SIMD32
heuristic to CS compilation.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
14f0a5cf64 intel/fs: Implement performance analysis-based SIMD32 heuristic for fragment shaders.
The heuristic enables the SIMD32 fragment shader based on whether the
IR performance modeling pass predicts it to have greater throughput
than the SIMD16 and SIMD8 variants of the same shader.  It would be
straightforward to do the same thing in order to control whether
SIMD16 dispatch is enabled, but it's pending additional performance
evaluation.

The INTEL_DEBUG=do32 option is left around in order to force the
SIMD32 shader to be used regardless of the result of the heuristic,
since it's useful as a debugging aid e.g. in order to identify
SIMD32-specific codegen issues which may be masked by the SIMD32
heuristic, or cases where the heuristic is incorrectly disabling
SIMD32 shaders that offer a performance advantage.

Currently this is only enabled on Gen6+, since SIMD32 codegen support
is incomplete on earlier platforms.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
d6aa0c261f intel/fs: Heap-allocate fs_visitors in brw_compile_fs().
This makes brw_compile_fs() look a bit more similar to
brw_compile_cs().  It saves us three v*_shader_stats local variables,
and will save us additional triplicated declarations as we start
tracking IR performance analysis results.

The triplicated cfg pointers are left around because they're set to
NULL to mark specific dispatch modes as disabled (e.g. in order to
enforce hardware restrictions).  Doing the same thing with the visitor
pointers would cause data leaks.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:01:27 -07:00
Francisco Jerez
6310a05f68 intel/fs: Rename half() helpers to quarter(), allow index up to 3.
Makes more sense considering SIMD32.  Relaxing the assertion in
brw_ir_fs.h will be required in order to avoid assertion failures on
SNB with SIMD32 fragment shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
e549e4f6c0 intel/fs/gen12: Fix Render Target Read header setup for new thread payload layout.
In Gen12 the Poly 0 Info DWORD containing the Viewport Index and
Render Target Index fields were moved from r0.0 to r1.1 in order to
make room for dual-polygon dispatch.  The render target message format
was updated to expect that information in the same location, so we
didn't need to make any changes for framebuffer fetch to work with
SIMD8 and SIMD16 dispatch.  Unfortunately that won't work with SIMD32,
since the render target message header is assembled from r0 and r2
instead of r1, and the r2 thread payload wasn't updated with an
additional copy of the same information.  We need to fix things up
manually instead.  This avoids a handful of
EXT_shader_framebuffer_fetch regressions in combination with SIMD32
fragment shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:29 -07:00
Francisco Jerez
d6ae079771 intel/fs/gen12: Fix hangs with per-sample SIMD32 fragment shader dispatch.
The Gen12 docs are rather contradictory regarding the dispatch
configurations supported by the fragment shader -- The same table
present in previous generations seems to imply that only one dispatch
mode can be enabled when doing per-sample shading, but a restriction
documented in the 3DSTATE_PS_BODY page implies the opposite: That
SIMD32 can only be used in combination with some other dispatch mode.

The latter seems to match the behavior of real hardware as I could
tell from my testing: A bunch of multisample test-cases that do
per-sample shading hang if we only provide a SIMD32 shader.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-04-28 23:00:28 -07:00
Dylan Baker
f8e4542bad replace _mesa_logbase2 with util_logbase2
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
2020-04-21 11:09:03 -07:00
Jason Ekstrand
d0d039a4d3 anv: Emit pushed UBO bounds checking code in the back-end compiler
This commit fixes performance regressions introduced by e03f965280
in which we started bounds checking our push constants.  This added a
LOT of shader code to shaders which use the robustBufferAccess feature
and led to substantial spilling.  The checking we just added to the FS
back-end is far more efficient for two reasons:

 1. It can be done at a whole register granularity rather than per-
    scalar and so we emit one SIMD8 SEL per 32B GRF rather than one
    SIMD16 SEL (executed as two SELs) for each component loaded.

 2. Because we do it with NoMask instructions, we can do it on whole
    pushed GRFs without splatting them out to SIMD8 or SIME16 values.
    This means that robust buffer access no longer explodes our register
    pressure for no good reason.

As a tiny side-benefit, we're now using can use AND instead of SEL which
means no need for the flag and better scheduling.

Vulkan pipeline database results on ICL:

    Instructions in all programs: 293586059 -> 238009118 (-18.9%)
    SENDs in all programs: 13568515 -> 13568515 (+0.0%)
    Loops in all programs: 149720 -> 149720 (+0.0%)
    Cycles in all programs: 88499234498 -> 84348917496 (-4.7%)
    Spills in all programs: 1229018 -> 184339 (-85.0%)
    Fills in all programs: 1348397 -> 246061 (-81.8%)

This also improves the performance of a few apps:

 - Shadow of the Tomb Raider: +4%
 - Witcher 3: +3.5%
 - UE4 Shooter demo: +2%

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>
2020-04-17 14:48:06 +00:00
Caio Marcelo de Oliveira Filho
db74ad0696 intel/compiler: Remove cs_prog_data->threads
At this point all drivers are doing this math on their own -- since
most of them need to cover the variable group size case, in which at
compile time the group size (and number of threads) is not defined.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
2020-04-09 19:23:20 -07:00
Plamena Manolova
c77dc51203 intel/compiler: Add support for variable workgroup size
Add new builtin parameters that are used to keep track of the group
size.  This will be used to implement ARB_compute_variable_group_size.

The compiler will use the maximum group size supported to pick a
suitable SIMD variant.  A later improvement will be to keep all SIMD
variants (like FS) so the driver can select the best one at dispatch
time.

When variable workgroup size is used, the small workgroup optimization
is disabled as it we can't prove at compile time that the barriers
won't be needed.

Extracted from original i965 patch with additional changes by
Caio Marcelo de Oliveira Filho.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
2020-04-09 19:23:12 -07:00
Caio Marcelo de Oliveira Filho
c54fc0d07b intel/compiler: Replace cs_prog_data->push.total with a helper
The push.total field had three values but only one was directly
used (size).  Replace it with a helper function that explicitly takes
the cs_prog_data and the number of threads -- and use that in the
drivers.

This is a preparation for ARB_compute_variable_group_size where the
number of threads (hence the total size for push constants) is not
defined at compile time (not cs_prog_data->threads).

Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
2020-04-09 19:23:12 -07:00
Caio Marcelo de Oliveira Filho
395de69b1f intel/fs: Allow multiple slots for position
Change brw_compute_vue_map() to also take the number of pos slots.  If
more than one slot is used, the VARYING_SLOT_POS is treated as an
array.

When using Primitive Replication, instead of a single position, the
VUE must contain an array of positions.  Padding might be
necessary (after clip distance) to ensure rest of attributes start
aligned.

v2: Add note about array in the commit message and assert that
    pos_slots >= 1 to make clear 0 is invalid. (Jason)
    Move padding to be after the clip distance.

v3: Apply the correct offset when gathering the sources from outputs.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> [v2]
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2313>
2020-04-07 17:16:09 +00:00
Juan A. Suarez Romero
460de2159e intel/compiler: store the FS inputs in WM prog data
Store the fragment shader inputs in the program data so we can use them
later when required without needing the NIR shader.

Cc: mesa-stable@lists.freedesktop.org
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2010>
2020-04-01 23:36:28 +00:00
Ian Romanick
ba88e95187 intel/fs: Fix NULL destinations on 3-source instructions again after late DCE
We considered moving this down near the call to
insert_gen4_send_dependency_workarounds.  By that point it's too late
for a couple reasons.  One, we're potentially increasing resiter
pressure that may lead to anoter spill.  Two, fixup_3src_null_dest tries
to allocate a VGRF, but the post-register allocation shader uses
physical registers.

Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2621
Fixes: ba2fa1ceaf ("intel/fs: Do cmod prop again after scheduling")
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4155>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4155>
2020-03-12 08:22:43 -07:00
Mathias Fröhlich
630154e77b i965: Move down genX_upload_sbe in profiles.
Avoid looping over all VARYING_SLOT_MAX urb_setup array
entries from genX_upload_sbe. Prepare an array indirection
to the active entries of urb_setup already in the compile
step. On upload only walk the active arrays.

v2: Use uint8_t to store the attribute numbers.
v3: Change loop to build up the array indirection.
v4: Rebase.
v5: Style fix.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/308>
2020-03-10 14:28:36 +00:00
Ian Romanick
ba2fa1ceaf intel/fs: Do cmod prop again after scheduling
Pre-RA scheduling can create more opportunities for CMOD propagation.
This takes advantage of that.

It may be worth doing this again in post-RA scheduling, but there are
additional problems there.

I'm a little torn about the use of the OPT() macro.  On the one hand, it
would be confusing to see dumps from INTEL_DEBUG=optimizer that don't
match the final output.  On the other hand, since register allocation
can fail, the same pass can be run multiple times.  Each time one or
both passes might or might not make progress.  This would also lead to
incongruous, confusing output.

Ice Lake
total instructions in shared programs: 14549808 -> 14548529 (<.01%)
instructions in affected programs: 231985 -> 230706 (-0.55%)
helped: 632
HURT: 0
helped stats (abs) min: 1 max: 32 x̄: 2.02 x̃: 1
helped stats (rel) min: 0.05% max: 2.56% x̄: 0.57% x̃: 0.41%
95% mean confidence interval for instructions value: -2.25 -1.79
95% mean confidence interval for instructions %-change: -0.61% -0.54%
Instructions are helped.

total cycles in shared programs: 203770850 -> 203776599 (<.01%)
cycles in affected programs: 2495653 -> 2501402 (0.23%)
helped: 282
HURT: 197
helped stats (abs) min: 1 max: 242 x̄: 20.37 x̃: 16
helped stats (rel) min: <.01% max: 11.65% x̄: 0.91% x̃: 0.64%
HURT stats (abs)   min: 2 max: 609 x̄: 58.35 x̃: 20
HURT stats (rel)   min: <.01% max: 10.97% x̄: 1.35% x̃: 0.66%
95% mean confidence interval for cycles value: 5.27 18.73
95% mean confidence interval for cycles %-change: -0.16% 0.21%
Inconclusive result (%-change mean confidence interval includes 0).

LOST:   0
GAINED: 2

Skylake
total instructions in shared programs: 13447708 -> 13446594 (<.01%)
instructions in affected programs: 216813 -> 215699 (-0.51%)
helped: 623
HURT: 0
helped stats (abs) min: 1 max: 32 x̄: 1.79 x̃: 1
helped stats (rel) min: 0.06% max: 2.86% x̄: 0.59% x̃: 0.42%
95% mean confidence interval for instructions value: -1.99 -1.59
95% mean confidence interval for instructions %-change: -0.63% -0.55%
Instructions are helped.

total cycles in shared programs: 193759224 -> 193762726 (<.01%)
cycles in affected programs: 2540035 -> 2543537 (0.14%)
helped: 249
HURT: 190
helped stats (abs) min: 2 max: 196 x̄: 16.67 x̃: 14
helped stats (rel) min: <.01% max: 4.71% x̄: 0.66% x̃: 0.62%
HURT stats (abs)   min: 2 max: 614 x̄: 40.27 x̃: 14
HURT stats (rel)   min: 0.02% max: 5.78% x̄: 0.86% x̃: 0.37%
95% mean confidence interval for cycles value: 2.57 13.39
95% mean confidence interval for cycles %-change: -0.11% 0.11%
Inconclusive result (%-change mean confidence interval includes 0).

LOST:   0
GAINED: 1

Broadwell
total instructions in shared programs: 13418631 -> 13417393 (<.01%)
instructions in affected programs: 243192 -> 241954 (-0.51%)
helped: 694
HURT: 0
helped stats (abs) min: 1 max: 31 x̄: 1.78 x̃: 1
helped stats (rel) min: 0.06% max: 2.86% x̄: 0.59% x̃: 0.44%
95% mean confidence interval for instructions value: -1.95 -1.62
95% mean confidence interval for instructions %-change: -0.62% -0.55%
Instructions are helped.

total cycles in shared programs: 200822940 -> 200829128 (<.01%)
cycles in affected programs: 2128651 -> 2134839 (0.29%)
helped: 251
HURT: 226
helped stats (abs) min: 1 max: 200 x̄: 14.32 x̃: 12
helped stats (rel) min: <.01% max: 3.56% x̄: 0.60% x̃: 0.50%
HURT stats (abs)   min: 2 max: 611 x̄: 43.28 x̃: 18
HURT stats (rel)   min: 0.02% max: 7.03% x̄: 0.93% x̃: 0.54%
95% mean confidence interval for cycles value: 7.44 18.50
95% mean confidence interval for cycles %-change: 0.02% 0.23%
Cycles are HURT.

Haswell and Ivy Bridge had similar results. (Haswell shown)
total instructions in shared programs: 11569710 -> 11568829 (<.01%)
instructions in affected programs: 147862 -> 146981 (-0.60%)
helped: 487
HURT: 0
helped stats (abs) min: 1 max: 34 x̄: 1.81 x̃: 1
helped stats (rel) min: 0.12% max: 4.75% x̄: 0.57% x̃: 0.45%
95% mean confidence interval for instructions value: -2.03 -1.59
95% mean confidence interval for instructions %-change: -0.61% -0.54%
Instructions are helped.

total cycles in shared programs: 187079425 -> 187079437 (<.01%)
cycles in affected programs: 1088494 -> 1088506 (<.01%)
helped: 234
HURT: 124
helped stats (abs) min: 2 max: 282 x̄: 22.66 x̃: 16
helped stats (rel) min: 0.03% max: 7.88% x̄: 0.93% x̃: 0.75%
HURT stats (abs)   min: 1 max: 276 x̄: 42.86 x̃: 20
HURT stats (rel)   min: 0.03% max: 6.70% x̄: 0.99% x̃: 0.53%
95% mean confidence interval for cycles value: -5.54 5.61
95% mean confidence interval for cycles %-change: -0.41% -0.11%
Inconclusive result (value mean confidence interval includes 0).

total spills in shared programs: 7746 -> 7740 (-0.08%)
spills in affected programs: 6 -> 0
helped: 1
HURT: 0

total fills in shared programs: 6264 -> 6258 (-0.10%)
fills in affected programs: 6 -> 0
helped: 1
HURT: 0

Sandy Bridge
total instructions in shared programs: 10688576 -> 10688177 (<.01%)
instructions in affected programs: 137875 -> 137476 (-0.29%)
helped: 358
HURT: 0
helped stats (abs) min: 1 max: 9 x̄: 1.11 x̃: 1
helped stats (rel) min: 0.15% max: 1.43% x̄: 0.35% x̃: 0.28%
95% mean confidence interval for instructions value: -1.18 -1.05
95% mean confidence interval for instructions %-change: -0.37% -0.32%
Instructions are helped.

total cycles in shared programs: 153397144 -> 153393046 (<.01%)
cycles in affected programs: 1220713 -> 1216615 (-0.34%)
helped: 255
HURT: 31
helped stats (abs) min: 1 max: 304 x̄: 16.71 x̃: 16
helped stats (rel) min: <.01% max: 6.70% x̄: 0.41% x̃: 0.31%
HURT stats (abs)   min: 1 max: 41 x̄: 5.29 x̃: 3
HURT stats (rel)   min: 0.02% max: 0.65% x̄: 0.16% x̃: 0.11%
95% mean confidence interval for cycles value: -17.44 -11.22
95% mean confidence interval for cycles %-change: -0.40% -0.29%
Cycles are helped.

Iron Lake
total instructions in shared programs: 8106894 -> 8105529 (-0.02%)
instructions in affected programs: 287197 -> 285832 (-0.48%)
helped: 1099
HURT: 0
helped stats (abs) min: 1 max: 10 x̄: 1.24 x̃: 1
helped stats (rel) min: 0.16% max: 4.55% x̄: 0.67% x̃: 0.61%
95% mean confidence interval for instructions value: -1.29 -1.19
95% mean confidence interval for instructions %-change: -0.70% -0.64%
Instructions are helped.

total cycles in shared programs: 188347022 -> 188344266 (<.01%)
cycles in affected programs: 3740632 -> 3737876 (-0.07%)
helped: 758
HURT: 10
helped stats (abs) min: 2 max: 38 x̄: 3.68 x̃: 2
helped stats (rel) min: <.01% max: 1.00% x̄: 0.12% x̃: 0.08%
HURT stats (abs)   min: 2 max: 4 x̄: 3.20 x̃: 4
HURT stats (rel)   min: 0.03% max: 0.07% x̄: 0.06% x̃: 0.07%
95% mean confidence interval for cycles value: -3.82 -3.35
95% mean confidence interval for cycles %-change: -0.13% -0.11%
Cycles are helped.

GM45
total instructions in shared programs: 4985449 -> 4984768 (-0.01%)
instructions in affected programs: 145154 -> 144473 (-0.47%)
helped: 547
HURT: 0
helped stats (abs) min: 1 max: 10 x̄: 1.24 x̃: 1
helped stats (rel) min: 0.16% max: 2.86% x̄: 0.66% x̃: 0.61%
95% mean confidence interval for instructions value: -1.31 -1.18
95% mean confidence interval for instructions %-change: -0.69% -0.62%
Instructions are helped.

total cycles in shared programs: 128835062 -> 128833144 (<.01%)
cycles in affected programs: 2720650 -> 2718732 (-0.07%)
helped: 517
HURT: 1
helped stats (abs) min: 2 max: 38 x̄: 3.71 x̃: 2
helped stats (rel) min: <.01% max: 0.89% x̄: 0.11% x̃: 0.07%
HURT stats (abs)   min: 2 max: 2 x̄: 2.00 x̃: 2
HURT stats (rel)   min: 0.04% max: 0.04% x̄: 0.04% x̃: 0.04%
95% mean confidence interval for cycles value: -4.02 -3.39
95% mean confidence interval for cycles %-change: -0.12% -0.10%
Cycles are helped.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3965>
2020-03-09 16:46:19 -07:00
Matt Turner
bb3e7b0fe3 intel/compiler: Pass shader_stats for each SIMD mode
Passing shader_stats to the fs_generator constructor means that the
SIMD8 shader stats from the visitor (such as the scheduler mode) will be
reported out for the SIMD16/SIMD32 versions as well.

As you can see, we are now passing 'shader_stats' and 'stats' to
generate_code(), which is obviously odd looking. Ian rebased and
committed an old patch of mine which added the shader_stats struct on
July 30 in commit dabb5d4bee (i965/fs: Add a shader_stats struct.) and
shortly after on August 12 Jason added the brw_compile_stats struct in
commit 134607760a (intel/compiler: Fill a compiler statistics struct).

I'd like to combine the two, but I'm not sure how. shader_stats is an
input to generate_code() while brw_compile_stats is an output and is
only used by the Vulkan driver. Leave it as is for now...

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>
2020-03-09 04:44:12 +00:00
Matt Turner
75a33e268e intel/compiler: Mark some methods and parameters const
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>
2020-03-09 04:44:11 +00:00
Francisco Jerez
70349a2252 intel/compiler: Calculate num_instructions in O(1) during register pressure calculation
And mark the variable declaration as const.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
2020-03-06 10:21:13 -08:00
Francisco Jerez
e5e4d016b9 intel/compiler: Move register pressure calculation into IR analysis object
This defines a new BRW_ANALYSIS object which wraps the register
pressure computation code along with its result.  For the rationale
see the previous commits converting the liveness and dominance
analysis passes to the IR analysis framework.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
2020-03-06 10:21:10 -08:00
Francisco Jerez
ea44de6d8c intel/compiler/fs: Switch liveness analysis to IR analysis framework
This involves wrapping fs_live_variables in a BRW_ANALYSIS object and
hooking it up to invalidate_analysis() so it's properly invalidated.
Seems like a lot of churn but it's fairly straightforward.  The
fs_visitor invalidate_ and calculate_live_intervals() methods are no
longer necessary after this change.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
2020-03-06 10:20:57 -08:00
Francisco Jerez
ba73e606f6 intel/compiler: Move all live interval analysis results into fs_live_variables
This moves the following methods that are currently defined in
fs_visitor (even though they are side products of the liveness
analysis computation) and are already implemented in
brw_fs_live_variables.cpp:

> bool virtual_grf_interferes(int a, int b) const;
> int *virtual_grf_start;
> int *virtual_grf_end;

It makes sense for them to be part of the fs_live_variables object,
because they have the same lifetime as other liveness analysis results
and because this will allow some extra validation to happen wherever
they are accessed in order to make sure that we only ever use
up-to-date liveness analysis results.

This shortens the virtual_grf prefix in order to compensate for the
slightly increased lexical overhead from the live_intervals pointer
dereference.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
2020-03-06 10:20:43 -08:00
Francisco Jerez
ab6d792986 intel/compiler: Pass detailed dependency classes to invalidate_analysis()
Have fun reading through the whole back-end optimizer to verify
whether I've missed any dependency flags -- Or alternatively, just
trust that any mistake here will trigger an assertion failure during
analysis pass validation if it ever poses a problem for the
consistency of any of the analysis passes managed by the framework.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
2020-03-06 10:20:39 -08:00
Francisco Jerez
d966a6b4c4 intel/compiler: Introduce backend_shader method to propagate IR changes to analysis passes
The invalidate_analysis() method knows what analysis passes there are
in the back-end and calls their invalidate() method to report changes
in the IR.  For the moment it just calls invalidate_live_intervals()
(which will eventually be fully replaced by this function) if anything
changed.

This makes all optimization passes invalidate DEPENDENCY_EVERYTHING,
which is clearly far from ideal -- The dependency classes passed to
invalidate_analysis() will be refined in a future commit.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>
2020-03-06 10:20:32 -08:00
Jordan Justen
cf12faef61
intel/compiler: Restrict cs_threads to 64
Our current GPGPU_WALKER code only supports up to 64 threads.

On HSW we could use up to 70 and TGL up to 112, but only if the walker
is adjusted so the width does not exceed 64. Work to support this is
in progress.

Previous to this change, we might try to downgrade to SIMD8 if the
SIMD16 shader spilled. Since HSW and TGL have the max number of
threads above 64, we would then try to emit an invalid GPGPU walker
command.

Fixes: 932045061b ("i965/cs: Emit compute shader code and upload programs")
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Tested-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
2020-02-28 14:45:43 -08:00
Danylo Piliaiev
d596795d4d brw_fs: Avoid zero size vla
../src/intel/compiler/brw_fs.cpp:2247:46: runtime error: variable length array bound evaluates to non-positive value 0
    #0 0x7f78f5697678 in fs_visitor::assign_constant_locations() ../src/intel/compiler/brw_fs.cpp:2247
    #1 0x7f78f571d29e in fs_visitor::optimize() ../src/intel/compiler/brw_fs.cpp:7361
    #2 0x7f78f574eb84 in fs_visitor::run_fs(bool, bool) ../src/intel/compiler/brw_fs.cpp:8022
    #3 0x7f78f575641b in brw_compile_fs ../src/intel/compiler/brw_fs.cpp:8408
    #4 0x7f78f255c8e4 in brw_codegen_wm_prog ../src/mesa/drivers/dri/i965/brw_wm.c:123
    #5 0x7f78f2565571 in brw_fs_precompile ../src/mesa/drivers/dri/i965/brw_wm.c:608
    #6 0x7f78f24edd2c in brw_shader_precompile ../src/mesa/drivers/dri/i965/brw_link.cpp:56
    #7 0x7f78f24f3af8 in brw_link_shader ../src/mesa/drivers/dri/i965/brw_link.cpp:381
    #8 0x7f78f39a302a in _mesa_glsl_link_shader ../src/mesa/program/ir_to_mesa.cpp:3119
    #9 0x7f78f3a43826 in create_new_program ../src/mesa/main/ff_fragment_shader.cpp:1133
    #10 0x7f78f3a43d00 in _mesa_get_fixed_func_fragment_program ../src/mesa/main/ff_fragment_shader.cpp:1163
    #11 0x7f78f325ddcd in update_program ../src/mesa/main/state.c:134
    #12 0x7f78f325fe64 in _mesa_update_state_locked ../src/mesa/main/state.c:360
    #13 0x7f78f32600f1 in _mesa_update_state ../src/mesa/main/state.c:394
    #14 0x7f78f2b3e587 in clear ../src/mesa/main/clear.c:169
    #15 0x7f78f2b3e587 in _mesa_Clear ../src/mesa/main/clear.c:242

Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3825>
2020-02-19 12:07:24 +02:00
Francisco Jerez
8d3b86e34a intel/fs/gen7+: Implement discard/demote for SIMD32 programs.
At this point this simply involves fixing the initialization of the
sample mask flag register to take the right dispatch mask from the
thread payload, and fixing sample_mask_reg() to return f1.1 for the
second half of a SIMD32 thread.  This improves Manhattan 3.1
performance by 2.4%±0.31% (N>40) on my ICL with SIMD32 enabled
relative to falling back to SIMD16 for the shaders that use discard.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:49 -08:00
Francisco Jerez
04c7d3d4b1 intel/fs: Return consistent UW types from sample_mask_reg() in fragment shaders.
In SIMD32 programs that don't use discard, the upper 16 bits of the UD
result of sample_mask_reg() don't contain the sample mask of the upper
16 channels as one would expect.  Stop pretending we are returning a
valid 32-bit mask.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:49 -08:00
Francisco Jerez
1c6853a9be intel/fs: Refactor predication on sample mask into helper function.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
a792e11f5c intel/fs/gen7+: Swap sample mask flag register and FIND_LIVE_CHANNEL temporary.
FIND_LIVE_CHANNEL was using f1.0-f1.1 as temporary flag register on
Gen7, instead use f0.0-f0.1.  In order to avoid collision with the
discard sample mask, move the latter to f1.0-f1.1.  This makes room
for keeping track of the sample mask of the second half of SIMD32
programs that use discard.

Note that some MOVs of the sample mask into f1.0 become redundant now
in lower_surface_logical_send() and lower_a64_logical_send().

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>x
2020-02-14 14:31:48 -08:00
Francisco Jerez
083fd96a97 intel/fs: Use helper for discard sample mask flag subregister number.
Use it instead of hard-coding f0.1 for the sample mask of programs
that use discard.  This will make the task easier when we replace f0.1
with another flag register location in order to support discard with
SIMD32 shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
a6bc11a789 intel/fs: Make sample_mask_reg() local to brw_fs.cpp and use it in more places.
It's only really useful there.  This will avoid confusion with another
helper with a similar purpose I'm about to introduce that will be
useful in multiple files from the FS back-end.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
57dee58c82 intel/fs: Set src0 alpha present bit in header when provided in message payload.
Currently the "Source0 Alpha Present to RenderTarget" bit of the RT
write message header is derived from brw_wm_prog_data::replicate_alpha.
However the src0_alpha payload is provided anytime it's specified to
the logical message.  This could theoretically lead to an
inconsistency if somebody provided a src0_alpha value while
brw_wm_prog_data::replicate_alpha was false, as I'm planning to do in
a future commit in order to implement a hardware workaround.

Instead calculate the header bit based on whether a src0_alpha value
was provided to the logical message, which guarantees the same
behavior on pre-ICL and ICL+ (the latter used an extended descriptor
bit for this which didn't suffer from the same issue).  Remove the
brw_wm_prog_data::replicate_alpha flag.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
a8ac0bd759 intel/fs/gen12: Workaround unwanted SEND execution due to broken NoMask control flow.
This is a less invasive alternative to the workaround documented in
the hardware spec for GEN:BUG:1407528679, which doesn't involve
disabling structured control flow (it's unlikely that switching to
GOTO/JOIN would have actually fixed the problem anyway).

Under some conditions Gen12 hardware can end up executing a BB with
all channels disabled, which will lead to the execution of any NoMask
instructions in it, even though any execution-masked instructions will
be correctly shot down.  This may break assumptions of some NoMask
SEND messages whose descriptor depends on data generated by live
invocations of the shader.

This avoids the problem by predicating certain instructions on an ANY
horizontal predicate that makes sure that their execution is omitted
when all channels of the program are disabled.  The shader-db impact
of this patch seems to be minimal:

total instructions in shared programs: 17169833 -> 17169913 (0.00%)
instructions in affected programs: 30663 -> 30743 (0.26%)
helped: 0
HURT: 42

total cycles in shared programs: 336966176 -> 336968568 (0.00%)
cycles in affected programs: 2367290 -> 2369682 (0.10%)
helped: 0
HURT: 13

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: 20.0 <mesa-stable@lists.freedesktop.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
008f95a043 intel/fs: Add virtual instruction to load mask of live channels into flag register.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: 20.0 <mesa-stable@lists.freedesktop.org>
2020-02-14 14:31:48 -08:00
Francisco Jerez
b8b509fb92 intel/fs/gen7: Fix fs_inst::flags_written() for SHADER_OPCODE_FIND_LIVE_CHANNEL.
We need to pass a width of 32 since the opcode bashes the whole f1.0
register on IVB.  This is unlikely to have caused problems since f1.0
is largely unused currently.  That's likely to change soon though,
even on platforms other than Gen7.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Cc: 20.0 <mesa-stable@lists.freedesktop.org>
2020-02-14 14:31:48 -08:00
Ian Romanick
58907568ec intel/fs: Add SHADER_OPCODE_[IU]SUB_SAT pseudo-ops
v2: Add a big comment explaining the [IU]SUB_SAT lowering.  Suggested by
Caio.

v3: Use get_fpu_lowered_simd_width in get_lowered_simd_width.  Suggested
by Ken on IRC.

v4: Fix a typo in a comment.  Noticed by Caio.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
2020-01-23 00:18:57 +00:00
Ian Romanick
74cd0964d6 intel/fs: Don't lower integer multiplies that don't need lowering
v2: Move the check to fs_visitor::lower_integer_multiplication.
Previously the cases where lowering was skipped, the original
instruction was removed by fs_visitor::lower_integer_multiplication.

Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
2020-01-23 00:18:57 +00:00
Matt Turner
49c21802cb intel/compiler: Split has_64bit_types into float/int
Gen7 has 64-bit floats but not 64-bit ints.

Acked-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2635>
2020-01-22 00:19:20 +00:00
Caio Marcelo de Oliveira Filho
ff5b74ef32 intel/fs: Add workgroup_size() helper
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
2020-01-21 23:41:35 +00:00
Francisco Jerez
b54b67e067 intel/fs: Switch to standard vector layout for barycentrics at optimization time.
This involves permuting the registers of barycentric vectors to have
the standard X[0-n] Y[0-n] layout at NIR translation time.
Barycentrics are converted to the format expected by the PLN
instruction in the lower_barycentrics() pass run after the
optimization loop.

Main reason is correctness of SIMD32 fragment shaders.  The
shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
during NIR translation are busted for SIMD32.  This leads to serious
corruption at present with INTEL_DEBUG=do32, especially on Gen11+
where these helpers are hit more frequently due to the lack of a
hardware PLN instruction.

Of course one could have chosen to fix those helpers instead, but
there is another far more subtle issue that was reported during review
of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
currently handles SIMD32 barycentric vectors as if they had the
standard X[0-n] Y[0-n] layout, even though they are interleaved for
the PLN instruction, which causes incorrect execution masks to be
applied to the MOVs unzipping barycentric vectors in cases where a
LINTERP instruction occurs under non-uniform control flow.

I'm not aware of any conformance regressions due to the latter issue
at present, but for our peace of mind let's move the conversion to the
PLN layout into the lower_barycentrics() pass run after
lower_simd_width().

This leads to the following shader-db improvements (including SIMD32
shaders) in combination with the previous back-end preparation changes
-- Without them (especially the copy propagation changes) this would
lead to a massive number of regressions.  On ICL:

   total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
   instructions in affected programs: 10538474 -> 10343061 (-1.85%)
   helped: 68775
   HURT: 6

   total spills in shared programs: 8938 -> 8748 (-2.13%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8965 -> 8663 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   146
   GAINED: 43

On SKL:

   total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
   instructions in affected programs: 3876590 -> 3765635 (-2.86%)
   helped: 27492
   HURT: 2

   LOST:   191
   GAINED: 417

On SNB:

   total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
   instructions in affected programs: 5199074 -> 4606107 (-11.41%)
   helped: 29998
   HURT: 0

   LOST:   21
   GAINED: 30

Results are somewhat less impressive but still significant without
SIMD32 fragment shaders enabled.  On ICL:

   total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
   instructions in affected programs: 6114788 -> 6027719 (-1.42%)
   helped: 42046
   HURT: 6

   total spills in shared programs: 8218 -> 8028 (-2.31%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8953 -> 8651 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   0
   GAINED: 3

On SKL:

   total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
   instructions in affected programs: 168850 -> 167594 (-0.74%)
   helped: 711
   HURT: 2

On SNB:

   total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
   instructions in affected programs: 2702172 -> 2666037 (-1.34%)
   helped: 17818
   HURT: 0

All of the hurt shaders are either spilling slightly more or emitting
additional NOP instructions due to the SIMD16 POW workaround for
Gen8-9 combined with differences in scheduling.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:23:12 -08:00
Francisco Jerez
79bd252d6e intel/fs: Introduce barycentric layout lowering pass.
The goal is to represent barycentrics with the standard vector layout
during optimization and particularly SIMD lowering.  Instead of
emitting the barycentric layout conversions at NIR translation time,
do it later as a lowering pass.  For the moment this is only applied
to PI messages, but we'll give the same treatment to LINTERP
instructions too.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:22:59 -08:00
Francisco Jerez
ab0d1b3b3d intel/fs: Rework fs_inst::is_copy_payload() into multiple classification helpers.
This reworks the current fs_inst::is_copy_payload() method into a
number of classification helpers with well-defined semantics.  This
will be useful later on in order to optimize LOAD_PAYLOAD instructions
more aggressively in cases where we can determine it's safe to do so.

The closest equivalent of the present fs_inst::is_copy_payload()
method is the is_coalescing_payload() helper introduced here.

No functional nor shader-db changes.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:21:19 -08:00
Francisco Jerez
1873202f44 intel/fs: Generalize fs_reg::is_contiguous() to register files other than VGRF.
No functional nor shader-db changes.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:20:59 -08:00
Francisco Jerez
d9a57c85cc intel/fs: Try to vectorize header setup in lower_load_payload().
In cases where LOAD_PAYLOAD is provided a pair of contiguous registers
as header sources, try to use a single SIMD16 instruction in order to
initialize them.  This is unlikely to affect the overall cycle count
of the shader, since the compressed instruction has twice the issue
time, except due to the reduced pressure on the instruction cache.

Main motivation is avoiding instruction-count regressions in
combination with the following copy propagation improvements, which
will allow the SIMD16 g0-1 header setup emitted for framebuffer writes
to be copy-propagated into its LOAD_PAYLOAD, leading to the emission
of two SIMD8 MOV instructions instead of a single SIMD16 MOV.

Reverting this commit on top of the copy propagation changes would
lead to the following shader-db regressions on SKL and other
platforms:

 total instructions in shared programs: 14926738 -> 14935415 (0.06%)
 instructions in affected programs: 1892445 -> 1901122 (0.46%)
 helped: 0
 HURT: 8676

Without the following copy propagation changes this doesn't have any
effect on shader-db on Gen7+, because we would typically set up the FB
write header with a separate SIMD16 MOV that isn't currently
copy-propagated into the LOAD_PAYLOAD, so the individual SIMD8 MOVs
result of LOAD_PAYLOAD lowering would get register-coalesced away
under normal circumstances.  However that wasn't the case for MRF
LOAD_PAYLOAD destinations on Gen6 and earlier, because register
coalesce only kicks in for GRFs, leaving a number of redundant SIMD8
MOVs lying around.  On SNB this leads to the following shader-db
improvements:

 total instructions in shared programs: 10770538 -> 10734681 (-0.33%)
 instructions in affected programs: 2700655 -> 2664798 (-1.33%)
 helped: 17791
 HURT: 0

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-17 13:20:46 -08:00
Eric Anholt
3d9a3d0be0 i965: Reuse the new core glsl_count_dword_slots().
The only difference I could see was treating interfaces like structs.
Maintain that case.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3297>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3297>
2020-01-14 23:55:00 +00:00
Francisco Jerez
c20dc9b836 intel/fs: Make implied_mrf_writes() an fs_inst method.
This will be convenient in a later commit enabling SIMD32 fragment
shaders, and happens to fix the calculation for MATH instructions
which is currently inaccurate for SIMD-lowered instructions on Gen4-5
platforms (all of them on Gen4 in SIMD16 mode), since it was based on
the shader's dispatch width rather than on the actual execution size
of the instruction.

This causes some shader-db noise on Gen4 due to the more compact
register allocation interacting with the SEND dependency workarounds,
but otherwise no major changes.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-10 11:02:30 -08:00
Francisco Jerez
0a6e46d44d intel/fs/gen11+: Handle ROR/ROL in lower_simd_width().
Prevents invalid code from being emitted for ROR/ROL instructions in
SIMD32 shaders.

The problem can be reproduced with the following tests while forcing
SIMD32 to be used for fragment shaders:

 piglit.shaders.glsl-rotate-left
 piglit.shaders.glsl-rotate-right

However the issue could occur in production already with compute
shaders and a workgroup size large enough to trigger SIMD32 dispatch.

Fixes: 83fdec0f0d "intel/compiler: Enable the emission of ROR/ROL instructions"
Cc: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2020-01-10 11:00:24 -08:00
Caio Marcelo de Oliveira Filho
2137be22fa intel/fs: Fix lowering of dword multiplication by 16-bit constant
Existing code was ignoring whether the type of the immediate source
was signed or not.  If the source was signed, it would ignore small
negative values but it also would wrongly accept values between
INT16_MAX and UINT16_MAX, causing the atual value to later be
reinterpreted as a negative number (under 16-bits).

Fixes tests/shaders/glsl-mul-const.shader_test in Piglit for platforms
that don't support MUL with 32x32 types, including ICL and TGL.

Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2186
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-12-17 10:45:22 -08:00