Commit graph

29264 commits

Author SHA1 Message Date
Eric Anholt
96ffee2d02 vc4: Mark threaded FSes as non-singlethread in the CL. 2016-11-12 19:21:46 -08:00
Eric Anholt
ace0d810e5 vc4: Flag the last thread switch in the program as the last.
We don't allow the last thread switch to be inside control flow, to be
sure that we hit the last state exactly once.  If the last texturing was
in control flow, fall back to single threaded.
2016-11-12 19:21:46 -08:00
Eric Anholt
67f72c5f5d vc4: Add THRSW nodes after each tex sample setup in multithreaded mode.
This is a suboptimal implementation, but Jonas Pfeil found that it was
still a massive performance gain.
2016-11-12 19:21:46 -08:00
Eric Anholt
e3c620e868 vc4: Add some spec citations about texture fifo management. 2016-11-12 18:46:35 -08:00
Eric Anholt
fd2aff858b vc4: Use ra14/rb14 as the spilling registers.
This makes the raddr fixups compatible with FS threading.
2016-11-12 18:46:35 -08:00
Eric Anholt
755037173d vc4: Add support for register allocation for threaded shaders.
We have two major requirements: Make sure that only the bottom half of the
physical reg space is used, and make sure that none of our values are live
in an accumulator across a switch.
2016-11-12 18:46:35 -08:00
Eric Anholt
fdad4d2402 vc4: Split register class setup for physical files from accumulators. 2016-11-12 18:46:35 -08:00
Eric Anholt
8e704dca7f vc4: Use register allocator CLASS_BIT_R0_R3 to clean up CLASS_B.
We have had no reason to separate ability to store in an accumulator from
ability to store in B, but with FS threading, we need to be able to force
values to be stored only in the physical regfiles.
2016-11-12 18:46:35 -08:00
Eric Anholt
1ee503c74d vc4: Add support for QPU scheduling of thread switch instructions.
This is vaguely based off of Jonas Pfeil's thread switch support branch.
2016-11-12 18:46:35 -08:00
Eric Anholt
4f527f1260 vc4: Add a thread switch QIR instruction.
This will eventually be generated at the QIR level, so that
vc4_qir_schedule.c can arrange the separation of tex_strb from tex_result
correctly.  It will also be important so that register allocation set the
register classes appropriately for values that are live across the switch.
2016-11-12 18:46:35 -08:00
Eric Anholt
93cdae44de vc4: Add a bit of QPU validation for threaded shaders.
These are both bugs we've run into along the way writing multithreaded FS
support.
2016-11-12 18:46:35 -08:00
Eric Anholt
977d8b526b vc4: Fix register class handling of DDX/DDY arguments.
I had this exactly backwards, but apparently the piglit tests were all
landing in r0-r3 anyway.

Cc: "13.0" <mesa-stable@lists.freedesktop.org>
2016-11-12 18:46:35 -08:00
Rob Clark
dfc001dccc freedreno/ir3: fixup ralloc fallout
Fixes fallout from acc23b04 ("ralloc: remove memset from ralloc_size").
We were still depending on zero'd allocations in a couple of places.

Signed-off-by: Rob Clark <robdclark@gmail.com>
2016-11-12 08:57:03 -05:00
Laurent Carlier
3ff9f8c532 clover: fix building since llvm r286566
pretty trivial fix
2016-11-11 19:45:22 +00:00
Samuel Pitoiset
561f2208bd nvc0: support MP performance counters on Maxwell
This adds some performance counters/metrics for SM50/SM52.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Tested-by: Pierre Moreau <pierre.morrow@free.fr>
2016-11-10 22:13:49 +01:00
Tim Rowley
b9578b683d gallium: detect avx512 cpu features
v3: fix check for xmm/ymm test
v2: style code, add avx512 to cpu dump

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-11-10 15:03:21 -06:00
Marek Olšák
ce3f453f01 radeonsi: fix r600_texture::tc_compatible_htile
htile_size is now always non-zero if HTILE is allocated.

It seems to have caused no issues.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-10 18:34:55 +01:00
Marek Olšák
ce3189cbe6 radeonsi: accept is_store in image_fetch_rsrc instead of dcc_off
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-10 18:34:55 +01:00
Marek Olšák
f83b2f524a radeonsi: don't rely on tgsi_scan::images_buffers
the instruction knows the target

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-10 18:34:55 +01:00
Marek Olšák
4e00e20074 radeonsi: re-order cases in si_get_shader_param
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-10 18:34:55 +01:00
Marek Olšák
3f6e0063c8 radeonsi: increase MAX_CONTROL_FLOW_DEPTH AKA MaxIfDepth
we don't want to lower deep IFs unconditionally

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-10 18:34:55 +01:00
Nicolai Hähnle
b21912e2e9 radeonsi: fix/silence unused variable warnings in optimized builds
I'm leaving num_out_sgpr around since it's not in a fast path, and besides
the compiler should be able to optimize it away easily. The alternative
with #if/#endif would be extremely ugly.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-11-10 13:18:16 +01:00
Nicolai Hähnle
b46a9c570f gallivm: fix [IU]MUL_HI regression harder
The fix in commit 88f791db75 was insufficient
for radeonsi because the vector case was not handled properly. It seems
piglit only covers the scalar case, unfortunately.

Fixes GL45-CTS.shader_bitfield_operation.[iu]mulExtended.*

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-11-10 13:17:10 +01:00
Ilia Mirkin
828faaef40 swr: correct setting of independentAlphaBlendEnable
This setting is for whether color and alpha have different blend
settings, not for whether blending is enabled on a per-RT basis.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-09 20:11:57 -05:00
Ilia Mirkin
5be635d5e4 swr: [rasterizer] add a .dir-locals.el to support 4-space indents
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-09 20:11:39 -05:00
Ilia Mirkin
36e5d68cad swr: set halfz rasterizer setting
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-09 20:11:10 -05:00
Ilia Mirkin
4b5b87e7ab swr: [rasterizer core] allow an OpenGL driver to specify halfz clipping
With ARB_clip_control, GL may also do 0..1 depth clipping, not just
-1..1. This removes clip's reliance on driver type. DX users will need
to be updated to set the new clipHalfZ flag to get proper clipping
functionality.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-09 20:10:52 -05:00
Ilia Mirkin
4af25e7131 swr: fix support for inverted depth scales
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-09 20:10:44 -05:00
Ilia Mirkin
aed517f985 swr: [rasterizer jitter] fix logic op to work with unorm/snorm
Most logic op usage is probably going to end up with normalized
textures. Scale the floating point values and convert to integer before
performing the logic operations.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-09 20:10:25 -05:00
Eric Anholt
08d51487e3 vc4: Clamp the shadow comparison value.
Fixes piglit glsl-fs-shadow2D-clamp-z.

Cc: <mesa-stable@lists.freedesktop.org>
2016-11-09 15:33:56 -08:00
Eric Anholt
e887341d3f vc4: Don't pair up TLB scoreboard locking instructions early in QPU sched.
Jonas Pfeil noticed that we were putting passthrough tlb_z writes early in
the shader, despite QIR and QPU scheduling both trying to delay scoreboard
locking for as long as possible.

The problem was that when trying to pair up QPU instructions, at some
point the passthrough tlb_z would be the last one available and it would
get paired, even if the other half would open up other instructions to be
scheduled and we could have paired tlb_z with something later in the
program.  Also, since passthrough z is just a mov, it pairs up really
easily.

The proper fix would probably be to flip the order of scheduling
instructions so we went from bottom to top (also relevant for branch delay
slot scheduling).

However, we can do a quick fix here to just not schedule a TLB lock until
there's nothing but TLB left in the program, at a slight instruction cost
(est .61% cycle count in shader-db) but a major fragment shader
parallelism win.

glmark2 results:
  texture:texture-filter=linear: +1.24481% +/- 0.626117% (n=15)
  bump:bump-render=height: 1.24991% +/- 0.154793% (n=136,133 -- screensaver
    outliers removed)
2016-11-09 15:33:56 -08:00
Eric Anholt
695a2e2ffa vc4: Print a reg pressure estimate in our reg allocation failure dump. 2016-11-09 15:33:56 -08:00
Eric Anholt
4d019bd703 vc4: Don't abort when a shader compile fails.
It's much better to just skip the draw call entirely.  Getting this
information out of register allocation will also be useful for
implementing threaded fragment shaders, which will need to retry
non-threaded if RA fails.

Cc: <mesa-stable@lists.freedesktop.org>
2016-11-09 15:33:56 -08:00
Aaron Watry
1492633070 llvmpipe: Fix build after removal of deprecated attribute API v2
Applies on top of v3 of Tom's gallivm change.

v2:
  - Tom Stellard: Use enums instread of strings.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Aaron Watry <awatry@gmail.com>
CC: Tom Stellard <thomas.stellard@amd.com>
CC: Jan Vesely <jan.vesely@rutgers.edu>
2016-11-09 20:13:27 +00:00
Tom Stellard
8bdd52c8f3 gallivm: Fix build after removal of deprecated attribute API v3
v2:
  Fix adding parameter attributes with LLVM < 4.0.

v3:
  Fix typo.
  Fix parameter index.
  Add a gallivm enum for function attributes.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-09 20:13:27 +00:00
Roland Scheidegger
4d5346aaac Revert "draw: use vectorized calculations for fetch"
Trivial. There's some regressions internally, related to overflow
behavior. I'll have to look at it at another time, some interactions
with vsplit/vcache are actually mind-blowing.

This reverts commit 3fa10ffb49.
2016-11-09 05:53:16 +01:00
Ilia Mirkin
f037afb701 swr: disable logic op when the rt format is float or srgb
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-08 19:28:35 -05:00
Ilia Mirkin
e2e40e236f swr: fix AND_INVERTED logic op conversion
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-08 19:28:35 -05:00
Ilia Mirkin
bef4a48d1c swr: add support for EXT_depth_bounds_test
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-08 19:28:35 -05:00
Ilia Mirkin
aa62fa8fb7 swr: [rasterizer core] set depth hottile when depth bounds test enabled
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-11-08 19:28:35 -05:00
Tim Rowley
95ed1c19bf swr: allow alphatest without blend or logicop
We need to compile a blend function when alphatest is enabled.

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
2016-11-08 14:18:47 -06:00
Marek Olšák
bdd48e47c0 tgsi/scan: turn a huge if-else-if.. chain into a switch statement
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-08 17:56:42 +01:00
Marek Olšák
f864547fa9 tgsi/scan: fix images_buffers regression
The first IF statement disabled the second one.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98599

Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-08 17:56:42 +01:00
Nicolai Hähnle
88f791db75 gallivm: fix [IU]MUL_HI regression
This patch does two things:

1. It separates the host-CPU code generation from the generic code
   generation. This guards against accidently breaking things for
   radeonsi in the future.

2. It makes sure we actually use both arguments and don't just compute
   a square :-p

Fixes a regression introduced by commit 29279f44b3

Cc: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-11-08 16:25:54 +01:00
Roland Scheidegger
3fa10ffb49 draw: use vectorized calculations for fetch
Instead of doing all the math with scalars, use vectors. This means the
overflow math needs to be done manually, albeit that's only really
problematic for the stride/index mul, the rest has been pretty much
moved outside the shader loop (albeit the mul could actually be optimized
away too), where things are still scalar. Because llvm is complete fail
with the zero-extend widening mul, roll our own even...
To eliminate control flow in the main shader loop fetch, provide fake
buffers (so index 0 is always valid to fetch).
Still uses aos fetch though in the end - mostly because some more code
would be needed to handle unaligned fetches in that path, and because for
most formats it won't make a difference anyway (we generate some truly
horrendous code for things like R16G16_something for instance).

Instanced fetch however stays roughly the same as before, except that
no longer the same element is fetched multiple times (I've seen a reduction
of ~3 times in main shader loop size due to apparently llvm not being able
to deduce it's really all the same with a couple instanced elements).

Also, for elts gathering, use vectorized code as well - provide a fake
elt buffer if there's no valid one bound.

The generated shaders are smaller and faster to compile (not entirely sure
about execution speed, but generally unless there's just single vertices
to handle I would expect it to be faster - there's more opportunities
for future improvements by using soa fetch).

No piglit change.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
2016-11-08 03:41:26 +01:00
Roland Scheidegger
29279f44b3 gallivm: introduce 32x32->64bit lp_build_mul_32_lohi function
This is used by shader umul_hi/imul_hi functions (and soon by draw).
It's actually useful separating this out on its own, however the real
reason for doing it is because we're using an optimized sse2 version,
since the code llvm generates is atrocious (since there's no widening
mul in llvm, and it does not recognize the widening mul pattern, so
it generates code for real 64x64->64bit mul, which the cpu can't do
natively, in contrast to 32x32->64bit mul which it could do).

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
2016-11-08 03:41:26 +01:00
Samuel Pitoiset
e32e5d214e nvc0: simplify draw parameters upload for vertex shaders
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
2016-11-07 22:50:17 +01:00
Steven Toth
381edca826 gallium/hud: protect against and initialization race
In the event that multiple threads attempt to install a graph
concurrently, protect the shared list.

Signed-off-by: Steven Toth <stoth@kernellabs.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-07 18:31:52 +01:00
Steven Toth
5a58323064 gallium/hud: close a previously opened handle
We're missing the closedir() to the matching opendir().

Signed-off-by: Steven Toth <stoth@kernellabs.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-07 18:31:52 +01:00
Steven Toth
6ffed08679 gallium/hud: fix a problem where objects are free'd while in use.
Instead of trying to maintain a reference counted list of valid HUD
objects, and freeing them accordingly, creating race conditions
between unanticipated multiple threads, simply accept they're
allocated once and never released until the process terminates.

They're a shared resource between multiple threads, so accept
they're always available for use.

Signed-off-by: Steven Toth <stoth@kernellabs.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-11-07 18:31:52 +01:00