Commit graph

27608 commits

Author SHA1 Message Date
Dave Airlie
0438bc76e2 nouveau: enable GL 4.3 on kepler/fermi
Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-05-28 05:52:13 +10:00
Marek Olšák
43550f25ed radeonsi: always reserve output space for tess factors
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Tested-by: Dave Airlie <airlied@redhat.com>
2016-05-27 21:40:43 +02:00
Bas Nieuwenhuizen
4717d5a2d3 gallium/ddebug: Add passthrough for query_memory_info.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-27 20:00:07 +02:00
Roland Scheidegger
9247570d42 gallivm: eliminate a unnecessary AND with unorm lerps
Instead of doing a add and then mask out the upper bits, we can
simply do a add with a half wide type (this, of course, assumes
the hw can actually do it...), so we'll get the required zero
in the upper bits automatically.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
2016-05-27 19:11:28 +02:00
Roland Scheidegger
17d685c426 gallium/util: use enum pipe_prim_type instead of unsigned some more
There were complaints from a mingw build:
u_draw.h:134:14: error: invalid conversion from ‘uint {aka unsigned int}’
to ‘pipe_prim_type’ [-fpermissive]

Reviewed-by: Brian Paul <brianp@vmware.com>
2016-05-27 19:11:28 +02:00
Brian Paul
2318d2015a svga: remove unneeded casts in get_query_result_vgpu9() calls
Reviewed-by: Charmaine Lee <charmainel@vmware.com>
2016-05-27 10:01:12 -06:00
Brian Paul
9be122e9b0 svga: use MAYBE_UNUSED to silence release-build warnings
Signed-off-by: Brian Paul <brianp@vmware.com>
2016-05-27 10:00:56 -06:00
Ilia Mirkin
df2881381a nvc0/ir: handle a load's reg result not being used for locked variants
For a load locked, we might not use the first result but the second
result is the predicate result of the locking. In that case the load
splitting logic doesn't apply (which is designed for splitting 128-bit
loads). Instead we take the predicate and move it into the first
position (as having a dead result in first def's position upsets all
sorts of things including RA). Update the emitters to deal with this as
well.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Tested-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
2016-05-26 21:23:49 -04:00
Ilia Mirkin
04ecad97ff nvc0/ir: avoid generating illegal instructions for compute constbuf loads
For user-supplied constbufs, fileIndex is 0. In that case, when we
subtract 1, we'll end up loading from constbuf offset -16. This is
illegal, and there are asserts to avoid it. Normally we'd just DCE it,
but no point in generating the instructions if they're not going to be
used.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
2016-05-26 21:23:49 -04:00
Rob Clark
4f98c94be7 gallium/util: fix build break
Missing #include caused build breaks after 21a3fb9cd.

Signed-off-by: Rob Clark <robclark@freedesktop.org>
2016-05-26 20:59:08 -04:00
Brian Paul
1ec45a1948 gallium/util: use enum pipe_prim_type in u_prim.h functions
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:18 -06:00
Brian Paul
7a49b41436 util/indices: move duplicated assignments out of switch cases
Spotted by Roland.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:18 -06:00
Brian Paul
46be65c681 gallium: change pipe_draw_info::mode to be pipe_prim_type
Makes debugging with gdb a little nicer.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:18 -06:00
Brian Paul
a25ae485a6 util/indices,svga: s/unsigned/enum pipe_prim_type/
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:18 -06:00
Brian Paul
21a3fb9cd8 util: s/unsigned/enum pipe_resource_usage/ for buffer usage variables
Reviewed-by: Marek Olšák <marek.olsak@amd.com>

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:18 -06:00
Brian Paul
45078e8890 svga: s/unsigned/enum pipe_resource_usage/ for buffer usage variables
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:18 -06:00
Brian Paul
d21a309c6c svga: s/unsigned/enum pipe_prim_type/ for primitive type variables
Proper enum types were only added recently.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
90afd7b7ef svga: fix test for unfilled triangles fallback
VGPU10 actually supports line-mode triangles.  We failed to make use of
that before.

Reviewed-by: Charmaine Lee <charmainel@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
2c07c40d2f svga: clean up and improve comments in svga_draw_private.h
Reviewed-by: Charmaine Lee <charmainel@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
0f983e1793 util/indices: implement unfilled (tri->line) conversion for adjacency prims
Tested with new piglit gl-3.2-adj-prims test.

v2: re-order trisadj and tristripadj code, per Roland.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
d6c2c7d710 util/indices: implement provoking vertex conversion for adjacency primitives
Tested with new piglit gl-3.2-adj-prims test.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
479d364c39 util/indices: assert that the incoming primitive is a triangle type
The unfilled index translator/generator functions should only be
called when the primitive mode is one of the triangle types.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
26de558072 util/indices: formatting, whitespace fixes in u_unfilled_indices.c
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
24eadb4810 util/indices: improve comments in u_indices.h
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
2016-05-26 17:44:17 -06:00
Brian Paul
5393238765 svga: fix primitive mode (point/line/tri) test for unfilled primitives
The original mode test was valid before we had GS support.

Regression tested with full piglit run.  Though, I don't think we have
any piglit tests that exercise drawing unfilled adjacency primitives.

Reviewed-by: Charmaine Lee <charmainel@vmware.com>
2016-05-26 17:44:17 -06:00
Samuel Pitoiset
e01a482182 nvc0: invalidate textures/samplers between 3D and CP on Fermi
Like constant buffers, samplers and textures are aliased on Fermi and
we need to invalidate the state when switching from 3D to CP and vice
versa.

This fixes rendering issues in the UE4 demos.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
2016-05-26 23:51:22 +02:00
Jason Ekstrand
32210dea8e compiler: Move glsl_to_nir to libglsl.la
Right now libglsl.la depends on libnir.la so putting it in libnir.la
adds a dependency on libglsl.la that goes the wrong direction.

Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
2016-05-26 14:13:38 -07:00
Bas Nieuwenhuizen
43d7305a40 radeonsi: Allow TES distribution between shader engines.
The R_028B50_VGT_TESS_DISTRIBUTION value is copied from
amdgpu-pro. Smaller values in the ACCUM fields seem to
decrease the performance advantage from this patch, higher
values don't seem to matter.

v2: Add distribution mode field enums.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
f91c85b29b radeonsi: Process multiple patches per threadgroup.
Using more than 1 wave per threadgroup does increase performance
generally.  Not using too many patches per threadgroup also
increases performance. Both catalyst and amdgpu-pro seem to
use 40 patches as their maximum, but I haven't really seen
any performance increase from limiting the number of patches
to 40 instead of 64.

Note that the trick where we overlap the input and output LDS
does not work anymore as the insertion of the tess factors
changes the patch stride.

v2: - Add comment about LDS assumptions.
    - Add constant for buffer size.
    - Fix code style.

v3: - Correct limits for not splitting patches between waves.
    - Set max num_patches to 40 as in the proprietary driver.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
fd0a7a382f radeonsi: Add barrier before writing the tess factors.
The factors may be stored to LDs by another invocation than
the invocation for vertex 0.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
fee3160af9 radeonsi: Enable dynamic HS.
This allows running the TES on different CU's than the
TCS which results in performance improvements.

v2: Only write the control word from one invocation.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
26f436132b radeonsi: Remove LDS layout user SGPR's from TES.
They are unused.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
a4e2146a9d radeonsi: Use buffer loads and stores for passing data from TCS to TES.
We always try to use 4-component loads, as LLVM does not combine loads
and they bypass the L1 cache.

We can't use a similar strategy for stores and this is especially
notable with the tess factors, as they are often set with separate
MOV's per component in the TGSI.

We keep storing to LDS and the LDS space, so we can load the outputs
later, either due to the shader, of for wrting the tess factors.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
6217716e8f radeonsi: Store inputs to memory when not using a TCS.
We need to copy the VS outputs to memory. I decided to do this
using a shader key, as the value depends on other shaders.

I also switch the fixed function TCS over to monolithic, as
otherwisze many of the user SGPR's need to be passed to the
epilog, which increases register pressure, or complexity to
avoid that. The main body of the fixed function TCS is not
that interesting to precompile anyway, since we do it on
demand and it is very small.

v2: Use u_bit_scan64.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
7846fa8768 radeonsi: Add offchip buffer address calculation.
Instead of creating a memory area per patch and per vertex, we put
the same attribute of every vertex & patch together. Most loads
and stores access the same attribute across all lanes, only for
different patches and vertices.

For the TCS this results in tightly packed data for 4-component
stores.

For the TES this is not the case as within a patch the loads
often also access the same vertex. However if there are < 4
vertices/patch, this still results in a reduction of the number
of cache lines. In the LDS situation we only do better than worst
case if the data per patch < 64 bytes, which due to the
tessellation factors is pretty much never.

We do not use hardware swizzling for this. It would slightly reduce
the number of executed VALU instructions, but I had issues with
increased wait times that I haven't been able to solve yet.
Furthermore, the tbuffer_store intrinsic does not support both
VGPR offset and an index, so we have a problem storing
indirectly indexed outputs. This can be solved by temporarily
storing arrays in LDS and then copying them, but I don't think
that is worth the effort. The difference in VALU cycles
hardware swizzling gives is about 0.2% of total busy cycles.
That is without handling the array case.

I chose for attributes instead of components as they are often
accessed together, and the software swizzling takes VALU cycles
for calculating offsets.

v2: - Rename functions to get_tcs_tes_buffer_address.
    - multiply by 16 as late as possible.
    - Use  tgsi_full_src_register_from_dst.
    - Remove some bad comments.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
c49e68dc4b radeonsi: Add user SGPR for the layout of the offchip buffer.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
d9a0c54f6f radeonsi: Use correct parameter index for LS_OUT_LAYOUT.
This happens to be in the right position, but that changes
when TCS/TES get new parameters.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
3e7a7a9a65 radeonsi: Add buffer load functions.
v2: - Use llvm.admgcn.buffer.load instrinsics for new LLVM.
    - Code style fixes.

v3: - Code style fix.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
9fdb778702 radeonsi: Define build_tbuffer_store_dwords earlier to support new users.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
5c34562d7c radeonsi: Add offchip tessellation parameters.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Bas Nieuwenhuizen
d27ff7d683 radeonsi: Add buffer for offchip storage between TCS and TES.
The buffer is quite large, but should only be allocated if the
application uses tessellation. Most non-games don't.

v2: - Use the correct register for SI.
    - Add define for block size.

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2016-05-26 22:07:04 +02:00
Rob Clark
6e51fe75a4 tgsi: fix coverity out-of-bounds warning
CID 1271532 (#1 of 1): Out-of-bounds read (OVERRUN)34. overrun-local:
Overrunning array of 2 16-byte elements at element index 2 (byte offset
32) by dereferencing pointer &inst.Dst[i].

Signed-off-by: Rob Clark <robclark@freedesktop.org>
Reviewed-by: Brian Paul <brianp@vmware.com>
2016-05-26 15:17:49 -04:00
Rob Clark
3d66ba971e tgsi: fix out of bounds access
Not sure why coverity calls this an out-of-bounds read vs out-of-bounds
write.

CID 1358920 (#1 of 1): Out-of-bounds read (OVERRUN)9. overrun-local:
Overrunning array r of 3 16-byte elements at element index 3 (byte
offset 48) using index chan (which evaluates to 3).

Signed-off-by: Rob Clark <robclark@freedesktop.org>
Reviewed-by: Brian Paul <brianp@vmware.com>
2016-05-26 15:17:49 -04:00
Samuel Pitoiset
c52e92ec3a nvc0: allow to monitor MP perf counters with compute shaders
To read out MP perf counters we use a compute shader and need to upload
input data like a 64-bits addr used to store the values and a sequence
ID for synchronization. Currently, this input data is uploaded as user
uniforms which means that it's sticked to c0[], but if a compute shader
from a real application is used, monitoring those performance counters
will just overwrite some data and miserably crash.

Instead, sticking the 64-bits addr and the sequence into the driver
constant buffer seems like much better and will allow to monitor
counters with GL 4.3 apps.

Tested on GF119 and GK110, but should not hurt anything on GK104.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
2016-05-26 19:34:57 +02:00
Marek Olšák
8539c9bf31 gallium/radeon: add the kernel version into the renderer string
Example:
Gallium 0.4 on AMD TONGA (DRM 3.2.0 / 4.5.0, LLVM 3.9.0)

My kernel version is pretty long already (4.5.0-amd-01025-g32791c1)
and adding "kernel" into the string would make too it long for glxinfo
to display.

Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
2016-05-26 16:53:46 +02:00
Marek Olšák
53f33619a4 winsys/amdgpu: add back multithreaded command submission
Ported from the initial amdgpu winsys from the private AMD branch.

The thread creates the buffer list, submits IBs, and cleans up
the submission context, which can also destroy buffers.

3-5% reduction in CPU overhead is expected for apps submitting a lot
of IBs per frame. This is most visible with DMA IBs.

v2: use a semaphore instead of a busy loop in amdgpu_ws_queue_cs
    add another amdgpu_cs_sync_flush call into amdgpu_bo_map

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2016-05-26 16:43:45 +02:00
Lars Hamre
c626a86586 gallium/tgsi: use _mesa_roundevenf in micro_rnd
Fixes the following piglit tests (for softpipe):

/spec/glsl-1.30/execution/built-in-functions/...
fs-roundeven-float
fs-roundeven-vec2
fs-roundeven-vec3
fs-roundeven-vec4
vs-roundeven-float
vs-roundeven-vec2
vs-roundeven-vec3
vs-roundeven-vec4

/spec/glsl-1.50/execution/built-in-functions/...
gs-roundeven-float
gs-roundeven-vec2
gs-roundeven-vec3
gs-roundeven-vec4

Signed-off-by: Lars Hamre <chemecse@gmail.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
2016-05-26 07:59:15 -06:00
Ilia Mirkin
f998e5dc6b nvc0: add note about where the viewport mask would go
Not piping this all the way through yet, but no better place to note
this down. This will can be used with NV_viewport_array2.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
2016-05-26 08:46:29 -04:00
Ilia Mirkin
b634936d3b nvc0: enable 32 textures on kepler+
For fermi, this likely will require use of linked tsc mode. However on
bindless architectures, we can have as many as we want. As it stands,
the AUX_TEX_INFO has 32 teture handles reserved.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
2016-05-26 08:46:13 -04:00
Bruce Cherniak
c8835a5924 swr: [rasterizer] Correctly select optimized primitive assembly.
Indexed primitives were always using cut-aware primitive assembly,
whether primitive_restart was enabled or not.  Correctly pass down
primitive_restart and select optimized PA when possible.

Reviewed-by: Tim Rowley <timothy.o.rowley@intel.com>
2016-05-25 18:47:16 -05:00