ALIGN_DIVUP is a driver specific(r600g) macro that duplicates DIV_ROUND_UP functionality.
Replacing it with DIV_ROUND_UP eliminates this problems.
Signed-off-by: Krzysztof A. Sobiecki <sobkas@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
I had the driver all tested for the last series, and in my last build I
noticed that get_swizzled_channel was unused now, and removed
it... apparently without testing to find that I removed the wrong channel
swizzle function.
We routinely have code like:
vec1 ssa_220 = fge ssa_104, ssa_61
vec1 ssa_199 = bcsel ssa_220, ssa_106, ssa_105
and we would compare fge's args and choose between ~0 and 0 to generate
ssa_220, then compare ssa_220 to 0 and choose between bcsel's args.
Instead, try to notice the pattern and compare between fge's args to
select between bcsel's args.
total instructions in shared programs: 88019 -> 87574 (-0.51%)
instructions in affected programs: 9985 -> 9540 (-4.46%)
total estimated cycles in shared programs: 245752 -> 245237 (-0.21%)
estimated cycles in affected programs: 17232 -> 16717 (-2.99%)
We can't use its other features currently (mostly because we don't want
Newton-Raphson on rcps for texture coordinates), but it gets us started.
This eliminates some comparisons with constants in GLB2.7 and ETQW traces
at the QIR level by moving the comparisons into NIR, where they get
constant-folded out.
instructions in affected programs: 165 -> 156 (-5.45%)
total uniforms in shared programs: 32087 -> 32085 (-0.01%)
total estimated cycles in shared programs: 245762 -> 245752 (-0.00%)
estimated cycles in affected programs: 461 -> 451 (-2.17%)
I'm moving away from QIR being SSA (since NIR is doing lots of SSA
optimization for us now) and instead having QIR just be QPU operations
with virtual registers. By making our SELs be composed of two MOVs, we
could potentially coalesce the registers for the MOV's src and dst and
eliminate the MOV.
total instructions in shared programs: 88448 -> 88028 (-0.47%)
instructions in affected programs: 39845 -> 39425 (-1.05%)
total estimated cycles in shared programs: 246306 -> 245762 (-0.22%)
estimated cycles in affected programs: 162887 -> 162343 (-0.33%)
If you want the SF of the value of a register produced from a series of
packing MOVs or conditional MOVs, we can't just SF on the last MOV into
the register.
Fix a 's/unsigned int/unsigned/' consistency case while here.
Found-by: Coccinelle
Signed-off-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Fix silly issue with MSVC case fall-though support to need
a extra 'break;'
Found-by: Coccinelle
Signed-off-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
This patch converts the SSE-optimized lp_rast_triangle_32_3_16()
to VMX/VSX.
I measured the results on POWER8 machine with 32 cores at 3.4GHz and
16GB of RAM.
FPS/Score
Name Before After Delta
------------------------------------------------
openarena 16.35 16.7 2.14%
xonotic 4.707 4.97 5.57%
glmark2 didn't show a significant (more than 1%) difference.
v2: Make sure code is build only on POWER8 LE machine
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
This patch converts the SSE-optimized build_mask_32() and
build_mask_linear_32() to VMX/VSX.
I measured the results on POWER8 machine with 32 cores at 3.4GHz and
16GB of RAM.
FPS/Score
Name Before After Delta
------------------------------------------------
glmark2 (score) 139.8 142.7 2.07%
openarena and xonotic didn't show a significant (more than 1%)
difference.
v2: Make sure code is build only on POWER8 LE machine
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
This patch converts the SSE optimization done in do_triangle_ccw to
VMX/VSX.
I measured the results on POWER8 machine with 32 cores at 3.4GHz and
16GB of RAM.
FPS/Score
Name Before After Delta
------------------------------------------------
glmark2 (score) 136.6 139.8 2.34%
openarena 16.14 16.35 1.30%
xonotic 4.655 4.707 1.11%
v2:
- Convert loads to use aligned loads
- Make sure code is build only on POWER8 LE machine
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
This file provides a portability layer that will make it easier to convert
SSE-based functions to VMX/VSX-based functions.
All the functions implemented in this file are prefixed using "vec_".
Therefore, when converting from SSE-based function, one needs to simply
replace the "_mm_" prefix of the SSE function being called to "vec_".
Having said that, not all functions could be converted as such, due to the
differences between the architectures. So, when doing such
conversion hurt the performance, I preferred to implement a more ad-hoc
solution. For example, converting the _mm_shuffle_epi32 needed to be done
using ad-hoc masks instead of a generic function.
All the functions in this file support both little-endian and big-endian
but currently the file is build only on POWER8 LE machine.
All of the functions are implemented using the Altivec/VMX intrinsics,
except one where I needed to use inline assembly (due to missing
intrinsic).
v2:
- Use vec_vgbbd instead of __builtin_vec_vgbbd
- Add an aligned load function
- Don't use typeof()
- Make file build only on POWER8 LE machine
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
We only want to set the SVGA_NEW_STIPPLE dirty flag when the polygon
stipple state changes. Before, we only set the flag when we were
enabling stipple, but not disabling.
We don't really have to add SVGA_NEW_STIPPLE to the dirty FS state
set since it's a subset of SVGA_NEW_RAST, but let's be explicit.
This doesn't fix any known bugs.
Reviewed-by: Charmaine Lee <charmainel@vmware.com>
and svga_set_sampler_views(). If there's no change, return early
and don't set a SVGA_NEW_x dirty state flag.
Reviewed-by: Charmaine Lee <charmainel@vmware.com>
It allows to call nouveau_vp3_bsp_next multiple times
between one begin/end.
It is required to support st/va.
https://bugs.freedesktop.org/show_bug.cgi?id=89969
Signed-off-by: Julien Isorce <j.isorce@samsung.com>
[imirkin: create strparm_bsp function, simplified w0 calculation]
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
The counter was not set but used by the nouveau driver.
It is required otherwise visual output is garbage.
Signed-off-by: Julien Isorce <j.isorce@samsung.com>
Reviewed-by: Christian Koenig <christian.koenig@amd.com>
Experimentally, 4M causes corruption and slowness, try to ramp it up
with size instead.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Cc: "11.0 11.1" <mesa-stable@lists.freedesktop.org>
H264 doesn't have a bitplane bo. We just need a device reference, so use
the one from the client.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Cc: "11.0 11.1" <mesa-stable@lists.freedesktop.org>
It's the same behavior that we use for later LLVM.
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Found during NIR_TEST_CLONE=1 piglit run. We were using block->index
but forgetting to require it. Causing things to not work with a cloned
shader which didn't preserve block_index.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Immediately convert into NIR and do an initial key-agnostic lowering/
optimization pass. This should let us share most of the per-variant
transformations between each variant, and hopefully minimize the draw-
time variant creation part of the compilation process.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
It will still hit a compile_assert() in emit_tex, which has the
advantage of dumping out the offending shader.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Nested functions are supported as an extension in GNU C, but Clang
don't support them.
This fixes compilation errors when (manually) building compute.c,
or by setting --enable-gallium-tests to the configure script.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75165
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Instead of iterating over all the buffer resources looking for coherent
buffers, we keep track of a context-wide count. This will save some
iterations (and CPU cycles) in 99.99% case because usually coherent
buffers are not so used.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>