This implements support for OpenGL logic operations by emitting code to read
from the TLB if needed and blending the fragment output accordingly. It is
similar to VC4's blend lowering pass, but exclusive to logic operations, since
blending is otherwise supported in hardware.
The pass doesn't handle MSAA targets yet.
Fixes the following piglit tests:
spec/!opengl 1.0/gl-1.0-logicop/*
spec/!opengl 1.1/gl-1.1-xor
spec/!opengl 1.1/gl-1.1-xor-copypixels
It also fixes text cursor rendering in Libreoffice with the GTK+2 theme, which
is rendered via glamor using the XOR logic operation.
v2: fix checks for allowed variable location and maximum render target (Eric)
Reviewed-by: Eric Anholt <eric@anholt.net>
Until now we have always been emitting our scoreboard locks on the last thread
switch to improve parallelism. We did this by emitting our last thread switch
right before our tlb writes at the very end of the program, where we know that
we are outside control flow.
Unfortunately, this strategy is not valid when we have tlb color reads too, as
these will happen before this point in the program and can happen inside
control flow.
To fix this we always emit a thread switch before the first tlb load and if we
see additional thread switches after that point, we change the strategy to lock
on the first thread switch.
v2: change the solution so it is expected to work in more scenarios (Eric).
Reviewed-by: Eric Anholt <eric@anholt.net>
We will be emitting this intrinsic to signal TLB color loads when we implement
OpenGL logic operations, where we need to blend the fragment shader color
output with the existing color in the render target.
Per-sample TLB reads are not supported yet.
v2: fix the offset into the color_reads array (Eric).
Reviewed-by: Eric Anholt <eric@anholt.net>
We need to scale the size of these arrays to consider up to
V3D_MAX_DRAW_BUFFERS render targets and 4 components per color.
v2: we want to store each color component separately, so scale by 4 too.
Reviewed-by: Eric Anholt <eric@anholt.net>
The ldtlbu version will read an implicit uniform with the TLB read
specifier and should be used for the first read in a sequence
of TLB reads (unless the default configuration is valid, in which
case we can use ldtlb). The ldtlb version is used for any subsequent
TLB read in the sequence.
Reviewed-by: Eric Anholt <eric@anholt.net>
We already have code to print these signals but the early return in the code
that checks if any signals are present present was missing the checks for them,
so it would skip printing them unless they were paired with other signals.
Reviewed-by: Eric Anholt <eric@anholt.net>
That is: the five least significant bits provide the values of
'bits' and 'offset' which is the case for all hardware currently
supported by NIR and using the bfm/bfe instructions.
This patch also changes the lowering of bitfield_insert/extract
using shifts to not use bfm and removes the flag 'lower_bfm'.
Tested-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Either all channels executed the 'then' block, in which case all
channels will directly jump to the 'endif' block at the end of the
'then' block, or all channels execute the 'else' block (so no
execution masking is necessary).
Shader-db results:
total instructions in shared programs: 9119238 -> 9117550 (-0.02%)
instructions in affected programs: 401252 -> 399564 (-0.42%)
helped: 855
HURT: 77
total uniforms in shared programs: 3022622 -> 3022605 (<.01%)
uniforms in affected programs: 3566 -> 3549 (-0.48%)
helped: 17
HURT: 0
total max-temps in shared programs: 1327762 -> 1327774 (<.01%)
max-temps in affected programs: 619 -> 631 (1.94%)
helped: 2
HURT: 15
Reviewed-by: Eric Anholt <eric@anholt.net>
We were not accountint for small immediates in the B mux so the scheduler
was interpreting these are regular register file accesses, which could
lead to additional (incorrect) write-read dependencies.
Shader-db changes:
total instructions in shared programs: 9163664 -> 9137263 (-0.29%)
instructions in affected programs: 3931035 -> 3904634 (-0.67%)
helped: 12457
HURT: 2563
total max-temps in shared programs: 1325787 -> 1325597 (-0.01%)
max-temps in affected programs: 5746 -> 5556 (-3.31%)
helped: 186
HURT: 16
helped stats (abs) min: 1 max: 4 x̄: 1.12 x̃: 1
helped stats (rel) min: 1.45% max: 22.22% x̄: 4.42% x̃: 3.28%
HURT stats (abs) min: 1 max: 3 x̄: 1.12 x̃: 1
HURT stats (rel) min: 2.86% max: 10.00% x̄: 5.76% x̃: 5.88%
95% mean confidence interval for max-temps value: -1.04 -0.84
95% mean confidence interval for max-temps %-change: -4.16% -3.07%
Max-temps are helped.
Reviewed-by: Eric Anholt <eric@anholt.net>
Currently, st/mesa is always calling the GLSL IR lower_instructions()
pass with MOD_TO_FLOOR set, so mod operations will be lowered before
ever reaching NIR. This enables the same lowering at the NIR level,
which will let me shut off the GLSL IR path for NIR-based drivers.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Eric Anholt <eric@anholt.net>
The difference between imov and fmov has been a constant source of
confusion in NIR for years. No one really knows why we have two or when
to use one vs. the other. The real reason is that they do different
things in the presence of source and destination modifiers. However,
without modifiers (which many back-ends don't have), they are identical.
Now that we've reworked nir_lower_to_source_mods to leave one abs/neg
instruction in place rather than replacing them with imov or fmov
instructions, we don't need two different instructions at all anymore.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Acked-by: Rob Clark <robdclark@chromium.org>
This can be used by both etnaviv and freedreno/a2xx as they are both vec4
architectures with some instructions being scalar-only.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
I don't know why I thought NIR_PASS always set the progress variable.
Derp.
Fixes: d41cdef2a5 ("nir: Use the flrp lowering pass instead of nir_opt_algebraic")
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Coverity CID: 1444996
Coverity CID: 1444995
Coverity CID: 1444994
Coverity CID: 1444993
Coverity CID: 1444991
Coverity CID: 1444989
I tried to be very careful while updating all the various drivers, but I
don't have any of that hardware for testing. :(
i965 is the only platform that sets always_precise = true, and it is
only set true for fragment shaders. Gen4 and Gen5 both set lower_flrp32
only for vertex shaders. For fragment shaders, nir_op_flrp is lowered
during code generation as a(1-c)+bc. On all other platforms 64-bit
nir_op_flrp and on Gen11 32-bit nir_op_flrp are lowered using the old
nir_opt_algebraic method.
No changes on any other Intel platforms.
v2: Add panfrost changes.
Iron Lake and GM45 had similar results. (Iron Lake shown)
total cycles in shared programs: 188647754 -> 188647748 (<.01%)
cycles in affected programs: 5096 -> 5090 (-0.12%)
helped: 3
HURT: 0
helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2
helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12%
Reviewed-by: Matt Turner <mattst88@gmail.com>
Driver which do not support native integers should use a lowering
pass to go from integers to floats.
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
One special case, `src/util/xmlpool/.gitignore` is not entirely deleted,
as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`).
Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
We can't use the QPU functions to detect this until register allocation is
done and we've moved inst->dst into inst->qpu.
Fixes bad TMU sequences from register spilling in
KHR-GLES31.core.compute_shader.shared-max.
Looks like I lost it in a rebase conflict resolution. We'd hit the
unknown intrinsic assertion in
KHR-GLES31.core.compute_shader.shared-struct.
Fixes: 6b1c659825 ("v3d: Add Compute Shader compilation support.")
In what might be my first case of finding a divergence between hardware
and simpenrose for v3d 4.x, it seems that despite what the spec claims,
you actually need specific values in the TYPE field for atomic ops.
Fixes dEQP-GLES31.functional.*.compswap.*
We were failing to set up payload[1] for use by LocalInvocationIndex/ID
and shared variable accesses if gl_WorkGroupID/gl_GlobalInvocationID
wasn't used (possibly because you only have one workgroup). You're always
going to use payload[1], and payload[0] is common enough and we have DCE
in the backend to clean it up if it happens to not be used.
Fixes assertion failures in the CTS since Karol's cleanup when NIR started
noticing that we were reading an invalid component.
Fixes: 5450f1c9fb ("v3d: prefer using nir_src_comp_as_int over nir_src_as_const_value")
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Acked-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Matt Turner <mattst88@gmail.com>
v2: remove & operator in a couple of memsets
add some memsets
v3: fixup lima
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v2)
We can use the same register spilling infrastructure for our loads/stores
of indirect access of temp variables, instead of doing an if ladder.
Cuts 50% of instructions and max-temps from 2 KSP shaders in shader-db.
Also causes several other KSP shaders with large bodies and large loop
counts to not be force-unrolled.
The change was originally motivated by NOLTIS slightly modifying register
pressure in piglit temp mat4 array read/write tests, triggering register
allocation failures.
While waiting for the CSD UABI to get reviewed, I keep having to rebase
the CS patch. Just land the compiler side for now to keep it from
diverging.
For now this covers just GLES 3.1 compute shaders, not CL kernels.
We're using ARB_debug_output for the main shader-db, but I had this env
var left around from the shader-db-2 support (vc4 apitrace-based). Keep
the env var around since it's nice sometimes to get the stats on a shader
you're optimizing without having to do a shader-db run, but drop the old
formatting that's not useful and keeps tricking me when I go to add
another measurement to the shader-db output.
Our exec masking introduces lots of redundant flags updates, and even
without that there will be cases where NIR comparisons on the same sources
for different reasons may generate the same comparison instruction before
the selection.
total instructions in shared programs: 6492930 -> 6460934 (-0.49%)
total uniforms in shared programs: 2117460 -> 2115106 (-0.11%)
total spills in shared programs: 4983 -> 4987 (0.08%)
total fills in shared programs: 6408 -> 6416 (0.12%)
We have a pass to lower global registers to locals and many drivers
dutifully call it. However, no one ever creates a global register ever
so it's all dead code. It's time we bury it.
Acked-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>