In our backend, the successor edges from the blocks only point to where
QPU control flow goes, not where the notional control flow goes from a
"break" or "continue" modifying the execution mask to resume writing to
some channels later. As a result, this attempt at restricting live ranges
ended up missing the live range of a value where a conditional
break/continue was present in a loop before the later def of a variable.
The previous commit ended up fixing the problem that the flag tried to
solve.
Fixes glsl-vs-loop-continue.shader_test and/or
glsl-vs-loop-redundant-condition.shader_test based on register allocation
results.
In the backend, we often have condition codes on writes to variables, such
that there's no screening def anywhere and the previous live ranges
algorithm would conclude that the start of the range extends to the start
of the program. However, we do know that the live range can only extend
as early as you can reach from all blocks writing to the variable.
The motivation was that, while we have a couple of hacks to try to promote
conditional writes up to being a def within the block, the exec_mask one
was broken and needed a replacement.
Based on c3c1aa5aeb ("intel/fs: Restrict live intervals to the subset
possibly reachable from any definition.").
If we have a MOV of a uniform value available to spill, that's one of our
best choices. We can just not spill the value, and emit a new load of the
uniform as the fill. This saves bothering the TMU and the thrsw, and is
the same cost in uniforms (since the spill offset is a uniform anyway).
This doesn't have a huge impact on shader-db, since there aren't a whole
lot of spills and we usually copy-prop the uniforms at the VIR level such
that the only uniform MOVs are from vir_lower_uniforms:
total instructions in shared programs: 6430292 -> 6430279 (<.01%)
total uniforms in shared programs: 2386023 -> 2385787 (<.01%)
total spills in shared programs: 4961 -> 4960 (-0.02%)
total fills in shared programs: 6352 -> 6350 (-0.03%)
However, I'm interested in dropping the uniforms copy-prop in the backend,
since it would be cheaper to not load repeated uniforms if we have the
registers to spare. This also saves many spills on
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20, which is what
motivated a bunch of my recent backend work in the first place:
before: 46 spills, 106 fills, 3062 instructions
after: 0 spills, 0 fills, 2611 instructions
For V3D 3.x, we emitted the ldvpms all at the top so that we didn't need
to do VPM setup when the load_inputs are out of order. For V3D 4.x, we
can reduce register pressure by delaying our loads until they're actually
needed. This also avoids a bunch of silly MOVs in the pre-opt VIR dump.
total instructions in shared programs: 6421415 -> 6419933 (-0.02%)
total uniforms in shared programs: 2393139 -> 2393140 (<.01%)
total threads in shared programs: 153864 -> 153906 (0.03%)
The execute.file check used to be good enough, until I stopped setting up
the execute mask for uniform ifs.
No known tests fixed, noticed while doing a refactor.
Fixes: 0805060573 ("v3d: Handle dynamically uniform IF statements with uniform control flow.")
Now that we don't have the vir_PF() magic, it's obvious that we were doing
the wrong thing for f2b32 by allowing -0.0 to produce true instead of
false.
You were allowed to pass in any old temp so that you could hopefully fold
the PF up into the def of the temp. If we couldn't find one, it
implicitly generated a MOV(nop, reg). However, that PF could have
different behavior depending on whether the def being folded into was a
float or int opcode, which the caller doesn't necessarily control.
Due to the fragility of the function, just switch all callers over to
vir_set_pf(). This also encourages the callers to use a _dest call for
the inst they're putting the PF on, eliminating a bunch of temps in the
pre-optimization VIR.
shader-db says the change is in the noise:
total instructions in shared programs: 6226247 -> 6227184 (0.02%)
instructions in affected programs: 851068 -> 852005 (0.11%)
Both were doing the same thing to try to get a condition to predicate on.
Noticed when I wanted to do this for discard_if as well.
No change in shader-db.
The NIR lowering works fine, though it causes some slight noise due to
what looks like choices about propagating constants up multiply chains
changing.
total instructions in shared programs: 6229671 -> 6229820 (<.01%)
total uniforms in shared programs: 2312171 -> 2312324 (<.01%)
Apparently we need disable-EZ flagged, not just "does Z writes".
Fixes
dEQP-GLES31.functional.image_load_store.early_fragment_tests.no_early_fragment_tests_depth_fbo
on 7278, even though it passed in simulation.
Signed-off-by: Eric Anholt <eric@anholt.net>
Fixes: 051a41d3d5 ("v3d: Add support for the early_fragment_tests flag.")
I had a single function for "does this do float input unpacking" with two
major flaws: It was missing the most common thing to try to copy propagate
a f32 input nunpack to (the VFPACK to an FP16 render target) along with
several other ALU ops, and also would try to propagate an f32 unpack into
a VFMUL which only does f16 unpacks.
instructions in affected programs: 659232 -> 655895 (-0.51%)
uniforms in affected programs: 132613 -> 135336 (2.05%)
and a couple of programs increase their thread counts.
The uniforms hit appears to be a pattern in generated code of doing (-a >=
a) comparisons, which when a is abs(b) can result in the abs instruction
being copy propagated once but not fully DCEed.
If you only bound rt 1+, we'd still emit a write to the rt0 that isn't
present (noticed while debugging an
ext_framebuffer_multisample-alpha-to-coverage-no-draw-buffer-zero
regression in another change).
This was copy-and-paste fail, that oddly showed up in the CTS's
reinterprets of r32f, rgba8, and srgba8 to rgba8i, but not r32ui and r32i
to rgba8i or reinterprets to other signed int formats.
Fixes: 6281f26f06 ("v3d: Add support for shader_image_load_store.")
We get a payload for the ivec3 workgroup and an int local invocation
index, and we use the core lowering to turn into the global invocation id
and the local invocation id ivec3s.
This is only exposed on V3D 4.1+, because we didn't have the TMU write
operations for images on 3.3 (To do GLES 3.1 there, you have to lower it
to SSBO load/stores, which is a problem to solve later).
We've been relying on linking splitting up our varying matrices into
separate vectors, but with SSO that doesn't happen. Supporting matrix
inputs isn't too hard, though.
V3D returns the texels in a different order in the resulting vec4 from
what GLSL wants, so we need to put in a swizzle. Fixes
dEQP-GLES31.functional.texture.gather.basic.2d.rgba8.base_level.level_1
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Even without any clever optimization on the unpack operations, this gives
us a useful value for the channels read field, which we can use to avoid
ldtmu instructions to the no-op register.
instructions in affected programs: 890712 -> 881974 (-0.98%)
We can pull a whole vector in a single indirect load. This saves a bunch
of round-trips to the TMU, instructions for setting up multiple loads,
references to the UBO base in the uniforms, and apparently manages to
reduce register pressure as well.
instructions in affected programs: 3086665 -> 2454967 (-20.47%)
uniforms in affected programs: 919581 -> 721039 (-21.59%)
threads in affected programs: 1710 -> 3420 (100.00%)
spills in affected programs: 596 -> 522 (-12.42%)
fills in affected programs: 680 -> 562 (-17.35%)
Improves 3dmmes performance by 2.29312% +/- 0.139825% (n=5)
In the process of adding support for SSBOs and CS shared vars, I ended up
needing a helper function for doing TMU general ops. This helper can be
that starting point, and saves us a bunch of round-trips to the TMU by
loading a vector at a time.
Before, I had per-stage entryoints with some helpers shared between them.
As I extended for compute shaders and shader-db, it turned out that the
other common code in the middle wanted to be shared too.
Loops will be trickier, since we need some analysis to figure out if the
breaks/continues inside are uniform. Until we get that in NIR, this gets
us some quick wins.
total instructions in shared programs: 6192844 -> 6174162 (-0.30%)
instructions in affected programs: 487781 -> 469099 (-3.83%)
I wanted to reuse the comparison stuff for nir_ifs, but for that I just
want the flags and no destination value. Splitting the conditions from
the destinations ended up cleaning the existing code up, anyway.
We can just look at the MSF flags -- if they're unset, then we're
definitely in a helper invocation. Fixes
dEQP-GLES31.functional.shaders.helper_invocation.* with GLES3.1 enabled.