Classically, global code motion is also a dead code pass. However, in
the initial implementation, the decision was made to place every
instruction and let conventional DCE clean up the dead ones. Because
any uses of a dead instruction are unreachable, we have no late block
and the dead instructions are always scheduled early. The problem is
that, because we place the dead instruction early, it pushes the
placement of any dependencies of the dead instruction earlier than they
may need to be placed. In order prevent dead instructions from
affecting the placement of live ones, we need to delete them.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4636>
Now that the GCM pass is more conservative and only moves instructions
to different blocks when it's advantageous to do so, we can have a
proper notion of what it means to make progress.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4636>
We are about to adjust our instruction block assignment algorithm and we
will want to know the current block that the instruction lives in. In
order to allow for this, we can't overwrite nir_instr::block in the
early scheduling pass.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4636>
Inside nested flow control, nir_lower_returns inserts predicated breaks
in the outer block. However, it would omit doing this if the remainder
of the outer block (after the inner block) was empty. This is not
correct in the case of loops, as execution just wraps back around to the
start of the loop, so this change doesn't skip the predication inside
loops.
Fixes: 79dec93ead (nir: Add return lowering pass)
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2724
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4603>
This also makes spirv_to_nir a bit simpler because the new
nir_vector_insert helper automatically handles a constant component
selector like nir_vector_extract does.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4495>
We make some assumptions in opt_large_constants such as the size_align
function returning the obvious sizes for vectors. Now that we've got
the deref_size lying around, we may as well assert it's consistent with
our assumptions. In particular, we now assert that it really claims
booleans are 32-bit. If anyone's driver ever decides to be clever and
change this, we'll now catch the breakage earlier.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4468>
In f1883cc73d we tried to pass through alignments from load_constant
intrinsics when rewriting them to load_ubo in iris. However, those
intrinsics don't have ALIGN_MUL or ALIGN_OFFSET indices. It's easy
enough to add them. We just call the size/align function on the vector
type at the end of our deref chain and use the alignment returned from
there. It's possible we could do better by walking the whole deref
chain but this should be good enough.
Fixes: f1883cc73d "iris: Set alignments on cbuf0 and constant reads"
Closes: #2739
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4468>
I had missed that LDC actually uses vec4 units for its offset. This
means that we have to create a new instruction, and lower it in
ir3_nir_lower_io_offsets, similar to the existing SSBO instructions.
Unfortunately we can't assume that loads are always vec4-aligned, so we
have to use the alignment information that NIR gives us. Unfortunately,
it's currently woefully inadequate, and will have to be fixed to give us
good codegen in the future.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4568>
nir_cf_{extract,reinsert}() can't stitch a block together
if the block we are extracting ends in a jump but other jumps
nested in further ifs should be fine to move.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4477>
If a nir_variable is tagged with per_view, it must be an array with
size corresponding to the number of views. For slot-tracking, it is
considered to take just the slot for a single element -- drivers will
take care of expanding this appropriately.
This will be used to implement the ability of having per-view position
in a vertex shader in Intel platforms.
Acked-by: Rafael Antognolli <rafael.antognolli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2313>
We've had alignment parameters on these operations for a while but a
bunch of places weren't setting them. That should be resolved now so we
can start validating that they're always set.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4441>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4441>
These were clearly copied and pasted from SSBOs. The shared atomics
don't have an SSBO index so their offset is src0 and data is src1.
Fixes: ce9205c03b "nir: add a load/store vectorization pass"
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4367>
The algorithm we use for resolving parallel copy instructions plays this
little shell game with the values. The reason for this is that it lets
us handle cases where, for instance we have a -> b and b -> a and we
need to use a temporary to do a swap. One result of this algorithm is
that it tends to emit a lot of mov chains which are typcially really bad
for GPUs where a mov is far from free. For instance, it's likely to
turn this:
r16 = ssa_0; r17 = ssa_0; r18 = ssa_0; r15 = ssa_0
into this:
r15 = mov ssa_0
r18 = mov r15
r17 = mov r18
r16 = mov r17
which, if it's the only thing in a block (this is common for phis) is
impossible for a scheduler to fix because of the dependencies and you
end up with significant stalling. If, on the other hand, we only do the
chaining in the actual case where we need to free up a so that it can be
used as a destination, we can emit this:
r15 = mov ssa_0
r18 = mov ssa_0
r17 = mov ssa_0
r16 = mov ssa_0
which is far nicer to the scheduler. On Intel, our copy propagation
pass will undo the chain for us so this has no shader-db impact.
However, for less intelligent back-ends, it's probably a lot better.
Reviewed-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4412>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4412>
If the shader is not a tesselation shader, then writing to the tess
member of the shaderinfo union will overwrite other members and crash.
Closes: #2722
Fixes: f1dd81ae10 ("nir: Collect if shader uses cross-invocation or indirect I/O.")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4408>
If the expression tree that is being replaced has a unary operation at
its root, set the cursor (location where new instructions are inserted)
at the source instruction instead.
This doesn't do much now because there are very few patterns that have a
unary operation as the root. Almost all of the patterns that do have a
unary operation as the root have inot. All of the shaders that are
affected by this commit have expression trees with an inot at the root.
This change prevents some significant, spurious caused by the next
commit. There is further explanation in the large comment added in
the code.
I also considered a couple other options that may still be worth exploring.
1. Add some mark-up to the search pattern to denote where new
instructions should be added. I considered using "@" to denote the
cursor location. For example,
(('fneg', ('fadd@', a, b)), ...)
2. To prevent other kinds of unintended code motion, add the ability to
name expressions in the search pattern so that they can be reused in
the replacement. For example,
(('bcsel', ('ige', ('find_lsb=b', a), 0), ('find_lsb', a), -1), b),
An alternative would be to add some kind of CSE at the time of
inserting the replacements. Create a new instruction, then check to
see if it already exists. That option might be better overall.
Over the years I know Matt has heard me complain, "I added a pattern
that just deleted an instruction, but it added a bunch of spills!" This
was always in large, complex shaders that are very hard to analyze. I
always blamed these cases on the scheduler being dumb. I am now very
suspicious that unintended code motion was the real problem.
All Gen4+ Intel platforms had similar results. (Tiger Lake shown)
total instructions in shared programs: 17611405 -> 17611333 (<.01%)
instructions in affected programs: 18613 -> 18541 (-0.39%)
helped: 41
HURT: 13
helped stats (abs) min: 1 max: 18 x̄: 4.46 x̃: 4
helped stats (rel) min: 0.27% max: 5.68% x̄: 1.29% x̃: 1.34%
HURT stats (abs) min: 1 max: 20 x̄: 8.54 x̃: 7
HURT stats (rel) min: 0.30% max: 4.20% x̄: 2.15% x̃: 2.38%
95% mean confidence interval for instructions value: -3.29 0.63
95% mean confidence interval for instructions %-change: -0.95% 0.02%
Inconclusive result (value mean confidence interval includes 0).
total cycles in shared programs: 338366118 -> 338365223 (<.01%)
cycles in affected programs: 257889 -> 256994 (-0.35%)
helped: 42
HURT: 15
helped stats (abs) min: 2 max: 120 x̄: 39.38 x̃: 34
helped stats (rel) min: 0.04% max: 2.55% x̄: 0.86% x̃: 0.76%
HURT stats (abs) min: 6 max: 204 x̄: 50.60 x̃: 34
HURT stats (rel) min: 0.11% max: 4.75% x̄: 1.12% x̃: 0.56%
95% mean confidence interval for cycles value: -30.39 -1.02
95% mean confidence interval for cycles %-change: -0.66% -0.02%
Cycles are helped.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/1359>
We have the code to do the lowering, we were just missing the
boilerplate bits to make should_lower_int64_alu_instr return true.
Fixes: 62d55f1281 "nir: Wire up int64 lowering functions"
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4365>
In 87839680c0, a very subtle mistake was made with the CFG walking
recursion. Instead of setting the local has_nested_loop variable when
process child loops, has_nested_loop_out was passed directly into the
process_loop_in_block call. This broke nested loop detection heuristics
and caused loop unrolling to run massively out of control. In
particular, it makes the following CTS test compile virtually forever:
dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.struct_mixed_types.uniform_buffer_block_geom
Fixes: 87839680c0 "nir: Fix breakage of foreach_list_typed_safe..."
Closes: #2710
Reviewed-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4380>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4380>
By inserting a b2b1 around the load_ubo, load_input, etc. intrinsics
generated by nir_lower_io, we can ensure that the intrinsic has the
correct destination bit size. Not having the right size can mess up
passes which try to optimize access. In particular, it was causing
brw_nir_analyze_ubo_ranges to ignore load_ubo of booleans which meant
that booleans uniforms weren't getting pushed as push constants. I
don't think this is an actual functional bug anywhere hence no CC to
stable but it may improve perf somewhere.
Shader-db results on ICL with iris:
total instructions in shared programs: 16076707 -> 16075246 (<.01%)
instructions in affected programs: 129034 -> 127573 (-1.13%)
helped: 487
HURT: 0
helped stats (abs) min: 3 max: 3 x̄: 3.00 x̃: 3
helped stats (rel) min: 0.45% max: 3.00% x̄: 1.33% x̃: 1.36%
95% mean confidence interval for instructions value: -3.00 -3.00
95% mean confidence interval for instructions %-change: -1.37% -1.29%
Instructions are helped.
total cycles in shared programs: 338015639 -> 337983311 (<.01%)
cycles in affected programs: 971986 -> 939658 (-3.33%)
helped: 362
HURT: 110
helped stats (abs) min: 1 max: 1664 x̄: 97.37 x̃: 43
helped stats (rel) min: 0.03% max: 36.22% x̄: 5.58% x̃: 2.60%
HURT stats (abs) min: 1 max: 554 x̄: 26.55 x̃: 18
HURT stats (rel) min: 0.03% max: 10.99% x̄: 1.04% x̃: 0.96%
95% mean confidence interval for cycles value: -79.97 -57.01
95% mean confidence interval for cycles %-change: -4.60% -3.47%
Cycles are helped.
total sends in shared programs: 815037 -> 814550 (-0.06%)
sends in affected programs: 5701 -> 5214 (-8.54%)
helped: 487
HURT: 0
LOST: 2
GAINED: 0
The two lost programs were SIMD16 shaders in CS:GO. However, CS:GO was
also one of the most helped programs where it shaves sends off of 134
programs. This seems to reduce GPU core clocks by about 4% on the first
1000 frames of the PTS benchmark.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4338>