First, we need to give the parent_instr field a unique name to be able to
replace with a helper. We have parent_instr fields for both nir_src and
nir_def, so let's rename nir_src::parent_instr in preparation for rework.
This was done with a combination of sed and manual fix-ups.
Then we use semantic patches plus manual fixups:
@@
expression s;
@@
-s->renamed_parent_instr
+nir_src_parent_instr(s)
@@
expression s;
@@
-s.renamed_parent_instr
+nir_src_parent_instr(&s)
@@
expression s;
@@
-s->parent_if
+nir_src_parent_if(s)
@@
expression s;
@@
-s.renamed_parent_if
+nir_src_parent_if(&s)
@@
expression s;
@@
-s->is_if
+nir_src_is_if(s)
@@
expression s;
@@
-s.is_if
+nir_src_is_if(&s)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24671>
If we spill render targets with a layered framebuffer, our spilled targets are
assumed to be 2D Arrays (in general). We need to use arrayed image operations to
load/store from these. The layer is given by the layer as read in the fragemnt
shader. This handles the eMRT portion of layered rendering.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Just add a flag for it. We don't care about the actual # of layers when
calculating the layout, only the boolean fact of being layered or not. The
reason we need this at all is because the eMRT implementation needs to
account for layering and that is only keyed off the tilebuffer layout.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
The hardware needs the layer ID and the viewport index packed together. That
consumes an entire varying slot, if we want those available in the frag shader
we need a separate slot. Add a pass to insert the extra packed write.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
With the exception of the backwards branch for loops, all the control flow we
insert during instruction selection just predicates instructions rather than
actually jumping around. That means, for example, we execute both sides of the
if even for a uniform condition! That's inefficient. The solution is insert
jmp_exec_none instructions after control flow in order to skip unexecuted
regions, which is much faster than predicating them out. However, jmp_exec_none
is costly in itself, so we need to use a heuristic to determine when it's
actually beneficial.
This uses a very simple heuristic for this purpose. However, it is a massive
performance speed-up for Dolphin uber shaders: 39fps -> 67fps at 2x resolution.
Nearly a doubling of performance!
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
jmp_exec_none variant that jumps to the last instruction of the target block,
rather than the beginning. This is convenient for skipping over elses, while
still executing the block-final pop_exec instruction. Similarly for skipping
over loop bodies while still executing the block-final pop_exec, after break
instructions.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Add an optional pointer to a target block for these instructions. This does NOT
act like a logical branch, and does NOT get added to the logical control flow.
It is ignored wholesale until after RA, when physical edges may be inserted by a
pass we add later in this series.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Although this is well-motivated, perf effect seems to be neglible for Dolphin.
It does prevent the scheduler from making things worse by sinking these
instructions though, so as a way to prevent future problems this seems sensible.
The kind of problem this affects (late discard) isn't modelled in shader-db.
Nevertheless, nothing concerning there:
total instructions in shared programs: 1756699 -> 1756722 (<.01%)
instructions in affected programs: 10106 -> 10129 (0.23%)
helped: 21
HURT: 41
Inconclusive result (value mean confidence interval includes 0).
total bytes in shared programs: 11525404 -> 11525452 (<.01%)
bytes in affected programs: 72900 -> 72948 (0.07%)
helped: 27
HURT: 41
Inconclusive result (value mean confidence interval includes 0).
total halfregs in shared programs: 483394 -> 483286 (-0.02%)
halfregs in affected programs: 4945 -> 4837 (-2.18%)
helped: 88
HURT: 78
Inconclusive result (value mean confidence interval includes 0).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This fixes live range splitting with 3D textureGrad(), which involves vectors
larger than the natural 128-bit maximum and hence requires special handling.
Fixes this assert with a combination of debug flags and new patches:
unsigned int find_best_region_to_evict(struct ra_ctx *, unsigned int,
unsigned int *, unsigned int *):
Assertion `(rctx->bound % size) == 0 && "register file size must be aligned
to the maximum vector size"' failed
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
When rendering to a layered depth/stencil attachment, we specify the layer
stride in pages. That means that depth/stencil targets must be page-aligned to
be rendered to correctly.
If we're merely sampling, not rendering, we do not need the extra alignment. So
we add a flag to handle this case so we keep passing the generated ail tests.
Fixes KHR-GLES31.core.texture_cube_map_array.color_depth_attachments
Similarly, we page-align colour attachments. I don't have a good theoretical
justification for this part, but it seems to be necessary and layered rendering
fails otherwise. Possibly the PBE requires page-aligned layers unconditionally?
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
This lets us save a LOT of instructions at the cost of increased register
pressure. However, on my shader-db, this is still coming out ahead since no
shaders are hurt for thread count/spills, and only 1/10 of the shaders helped
for instruction count are hurt for register pressure. The shaders most hurt
for pressure have very low pressure (7 -> 15 is the worst case) and you need a
certain number of registers to use a 4 source instruction at all. Analyzing the
hurt shaders, nothing concerns me too much ... this isn't as bad as I feared.
So I think at this point it's worth ripping off the bandage, given the massive
potential for instruction count win. This is a big improvement for some of the
shaders I'm working on for my $SECRET_PROJECT.
total instructions in shared programs: 1784943 -> 1775169 (-0.55%)
instructions in affected programs: 644211 -> 634437 (-1.52%)
helped: 3498
HURT: 38
Instructions are helped.
total bytes in shared programs: 11720734 -> 11643224 (-0.66%)
bytes in affected programs: 4370986 -> 4293476 (-1.77%)
helped: 3572
HURT: 36
Bytes are helped.
total halfregs in shared programs: 474094 -> 475165 (0.23%)
halfregs in affected programs: 12821 -> 13892 (8.35%)
helped: 65
HURT: 247
Halfregs are HURT.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
Simple greedy thing that has the potential to inflate register pressure but
reduces instructions. Thanks to the recent loop work that turns if { break }
into while_icmp, this also implicitly handles fusing conditions into loops,
which is what actually prompted this.
Surprisingly, this helps register pressure on my shader-db (no change to thread
count), I guess by eliminating the boolean temps in case where the sources are
used multiple times.
total instructions in shared programs: 1786561 -> 1784943 (-0.09%)
instructions in affected programs: 128557 -> 126939 (-1.26%)
helped: 474
HURT: 13
Instructions are helped.
total bytes in shared programs: 11733236 -> 11720734 (-0.11%)
bytes in affected programs: 976034 -> 963532 (-1.28%)
helped: 521
HURT: 13
Bytes are helped.
total halfregs in shared programs: 474245 -> 474094 (-0.03%)
halfregs in affected programs: 1869 -> 1718 (-8.08%)
helped: 28
HURT: 7
Halfregs are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
Apple doesn't do this, but it should be equivalent and it makes it easier to see
that we can use while_icmp for break_if_icmp in loops that don't use continue
(which Apple does do). So, the effect of this commit is to use while_icmp for
most breaks, which saves an instruction.
total instructions in shared programs: 1764199 -> 1764076 (<.01%)
instructions in affected programs: 24149 -> 24026 (-0.51%)
helped: 78
HURT: 0
Instructions are helped.
total bytes in shared programs: 11609306 -> 11608322 (<.01%)
bytes in affected programs: 164604 -> 163620 (-0.60%)
helped: 78
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
The only role of the while_icmp at the end of a NIR loop is to make continue
jumps work. If, after emitting the loop, we learn that there are no continues,
there is no need to insert a while_icmp since it would be a no-op anyway.
total instructions in shared programs: 1764311 -> 1764199 (<.01%)
instructions in affected programs: 26321 -> 26209 (-0.43%)
helped: 82
HURT: 0
Instructions are helped.
total bytes in shared programs: 11609978 -> 11609306 (<.01%)
bytes in affected programs: 178842 -> 178170 (-0.38%)
helped: 82
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
In general, loops need a push_exec at the start for correctness. However, a
push_exec at the top level (non-nested) is a no-op, so we can omit and save a
few cycles.
total instructions in shared programs: 1764350 -> 1764311 (<.01%)
instructions in affected programs: 7339 -> 7300 (-0.53%)
helped: 36
HURT: 0
Instructions are helped.
total bytes in shared programs: 11610212 -> 11609978 (<.01%)
bytes in affected programs: 48638 -> 48404 (-0.48%)
helped: 36
HURT: 0
Bytes are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
Search for code like
if ... {
break
}
and replace with a break_if pseudo-instruction for optimized handling, since the
break_if lowering is better than the original code.
total instructions in shared programs: 1764596 -> 1764350 (-0.01%)
instructions in affected programs: 24540 -> 24294 (-1.00%)
helped: 78
HURT: 0
Instructions are helped.
total bytes in shared programs: 11611196 -> 11610212 (<.01%)
bytes in affected programs: 166458 -> 165474 (-0.59%)
helped: 78
HURT: 0
Bytes are helped.
shader-db probably understates the benefit here, since this optimizes the body
of loops.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
We use it for two different things. Pseudo-instructions are cheap, split it up
for easier optimization passes. This also fixes the schedule classes.. we can
move the cf_begin around if we want, it's inert.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
They're more trouble than they're worth for us. They were originally lifted
unthinkingly from ACO, where I assume they're necessary for software CF
lowering, but they're just an inconvenient convenience for us. Remove em.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>
If we bound 4 render targets but we only write to 1 of them, the other 3 need
their contents preserved. This requires either properly configuring HSR to
implement colour masking (TODO) or using the big hammer of setting TRANSLUCENT.
This patch picks the latter for now.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25052>