Because a LDS load is 2 separate loads on gfx10+ with wave64, a different wave
can write LDS in between and cause a non uniform result. Use v_readfirst_lane
instead of p_as_uniform because it cannot be copy propagated.
Fixes a OpenCL CTS test with zink+rusticl.
Totals from 136 (0.17% of 78196) affected shaders:
MaxWaves: 3236 -> 3244 (+0.25%)
Instrs: 130069 -> 131221 (+0.89%)
CodeSize: 698048 -> 703436 (+0.77%)
VGPRs: 5464 -> 5440 (-0.44%)
SpillSGPRs: 94 -> 96 (+2.13%)
Latency: 5361017 -> 5363781 (+0.05%); split: -0.00%, +0.05%
InvThroughput: 883010 -> 884100 (+0.12%)
SClause: 3822 -> 3821 (-0.03%); split: -0.05%, +0.03%
Copies: 14220 -> 14314 (+0.66%); split: -0.01%, +0.68%
Branches: 4549 -> 4551 (+0.04%)
PreSGPRs: 4934 -> 4940 (+0.12%)
PreVGPRs: 4666 -> 4655 (-0.24%)
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25973>
The usage after spilling is usually either the same as before or the
maximum.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25559>
This lets us measure optimizations without interference of waitcnt
instructions.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25559>
According to RadeonSI, this is required for preemption, user queues,
and we only have to wait for VS after streamout which should be more
performant.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25284>
First, we need to give the parent_instr field a unique name to be able to
replace with a helper. We have parent_instr fields for both nir_src and
nir_def, so let's rename nir_src::parent_instr in preparation for rework.
This was done with a combination of sed and manual fix-ups.
Then we use semantic patches plus manual fixups:
@@
expression s;
@@
-s->renamed_parent_instr
+nir_src_parent_instr(s)
@@
expression s;
@@
-s.renamed_parent_instr
+nir_src_parent_instr(&s)
@@
expression s;
@@
-s->parent_if
+nir_src_parent_if(s)
@@
expression s;
@@
-s.renamed_parent_if
+nir_src_parent_if(&s)
@@
expression s;
@@
-s->is_if
+nir_src_is_if(s)
@@
expression s;
@@
-s.is_if
+nir_src_is_if(&s)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24671>
Next part don't know whether p_end_with_regs args are loaded from
memory ops or not, need to wait it's done here.
Other memory load needs to be waited too like:
a = load_mem()
b = ...
if (...) {
wait_mem(a)
store_mem(a)
}
p_end_with_regs(b)
"a" still needs to be waited, otherwise next shader part regs may
be overwritten by unfinished memory loads.
Memory stores are waited too. When >=gfx10 and last VGT has no
parameter export, we need to wait all memeory stores done before
pos export (see ac_nir_export_position). So when merged shader
(ES+GS or VS+GS) is partially built, first stage needs to wait
all memory stores done, otherwise second stage don't know if
any memory stores pending before.
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Signe-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24973>
radeonsi may generate empty main shader or an empty exit block
for p_end_with_regs to jump to.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24973>
PS with epilog does not need to fix_exports. And radeonsi use
p_end_with_regs so does not have jump instruction at last.
radeonsi may also have exec restore instruction, so may break
before reach to p_end_with_regs.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24973>
radeonsi need to compact color export for ps epilog while radv does not.
radv will fill empty color slot, so won't affected by this change.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24973>
ps needs to handle wqm:
1. main part may compute with args from prolog in wqm mode, so
prolog need to compute these args in wqm mode too.
2. prolog and main part need to end with exact exec, so next
shader part which inherit previous shader part's exec won't
do valid job for helper threads
1 need p_end_with_regs to operate in wqm mode and itself can't
be exact, otherwise some move instruction added by it won't be
in wqm mode so helper threads' compute result is not passed to
next shader part as args.
2 is done by p_end_wqm added by finish_program automatically
after p_end_with_regs.
Piglit tests can trigger the problem:
1. gl-2.1-polygon-stipple-fs
a. ps prolog call discard_if
b. ps main pass wqm exec to epilog
c. ps epilog export color for discarded pixel
2. fs-fwidth-color.shader_test
a. ps prolog need to pass args computed in wqm mode
b. set p_end_with_regs to exact will end wqm mode before
the move instructions, so helper threads's result is not
passed to next shader part
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24973>
p_end_with_regs just partially end the program, next part need
exec mask to be set correctly. For example p_end_wqm will generate
a exec restore from WQM mode after p_end_with_regs.
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24973>