To avoid dragging gl.h into places it has no business being,
defined tessellation primitive mode to an enum.
This has a lot of fallout all over the place.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14605>
Doing this for ir3 required adding a struct for limits of how much base to
fold in (which NTT wants as well for its case of shared vars), otherwise
the later work to lower to the 1<<9 word limit would emit more
instructions.
The shader-db results are that sometimes the reduction in NIR instruction
count results in the fewer sampler prefetches due to the shader being
estimated to be shorter (dota2, nexuiz):
total instructions in shared programs: 8996651 -> 8996776 (<.01%)
total cat5 in shared programs: 86561 -> 86577 (0.02%)
Reviewed-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14023>
- gen4 - has dp4acc and dp2acc, dp4acc is used to implement
4x8 dot product.
- gen3 - has dp2acc, in OpenCL blob uses dp2acc for dot product
on both get3 and gen4.
- gen2 - unknown, lower everything.
- gen1 - no dp2acc, lower everything. OpenCL blob doesn't advertise
cl_qcom_dot_product8 but still generates code for it.
The assembly is more verbose and uses yet to be documented
mad32.u16 instruction.
Passes:
dEQP-VK.spirv_assembly.instruction.compute.opsdotkhr.*
dEQP-VK.spirv_assembly.instruction.compute.opudotkhr.*
dEQP-VK.spirv_assembly.instruction.compute.opsudotkhr.*
dEQP-VK.spirv_assembly.instruction.compute.opsdotaccsatkhr.*
dEQP-VK.spirv_assembly.instruction.compute.opudotaccsatkhr.*
dEQP-VK.spirv_assembly.instruction.compute.opsudotaccsatkhr.*
Only packed 4x8 unsigned and mixed versions are accelerated.
However in theory we should be able to do better for signed version
than current NIR lowering.
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13986>
* shrm - (src2 >> src1) & src3
* shlm - (src2 << src1) & src3
* shrg - (src2 >> src1) | src3
* shlg - (src2 << src1) | src3
* andg - (src2 & src1) | src3
* dp2acc - dot product of two {i,u}8vec2 packed into
SRC1 and SRC2, added to 32b SRC3
* dp4acc - dot product of two {i,u}8vec4 packed into
SRC1 and SRC2, added to 32b SRC3
* wmm - vec4(x_1, x_2, x_3, x_4) * (y_1 + y_2 + y_3 + y_4), which is
duplicated (1 << (SRC3 / 32)) times starting from DST register
* wmm.accu - same as wmm but result is added to DST registers, however
the first reg in each vec4 result is overwritten instead of
accumulating.
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13986>
This allows the wavesize to be controlled per-shader. This will be used
by VK_EXT_subgroup_size_control, and freedreno will also need it if
legacy ARB_shader_ballot is to be supported (since it forces a wavesize
of 64 or less).
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13960>
If we have a compute shader that has a big workgroup, a barrier, and
a branchstack which limits max_waves - this may result in a situation
when we cannot run concurrently all waves of the workgroup, which
would lead to a hang.
Blob just explodes in such case.
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14110>
If barriers are used, it must be possible for all waves in the workgroup
to execute concurrently. Thus we may have to reduce the registers limit.
Fixes a hang in "Digital Combat Simulator".
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14110>
The blob uses *both* nops and (ss). It turns out that in some rare cases
the hardware does take more than 6 cycles, at least for movmsk, but
adding nops is unnecessary. I believe the extra nops are only there due
to the immaturity of the blob's implementation of subgroup ops, so we
don't have to copy them - just handle shared reg producers the same as
SFU instructions.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14246>
This now covers e.g. cat6 instructions as well, and ss will cover
instructions writing shared regs as well. This is split out from the
previous change to avoid too much churn and shouldn't cause any
functional changes.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14246>
Add new centralized functions which will replace the various places we
hardcode 10 for the number of (ss) nops, add numbers for soft (sy) nops
based on similar computerator experiments with ldc, sam, and ldib (the
most common (sy) producers), and add a "systall" metric which is
analogous to sstall. This also fixes some cases where we'd erroniously
count ldl* as (sy) producers instead of (ss) producers when calculating
sstall.
This only switches over the metric reporting to the new functions, so
there is no behavior change. The following commit will switch over
the rest of the compiler.
While we're at it, remove max_sun as it's never set.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14246>
After some experimentation with computerator, it seems on a618 that
writing a full register and then reading half of it as a half register
requires a delay of 6, the same as the delay for cat5/cat6 sources. The
other direction only has a delay of 5, but just bump it unconditionally
out of an abundance of caution.
Fixes: 890de1a436 ("ir3/delay: Fix full->half and half->full delay")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14246>
If we're allocating a source then we force is_killed to false, not to
true. Fixes a regression in
dEQP-GLES31.functional.synchronization.in_invocation.image_atomic_write_read
later.
Fixes: 0ffcb19b9d ("ir3: Rewrite register allocation")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14246>
Without this, a cloned instruction that takes full regs will trigger an
ir3_validate assert. This can happen, for ex, if an instruction that
writes p0.x and has a relative src gets cloned in ir3_sched.
Fixes an assert in Genshin Impact with a debug build.
Fixes: 9af795d9b9 ("ir3: Make ir3_instruction::address a normal register")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14231>
The variant may include a lowered gl_Clip/CullDistance array. So we have
to use the variant's info (which is not available). However we save off
the clip/cull masks already, so just reuse those.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13891>
We expose the compact array cap, which means that we get compact
clipdist arrays. Indicate this to the lowering pass so that it works for
gl_ClipDistance from fs, among others.
Fixes, among others, on a420,
tests/spec/glsl-1.30/execution/clipping/fs-clip-distance-interpolated.shader_test
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13891>
These macros expand to a mov in an if statement which breaks address
assumption that instruction which produces address and consumes it
are in the same block.
Fixes test:
dEQP-VK.subgroups.ballot_broadcast.framebuffer.subgroupbroadcast_intvertex
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13931>
I was trying to be clever here, skipping ahead to the newly-created
block and processing the remaining instructions after the split in the
same loop. But if the last instruction in a block was lowered, the saved
next instruction would be the head of the block before the split, not
the new block, and we would compare it to the new block so we wouldn't
stop like we were supposed to. Stop being so clever, and just restart
processing with the new block after lowering an instruction.
Because we're wrapping the actual transform in yet another loop, and the
restarting logic is a bit tricky, refactor the actual lowering into a
separate lower_instr function. Otherwise we'd be mixing the two and
indenting the actual logic even more.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13928>
Only for a6xx since we don't know the instructions for global
atomics on previous gens. Per Qualcomm's docs in OpenCL atomics
are only supported since a5xx together with Generic memory space.
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8717>
Separating atomic opcodes makes possible to express a6xx global
atomics which take iova in SRC1. They would be needed by
VK_KHR_buffer_device_address.
The change also makes easier to distiguish atomics in conditions.
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8717>
If the source and destination were within the same full register, like
hr90.x and hr90.y (which both map to r45.x), then we'd perform the
swap/copy with the wrong register. This broke
dEQP-VK.ssbo.phys.layout.random.16bit.scalar.35 once BDA is enabled.
Fixes: 0ffcb19b9d ("ir3: Rewrite register allocation")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13818>
This is required by
dEQP-VK.ssbo.phys.layout.random.all_shared_buffer.47, where we need to
spill a lot of pointers due to NIR CSE being a little too aggressive and
creating a large register pressure across basic blocks, too large to fit
within the boundaries of ldp/stp offsets.
Note that this will be a lot more difficult with support for "real
functions" because the base register will become unknown at compile
time. However this hack gets things working for the time being.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13818>
This would've caught the previous issue earlier. We checked that the
physreg made sense when inserting via ra_file_insert() but not
ra_push_interval() which is used for live-range splitting.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13818>
Analogous to the pre-RA scheduler. Unfortunately this time it's a bit
more involved because we have to correctly handle (rptN), which is
already relevant for swz. This means we need the index of the
destination register that conflicts with the source register, to handle
swz, and we need to expose that part of ir3_delay. But once that's done,
we can delete ir3_delay_calc_postra.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13722>
We have a situation in some skia shaders like:
add.f r0.x, ...
(rpt2)nop
mul.f ..., r0.x
sam (xyzw) r0.x, ...
rcp ..., r0.x
Notice that rcp uses the result of the sam instruction, not the add.f,
but we didn't keep track of which instructions kill the sources in
ir3_delay, so we'd add an extra nop, resulting in a disagreement betwen
ir3_delay and the scheduling graph. Since postsched is correct, fix
ir3_delay. This only results in some very slight shader-db changes but
keeps the next commit from changing things.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13722>
We want to model soft dependencies, but because of how there's only a
single bit to wait on all of them, there may be unnecessary delays
inserted when a (sy)-consumer follows an unrelated (sy)-producer.
Previously there was some code to try to work around this, but we can
just model it directly using the sfu_delay and tex_delay cycle counts
that we have to maintain anyway and delete it.
This also gets rid of the calls to ir3_delay_postra with soft=true which
would be more complicated to handle in the next commit.
There is a functional change here: the idea of preferring less nop's
over critical path length (max_delay) up to 3 nops is kept (and we
delete the TODO which is already sort-of resolved by it), but delays due
to (ss)/(sy) and nops are now treated equally, rather than always
preferring nops over syncs. So if our estimate indicates that scheduling
an (ss) consumer will result in a wait of one cycle and there's another
instruction that will require one nop, we will treat them otherwise
equal and choose based on max_delay instead. This results in more
sstall, but the decrease in nops is much greater.
total nops in shared programs: 376613 -> 345482 (-8.27%)
nops in affected programs: 275483 -> 244352 (-11.30%)
helped: 3226
HURT: 110
helped stats (abs) min: 1 max: 78 x̄: 9.73 x̃: 7
helped stats (rel) min: 0.19% max: 100.00% x̄: 19.48% x̃: 13.68%
HURT stats (abs) min: 1 max: 16 x̄: 2.43 x̃: 2
HURT stats (rel) min: 0.00% max: 150.00% x̄: 13.34% x̃: 4.36%
95% mean confidence interval for nops value: -9.61 -9.06
95% mean confidence interval for nops %-change: -19.01% -17.78%
Nops are helped.
total sstall in shared programs: 126195 -> 133806 (6.03%)
sstall in affected programs: 79440 -> 87051 (9.58%)
helped: 300
HURT: 1922
helped stats (abs) min: 1 max: 15 x̄: 4.72 x̃: 4
helped stats (rel) min: 1.05% max: 100.00% x̄: 17.15% x̃: 14.55%
HURT stats (abs) min: 1 max: 29 x̄: 4.70 x̃: 4
HURT stats (rel) min: 0.00% max: 900.00% x̄: 25.38% x̃: 10.53%
95% mean confidence interval for sstall value: 3.22 3.63
95% mean confidence interval for sstall %-change: 17.50% 21.78%
Sstall are HURT.
total (ss) in shared programs: 35190 -> 35472 (0.80%)
(ss) in affected programs: 6433 -> 6715 (4.38%)
helped: 163
HURT: 401
helped stats (abs) min: 1 max: 2 x̄: 1.06 x̃: 1
helped stats (rel) min: 1.92% max: 33.33% x̄: 11.53% x̃: 10.00%
HURT stats (abs) min: 1 max: 3 x̄: 1.13 x̃: 1
HURT stats (rel) min: 1.56% max: 100.00% x̄: 15.33% x̃: 12.50%
95% mean confidence interval for (ss) value: 0.41 0.59
95% mean confidence interval for (ss) %-change: 6.22% 8.93%
(ss) are HURT.
total (sy) in shared programs: 13476 -> 13521 (0.33%)
(sy) in affected programs: 669 -> 714 (6.73%)
helped: 30
HURT: 78
helped stats (abs) min: 1 max: 2 x̄: 1.13 x̃: 1
helped stats (rel) min: 4.00% max: 50.00% x̄: 21.22% x̃: 21.11%
HURT stats (abs) min: 1 max: 2 x̄: 1.01 x̃: 1
HURT stats (rel) min: 3.45% max: 100.00% x̄: 31.93% x̃: 25.00%
95% mean confidence interval for (sy) value: 0.23 0.60
95% mean confidence interval for (sy) %-change: 11.19% 23.15%
(sy) are HURT.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13722>
The old code walked the instructions between each ready instruction and
each of its parents for every instruction, which can quickly become
accidentally quadratic. Instead we keep track of the current
"instruction pointer" of the to-be-scheduled instruction, and for each
ready instruction calculate an "earliest possible IP" which is the IP
that needs to be reached before we can schedule it. Because this stays
constant as soon as an instruction becomes ready, we never have to
recompute it and each call to ir3_delay_calc_prera() becomes a simple
comparison and subtract. We only need to iterate over the children and
update their earliest_ip when scheduling an instruction, and we already
do that in util_day_prune_head() so it should be cheap.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13722>
The function would return both the 3d and array flags set for 2d array,
and would return just 3d for cubes. Fix the flags so that they are
appropriate for images.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13804>