It is likely that we would need more changes, as this packet changed,
but this is enough to get basic tests running. Any additional support
will be handled with new commits.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
Content, structure and size would depend on the generation. Even if it
is needed at all.
So let's move it to the v3dvx files.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
As the packet CLIPPER_XY scaling, this needs to be computed on 1/64ths
of pixel, instead of 1/256ths of pixels.
As this is the usual values that we get from macros, we add manually a
v42 and v71 macro, and define a new helper (V3DV_X) to get the value
for the current hw version.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
Starting point for v71 version inclusion.
This just adds it as one of the versions to be compiled (on meson),
updates the v3dX/v3dv_X macros, and update the code enough to get it
compiling when building using the two versions. For any packet not
available on v71 we just provide a generic asserted placeholder of
generation not supported.
Any real v71 support will be implemented on following commits.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
So we can use it for ldunif(a) and avoid generating ldunif(a)rf which
can't be paired with conditional instructions.
shader-db (pi5):
total instructions in shared programs: 11357802 -> 11338883 (-0.17%)
instructions in affected programs: 7117889 -> 7098970 (-0.27%)
helped: 24264
HURT: 17574
Instructions are helped.
total uniforms in shared programs: 3857808 -> 3857815 (<.01%)
uniforms in affected programs: 92 -> 99 (7.61%)
helped: 0
HURT: 1
total max-temps in shared programs: 2230904 -> 2230199 (-0.03%)
max-temps in affected programs: 52309 -> 51604 (-1.35%)
helped: 1219
HURT: 725
Max-temps are helped.
total sfu-stalls in shared programs: 15021 -> 15236 (1.43%)
sfu-stalls in affected programs: 6848 -> 7063 (3.14%)
helped: 1866
HURT: 1704
Inconclusive result
total inst-and-stalls in shared programs: 11372823 -> 11354119 (-0.16%)
inst-and-stalls in affected programs: 7149177 -> 7130473 (-0.26%)
helped: 24315
HURT: 17561
Inst-and-stalls are helped.
total nops in shared programs: 273624 -> 273711 (0.03%)
nops in affected programs: 31562 -> 31649 (0.28%)
helped: 1619
HURT: 1854
Inconclusive result (value mean confidence interval includes 0).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In programs with a lot of unused temps, if we don't do this, we may
end up recycling previously used rfs more often, which can be
detrimental to instruction pairing.
total instructions in shared programs: 11464335 -> 11444136 (-0.18%)
instructions in affected programs: 8976743 -> 8956544 (-0.23%)
helped: 33196
HURT: 33778
Inconclusive result
total max-temps in shared programs: 2230150 -> 2229445 (-0.03%)
max-temps in affected programs: 86413 -> 85708 (-0.82%)
helped: 2217
HURT: 1523
Max-temps are helped.
total sfu-stalls in shared programs: 18077 -> 17104 (-5.38%)
sfu-stalls in affected programs: 8669 -> 7696 (-11.22%)
helped: 2657
HURT: 2182
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 11482412 -> 11461240 (-0.18%)
inst-and-stalls in affected programs: 8995697 -> 8974525 (-0.24%)
helped: 33319
HURT: 33708
Inconclusive result
total nops in shared programs: 298140 -> 296185 (-0.66%)
nops in affected programs: 52805 -> 50850 (-3.70%)
helped: 3797
HURT: 2662
Inconclusive result
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
The last 3 instructions can't use specific registers so flag all the
nodes for temps used in the last program instructions and try to
avoid assigning any of these. This may help us avoid injecting nops
for the last thread switch instruction.
Because regisster allocation needs to happen before QPU scheduling
and instruction merging we can't tell exactly what the last 3
instructions will be, so we do this for a few more instructions than
just 3.
We only do this for fragment shaders because other shader stages
always end with VPM store instructions that take an small immediate
and therefore will never allow us to merge the final thread switch
earlier, so limiting allocation for these shaders will never improve
anything and might instead be detrimental.
total instructions in shared programs: 11471389 -> 11464335 (-0.06%)
instructions in affected programs: 582908 -> 575854 (-1.21%)
helped: 4669
HURT: 578
Instructions are helped.
total max-temps in shared programs: 2230497 -> 2230150 (-0.02%)
max-temps in affected programs: 5662 -> 5315 (-6.13%)
helped: 344
HURT: 44
Max-temps are helped.
total sfu-stalls in shared programs: 18068 -> 18077 (0.05%)
sfu-stalls in affected programs: 264 -> 273 (3.41%)
helped: 37
HURT: 48
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 11489457 -> 11482412 (-0.06%)
inst-and-stalls in affected programs: 585180 -> 578135 (-1.20%)
helped: 4659
HURT: 588
Inst-and-stalls are helped.
total nops in shared programs: 301738 -> 298140 (-1.19%)
nops in affected programs: 14680 -> 11082 (-24.51%)
helped: 3252
HURT: 108
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
This commits adds the qpu definitions for several new v71
instructions.
Packing:
* vpack does a 2x32 to 2x16 bit integer pack
* v8pack: Pack 2 x 2x16 bit integers into 4x8 bits
* v10pack packs parts of 2 2x16 bit integer into r10g10b10a2.
* v11fpack packs parts of 2 2x16 bit float into r11g11b10 rounding
to nearest
Conversion to unorm/snorm:
* vftounorm8/vftosnorm8: converts from 2x16-bit floating point
to 2x8 bit unorm/snorm.
* ftounorm16/ftosnorm16: converts floating point to 16-bit
unorm/snorm
* vftounorm10lo: Convert 2x16-bit floating point to 2x10-bit unorm
* vftounorm10hi: Convert 2x16-bit floating point to one 2-bit and one 10-bit unorm
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In V3D 4.x we start at RF3 so that we allocate RF0-2 only if there
aren't any other RFs available. This is useful with small shaders to
ensure that our TLB writes don't use these registers because these are
the last instructions we emit in fragment shaders and the last
instructions in a program can't write to these registers, so if we do,
we need to emit NOPs.
In V3D 7.x the registers affected by this restriction are RF2-3, so we
choose to start at RF4.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In V3D 7.x we don't have accumulators which would not survive a thread
switch, so the only restriction is that ldvary can't be placed in the
second delay slot of a thread switch.
shader-db results for UnrealEngine4 shaders:
total instructions in shared programs: 446458 -> 446401 (-0.01%)
instructions in affected programs: 13492 -> 13435 (-0.42%)
helped: 58
HURT: 3
Instructions are helped.
total nops in shared programs: 19571 -> 19541 (-0.15%)
nops in affected programs: 161 -> 131 (-18.63%)
helped: 30
HURT: 0
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In 4.x it is not allowed to write to the register file in the last 3
instructions, but in 7.x we only have this restriction in the thread
end instruction itself, and only if the write comes from the ALU
ports.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
V3D 7.x added 'mov' opcodes to the ADD alu, so now it is possible to
move these to the ADD alu to facilitate merging them with other MUL
instructions.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
ldvary writes to rf0 implicitly, so we don't want to allocate rf0 to
any temps that are live across ldvary's rf0 live ranges.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
The rf variants need to encode the destination in the cond bits, which
prevents these to be merged with any other instruction that need them.
In 4.x, ldunif(a) write to r5 which is a special register that only
ldunif(a) and ldvary can write so we have a special register class for
it and only allow it for them. Then when we need to choose a register
for a node, if this register is available we always use it.
In 7.x these instructions write to rf0, which can be used by any
instruction, so instead of restricting rf0, we track the temps that
are used as ldunif(a) destinations and use that information to favor
rf0 for them.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In V3D 4.x only a couple of simultaneous accesses where allowed, but
V3D 7.x is a bit more flexible, so rather than trying to check for all
the allowed combinations it is easier to check if we are one of the
disallows.
Shader-db (pi5):
total instructions in shared programs: 11338883 -> 11307386 (-0.28%)
instructions in affected programs: 2727201 -> 2695704 (-1.15%)
helped: 12555
HURT: 289
Instructions are helped.
total max-temps in shared programs: 2230199 -> 2229260 (-0.04%)
max-temps in affected programs: 20508 -> 19569 (-4.58%)
helped: 608
HURT: 4
Max-temps are helped.
total sfu-stalls in shared programs: 15236 -> 15293 (0.37%)
sfu-stalls in affected programs: 148 -> 205 (38.51%)
helped: 38
HURT: 64
Inconclusive result (%-change mean confidence interval includes 0).
total inst-and-stalls in shared programs: 11354119 -> 11322679 (-0.28%)
inst-and-stalls in affected programs: 2732262 -> 2700822 (-1.15%)
helped: 12550
HURT: 304
Inst-and-stalls are helped.
total nops in shared programs: 273711 -> 274095 (0.14%)
nops in affected programs: 9626 -> 10010 (3.99%)
helped: 186
HURT: 397
Nops are HURT.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
As for v71 the payload registers are not the same. Specifically now
rf3 is used as payload register, so this is needed to avoid rf3 being
selected as a instruction dst by the register allocator, overwriting
the payload value that could be still used.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
For v42 and below ldunif/ldvary write both on r5, but with a different
delay, so we need to take that into account when scheduling both.
For v71 the register used is rf0, but the behaviour is the same. So
the scheduling code can be the same, but the comment needs update.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
V3D 4.x has pixel center W in rf0 and V3D 7.x has it in rf3. We already
account for this when we setup the c->payload_w, so use that.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
We are doing this for the ADD alu already and it may be helpful to
identify cases where we have QPU code with pack/unpack modifiers on
MUL opcodes that we then are not packing into the actual QPU
instructions.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In v3d 4.x there were restrictions based on the number of raddrs used
by the combined instructions, but we don't have these restrictions in
v3d 7.x.
It should be noted that while there are no restrictions on the number
of raddrs addressed, a QPU instruction can only address a single small
immediate, so we should be careful about that when we add support for
small immediates.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
ldvary writes rf0 implicitly on the next cycle so they would clash.
This case is not handled correctly by our normal dependency tracking,
which doesn't know anything about delayed writes from instructions
and thinks the rf0 write happens on the same cycle ldvary is emitted.
Fixes (v71):
dEQP-VK.glsl.conversions.matrix_to_matrix.mat2x3_to_mat4x2_fragment
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
In platforms that don't have accumulators and have implicit writes to
the register file we need to be careful and avoid assigning a physical
register to a temp that lives across an implicit write to that same
physical register.
For now, we have the case of implicit writes to rf0 from various
signals, but it should be easy to extend this to include additional
registers if needed.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>
They use the same opcodes, and switch between one and the other based
on raddr.
Note that the rule includes also if small_imm_a/b are used. That is
still not in place so that part is hardcoded. Would be updated later
when small immediates support for v71 gets implemented.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25450>