The intrinsic produces a vec2, so let's honor that and avoid the weird
lowering to scalar and later reconstruction to vec2 when we find
load vulkan descriptor intrinsics.
It fixes tests like this (which require that we expose KHR_spirv_1_4):
dEQP-VK.spirv_assembly.instruction.spirv1p4.opptrequal.null_comparisons_ssbo_equal
that otherwise produce bad code that tries to access a vec2 from the result of
that intrinsic, leading to NIR validation errors.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11257>
Be consistent with other usages in Vulkan and SPIR-V, and the recently
added workgroup_size field.
Acked-by: Emma Anholt <emma@anholt.net>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11190>
We were using the num_components to infer it, but in the end it is
VEC2 for CMPXCHG and 32BIT for anything else.
This doesn't affect any test with the real hw, but fixes an assert
with the last version of the simulator.
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11039>
This was added with VK_KHR_device_group and allows users to specify
a base offset that will be automatically added to gl_WorkGroupID.
Unfortunately, V3D doesn't support this natively, so we need to add
the base to the workgroup id generated by hardware manually. For this,
we inject add instructions that source from a QUNIFORM that will
retrieve the actual dispatch base from the compute job when it is
dispatched.
Since a compute shader can be dispatched with CmdDispatch and/or
CmdDispatchBase, we always need to add these additional add
instructions and use a base of (0,0,0) for regular dispatches.
Since we don't support any version of OpenGL with this dispatch
base functionality we can avoid the extra instructions there.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11037>
In Vulkan we configure our integer RTs to clamp automatically, so with logic
operations we need to be careful and avoid overflows by discarding any bits
that won't fit in the RT component size.
Fixes remaining CTS test failures in:
dEQP-VK.pipeline.logic_op.*
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10801>
This avoids debug builds to assert crash. Components that don't exist
won't be used and will be eventually DCEd, so simply lower them to 0.
Fixes CTS tests like these in debug builds:
dEQP-VK.pipeline.logic_op.r8_uint.clear
dEQP-VK.pipeline.logic_op.r8_uint.and
dEQP-VK.pipeline.logic_op.r8_uint.and_reverse
dEQP-VK.pipeline.logic_op.r8_uint.copy
dEQP-VK.pipeline.logic_op.r8_uint.and_inverted
dEQP-VK.pipeline.logic_op.r8_uint.no_op
dEQP-VK.pipeline.logic_op.r8_uint.xor
dEQP-VK.pipeline.logic_op.r8_uint.or
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10801>
We don't want to have to deal with vector phis in freedreno, because
vectors are always split/unsplit around vectorized instructions anyways,
and the stated reason for not scalarising them (it hurting coalescing)
won't apply to us because we won't be using nir_from_ssa. Add this
option so that we don't have to do the equivalent thing while
translating from NIR.
Reviewed-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10809>
We enabled this in the past to fix some register allocation issues we
faced with geometry shaders but we didn't document why it is safe for
us to do this, which is not immediately obvious.
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10745>
When spilling a register, the number of temps can be increased when
introducing a temporal variable.
Those nodes are not elegible to be spilled, but we need to take care of
no accessing out-of-bounds of the arrays defined with a size equal to
the original number of temps.
Fixes address sanitizer error on
KHR-GLES3.shaders.uniform_block.random.all_shared_buffer.14 (and many
others).
v2 (Iago):
- Add clarification in assertion.
- Use `vir_get_temp` to increase num_temps.
v3 (Iago):
- Update clarification
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10643>
The current policy is to always favor accumulators if possible, however,
this is not always optimal.
Particularly, accumulators play a crucial role in enabling QPU instruction
merges, since these are limited to both the ADD and the ALU instructions
addressing at most 2 physical registers. For 2-src instructions, this means
that to be able to merge we need them to address at least 2 accumulators.
While favoring accumulators does help the case for instruction merges in
general, it is risky to assign accumulators to variables that have
long life spans. Doing so will make the accumulator unavailable for
any other instructions during that life span, and since we only have a few
accumulators, we can quickly run out and losing our capacity to merge
instructions for large parts of the qpu program.
On the other hand, we also want to avoid the extreme case were we keep
allocating physical registers to the point we run out, even if we have
accumulators available, since accumulators have additional restrictions
and may not be suitable for everything.
This change continues the policy of favoring accumulators, but it only
does so if the life span of the temps is short, to ensure that we can
recycle accumulators often across instructions and avoid running out
for sections of the QPU code, unless we are already running out of
physical registers.
total instructions in shared programs: 13654647 -> 13336921 (-2.33%)
instructions in affected programs: 11015919 -> 10698193 (-2.88%)
helped: 39758
HURT: 17325
Instructions are helped.
total threads in shared programs: 412046 -> 412038 (<.01%)
threads in affected programs: 16 -> 8 (-50.00%)
helped: 0
HURT: 4
Threads are HURT.
total uniforms in shared programs: 3745726 -> 3746003 (<.01%)
uniforms in affected programs: 17296 -> 17573 (1.60%)
helped: 76
HURT: 99
Uniforms are HURT.
total max-temps in shared programs: 2364430 -> 2359942 (-0.19%)
max-temps in affected programs: 109117 -> 104629 (-4.11%)
helped: 2893
HURT: 772
Max-temps are helped.
total spills in shared programs: 5727 -> 5746 (0.33%)
spills in affected programs: 221 -> 240 (8.60%)
helped: 1
HURT: 2
total fills in shared programs: 13121 -> 13139 (0.14%)
fills in affected programs: 466 -> 484 (3.86%)
helped: 1
HURT: 2
total sfu-stalls in shared programs: 33432 -> 34491 (3.17%)
sfu-stalls in affected programs: 18219 -> 19278 (5.81%)
helped: 4459
HURT: 5087
Inconclusive result
total inst-and-stalls in shared programs: 13688079 -> 13371412 (-2.31%)
inst-and-stalls in affected programs: 11030017 -> 10713350 (-2.87%)
helped: 39630
HURT: 17429
Inst-and-stalls are helped.
total nops in shared programs: 335753 -> 333708 (-0.61%)
nops in affected programs: 112659 -> 110614 (-1.82%)
helped: 8726
HURT: 7383
Inconclusive result
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10686>
Once we have exhausted compile strategies at 4 threads and we start
enabling lower thread counts, there is no point in starting compiles
with 4 threads for them, we know these will fail, so let's start at
2 in these cases.
This also has another nice implication: if the driver compiles at 4
threads and fails to register allocate, we were allowing it to try
with 2 threads, but this would only retry the register allocation
process and would not really recompile the shader with 2 threads. This
is not optimal, because at 2 threads we have more TMU fifo space for
each thread and we can do more TMU pipelining, so we were missing that
opportunity.
This improves performance in Sponza by ~1.5% and also seems to help
UE4 slightly.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>
Until now, if we can't compile at 4 threads we would lower thread count
with optimizations disabled, however, lowering thread count doubles the
amount of registers available per thread, so that alone is already a big
relief for register pressure so it makes sense to enable optimizations
when we do that, and progressively disable them until we enable spilling
as a last resort.
This can slightly improve performance for some applications. Sponza,
for example, gets a ~1.5% boost. I see several UE4 shaders that also get
compiled to better code at 2 threads with this, but it is more difficult
to assess how much this improves performance in practice due to the large
variance in frame times that we observe with UE4 demos.
Also, if a compiler strategy disables an optimization that did not make
any progress in the previous compile attempt, we would end up re-compiling
the exact same shader code and failing again. This, patch keeps track of
which strategies won't make progress and skips them in that case to save
some CPU time during shader compiles.
Care should be taken to ensure that we try to compile with the default
NIR scheduler at minimum thread count at least once though, so a specific
strategy for this is added, to prevent the scenario where no optimizations
are used and we skip directly to the fallback scheduler if the default
strategy fails at 4 threads.
Similarly, we now also explicitly specify which strategies are allowed to do
TMU spills and make sure we take this into account when deciding to skip
strategies. This prevents the case where no optimizations are used in a
shader and we skip directly to the fallback scheduler after failing
compilation at 2 threads with the default NIR scheduler but without trying
to spill first.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>
The Vulkan driver was already creating and using its own set of options, so
the ones defined in the compiler are only used with GL, which is confusing.
Move them to the GL driver.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>
Right now this is useful for Vulkan onnly, because GL gets loop
unrolling from the GLSL compiler and/or mesa state tracker
NIR front-ends.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10647>
Vertex Shader has a store_out lowering pass that converts gallium driver
locations in offsets inside the VPM.
One of the consequences is that these offsets are consecutives; that is,
if the VS stores VARYING_SLOT_VAR0.xyz and VARYING_SLOT_VAR1.xyzw, there
isn't a hole in the VPM offsets for the un-stored VARYING_SLOT_VAR0.w.
Thus we need to change how the VPM offset is computed in the Geometry
Shader when loading the inputs.
This bug is exposed by !9050.
v2 (Iago):
- Include explanatory comment.
- Use assert.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10129>
`qpu.raddr_b` is an unsigned int, so it is always positive, even after
casting to signed int.
Fixes CID#1438117 "Operands don't affect result
(CONSTANT_EXPRESSION_RESULT)":
"result_independent_of_operands: (int)inst->qpu.raddr_b >= -16 is
always true regardless of the values of its operands. This occurs as
the logical first operand of "&&".
v2:
- Use signed pointers (Iago)
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10131>
The term 'last' may be misleading because the offset represents
the current unifa offset, which is the offset used by the last
load plus 4 bytes, so rename these to use the term 'current'
instead.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>
This implements a NIR pass that groups together constant UBO loads
for the same UBO index in order of increasing offset when the distance
between them is small enough that it enables the "skip unifa write"
optimization.
This may increase register pressure because it can move UBO loads
earlier, so we also add a compiler strategy fallback to disable the
optimization if we need to drop thread count to compile the shader
with this optimization enabled.
total instructions in shared programs: 13557555 -> 13550300 (-0.05%)
instructions in affected programs: 814684 -> 807429 (-0.89%)
helped: 4485
HURT: 2377
Instructions are helped.
total uniforms in shared programs: 3777243 -> 3760990 (-0.43%)
uniforms in affected programs: 112554 -> 96301 (-14.44%)
helped: 7226
HURT: 36
Uniforms are helped.
total max-temps in shared programs: 2318133 -> 2333761 (0.67%)
max-temps in affected programs: 63230 -> 78858 (24.72%)
helped: 23
HURT: 3044
Max-temps are HURT.
total sfu-stalls in shared programs: 32245 -> 32567 (1.00%)
sfu-stalls in affected programs: 389 -> 711 (82.78%)
helped: 139
HURT: 451
Inconclusive result.
total inst-and-stalls in shared programs: 13589800 -> 13582867 (-0.05%)
inst-and-stalls in affected programs: 817738 -> 810805 (-0.85%)
helped: 4478
HURT: 2395
Inst-and-stalls are helped.
total nops in shared programs: 354365 -> 342202 (-3.43%)
nops in affected programs: 31000 -> 18837 (-39.24%)
helped: 4405
HURT: 265
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>
This adds a minimum thread count parameter to each compilation strategy with
the intention to limit the minimum allowed thread count that can be used to
register allocate with that strategy.
For now all strategies allow the minimum thread count supported by the
hardware, but we will be using this infrastructure to impose a more
strict limit in an upcoming optimization.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>
We will be using this distance to setup another optimization in a
follow-up patch.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
x# Please enter the commit message for your changes. Lines starting
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10100>
first_component is an uint, and thus if it takes value 0 we can't know
if it is because writemask has its first bit to 1, or all bits to 0.
As we want to ensure that at least one bit is set, apply the assertion
in writemask.
Fixes CID#1472829 "Macro compares unsigned to 0 (NO_EFFECT)".
v2:
- Restore "first_component <= last_component" assertion (Iago)
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10103>
A break/continue in a loop is typically emitted like this:
if (cond) {
break/continue;
} else {
}
If cond is uniform, we'll emit code for a uniform if statement and
that will emit a branch right before the if to jump directly to the
else (or the block after the else in this case, since the else is
empty) in case cond evaluates to false. This means we end up emitting
two consecutive branch instructions, one before the if and one for the
THEN block right after:
branch(!cond) -> jump to else (or after else) if cond is false
nop
nop
nop
branch -> unconditional jump to break/continue
nop
nop
nop
Instead, if we are in this scenario, we can do better by emitting the
conditional jump directly and avoiding the "jump to else" case:
branch(cond) -> jump to break/continue if cond is true
nop
nop
nop
We need to be careful when emitting the break/continue for the case
where all lanes are disabled to avoid infinite loops: if we have a
break we always want to take the jump, but we don't want to take it
if it is a continue.
total instructions in shared programs: 13563672 -> 13557348 (-0.05%)
instructions in affected programs: 348034 -> 341710 (-1.82%)
helped: 1158
HURT: 10
Instructions are helped.
total uniforms in shared programs: 3779137 -> 3777535 (-0.04%)
uniforms in affected programs: 90583 -> 88981 (-1.77%)
helped: 1169
HURT: 0
Uniforms are helped.
total max-temps in shared programs: 2317670 -> 2317575 (<.01%)
max-temps in affected programs: 1943 -> 1848 (-4.89%)
helped: 85
HURT: 4
Max-temps are helped.
total sfu-stalls in shared programs: 32247 -> 32247 (0.00%)
sfu-stalls in affected programs: 69 -> 69 (0.00%)
helped: 7
HURT: 9
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 13595919 -> 13589595 (-0.05%)
inst-and-stalls in affected programs: 350674 -> 344350 (-1.80%)
helped: 1154
HURT: 11
Inst-and-stalls are helped.
total nops in shared programs: 358202 -> 354325 (-1.08%)
nops in affected programs: 17367 -> 13490 (-22.32%)
helped: 1168
HURT: 1
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9948>
If we have an unconditional branch then we can try to fill up its
delay slots with the initial instructions of its successor block by
copying them into the delay slots and adjusting the branch offset to
skip the copied instructions.
total nops in shared programs: 365640 -> 364471 (-0.32%)
nops in affected programs: 15416 -> 14247 (-7.58%)
helped: 462
HURT: 0
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9918>
For this we do something similar to what we do with thrsw where we try to
move the branch instruction earlier so the previous instructions execute
in the delay slots of the branch.
Generally, we can do this with any instruction except:
- If the instruction reads a uniform: since our branches do as well and
uniforms come from an ordered FIFO stream.
- If the instruction writes flags, since our branch instruction will
probably read them.
- If the instruction is in the delay slots of another thread switch,
branch, or unifa write, which is disallowed.
total instructions in shared programs: 13648140 -> 13613972 (-0.25%)
instructions in affected programs: 2209552 -> 2175384 (-1.55%)
helped: 6765
HURT: 0
Instructions are helped.
total max-temps in shared programs: 2318687 -> 2318436 (-0.01%)
max-temps in affected programs: 5046 -> 4795 (-4.97%)
helped: 152
HURT: 0
Max-temps are helped.
total inst-and-stalls in shared programs: 13680494 -> 13646326 (-0.25%)
inst-and-stalls in affected programs: 2220394 -> 2186226 (-1.54%)
helped: 6765
HURT: 0
Inst-and-stalls are helped.
total nops in shared programs: 399818 -> 365640 (-8.55%)
nops in affected programs: 127311 -> 93133 (-26.85%)
helped: 6765
HURT: 0
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9918>
Do not assign to a variable that won't be used.
Fixes CID#1451708 and CID#1451710 "Unused value (UNUSED_VALUE)".
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9910>
We were using a write dependency to ensure ordering since LDTMUs sequences
are ordered, but by using a write dependency with TMU config we were also
preserving ordering with TMU config writes that are not a sequence
terminator, which is not required and reduces scheduling flexibility.
Instead, use a write dependency to ensure strict ordering of TMU reads,
but only a read depdency with TMU config.
With this change we also need to update CS barriers to also have a write
dependency with TMU reads to ensure that we don't move TMU reads around
CS barriers.
total instructions in shared programs: 13602500 -> 13597851 (-0.03%)
instructions in affected programs: 2681428 -> 2676779 (-0.17%)
helped: 6567
HURT: 4960
Instructions are helped.
total max-temps in shared programs: 2317927 -> 2317914 (<.01%)
max-temps in affected programs: 13861 -> 13848 (-0.09%)
helped: 355
HURT: 300
Inconclusive result (value mean confidence interval includes 0).
total sfu-stalls in shared programs: 32074 -> 32247 (0.54%)
sfu-stalls in affected programs: 848 -> 1021 (20.40%)
helped: 160
HURT: 327
Inconclusive result (%-change mean confidence interval includes 0).
total inst-and-stalls in shared programs: 13634574 -> 13630098 (-0.03%)
inst-and-stalls in affected programs: 2703041 -> 2698565 (-0.17%)
helped: 6558
HURT: 5020
Inst-and-stalls are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9856>
Instead of last TMU write. According to the documentation, the entries
in the output FIFO are pushed with the *final* input write for the
lookup, which is the one terminating the sequence. We flag these
with last_tmu_config.
This will allow us to move all TMU register writes for a lookup except
the last one ahead of the LDTMUs for the previous lookup, possibly
allowing us to pair up these writes the wrtmuc instructions for the
same lookup, turning code like this:
nop ; nop ; wrtmuc (tex[0].p0 | 0x3)
nop ; nop ; wrtmuc (tex[2].p1 | 0x1)
nop ; nop ; ldunif (ubo[2]+0xe0)
fadd r4, rf33, rf51 ; mov unifa, r5 ; ldunif (ubo[2]+0x110)
fmax rf34, 0, r4 ; nop
nop ; mov tmut, rf11
nop ; mov tmus, rf0
into:
nop ; mov tmut, rf11 ; wrtmuc (tex[0].p0 | 0x3)
nop ; nop ; wrtmuc (tex[2].p1 | 0x1)
nop ; nop ; ldunif (ubo[2]+0xe0)
fadd r4, rf33, rf51 ; mov unifa, r5 ; ldunif (ubo[2]+0x110)
fmax rf34, 0, r4 ; nop
nop ; mov tmus, rf0
total instructions in shared programs: 13648140 -> 13602500 (-0.33%)
instructions in affected programs: 3497402 -> 3451762 (-1.30%)
helped: 12044
HURT: 3484
Instructions are helped.
total max-temps in shared programs: 2318687 -> 2317927 (-0.03%)
max-temps in affected programs: 17234 -> 16474 (-4.41%)
helped: 615
HURT: 198
Max-temps are helped.
total sfu-stalls in shared programs: 32354 -> 32074 (-0.87%)
sfu-stalls in affected programs: 1462 -> 1182 (-19.15%)
helped: 461
HURT: 188
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 13680494 -> 13634574 (-0.34%)
inst-and-stalls in affected programs: 3514405 -> 3468485 (-1.31%)
helped: 12062
HURT: 3486
Inst-and-stalls are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9856>
The way we handle thrsw instructions is that we try to merge them
back into previously scheduled instructions to fill up its delay
slots. This is generally safe, because the thrsw won't happen until
after the delay slots, so we are not really changing the execution
order of the instructions and we just need to make sure we don't
violate a few specific restrictions.
If we have not managed to fill up all delay slots after doing this,
then we emit as many NOPs as needed to fill them. This is to ensure
that we don't schedule an instruction that needs to execute after the
thread switch before the thread switch happens. However, doing this
can lead to inefficient code, since some times the instructions we
schedule after a thrsw are indepdent of the thrsw and could be safely
executed in its delay slots.
This change removes the fixed NOP emission after a thrsw to fill
delay slots and instead adds code to ensure that our instruction
scheduling is aware of when it is scheduling instructions in the
delay slots of a previous thrsw to avoid selecting conflicting
instructions.
The only case were we still emit fixed NOPs is for the thread end that
we emit to terminate the program after scheduling all instructions
because we can't end the instruction stream before the thread end
is properly executed.
total instructions in shared programs: 13691004 -> 13648140 (-0.31%)
instructions in affected programs: 4345951 -> 4303087 (-0.99%)
helped: 19645
HURT: 652
Instructions are helped.
total max-temps in shared programs: 2319317 -> 2318687 (-0.03%)
max-temps in affected programs: 10510 -> 9880 (-5.99%)
helped: 532
HURT: 9
Max-temps are helped.
total sfu-stalls in shared programs: 31752 -> 32354 (1.90%)
sfu-stalls in affected programs: 840 -> 1442 (71.67%)
helped: 7
HURT: 467
Sfu-stalls are HURT.
total inst-and-stalls in shared programs: 13722756 -> 13680494 (-0.31%)
inst-and-stalls in affected programs: 4335590 -> 4293328 (-0.97%)
helped: 19453
HURT: 758
Inst-and-stalls are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9825>
Integer add/sub can be implemented as either an add or a mul instruction
but we always emit them as add instructions at VIR level. We can use this
flexibility to improve our QPU scheduling so we can be more effective
at instruction merging by converting these to mul instructions when we
are attempting to merge them with another add instruction.
total instructions in shared programs: 13721549 -> 13691004 (-0.22%)
instructions in affected programs: 3340493 -> 3309948 (-0.91%)
helped: 12805
HURT: 1656
Instructions are helped.
total max-temps in shared programs: 2319528 -> 2319317 (<.01%)
max-temps in affected programs: 5285 -> 5074 (-3.99%)
helped: 195
HURT: 3
Max-temps are helped.
total sfu-stalls in shared programs: 31616 -> 31752 (0.43%)
sfu-stalls in affected programs: 469 -> 605 (29.00%)
helped: 52
HURT: 161
Sfu-stalls are HURT.
total inst-and-stalls in shared programs: 13753165 -> 13722756 (-0.22%)
inst-and-stalls in affected programs: 3340383 -> 3309974 (-0.91%)
helped: 12782
HURT: 1666
Inst-and-stalls are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9769>
This maps the nir shader data.location to its final
data.driver_location. In general we are using the driver location as
index (like vattr_sizes on the same struct), so having this map is
useful if what we have is the data.location, and we don't have
available the original nir shader.
v2: use memset instead of for loop, and nir_foreach_shader_in_variable
instead of nir_foreach_variable_with_modes (Iago)
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9403>
As we plan to try to get directly the compiled variant from the cache,
it would be possible to not have available the nir shaders, so we add
this info on prog data.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9403>