This lets us emit the VPM_WRITEs directly from
nir_intrinsic_store_output() (useful once NIR scheduling is in place so
that we can reduce register pressure), and lets future NIR scheduling
schedule the math to generate them. Even in the meantime, it looks like
this lets NIR DCE some more code and make better decisions.
total instructions in shared programs: 6429246 -> 6412976 (-0.25%)
total threads in shared programs: 153924 -> 153934 (<.01%)
total loops in shared programs: 486 -> 483 (-0.62%)
total uniforms in shared programs: 2385436 -> 2388195 (0.12%)
Acked-by: Ian Romanick <ian.d.romanick@intel.com> (nir)
We can pull a whole vector in a single indirect load. This saves a bunch
of round-trips to the TMU, instructions for setting up multiple loads,
references to the UBO base in the uniforms, and apparently manages to
reduce register pressure as well.
instructions in affected programs: 3086665 -> 2454967 (-20.47%)
uniforms in affected programs: 919581 -> 721039 (-21.59%)
threads in affected programs: 1710 -> 3420 (100.00%)
spills in affected programs: 596 -> 522 (-12.42%)
fills in affected programs: 680 -> 562 (-17.35%)
Improves 3dmmes performance by 2.29312% +/- 0.139825% (n=5)
We were doing this late after nir_lower_io, but we can just reuse the core
code. By doing it at this stage, we won't even set up the VS attributes
as inputs, reducing our VPM size.
This is a pretty straightforward fork of VC4's NIR compiler to VC5. The
condition codes, registers, and I/O have all changed, making the backend
hard to share, though their heritage is still recognizable.
v2: Move to src/broadcom/compiler to match intel's layout, rename more
"vc5" to "v3d", rename QIR to VIR ("V3D IR") to avoid symbol conflicts
with vc4, use new v3d_debug header, add compiler init/free functions,
do texture swizzling in NIR to allow optimization.