ir3: lower vectorized NIR instructions

Use the new repeat group builders to lower vectorized NIR instructions. Add NIR pass to vectorize NIR before lowering. Support for repeated instruction is added over a number of different commits. Here's how they all tie together: ir3 is a scalar architecture and as such most instructions cannot be vectorized. However, many instructions support the (rptN) modifier that allows us to mimic vector instructions. Whenever an instruction has the (rptN) modifier set it will execute N more times, incrementing its destination register for each repetition. Additionally, source registers with the (r) flag set will also be incremented. For example: (rpt1)add.f r0.x, (r)r1.x, r2.x is the same as: add.f r0.x, r1.x, r2.x add.f r0.y, r1.y, r2.x The main benefit of using repeated instructions is a reduction in code size. Since every iteration is still executed as a scalar instruction, there's no direct benefit in terms of runtime. The only exception seems to be for 3-source instructions pre-a7xx: if one of the sources is constant (i.e., without the (r) flag), a repeated instruction executes faster than the equivalent expanded sequence. Presumably, this is because the ALU only has 2 register read ports. I have not been able to measure this difference on a7xx though. Support for repeated instructions consists of two parts. First, we need to make sure NIR is (mostly) vectorized when translating to ir3. I have not been able to find a way to keep NIR vectorized all the way and still generate decent code. Therefore, I have taken the approach of vectorizing the (scalarized) NIR right before translating it to ir3. Secondly, ir3 needs to be adapted to ingest vectorized NIR and translate it to repeated instructions. To this end, I have introduced the concept of "repeat groups" to ir3. A repeat group is a group of instructions that were produced from a vectorized NIR operation and linked together. They are, however, still separate scalar instructions until quite late. More concretely: 1. Instruction emission: for every vectorized NIR operation, emit separate scalar instructions for its components and link them together in a repeat group. For every instruction builder ir3_X, a new repeat builder ir3_X_rpt has been added to facilitate this. 2. Optimization passes: for now, repeat groups are completely ignored by optimizations. 3. Pre-RA: clean up repeat groups that can never be merged into an actual rptN instruction (e.g., because their instructions are not consecutive anymore). This ensures no useless merge sets will be created in the next step. 4. RA: create merge sets for the sources and defs of the instructions in repeat groups. This way, RA will try to allocate consecutive registers for them. This will not be forced though because we prefer to split-up repeat groups over creating movs to reorder registers. 5. Post-RA: create actual rptN instructions for repeat groups where the allocated registers allow it. The idea for step 2 is that we prefer that any potential optimizations take precedence over creating rptN instructions as the latter will only yield a code size benefit. However, it might be interesting to investigate if we could make some optimizations repeat aware. For example, the scheduler could try to schedule instructions of a repeat group together. Signed-off-by: Job Noorman <jnoorman@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28341>
2025-12-20 20:20:18 +01:00 · 2024-08-15 08:46:36 +02:00 · 2024-08-15 08:46:36 +02:00 · 58d18bc7a8
commit 58d18bc7a8
parent 4c4366179b
4 changed files with 435 additions and 230 deletions
--- a/src/freedreno/ir3/ir3_compiler_nir.c
+++ b/src/freedreno/ir3/ir3_compiler_nir.c
--- a/src/freedreno/ir3/ir3_context.c
+++ b/src/freedreno/ir3/ir3_context.c
@ -121,9 +121,17 @@ ir3_context_init(struct ir3_compiler *compiler, struct ir3_shader *shader,
   if ((so->type == MESA_SHADER_FRAGMENT) && compiler->has_fs_tex_prefetch)
      NIR_PASS_V(ctx->s, ir3_nir_lower_tex_prefetch);
-   NIR_PASS(progress, ctx->s, nir_convert_to_lcssa, true, true);
+   bool vectorized = false;
   NIR_PASS(vectorized, ctx->s, nir_opt_vectorize, ir3_nir_vectorize_filter,
            NULL);
-   NIR_PASS(progress, ctx->s, nir_lower_phis_to_scalar, true);
+   if (vectorized) {
      NIR_PASS_V(ctx->s, nir_opt_undef);
      NIR_PASS_V(ctx->s, nir_copy_prop);
      NIR_PASS_V(ctx->s, nir_opt_dce);
   }
   NIR_PASS(progress, ctx->s, nir_convert_to_lcssa, true, true);
   /* This has to go at the absolute end to make sure that all SSA defs are
    * correctly marked.
--- a/src/freedreno/ir3/ir3_nir.h
+++ b/src/freedreno/ir3/ir3_nir.h
@ -61,6 +61,9 @@ void ir3_nir_lower_tess_eval(nir_shader *shader, struct ir3_shader_variant *v,
                             unsigned topology);
 void ir3_nir_lower_gs(nir_shader *shader);
 bool ir3_supports_vectorized_nir_op(nir_op op);
 uint8_t ir3_nir_vectorize_filter(const nir_instr *instr, const void *data);
 /*
 * 64b related lowering:
 */
--- a/src/freedreno/ir3/ir3_rpt.c
+++ b/src/freedreno/ir3/ir3_rpt.c
@ -5,6 +5,52 @@
 #include "ir3_nir.h"
 bool
 ir3_supports_vectorized_nir_op(nir_op op)
 {
   switch (op) {
      /* TODO: emitted as absneg which can often be folded away (e.g., into
       * (neg)). This seems to often fail when repeated.
       */
   case nir_op_b2b1:
      /* dsx/dsy don't seem to support repeat. */
   case nir_op_fddx:
   case nir_op_fddx_coarse:
   case nir_op_fddx_fine:
   case nir_op_fddy:
   case nir_op_fddy_coarse:
   case nir_op_fddy_fine:
      /* dp2acc/dp4acc don't seem to support repeat. */
   case nir_op_udot_4x8_uadd:
   case nir_op_udot_4x8_uadd_sat:
   case nir_op_sudot_4x8_iadd:
   case nir_op_sudot_4x8_iadd_sat:
      /* Among SFU instructions, only rcp doesn't seem to support repeat. */
   case nir_op_frcp:
      return false;
   default:
      return true;
   }
 }
 uint8_t
 ir3_nir_vectorize_filter(const nir_instr *instr, const void *data)
 {
   if (instr->type != nir_instr_type_alu)
      return 0;
   struct nir_alu_instr *alu = nir_instr_as_alu(instr);
   if (!ir3_supports_vectorized_nir_op(alu->op))
      return 0;
   return 4;
 }
 static void
 rpt_list_split(struct list_head *list, struct list_head *at)
 {