brw: Tune vectorizer conditions to allow overfetching with holes

Notably, our convergent block loads were already overfetching - we rounded up to block sizes of 8, 16, 32, or 64(LSC-only). But we did so in the backend, rather than NIR. With recent changes, nir_opt_load_store_vectorizer allows holes of up to 28 bytes (7 components at 4 bytes each). This allows us to detect cases where we did a convergent block load for 1 component (but loaded a whole vec8), then another load for the next vec8, and combine them into a single V16 load. Single component loads aren't the most common, but convergent loads of a vec2 in one group and a vec3 in another are quite common, and it makes no sense to do V8+V8 loads instead of V16. For non-block loads, we allow a max hole of 4 bytes. This allows the common case of XYZ_ + XYZ_ loads (where the last component is unread) to combine into a single larger load. fossil-db results on Lunarlake: Totals: Instrs: 146692608 -> 146246432 (-0.30%); split: -0.33%, +0.02% Subgroup size: 11100528 -> 11100512 (-0.00%) Send messages: 7003425 -> 6862529 (-2.01%); split: -2.01%, +0.00% Cycle count: 22396273274 -> 22523048654 (+0.57%); split: -1.08%, +1.64% Spill count: 67671 -> 67594 (-0.11%); split: -1.59%, +1.48% Fill count: 128999 -> 130223 (+0.95%); split: -1.73%, +2.68% Scratch Memory Size: 5986304 -> 6042624 (+0.94%); split: -1.40%, +2.34% Max live registers: 48898858 -> 48881655 (-0.04%); split: -0.05%, +0.01% Non SSA regs after NIR: 172397792 -> 167577380 (-2.80%); split: -2.80%, +0.00% Totals from 451003 (80.87% of 557667) affected shaders: Instrs: 134111754 -> 133665578 (-0.33%); split: -0.36%, +0.03% Subgroup size: 9039104 -> 9039088 (-0.00%) Send messages: 6127775 -> 5986879 (-2.30%); split: -2.30%, +0.00% Cycle count: 20306336726 -> 20433112106 (+0.62%); split: -1.19%, +1.81% Spill count: 56230 -> 56153 (-0.14%); split: -1.92%, +1.78% Fill count: 112920 -> 114144 (+1.08%); split: -1.97%, +3.06% Scratch Memory Size: 3769344 -> 3825664 (+1.49%); split: -2.23%, +3.72% Max live registers: 43750259 -> 43733056 (-0.04%); split: -0.05%, +0.01% Non SSA regs after NIR: 158449343 -> 153628931 (-3.04%); split: -3.04%, +0.00% In particular, sends get cut by 20.85% for Borderlands 3 DX12, 13.82% on Cyberpunk 2077, 10.75% on Strange Brigade, and 10.20% on Red Dead Redemption 2. Yet, spill/fills remain about the same. fossil-db results on Alchemist are similar though not quite as good. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
2026-05-05 11:48:06 +02:00 · 2024-09-09 17:59:50 -07:00 · 2024-09-09 17:59:50 -07:00 · 6fd10a6620
commit 6fd10a6620
parent f88eb48ff2
1 changed files with 7 additions and 4 deletions
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@ -1421,7 +1421,7 @@ brw_nir_should_vectorize_mem(unsigned align_mul, unsigned align_offset,
    * those back into 32-bit ones anyway and UBO loads aren't split in NIR so
    * we don't want to make a mess for the back-end.
    */
-   if (bit_size > 32 || hole_size || !nir_num_components_valid(num_components))
+   if (bit_size > 32)
      return false;

   if (low->intrinsic == nir_intrinsic_load_ubo_uniform_block_intel ||
@ -1429,14 +1429,14 @@ brw_nir_should_vectorize_mem(unsigned align_mul, unsigned align_offset,
       low->intrinsic == nir_intrinsic_load_shared_uniform_block_intel ||
       low->intrinsic == nir_intrinsic_load_global_constant_uniform_block_intel) {
      if (num_components > 4) {
-         if (!util_is_power_of_two_nonzero(num_components))
-            return false;
-
         if (bit_size != 32)
            return false;

         if (num_components > 32)
            return false;
+
+         if (hole_size > 4 * (8 - low->num_components))
+            return false;
      }
   } else {
      /* We can handle at most a vec4 right now.  Anything bigger would get
@ -1444,6 +1444,9 @@ brw_nir_should_vectorize_mem(unsigned align_mul, unsigned align_offset,
       */
      if (num_components > 4)
         return false;
+
+      if (hole_size > 4)
+         return false;
   }