brw: Tune vectorizer conditions to allow overfetching with holes

Notably, our convergent block loads were already overfetching - we
rounded up to block sizes of 8, 16, 32, or 64(LSC-only).  But we did
so in the backend, rather than NIR.

With recent changes, nir_opt_load_store_vectorizer allows holes of up
to 28 bytes (7 components at 4 bytes each).  This allows us to detect
cases where we did a convergent block load for 1 component (but loaded
a whole vec8), then another load for the next vec8, and combine them
into a single V16 load.  Single component loads aren't the most common,
but convergent loads of a vec2 in one group and a vec3 in another are
quite common, and it makes no sense to do V8+V8 loads instead of V16.

For non-block loads, we allow a max hole of 4 bytes.  This allows the
common case of XYZ_ + XYZ_ loads (where the last component is unread)
to combine into a single larger load.

fossil-db results on Lunarlake:

   Totals:
   Instrs: 146692608 -> 146246432 (-0.30%); split: -0.33%, +0.02%
   Subgroup size: 11100528 -> 11100512 (-0.00%)
   Send messages: 7003425 -> 6862529 (-2.01%); split: -2.01%, +0.00%
   Cycle count: 22396273274 -> 22523048654 (+0.57%); split: -1.08%, +1.64%
   Spill count: 67671 -> 67594 (-0.11%); split: -1.59%, +1.48%
   Fill count: 128999 -> 130223 (+0.95%); split: -1.73%, +2.68%
   Scratch Memory Size: 5986304 -> 6042624 (+0.94%); split: -1.40%, +2.34%
   Max live registers: 48898858 -> 48881655 (-0.04%); split: -0.05%, +0.01%
   Non SSA regs after NIR: 172397792 -> 167577380 (-2.80%); split: -2.80%, +0.00%

   Totals from 451003 (80.87% of 557667) affected shaders:
   Instrs: 134111754 -> 133665578 (-0.33%); split: -0.36%, +0.03%
   Subgroup size: 9039104 -> 9039088 (-0.00%)
   Send messages: 6127775 -> 5986879 (-2.30%); split: -2.30%, +0.00%
   Cycle count: 20306336726 -> 20433112106 (+0.62%); split: -1.19%, +1.81%
   Spill count: 56230 -> 56153 (-0.14%); split: -1.92%, +1.78%
   Fill count: 112920 -> 114144 (+1.08%); split: -1.97%, +3.06%
   Scratch Memory Size: 3769344 -> 3825664 (+1.49%); split: -2.23%, +3.72%
   Max live registers: 43750259 -> 43733056 (-0.04%); split: -0.05%, +0.01%
   Non SSA regs after NIR: 158449343 -> 153628931 (-3.04%); split: -3.04%, +0.00%

   In particular, sends get cut by 20.85% for Borderlands 3 DX12, 13.82%
   on Cyberpunk 2077, 10.75% on Strange Brigade, and 10.20% on Red Dead
   Redemption 2.  Yet, spill/fills remain about the same.

fossil-db results on Alchemist are similar though not quite as good.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
This commit is contained in:
Kenneth Graunke 2024-09-09 17:59:50 -07:00 committed by Marge Bot
parent f88eb48ff2
commit 6fd10a6620

View file

@ -1421,7 +1421,7 @@ brw_nir_should_vectorize_mem(unsigned align_mul, unsigned align_offset,
* those back into 32-bit ones anyway and UBO loads aren't split in NIR so
* we don't want to make a mess for the back-end.
*/
if (bit_size > 32 || hole_size || !nir_num_components_valid(num_components))
if (bit_size > 32)
return false;
if (low->intrinsic == nir_intrinsic_load_ubo_uniform_block_intel ||
@ -1429,14 +1429,14 @@ brw_nir_should_vectorize_mem(unsigned align_mul, unsigned align_offset,
low->intrinsic == nir_intrinsic_load_shared_uniform_block_intel ||
low->intrinsic == nir_intrinsic_load_global_constant_uniform_block_intel) {
if (num_components > 4) {
if (!util_is_power_of_two_nonzero(num_components))
return false;
if (bit_size != 32)
return false;
if (num_components > 32)
return false;
if (hole_size > 4 * (8 - low->num_components))
return false;
}
} else {
/* We can handle at most a vec4 right now. Anything bigger would get
@ -1444,6 +1444,9 @@ brw_nir_should_vectorize_mem(unsigned align_mul, unsigned align_offset,
*/
if (num_components > 4)
return false;
if (hole_size > 4)
return false;
}