nir: Allow large overfetching holes in the load store vectorizer

The load_*_uniform_block_intel intrinsics always load either 8x or 16x
32-bit components worth of data (so 32 byte increments).  This leads to
cases where we load a few components from one vec8, followed by a few
components of an adjacent vec8.  We want to combine those into a vec16
load, as that loads a whole cacheline at a time, and requires less hoops
to calculate addresses and request memory loads.

So, we allow 7 * 4 = 28 bytes of holes, which handles vec8+vec8 where
only the .x component is read.

Most drivers and intrinsics will not want such large holes.  I thought
about adding a per-intrinsic max_hole to the core code, but decided that
since we already have driver callbacks, we can just rely on them to
reject what makes sense to them.

No driver callbacks currently allow holes, so this should not currently
affect any drivers.  But any work in progress branches may need to be
updated to reject larger holes.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
This commit is contained in:
Kenneth Graunke 2024-11-19 01:30:28 -08:00 committed by Marge Bot
parent 01680a66a9
commit 5712fc48a9
2 changed files with 4 additions and 5 deletions

View file

@ -1326,13 +1326,14 @@ vectorize_sorted_entries(struct vectorize_ctx *ctx, nir_function_impl *impl,
struct entry *second = low->index < high->index ? high : low;
uint64_t diff = high->offset_signed - low->offset_signed;
/* Allow overfetching by 4 bytes, which can be rejected
* by the callback if needed.
/* Allow overfetching by 28 bytes, which can be rejected by the
* callback if needed. Driver callbacks will likely want to
* restrict this to a smaller value, say 4 bytes (or none).
*/
unsigned max_hole =
first->is_store ||
(ctx->options->has_shared2_amd &&
get_variable_mode(first) == nir_var_mem_shared) ? 0 : 4;
get_variable_mode(first) == nir_var_mem_shared) ? 0 : 28;
unsigned low_size = get_bit_size(low) / 8u * low->num_components;
bool separate = diff > max_hole + low_size;

View file

@ -346,8 +346,6 @@ bool nir_load_store_vectorize_test::mem_vectorize_callback(
{
nir_load_store_vectorize_test *test = (nir_load_store_vectorize_test *)data;
assert(hole_size <= 4);
if (hole_size > test->max_hole_size ||
(!test->overfetch && !nir_num_components_valid(num_components)))
return false;