From 4c366ef67bd9e45618bb16826fc2f60e2f949892 Mon Sep 17 00:00:00 2001 From: Paulo Zanoni Date: Tue, 27 Aug 2024 10:27:35 -0700 Subject: [PATCH] anv/trtt: set every entry to NULL when we create an L2 table When we create sparse resources the first thing we do is a NULL bind on them, as the Vulkan spec mandates certain behavior even for unbound sparse resources. We do this with the minimal effort possible: if we can get away with marking an L2 pointer as NULL in the L3 table, we just do it and return, instead of going all the way to creating L1 tables and marking all the final entries as NULL. The strategy we were using had a bug that could lead to previously created NULL entries not being marked as NULL anymore. Let's give an example: (before proceeding, keep in mind that a NULL entry in the L3 and L2 tables has bit 1 set, it does *not* have the value 0) - Create a 64mb buffer that uses an entire L1 table (needs to be properly aligned), which triggers a NULL bind. - Our algorithm will just set the L3 entry (pointing to the L2 table) as NULL. - Create a 64kb buffer that uses the same L2 table (but a different L1 table). - The NULL bind triggered won't do anything as the L2 table is already NULL. - Bind the first buffer to actual memory. This will end up creating the L2 table and the L1 table. The only entry we will set in the L2 table will be the one pointing to the L1 table. All the other values will be 0 (so they won't have neither the NULL or Invalid bits set: access to them will lead to page faults). - Try to use the second buffer, which is still unbound. It was relying on the fact that its L2 table pointer was NULL, but now it's not anymore, so the page walker will fetch the L1 entries in the L2 table and they will all be zero instead of having the NULL bit set. The fix is pretty simple: whenever we create a new L2 table, set every entry to NULL (except the one we're about to set to non-NULL). This preserves behavior for every other NULL resource relying on the L3 entry being set to NULL. We don't need to do this for the L1 table because its entries are different and instead of having bits to signal NULL entries we have a special TR-TT register that we can set that gets compared to check if an entry is NULL, and we conveniently program it to 0: see ANV_TRTT_L1_NULL_TILE_VAL. I am not aware of any real workloads that are triggering this behavior, I found this issue while investigating something else, running a custom sparse program in our pre-silicon environment, and it told us about the page faults. Cc: mesa-stable Reviewed-by: Lionel Landwerlin Signed-off-by: Paulo Zanoni Part-of: --- src/intel/vulkan/anv_sparse.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/src/intel/vulkan/anv_sparse.c b/src/intel/vulkan/anv_sparse.c index 97816c24e8a..5b4275df4f1 100644 --- a/src/intel/vulkan/anv_sparse.c +++ b/src/intel/vulkan/anv_sparse.c @@ -527,6 +527,20 @@ anv_trtt_bind_add(struct anv_device *device, anv_trtt_bind_list_add_entry(l3l2_binds, n_l3l2_binds, trtt->l3_addr + l3_index * sizeof(uint64_t), l2_addr); + + /* We have just created a new L2 table. Other resources may already have + * been pointing to this L2 table relying on the fact that it was marked + * as NULL, so now we need to mark every one of its entries as NULL in + * order to preserve behavior for those entries. + */ + for (int i = 0; i < 512; i++) { + if (i != l2_index) { + trtt->l2_mirror[l3_index * 512 + i] = ANV_TRTT_L3L2_NULL_ENTRY; + anv_trtt_bind_list_add_entry(l3l2_binds, n_l3l2_binds, + l2_addr + i * sizeof(uint64_t), + ANV_TRTT_L3L2_NULL_ENTRY); + } + } } assert(l2_addr != 0 && l2_addr != ANV_TRTT_L3L2_NULL_ENTRY); @@ -697,6 +711,11 @@ anv_sparse_bind_trtt(struct anv_device *device, * numbers are so small that a little overestimation won't hurt. * * We have assertions below to catch estimation errors. + * + * TODO: a bug fix caused us to put that "+ 1024" in the l3l2 capacity so + * now our estimations are super overestimating things in most cases, as + * most cases are still using a total capacity of just 1 or 2. We should + * replace this whole thing with something more efficient. */ int l3l2_binds_capacity = 1; int l1_binds_capacity = 0; @@ -704,7 +723,7 @@ anv_sparse_bind_trtt(struct anv_device *device, assert(sparse_submit->binds[b].size % (64 * 1024) == 0); int pages = sparse_submit->binds[b].size / (64 * 1024); l1_binds_capacity += pages; - l3l2_binds_capacity += (pages / 1024 + 1) * 2; + l3l2_binds_capacity += (pages / 1024 + 1) * 2 + 1024; } /* Turn a series of virtual address maps, into a list of L3/L2/L1 TRTT page