If we're going to hav valgrind verify state streams then we need to ensure
that once we choose a pointer into a block we always use that pointer until
the block is freed. I was trying to do this with the "current_map" thing.
However, that breaks down because you have to use the map from the block
pool to get to the stream_block to get at current_map. Instead, this
commit changes things to track the stream_block by pointer instead of by
offset into the block pool.
When I first did the valgrindifying for stream allocators, I misunderstood
some things about valgrind's expectations for NOACCESS and UNDEFINED.
First off, valgrind expects things to be marked NOACCESS before you
allocate out of them. Since our blocks came from a pool backed by a
mmapped memfd, they came in as UNDEFINED; we needed to mark them as
NOACCESS. Also, I didn't realize that VALGRIND_MEMPOOL_CHANGE only updated
the mempool allocation state and didn't actually change definedness; we had
to add a VALGRIND_MAKE_MEM_UNDEFINED to get rid of the NOACCESS on the
newly allocated portion.
Regular objects are created I915_CACHING_CACHED on LLC platforms and
I915_CACHING_NONE on non-LLC platforms. However, userptr objects are
always created as I915_CACHING_CACHED, which on non-LLC means
snooped. That can be useful but comes with a bit of overheard. Since
we're eplicitly clflushing and don't want the overhead we need to turn
it off.
We assert that the block offset we got while walking the list of blocks is
actually a multiple of the block size. If something goes wrong and the GPU
decides to stomp on the surface state buffer we can end up getting
corruptions in our list of blocks. This assertion makes such corruptions a
crash with a meaningful message rather than an infinite loop.
This allows us to allocate from either side of the block pool in a
consistent way. If you use the previous block_pool_alloc function, you
will get offsets from the start of the pool as normal. If you use the new
block_pool_alloc_back function, you will get a negative index that
corresponds to something in the "back" of the pool.
This has the unfortunate side-effect of making it so that we can't have a
block pool bigger than 1GB. However, that's unlikely to happen and, for
the sake of bi-directional block pools, we need to negative offsets.
We don't have any locking issues yet because we use the pool size itself as
a mutex in block_pool_alloc to guarantee that only one thread is resizing
at a time. However, we are about to add support for growing the block pool
at both ends. This introduces two potential races:
1) You could have two block_pool_alloc() calls that both try to grow the
block pool, one from each end.
2) The relocation handling code will now have to think about not only the
bo that we use for the block pool but also the offset from the start of
that bo to the center of the block pool. It's possible that the block
pool growing code could race with the relocation handling code and get
a bo and offset out of sync.
Grabbing the device mutex solves both of these problems. Thanks to (2), we
can't really do anything more granular.
The anv_block_pool data structure suffered from the exact same race as the
state pool. Namely, that the uniqueness of the blocks handed out depends
on the next_block value increasing monotonically. However, this invariant
did not hold thanks to our block "return" concept.
The previous algorithm had a race because of the way we were using
__sync_fetch_and_add for everything. In particular, the concept of
"returning" over-allocated states in the "next > end" case was completely
bogus. If too many threads were hitting the state pool at the same time,
it was possible to have the following sequence:
A: Get an offset (next == end)
B: Get an offset (next > end)
A: Resize the pool (now next < end by a lot)
C: Get an offset (next < end)
B: Return the over-allocated offset
D: Get an offset
in which case D will get the same offset as C. The solution to this race
is to get rid of the concept of "returning" over-allocated states.
Instead, the thread that gets a new block simply sets the next and end
offsets directly and threads that over-allocate don't return anything and
just futex-wait. Since you can only ever hit the over-allocate case if
someone else hit the "next == end" case and hasn't resized yet, you're
guaranteed that the end value will get updated and the futex won't block
forever.
We have pools, so we should be using them. Also, I think this will help
keep valgrind from getting confused when we have to end up fighting with
system allocations such as those from malloc/free and mmap/munmap.
Jason started the task by creating anv_cmd_buffer.c and anv_cmd_emit.c.
This patch finishes the task by renaming all other files except
gen*_pack.h and glsl_scraper.py.