All block pools are allocated with the same flags. There's no good
reason why it needs to be configurable.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This makes a number of changes to the current API:
1. Everything is renamed to anv_device_* instead of anv_bo_cache_*
because the BO cache is soon going to be the sole BO allocation path
and not some special case to make import/export work.
2. Drop the cache parameter. It's totally redundant with the device
and just annoying to keep typing.
3. Rework flags so that they go the convenient direction for usage in
ANV rather than whichever awkward way the i915 specified it to
maintain backwards compatibility. This also gives us the
opportunity to set some defaults.
4. Add flags for mapping and coherency.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
The growing algorithms for the softpin case and the userptr version are
almost entirely different. Having this weird join doesn't make the code
more comprehensible. This rework does a few things:
1. Move the comment about 48-bit addresses to anv_device_init where we
actually unset the EXEC_OBJECT_SUPPORTS_48B_ADDRESS flag.
2. Separate the paths in anv_block_pool_expand_range so it's easier to
see what happens in the two different cases.
3. Use the anv_block_poo::bos array for storing all allocated BOs in
both paths rather than using the cleanup list in both paths. This
lets us make the cleanups array only used for mmaps of the memfd for
the userptr case.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Instead of depending on a mutable BO in the state pool for handling
growing state pools, add a concept of "wrapper" BOs which just wrap an
actual BO. This way, the wrapper can exist once for all of time and we
can put it in relocation lists even if the actual BO it references gets
swapped out.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
We're not THAT strapped for space that we can't burn one extra bit for
a boolean. If we're really worried about it, we can always shrink the
flags field to 16 bits because the kernel only uses 7 currently.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
It has exactly one caller and we're about to change some of the dynamics
which would make this confusing as a separate function.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This lets us do less allocation because the anv_bo's are now embedded in
the sparse array and it also allows lock-free translation from GEM
handle to BO which will be useful in future commits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
From the MEDIA_VFE_STATE docs:
"Starting with this configuration, the Maximum Number of Threads must
be set to (#EU * 8) for GPGPU dispatches.
Although there are only 7 threads per EU in the configuration, the
FFTID is calculated as if there are 8 threads per EU, which in turn
requires a larger amount of Scratch Space to be allocated by the
driver."
It's pretty clear that we need to increase this for scratch address
calculations, because the FFTID has a certain bit-pattern. The quote
above seems to indicate that we should increase the actual thread count
programmed in MEDIA_VFE_STATE as well, but we think the intention is to
only bump the scratch space.
Fixes GPU hangs in Bioshock Infinite and Synmark's CSDof on Icelake 8x8.
Fixes: 5ac804bd9a ("intel: Add a preliminary device for Ice Lake")
Reviewed-by: Matt Turner <mattst88@gmail.com>
Move the Weston os_create_anonymous_file code from egl/wayland into util,
add support for Linux memfd and FreeBSD SHM_ANON,
use that code in anv/aubinator instead of explicit memfd calls for portability.
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
When using softpin, the first allocation was not calculating the
padding and offset correctly for the case the first allocation needed
to grow. We were missing initialize the state.end right after
expanding the pool for the first time.
This is not a problem for non-softpin since there we don't use
leftover padding so the ends would re-arrange incrementally.
This fixes running dEQP-VK.ssbo.phys.layout.random.16bit.scalar.13 in
SKL -- the test uses a shader larger than the initial size for the
instruction pool.
Fixes: dfc9ab2ccd "anv/allocator: Add padding information."
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We didn't notice this issue much because the 2 struct share a similar
layout, expect for the additional fields...
We run into that issue in Anv :
==15236== Invalid write of size 8
==15236== at 0x8CF3939C: anv_state_table_expand_range (anv_allocator.c:211)
==15236== by 0x8CF394D5: anv_state_table_grow (anv_allocator.c:264)
==15236== by 0x8CF3967E: anv_state_table_add (anv_allocator.c:312)
==15236== by 0x8CF3B13C: anv_state_pool_alloc_no_vg (anv_allocator.c:1167)
==15236== by 0x8CF3B2B0: anv_state_pool_alloc (anv_allocator.c:1190)
==15236== by 0x8CF60871: alloc_surface_state (anv_image.c:1122)
==15236== by 0x8CF61FF9: anv_CreateImageView (anv_image.c:1519)
==15236== by 0x8BCBD2ED: vkCreateImageView (trampoline.c:1358)
==15236== Address 0x8994ef10 is 0 bytes after a block of size 128 alloc'd
==15236== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15236== by 0x8D2578E6: u_vector_init (u_vector.c:47)
==15236== by 0x8CF3929A: anv_state_table_init (anv_allocator.c:168)
==15236== by 0x8CF3A99A: anv_state_pool_init (anv_allocator.c:921)
==15236== by 0x8CF56517: anv_CreateDevice (anv_device.c:1909)
==15236== by 0x8BCB4FBA: terminator_CreateDevice (loader.c:6073)
==15236== by 0x8DD2CB3D: ??? (in /home/djdeath/.steam/ubuntu12_64/libVkLayer_steam_fossilize.so)
==15236== by 0x8DF4D241: vkCreateDevice (in /home/djdeath/.steam/ubuntu12_64/steamoverlayvulkanlayer.so)
==15236== by 0x8BCB35C6: loader_create_device_chain (loader.c:5449)
==15236== by 0x8BCBC230: vkCreateDevice (trampoline.c:838)
v2: Rename mmap_cleanups to avoid confusion (Caio)
v3: s/fail_mmap_cleanups/fail_cleanups/ (Caio)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110648
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Use anv_gem_munmap for unmap when softpin in use, this corresponds to
anv_gem_mmap used in anv_block_pool_expand_range. This fixes valgrind
errors seen for each pool when softpin is in use:
==25581== 262,144 bytes in 1 blocks are definitely lost in loss record 31 of 31
==25581== at 0x50E77E8: anv_gem_mmap (anv_gem.c:96)
==25581== by 0x50EEE2B: anv_block_pool_expand_range (anv_allocator.c:543)
==25581== by 0x50EEB51: anv_block_pool_init (anv_allocator.c:477)
==25581== by 0x50EF7EF: anv_state_pool_init (anv_allocator.c:920)
==25581== by 0x510B8EB: anv_CreateDevice (anv_device.c:2031)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Accessing bo->map and then pool->center_bo_offset without a lock is
racy. One way of avoiding such race condition is to store the bo->map +
center_bo_offset into pool->map at the time the block pool is growing,
which happens within a lock.
v2: Only set pool->map if not using softpin (Jason).
v3: Move things around and only update center_bo_offset if not using
softpin too (Jason).
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reported-by: Ian Romanick <idr@freedesktop.org>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109442
Fixes: fc3f588320
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
If softpin is supported, create new BOs for the required size and add the
respective BO maps. The other main change of this commit is that
anv_block_pool_map() now returns the map for the BO that the given
offset is part of. So there's no block_pool->map access anymore (when
softpin is used.
v3:
- set fd to -1 on softpin case (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We are not going to use userptr for anv block pool BOs anymore. However,
so far we have been relying on the fact that userptr BOs are snooped on
non-llc platforms. Let's make sure that the block pool BOs are still
snooped, and we can also remove the clflush'ing that we do on all state
buffers.
And since we plan to remove the flushes, set the anv_bo_pool BOs to
cached (snooped on non-LLC platforms) too. For LLC platforms, they are
all cached by default, so this becomes a no-op.
v5:
- Add snooping to anv_bo_pool BOs too (Jason).
- Remove anv_gem_set_domain.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
It's possible that we still have some space left in the block pool, but
we try to allocate a state larger than that state. This means such state
would start somewhere within the range of the old block_pool, and end
after that range, within the range of the new size.
That's fine when we use userptr, since the memory in the block pool is
CPU mapped continuously. However, by the end of this series, we will
have the block_pool split into different BOs, with different CPU
mapping ranges that are not necessarily continuous. So we must avoid
such case of a given state being part of two different BOs in the block
pool.
This commit solves the issue by detecting that we are growing the
block_pool even though we are not at the end of the range. If that
happens, we don't use the space left at the end of the old size, and
consider it as "padding" that can't be used in the allocation. We update
the size requested from the block pool to take the padding into account,
and return the offset after the padding, which happens to be at the
start of the new address range.
Additionally, we return the amount of padding we used, so the caller
knows that this happens and can return that padding back into a list of
free states, that can be reused later. This way we hopefully don't waste
any space, but also avoid having a state split between two different
BOs.
v3:
- Calculate offset + padding at anv_block_pool_alloc_new (Jason).
v4:
- Remove extra "leftover".
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This commit tries to rework the code that split and returns chunks back
to the state pool, while still keeping the same logic.
The original code would get a chunk larger than we need and split it
into pool->block_size. Then it would return all but the first one, and
would split that first one into alloc_size chunks. Then it would keep
the first one (for the allocation), and return the others back to the
pool.
The new anv_state_pool_return_chunk() function will take a chunk (with
the alloc_size part removed), and a small_size hint. It then splits that
chunk into pool->block_size'd chunks, and if there's some space still
left, split that into small_size chunks. small_size in this case is the
same size as alloc_size.
The idea is to keep the same logic, but make it in a way we can reuse it
to return other chunks to the pool when we are growing the buffer.
v2:
- Include Jason's suggestions to the algorithm that returns chunks.
- Update comments.
v3:
- Disallow returning 0 blocks (Jason).
- fix min_size in the loop (Jason).
- remove temporary variables (Jason)
v4:
- return_chunk() should never return blocks larger than
pool->block_size.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
So far we use only one BO (the last one created) in the block pool. When
we switch to not use the userptr API, we will need multiple BOs. So add
code now to store multiple BOs in the block pool.
This has several implications, the main one being that we can't use
pool->map as before. For that reason we update the getter to find which
BO a given offset is part of, and return the respective map.
v3:
- Simplify anv_block_pool_map (Jason).
- Use fixed size array for anv_bo's (Jason)
v4:
- Respect the order (item, container) in anv_block_pool_foreach_bo
(Jason).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Change block_pool->bo to be a pointer, and update its usage everywhere.
This makes it simpler to switch it later to a list of BOs.
v3:
- Use a static "bos" field in the struct, instead of malloc'ing it.
This will be later changed to a fixed length array of BOs.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
After switching to using anv_state_table, there are very few places left
still using pool->map directly. We want to avoid that because it won't
be always the right map once we split it into multiple BOs.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Use anv_state_pool_return_blocks() to return blocks to the pool, instead
of manually pushing them.
v3:
- return blocks from the end of the chunk (Jason).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The use of anv_state_table_add() combined with anv_state_table_push(),
specially when adding a bunch of states to the table, is very verbose.
So we add this helper that makes things easier to digest.
We also already add the anv_state_table member in this commit, so things
can compile properly, even though it's not used.
v2: assert that the states are always aligned to their size (Jason)
v3: Add "table" member to anv_state_pool in this commit.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We will need the anv_block_pool_map to find the map relative to some BO
that is not at the start of the block pool.
v2: just return a pointer instead of a struct (Jason)
v4: Update comment (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Add a structure to hold anv_states. This table will initially be used to
recycle anv_states, instead of relying on a linked list implemented in
GPU memory. Later it could be used so that all anv_states just point to
the content of this struct, instead of making copies of anv_states
everywhere.
One has to call anv_state_table_add(), which returns an index for the
state in the table, and then get a pointer to such index, and finally
fill in the rest of the struct.
TODO:
1) There's a lot of common code between this table backing store
memory and the anv_block_pool buffer, due to how we grow it. I think
it's possible to refactory this and reuse code on both places.
2) Add unit tests.
v3:
- Rename state table memfd (Jason)
- Return VK_ERROR_OUT_OF_HOST_MEMORY on more places (Jason)
- anv_state_table_grow returns VkResult (Jason)
- Rename variables to be more informative (Jason)
- Return errors on state table grow.
- Rename anv_state_table_push/pop to anv_free_list_push2/pop2
This will be renamed again to remove the trailing "2" later.
v4:
- Remove exit(-1) from anv_state_table (Jason).
- Use uint32_t "next" field in anv_free_entry (Jason).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Replace calls to create hash tables and sets that use
_mesa_hash_pointer/_mesa_key_pointer_equal with the helpers
_mesa_pointer_hash_table_create() and _mesa_pointer_set_create().
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Eric Engestrom <eric@engestrom.ch>
Not going to matter, but be consistent.
Found by coverity
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: caf41c78c (anv/allocator: Support softpin in the BO cache)
On Broadwell and above, we have to use different MOCS settings to allow
the kernel to take over and disable caching when needed for external
buffers. On Broadwell, this is especially important because the kernel
can't disable eLLC so we have to do it in userspace. We very badly
don't want to do that on everything so we need separate MOCS for
external and internal BOs.
In order to do this, we add an anv-specific BO flag for "external" and
use that to distinguish between buffers which may be shared with other
processes and/or display and those which are entirely internal. That,
together with an anv_mocs_for_bo helper lets us choose the right MOCS
settings for each BO use.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99507
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2 (Jason Ekstrand):
- Break up Scott's mega-patch
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
Co-authored-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
It's safer to set them there because we have the opportunity to properly
handle combining flags if a BO is imported more than once.
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
The state_pools reserve virtual address space of the full
BLOCK_POOL_MEMFD_SIZE, but maintain the current behavior of
growing from the middle.
v2: - rename block_pool::offset to block_pool::start_address (Jason)
- assign state pool start_address statically (Jason)
v3: - remove unnecessary bo_flags tampering for the dynamic pool (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
The new name make the zero-input behavior more obvious. The next
patch adds a new function with different zero-input behavior.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Suggested-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Ken suggested that we might be underallocating scratch space on HD
400. Allocating scratch space as though there was actually 8 EUs
seems to help with a GPU hang seen on synmark CSDof.
Cc: <mesa-stable@lists.freedesktop.org>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Fixes hangs seen due to the lock not being released here.
Signed-off-by: Alex Smith <asmith@feralinteractive.com>
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This will allow to set the flags on any anv_bo created/filled from a
state pool or block pool later.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
While modern pthread mutexes are very fast, they still incur a call to an
external DSO and overhead of the generality and features of pthread mutexes.
Most mutexes in mesa only needs lock/unlock, and the idea here is that we can
inline the atomic operation and make the fast case just two intructions.
Mutexes are subtle and finicky to implement, so we carefully copy the
implementation from Ulrich Dreppers well-written and well-reviewed paper:
"Futexes Are Tricky"
http://www.akkadia.org/drepper/futex.pdf
We implement "mutex3", which gives us a mutex that has no syscalls on
uncontended lock or unlock. Further, the uncontended case boils down to a
cmpxchg and an untaken branch and the uncontended unlock is just a locked decr
and an untaken branch. We use __builtin_expect() to indicate that contention
is unlikely so that gcc will put the contention code out of the main code
flow.
A fast mutex only supports lock/unlock, can't be recursive or used with
condition variables. We keep the pthread mutex implementation around as
for the few places where we use condition variables or recursive locking.
For platforms or compilers where futex and atomics aren't available,
simple_mtx_t falls back to the pthread mutex.
The pthread mutex lock/unlock overhead shows up on benchmarks for CPU bound
applications. Most CPU bound cases are helped and some of our internal
bind_buffer_object heavy benchmarks gain up to 10%.
Signed-off-by: Kristian Høgsberg <krh@bitplanet.net>
Signed-off-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>