Now that we have the descriptor buffer mechanism, emulated texture
swizzle can be implemented in a very non-invasive way. Previous
attempts all tried to extend the push constant based image param
mechanism which was gross. This could, in theory, be done much faster
with a magic back-end instruction which does indirect MOVs but Vulkan on
IVB is already so slow this isn't going to matter much.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104355
Cc: "19.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Don't attempt sampling with HiZ if the sampler lacks support for it. On
ICL, the HW docs state that sampling with HiZ is not supported and that
instances of AUX_HIZ in the RENDER_SURFACE_STATE object will be
interpreted as AUX_NONE.
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Our tessellation control shaders can be dispatched in several modes.
- SINGLE_PATCH (Gen7+) processes a single patch per thread, with each
channel corresponding to a different patch vertex. PATCHLIST_N will
launch (N / 8) threads. If N is less than 8, some channels will be
disabled, leaving some untapped hardware capabilities. Conditionals
based on gl_InvocationID are non-uniform, which means that they'll
often have to execute both paths. However, if there are fewer than
8 vertices, all invocations will happen within a single thread, so
barriers can become no-ops, which is nice. We also burn a maximum
of 4 registers for ICP handles, so we can compile without regard for
the value of N. It also works in all cases.
- DUAL_PATCH mode processes up to two patches at a time, where the first
four channels come from patch 1, and the second group of four come
from patch 2. This tries to provide better EU utilization for small
patches (N <= 4). It cannot be used in all cases.
- 8_PATCH mode processes 8 patches at a time, with a thread launched per
vertex in the patch. Each channel corresponds to the same vertex, but
in each of the 8 patches. This utilizes all channels even for small
patches. It also makes conditions on gl_InvocationID uniform, leading
to proper jumps. Barriers, unfortunately, become real. Worse, for
PATCHLIST_N, the thread payload burns N registers for ICP handles.
This can burn up to 32 registers, or 1/4 of our register file, for
URB handles. For Vulkan (and DX), we know the number of vertices at
compile time, so we can limit the amount of waste. In GL, the patch
dimension is dynamic state, so we either would have to waste all 32
(not reasonable) or guess (badly) and recompile. This is unfortunate.
Because we can only spawn 16 thread instances, we can only use this
mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH.
This patch implements the new 8_PATCH TCS mode, but leaves us using
SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to
using 8_PATCH mode for testing and benchmarking purposes. We may
want to consider using 8_PATCH mode in Vulkan in some cases.
The data I've seen shows that 8_PATCH mode can be more efficient in
some cases, but SINGLE_PATCH mode (the one we use today) is faster
in other cases. Ultimately, the TES matters much more than the TCS
for performance, so the decision may not matter much.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
There's no real work to do here since we already support scalar block
layout which is a direct superset of what this extension allows.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Update various limits in
VkPhysicalDeviceDescriptorIndexingPropertiesEXT that were previously
zero to their values from VkPhysicalDeviceLimits. When using
VK_EXT_descriptor_indexing, the former limits will apply to all the
descriptor layout sets -- not only those using the new feature bits.
For the reference, VK_EXT_descriptor_indexing says
"There are new descriptor set layout and descriptor pool creation
flags that are required to opt in to the update-after-bind
functionality, and there are separate maxPerStage* and
maxDescriptorSet* limits that apply to these descriptor set
layouts which may be much higher than the pre-existing limits. The
old limits only count descriptors in non-updateAfterBind
descriptor set layouts, and the new limits count descriptors in
all descriptor set layouts in the pipeline layout."
Fixes: 6e230d7607 "anv: Implement VK_EXT_descriptor_indexing"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The key reason for that mechanism is gone: all the extra optional data
that could be in the anv_push_constants was moved elsewhere. At this
point, just put anv_push_constants directly in anv_cmd_state (part of
anv_cmd_buffer).
v2: Remove a NULL check we don't need anymore in
anv_cmd_buffer_push_constants(). (Lionel)
Fix size we consider for valid push params. (Lionel)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
We didn't notice this issue much because the 2 struct share a similar
layout, expect for the additional fields...
We run into that issue in Anv :
==15236== Invalid write of size 8
==15236== at 0x8CF3939C: anv_state_table_expand_range (anv_allocator.c:211)
==15236== by 0x8CF394D5: anv_state_table_grow (anv_allocator.c:264)
==15236== by 0x8CF3967E: anv_state_table_add (anv_allocator.c:312)
==15236== by 0x8CF3B13C: anv_state_pool_alloc_no_vg (anv_allocator.c:1167)
==15236== by 0x8CF3B2B0: anv_state_pool_alloc (anv_allocator.c:1190)
==15236== by 0x8CF60871: alloc_surface_state (anv_image.c:1122)
==15236== by 0x8CF61FF9: anv_CreateImageView (anv_image.c:1519)
==15236== by 0x8BCBD2ED: vkCreateImageView (trampoline.c:1358)
==15236== Address 0x8994ef10 is 0 bytes after a block of size 128 alloc'd
==15236== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15236== by 0x8D2578E6: u_vector_init (u_vector.c:47)
==15236== by 0x8CF3929A: anv_state_table_init (anv_allocator.c:168)
==15236== by 0x8CF3A99A: anv_state_pool_init (anv_allocator.c:921)
==15236== by 0x8CF56517: anv_CreateDevice (anv_device.c:1909)
==15236== by 0x8BCB4FBA: terminator_CreateDevice (loader.c:6073)
==15236== by 0x8DD2CB3D: ??? (in /home/djdeath/.steam/ubuntu12_64/libVkLayer_steam_fossilize.so)
==15236== by 0x8DF4D241: vkCreateDevice (in /home/djdeath/.steam/ubuntu12_64/steamoverlayvulkanlayer.so)
==15236== by 0x8BCB35C6: loader_create_device_chain (loader.c:5449)
==15236== by 0x8BCBC230: vkCreateDevice (trampoline.c:838)
v2: Rename mmap_cleanups to avoid confusion (Caio)
v3: s/fail_mmap_cleanups/fail_cleanups/ (Caio)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110648
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Once mem->bo is removed from the cache, it is likely to be freed.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: b80930a6fe ("anv: add support for VK_EXT_memory_budget")
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
We use a mix of MI & PIPE_CONTROL commands to write our queries' data
(results & availability). Those commands' memory write order is not
guaranteed with regard to their order in the command stream, unless CS
stalls are inserted between them. This is problematic for 2 reasons :
1. We copy results from the device using MI commands even though
the values are generated from PIPE_CONTROL, meaning we could
copy unlanded values into the results and then copy the
availability that is inconsistent with the values.
2. We allow the user to poll on the availability values of the
query pool from the CPU. If the availability lands in memory
before the values then we could return invalid values.
This change does 2 things to address this problem :
- We use either PIPE_CONTROL or MI commands to write both
queries values and availability, so that the ordering of the
memory writes guarantees that if availability is visible,
results are also visible.
- For the occlusion & timestamp queries we apply a CS stall
before copying the results on the device, to ensure copying
with MI commands see the correct values of previous
PIPE_CONTROL writes of availability (required by the Vulkan
spec).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reported-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
There are tests in CTS for alpha to coverage without a color attachment
that are failing. This happens because we remove the shader color
outputs when we don't have a valid color attachment for them, but when
alpha to coverage is enabled we still want to preserve the the output
at location 0 since we need the alpha component. In that case we will
also need to create a null render target for RT 0.
v2:
- We already create a null rt when we don't have any, so reuse that
for this case (Jason)
- Simplify the code a bit (Iago)
v3:
- Take alpha to coverage from the key and don't tie this to depth-only
rendering only, we want the same behavior if we have multiple render
targets but the one at location 0 is not used. (Jason).
- Rewrite commit message (Iago)
v4:
- Make sure we take into account the array length of the shader outputs,
which we were no handling correctly either and make sure we also
create null render targets for any invalid array entries too.
v5:
- Simplify removal of unused outputs by using rt_used[] so we don't have
to special case alpha to coverage there too.
Fixes the following CTS tests:
dEQP-VK.pipeline.multisample.alpha_to_coverage_no_color_attachment.*
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Both drivers are feature-complete and should be running more-or-less at
perf at this point. Drop the warning.
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Found while running Talos Principle.
As far as I can tell running a draw call with a pipeline having push
constants without the application having called vkCmdPushConstants
gives undefined push constant values.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Cc: mesa-stable@lists.freedesktop.org
It is an input but it comes in as part of the shader payload and doesn't
count towards the limits.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This enables the remaining capabilities in SPV_EXT_descriptor_indexing.
Fixes: 6e230d7607 "anv: Implement VK_EXT_descriptor_indexing"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
One special case, `src/util/xmlpool/.gitignore` is not entirely deleted,
as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`).
Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
In commit 0d46e404 ("anv: limit URB reconfigurations when using
blorp") we tried to limit the number of URB reconfiguration by
checking if the last allocation is large enough to fit the blorp
dispatch.
We used the last bound pipeline to compare the allocation. The problem
with this is that the pipeline is bound but its commands might not
have been emitted into the command buffer yet.
Let's just revert commit 0d46e40467
since it didn't seem to yield any performance improvement.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 0d46e404 ("anv: limit URB reconfigurations when using blorp")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110535
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
VK_ANDROID_external_memory_android_hardware_buffer requires this
extension. It is safe to enable it since currently aux usage is
disabled for ahw buffers.
Fixes following dEQP extension dependency test on Android:
dEQP-VK.api.info.device#extensions
Cc: <mesa-stable@lists.freedesktop.org>
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
In 105002bd2d, we fixed a memory leak bug where we weren't properly
destroying descriptor when destroying/resetting a descriptor pool.
However, the only real leak that happened was that we we take a
reference to the descriptor set layout in the descriptor set and we
weren't dropping our reference. Everything else in the descriptor set
is tied to the pool itself and doesn't need to be freed on a per-set
basis. This commit changes the destroy/reset functions to only bother
walking the list of sets to unref the layouts and otherwise we just
assume that the whole-pool destroy/reset takes care of the rest.
Now that we're doing more non-trivial things with descriptor sets such
as allocating things with util_vma_heap, per-set destruction is starting
to show up on perf traces. This takes reset back to where it's supposed
to be as a cheap whole-pool operation.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
In c520f4dec9, we chose to align the sizes of descriptor set buffers to
32 bytes. We have to align the descriptor set buffer to 32B so that
it's valid for using with push constants. We align the size as well so
we don't leave lots of holes with util_vma_heap_alloc. Unfortunately,
we were only aligning it for alloc and not for free so we were still
creating piles of holes when we delete descriptor sets. This causes
terrible perf for the allocator once we've deleted piles of descriptor
sets.
This commit reworks the code so that we align the descriptor set buffer
size to 32B for both alloc and free. The result is that it takes the
new crucible vkResetDescriptorPool from 104.567719 to 2.898354 seconds.
Fixes: c520f4dec9 "anv: Add a concept of a descriptor buffer"
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110497
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Instead of aligning and then taking inline uniforms into account, we
need to take inline uniforms into account and then align to a page.
Otherwise, we may not be aligned to a page and allocation may fail.
Fixes: 43f40dc7cb "anv: Implement VK_EXT_inline_uniform_block"
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
anv_descriptor_pool_free_set is called on the clean-up path of
anv_descriptor_set_create and the set may not have been added to the
pool's list of sets yet. While we're here, we move adding it to that
list into set_create for symmetry.
Fixes: 105002bd2d "anv: destroy descriptor sets when pool gets..."
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
The dri options are optional. When the dri options are not provided
the WSI will not use adaptive sync.
FWIW I think for xf86-video-amdgpu this still requires an X11 config
option, so only people who opt in can get possible regressions from this.
So then the remaining question is: why do this in the WSI?
It has been suggested in another MR that the application sets this.
However, I disagree with that as I don't think we'll ever get a
reasonable set of applications setting it.
The next questions is whether this can be a layer. It definitely
can be as implemented now. However, I think this generally fits
well with the function of the WSI. Furthemore, for e.g. the DISPLAY
WSI this is much harder to do in a layer.
Of course, most of the WSI could almost be a layer, but I think
this still fits best in the WSI.
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Doesn't fix anything but it's not the right function prototype.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 673f33c77d ("anv: Implement CmdBegin/EndQueryIndexed")
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Previously, we were storing the per-binding create info pointer in the
immutable_samplers field temporarily so that we can switch the order in
which we walk the loop. However, now that we have multiple arrays of
structs to walk, it makes more sense to store an index of some sort.
Because we want to leave immutable_samplers as NULL for undefined
bindings, we store index + 1 and then subtract one later.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
I missed this on the first go round. The bindingCount field of
VkDescriptorSetLayoutBindingFlagsCreateInfoEXT is allowed to be zero
which means the flags array is ignored.
Fixes: d6c9bd6e01 "anv: Put binding flags in descriptor set layouts"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Now that everything is in place to do bindless for all resource types
except input attachments and UBOs, VK_EXT_descriptor_indexing is
"trivial".
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This commit changes anv to put bindless handles and sampler pointers
into the descriptor buffer and use those instead of bindful when we run
out of binding table space. This "spilling" of descriptors allows to to
advertise an almost unbounded number of images and samplers.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Instead of setting it manually, call the helper. When setting
descriptor sets becomes more complicated than just setting some struct
values, this will keep immutable sampler handling correct.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This commit adds a new way for ANV to do SSBO bindings by just passing a
GPU address in through the descriptor buffer and using the A64 messages
to access the GPU address directly. This means that our variable
pointers are now "real" pointers instead of a vec2(BTI, offset) pair.
This carries a few of advantages:
1. It lets us support a virtually unbounded number of SSBO bindings.
2. It lets us implement VK_KHR_shader_atomic_int64 which we couldn't
implement before because those atomic messages are only available
in the bindless A64 form.
3. It's way better than messing around with bindless handles for SSBOs
which is the only other option for VK_EXT_descriptor_indexing.
4. It's more future looking, maybe? At the least, this is what NVIDIA
does (they don't have binding based SSBOs at all). This doesn't a
priori mean it's better, it just means it's probably not terrible.
The big disadvantage, of course, is that we have to start doing our own
bounds checking for robustBufferAccess again have to push in dynamic
offsets.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
In order to avoid the potential overhead of A64 operations on all SSBO
ops, we look for those SSBO ops where we can get to the descriptor set
from the SSBO access operation and lower those to a binding-table
approach. When robustBufferAccess is enabled, this lets the hardware do
the bounds checking for us. It also avoids some potentially expensive
64-bit integer calculations.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is more descriptive and a bit nicer than checking for gen >= 8 &&
use_softpin everywhere.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
If the number of surfaces or samplers exceeds what we can put in a
table, we will want to spill out to bindless. There is no bindless
support yet but this gets us the basic framework that will be used by
later commits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This commit just sorts the bindings by how often they're used vs the
array size of the binding. This will let us make more nuanced decisions
about what goes in the binding table vs. what to make bindless.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This also fixes a bug where we mis-calculate maximum binding table sizes
and may return true in vkGetDescriptorSetLayoutSupport even for sets too
large to fit in a binding table.
Fixes: ddc4069122 "anv: Implement VK_KHR_maintenance3"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is really where they belong; not push constants. The one downside
here is that we can't push them anymore for compute shaders. However,
that's a general problem and we should figure out how to push descriptor
sets for compute shaders. This lets us bump MAX_IMAGES to 64 on BDW and
earlier platforms because we no longer have to worry about push constant
overhead limits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We spend a lot of time in the driver adding things to hash sets to track
residency. The reality is that a properly built Vulkan app uses large
memory objects and sub-allocates from them. In a typical frame, most of
if not all of those allocations are going to be resident for the entire
frame so we're really not saving ourselves much by tracking fine-grained
residency. Just throwing everything in the validation list does make it
a little bit more expensive inside the kernel to walk the list and
ensure that all our VA is in order. However, without relocations, the
overhead of that is pretty small.
If we ever do run into a memory pressure situation where the fine-
grained residency could even potentially help, we would likely be
swapping one page out to make room for another within the draw call and
performance is totally lost at that point. We're better off swapping
out other apps and just letting ours run a whole frame.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
If the last graphics pipeline bound to the command buffer has enough
space in its VS URB entries for Blorp then avoid reconfiguring the URB
partitions.
v2: s/0/MESA_SHADER_VERTEX/ (Caio)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This 3d performance workaround was initially put in the kernel but the
media driver requires different settings so the register has been
whitelisted in i915 [1] and userspace drivers are left initializing it as
they wish.
[1] : https://patchwork.freedesktop.org/series/59494/
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>