As of 2d78e55a8c, nir_intrinsic_load_constant with a constant offset
is constant-folded so we should never end up with any that trigger
brw_nir_analyze_ubo_ranges.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This lets us stop tracking the pipeline layout. It also means less
indirection on a very hot path. As an extra bonus, we can make some of
our data structures smaller. No measurable CPU overhead improvement.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: Use ternary to simplify code (Jason)
v3: Reorder switch cases to follow existing section ordering (Nanley)
Add missing comment in cmd_buffer_end_subpass() about new layout (Nanley)
v4: Fix layout comparison for stencil case (Nanley)
Update a few more comments (Nanley)
Move VK_IMAGE_LAYOUT_STENCIL_ATTACHMENT_OPTIMAL_KHR in color
attachment case for future stencil-CCS support (Nanley)
v5: Missed comments update (Nanley)
Updated relnotes.txt (Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Now that we're no longer compacting binding table entries, the only time
they can possibly change is when we actually switch subpasses.
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Previously, we would read the offset from the BO in anv_reloc_list_add
to generate the presumed offset and then again in the caller to compute
the 64-bit address to write into the buffer. However, if the offset
somehow changed between these two points, the presumed offset would no
longer match the written offset. This is unlikely to actually ever be a
problem in practice because the presumed offset gets recorded first and
so if the written address is wrong then the presumed offset is almost
certainly wrong and the relocation will trigger. However, it's much
safer to simply have anv_reloc_list_add return the 64-bit address.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This shaves around 4-5% off of a CPU-limited example running with the
Dawn WebGPU implementation.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
For gen12 we set the streamout buffers using 4 separate
commands instead of 3DSTATE_SO_BUFFER.
Signed-off-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Prepare this function to be used in iris and to handle new Gen12 behavior.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
v2: Introduce the appropriate pipe controls
Properly deal with changes in metric sets (using execbuf parameter)
Record marker at query end
v3: Fill out PerfCntr1&2
v4: Introduce vkUninitializePerformanceApiINTEL
v5: Use new execbuf extension mechanism
v6: Fix comments in genX_query.c (Rafael)
Use PIPE_CONTROL workarounds (Rafael)
Refactor on the last kernel series update (Lionel)
v7: Only I915_PERF_IOCTL_CONFIG when perf stream is already opened (Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
See "i965/gen9: Optimize slice and subslice load balancing behavior."
for the rationale. According to Jason, improves Aztec Ruins
performance by 2.7%.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> (v1)
v2: Undo CPU performance micro-optimization done in i965 and iris due
to lack of data justifying it on anv. Use
cmd_buffer_apply_pipe_flushes wrapper instead of emitting pipe
control command directly. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We stopped using this when we moved to Jason's mi_builder.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This is the MOCS setting used for the A64 stateless messages which we
sometimes use for SSBO operations.
Fixes: 48ed2a7bb0 "anv: Implement VK_EXT_buffer_device_address"
Fixes: 79fb0d27f3 "anv: Implement SSBOs bindings with GPU addr..."
Reviewed-by: Chad Versace <chadversary@chromium.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This should fix floating-point border color on all gen7 HW. Integer is
still thoroughly busted on gen7 because it doesn't exist on IVB and it's
crazy on HSW.
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Intel hardware didn't get support for sampling from W-tiled (required
for stencil) images until Broadwell so we can't directly sample from
stencil. Instead, if we want to support stencil texturing on gen7
hardware, we have to keep a texture-capable shadow copy around and use
BLORP to update when stencil changes. The one thing this commit does
not implement is self-dependencies with stencil input attachments.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99493
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Modern DXVK requires event support [1], but looks like it only
uses vkCmdSetEvent() + vkGetEventStatus(). So we can just
borrow the relevant code from gen8, leaving CmdWaitEvents still
unimplemented.
[1] 8c3900c533
v2: Also move CmdWaitEvents into genX_cmd_buffer.c (Jason)
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
On CNL+, the clear color struct is composed of RGBA channel values and
fields which are either reserved by the HW or used to control
fast-clears. Currently anv initializes the channel values to zero and
allows the other fields to be undefined.
Satisfy the MBZ field requirements by removing an optimization that
doesn't hold true for CNL+ and pulling in the number of dwords to
initialize from ISL.
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
The key reason for that mechanism is gone: all the extra optional data
that could be in the anv_push_constants was moved elsewhere. At this
point, just put anv_push_constants directly in anv_cmd_state (part of
anv_cmd_buffer).
v2: Remove a NULL check we don't need anymore in
anv_cmd_buffer_push_constants(). (Lionel)
Fix size we consider for valid push params. (Lionel)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This commit changes anv to put bindless handles and sampler pointers
into the descriptor buffer and use those instead of bindful when we run
out of binding table space. This "spilling" of descriptors allows to to
advertise an almost unbounded number of images and samplers.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is really where they belong; not push constants. The one downside
here is that we can't push them anymore for compute shaders. However,
that's a general problem and we should figure out how to push descriptor
sets for compute shaders. This lets us bump MAX_IMAGES to 64 on BDW and
earlier platforms because we no longer have to worry about push constant
overhead limits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We spend a lot of time in the driver adding things to hash sets to track
residency. The reality is that a properly built Vulkan app uses large
memory objects and sub-allocates from them. In a typical frame, most of
if not all of those allocations are going to be resident for the entire
frame so we're really not saving ourselves much by tracking fine-grained
residency. Just throwing everything in the validation list does make it
a little bit more expensive inside the kernel to walk the list and
ensure that all our VA is in order. However, without relocations, the
overhead of that is pretty small.
If we ever do run into a memory pressure situation where the fine-
grained residency could even potentially help, we would likely be
swapping one page out to make room for another within the draw call and
performance is totally lost at that point. We're better off swapping
out other apps and just letting ours run a whole frame.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Vulkan spec doesn't explicitly forbid zero size transform
feedback buffers.
Having zero size xfb caused SurfaceSize overflow and
triggered assert in debug build.
The only way to have zero size SO_BUFFER is to disable
SO_BUFFER as stated in hardware spec.
From SKL PRM, Vol 2a, "3DSTATE_SO_BUFFER":
"If set, stream output to SO Buffer is enabled,
if 3DSTATE_STREAMOUT::SO Function ENABLE is also enabled.
If clear, the SO Buffer is considered "not bound" and effectively
treated as a zero- length buffer for the purposes of SO output and
overflow detection. If an enabled stream's Stream to Buffer Selects
includes this buffer it is by definition an overflow condition.
That stream will cause no writes to occur,
and only SO_PRIM_STORAGE_NEEDED[<stream>] will increment."
Fixes: 36ee2fd61c "anv: Implement the basic form of VK_EXT_transform_feedback"
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This buffer goes along side the CPU data structure and may contain
pointers, bindless handles, or any other descriptor information.
Currently, all descriptors are size zero and nothing goes in the buffer
but this commit sets up the framework we will need later.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This is what we're actually storing in the descriptor set and consuming
when we bind surface states. This commit renames image_count to
image_param_count a few places and moves the decision to not count image
params on gen9+ into anv_descriptor_set.c when we build the layout.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This commit moves our handling of gl_NumWorkgroups over to work like our
handling of other special bindings in the Vulkan driver. We give it a
magic descriptor set number and teach emit_binding_tables to handle it.
This is better than the bias mechanism we were using because it allows
us to do proper accounting through the bind map mechanism.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Annoyingly, this requires that we implement integer division on the
command streamer. Fortunately, we're only ever dividing by constants so
we can use the mulh+add+shift trick and it's not as bad as it sounds.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Doesn't save us a great deal of lines but at least they get decoded in
aubinators.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
In commit 9a7b319903 ("anv/query: flush render target before
copying results") we tracked all the render target writes to apply a
flushes in the vkCopyQueryResults(). But we can narrow this down to
only when we write a buffer (which is the only input of
vkCopyQueryResults).
v2: Drop newer render target write flags introduce by 1952fd8d2c
("anv: Implement VK_EXT_conditional_rendering for gen 7.5+")
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
Conditional rendering affects next functions:
- vkCmdDraw, vkCmdDrawIndexed, vkCmdDrawIndirect, vkCmdDrawIndexedIndirect
- vkCmdDrawIndirectCountKHR, vkCmdDrawIndexedIndirectCountKHR
- vkCmdDispatch, vkCmdDispatchIndirect, vkCmdDispatchBase
- vkCmdClearAttachments
Value from conditional buffer is cached into designated register,
MI_PREDICATE is emitted every time conditional rendering is enabled
and command requires it.
v2: by Jason Ekstrand
- Use vk_find_struct_const instead of manually looping
- Move draw count loading to prepare function
- Zero the top 32-bits of MI_ALU_REG15
v3: Apply pipeline flush before accessing conditional buffer
(The issue was found by Samuel Iglesias)
v4: - Remove support of Haswell due to possible hardware bug
- Made TMP_REG_PREDICATE and TMP_REG_DRAW_COUNT defines to
define registers in one place.
v5: thanks to Jason Ekstrand and Lionel Landwerlin
- Workaround the fact that MI_PREDICATE_RESULT is not
accessible on Haswell by manually calculating
MI_PREDICATE_RESULT and re-emitting MI_PREDICATE
when necessary.
v6: suggested by Lionel Landwerlin
- Instead of calculating the result of predicate once - re-emit
MI_PREDICATE to make it easier to investigate error states.
v7: suggested by Jason
- Make anv_pipe_invalidate_bits_for_access_flag add CS_STALL
if VK_ACCESS_CONDITIONAL_RENDERING_READ_BIT is set.
v8: suggested by Lionel
- Precompute conditional predicate's result to
support secondary command buffers.
- Make prepare_for_draw_count_predicate more readable.
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>