This fixes bunch of cts tests hitting issues when attempting
anv_image_mcs_op with compute.
Fixes: ab9d3528dc ("anv: fix queue check in anv_blorp_execute_on_companion on xe3")
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39581>
Geometry shaders load from separate handles for each vertex, so they
don't incorporate the vertex index in the URB offset like tessellation
shaders do. This means we can have a constant offset (within a vertex's
section) but not have a constant vertex index.
Prior to 41d7debcfe we were emitting non-folded ALU so we thought the
offset was non-constant at this point. Now we can properly detect
constant offsets...but still don't want to use push inputs for
non-constant vertex indices.
Fixes: 41d7debcfe ("brw: Use nir_imul_imm in per-vertex/per-primitive offset calculation")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39603>
This now also removes dead variables created by split_array_vars,
and in the future it is reasonable other optimizations inside the
optimization loop to make temp variables dead.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39596>
Previously the matching logic was designed to match names
like this
```
99993681767ac...32132a.anv.mda.tar/CS/NIR8/046-ssa
```
So up until the first slash of a pattern, a prefix match would be used,
followed by fuzzy matching for the remaining pattern. This don't
work well when there are subdirectories in the name, so when we see
```
before/99993681767ac...32132a.anv.mda.tar/CS/NIR8/046-ssa
before/91132154353bd...090919.anv.mda.tar/CS/NIR8/046-ssa
after/91132154353bd...090919.anv.mda.tar/CS/NIR8/046-ssa
```
the first entry can't be matched by `before/9999/first` since the fuzzy
match will kick in for the 9999 and if the second entry has four 9s
(which it does here) there would be multiple choices.
In practice the flexibility of fuzzy matching is not really needed
since we've been using consistent small prefixes (like CS, NIR8, BRW,
etc). The exception is the last part (the object versions, i.e.
"pass names"), where sometimes is convenient to reach by a substring.
The new matching logic is to use prefix match by default, except when
matching the "object version", where substring match is used. In the
example a possible set of the patterns to identify each entry can be
`b/99/ssa`, `b/91/ssa` and `a/91/ssa`.
The patch adds a few tests to the `is_match()` to clarify the behavior.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39506>
This macro will stop the loop early if there's no chance to make further
progress.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39504>
Add a pass tracker struct that can live the whole lifetime
of brw_compile() functions, it will keep track of the debug_archiver
and also store some metadata that allow us to name the passes.
With that, we can also embed the loop tracking in the same struct,
so that is free for any loop to use the "early break" optimization.
There are other brw_nir_* passes that are called in the pre-processing
phase. These are not currently included in the mda yet. Will be
handled when we hook debug_archiver or similar to the runtime/driver.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39504>
Compares versions of two objects one by one. Useful to compare two
shader compilations and find the first pass that changed.
This could already be done by using something like
`diff <(mda log ...) <(mda log ...)` but it is useful enough to become
a builtin.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39420>
I noticed we disable the prefetch only on Gfx12.5. But surely that
recommendation carries on on later platforms.
It seems other drivers just disable it all the time and only have an
option to force the prefetch. So implementing the same thing here.
Blorp path is left untouched.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39424>
For mesh/task shaders, the thread payload provides a local invocation
index, but it's always linear so it doesn't give the correct value when
quad derivatives are in use.
The lowering pass where all of this is done correctly for compute
shaders assumes load_local_invocation_index will be lowered in the
backend for mesh/task, calculates the values for the quads correctly but
then avoid replacing the original intrinsic and we remain with the wrong
results.
Add an intel specific intrinsic and always lower the generic one to that
(or whatever else was calculated) to avoid ambiguities and fix the value
for quad derivatives.
Fixes future CTS tests using mesh/task shaders under:
dEQP-VK.spirv_assembly.instruction.compute.compute_shader_derivatives.*
Fixes: d89bfb1ff7 ("intel/brw: Reorganize lowering of LocalID/Index to handle Mesh/Task")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39276>
Allow surface redescription when fast-clearing a layer > 0. This affects
at least five traces in the performance CI, but the CI doesn't report
any performance benefit from this. We already had code to handle unaligned
rows at the bottom of an image. Now that this handles the misalignment at
the top of the image range, we gain some symmetry.
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
On Xe2+, support multi-layer and non-zero-layer CCS fast-clears. To do
this in a simple manner, drop the code which splits multi-layer clears
into fast clears and slow clears. The performance CI reports no
regressions nor improvements on BMG.
For MCS on all platforms and for CCS on prior platforms, use a new
heuristic. Instead of only allowing fast clears on the first
slice/layer, do the following:
For 3D images, only fast-clear if all slices are cleared. Enables
fast-clearing every slice of 3D textures in:
* Terminator Resistance - 480x270x128.
* Ghostrunner 2 - 320x180x128.
For 2D arrays, match the Xe2+ behavior and allow clearing to any layer.
This is possible because we only allow fast-clearing if the clear color
matches the default value. Enables fast-clearing every layer of 2D array
textures in:
* Assassin's Creed - 128x128, 6-layers.
* Blackops 3 - 1024x1024, 6-layers.
* Borderlands 3 - 128x128, 6-layers.
* Cyberpunk - 1024x1024, 10-layers.
* Unigine Superposition - 4K, 2-layers.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11893
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
A future commit will enable clearing to more than the first layer of 2D
array images. To ensure consistency for the clear color, require the
ANV_FAST_CLEAR_DEFAULT_VALUE for such images if they make use of
ISL_AUX_STATE_CLEAR. Also, use a non-zero default value for some image
formats.
I tested the majority of workloads in the performance CI. This will
cause those which clear to 2D array layers to gain clears on more than
just the first layer. At the moment, we still only support clearing the
first layer, so there should be no change in performance. Affected games
are documented in the code.
Acked-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
Don't return early from anv_layout_to_fast_clear_type() for Xe2+. We'll
need to make more use of the function for some MCS changes in later
commits.
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
Now that hasvk is the driver for supporting HSW and BDW, we no longer
need to convert CCS_D partial resolves to full resolves to avoid an
assert-failure in BLORP.
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
This will make handling fast-clears on multiple layers simpler by saving
us from having to pass more parameters into fast-clear state setting
functions.
It also allows us to set more complex fast-clear state for FCV_CCS_E
without marking the image as compressed.
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
Enables more support for FCV_CCS_E partial resolves if we ever need it.
Also enables support for multiple layers being fast cleared and needing
resolves. Support for that will arrive in several commits.
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
We started allowing non-default clear colors with FCV in commit
cd8e120b97. When rendering to an image with FCV, set the fast-clear
type to ANV_FAST_CLEAR_ANY if the image properties allow such
fast-clears.
Fixes: cd8e120b97 ("anv: Allow more single subresource fast-clears with FCV")
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
From RENDER_SURFACE_STATE::AuxiliarySurfaceQPitch on BDW+,
This field must be set to an integer multiple of the Surface
Vertical Alignment
Accomplish this by aligning the height of each MCS layer to main
surface's vertical alignment. Prevents the following test group from
failing on Xe2 when a future commit enables multi-layer fast-clears in
anv:
dEQP-VK.api.image_clearing.*.
clear_color_attachment.multiple_layers.
*_clamp_input_sample_count_*
The main test I used to debug this:
dEQP-VK.api.image_clearing.core.
clear_color_attachment.multiple_layers.
a8b8g8r8_unorm_pack32_64x11_clamp_input_sample_count_2
Backport-to: 25.3
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37660>
This avoids generating some useless math that would need to be cleaned
up later, without complicating things too much.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
This helps cut down URB messages on tessellation and mesh shaders
significantly. fossil-db results on Battlemage:
Instrs: 505172392 -> 505207187 (+0.01%); split: -0.00%, +0.01%
Send messages: 23678197 -> 23656126 (-0.09%); split: -0.09%, +0.00%
Cycle count: 63150470088 -> 63147482640 (-0.00%); split: -0.01%, +0.00%
Spill count: 576554 -> 576616 (+0.01%)
Fill count: 545304 -> 545413 (+0.02%)
Max live registers: 141099192 -> 141150675 (+0.04%); split: -0.00%, +0.04%
Max dispatch width: 39856192 -> 39856208 (+0.00%)
Totals from 4231 (0.27% of 1583648) affected shaders:
Instrs: 1620161 -> 1654956 (+2.15%); split: -0.25%, +2.40%
Send messages: 128652 -> 106581 (-17.16%); split: -17.18%, +0.03%
Cycle count: 24650700 -> 21663252 (-12.12%); split: -12.82%, +0.70%
Spill count: 378 -> 440 (+16.40%)
Fill count: 1308 -> 1417 (+8.33%)
Max live registers: 364676 -> 416159 (+14.12%); split: -0.24%, +14.36%
Max dispatch width: 67952 -> 67968 (+0.02%)
There are several reasons we didn't go with nir_opt_vectorize_io:
1. nir_opt_vectorize_io appears to work on the slot location level.
We want to be able to vectorize based on the URB offsets, especially
for cases like point size, layer, and viewport which have different
VARYING_SLOT_* values but live in the same vec4 in a URB entry.
2. We want vec8 stores, and nir_opt_vectorize_io only seems to vectorize
within a single 32-bit vec4. It does handle 8 components, but that's
only for packing 16-bit values into a 32-bit vec4.
Improves performance of Sascha Willems' tessellation demo by around 4%
on Meteorlake.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
Both the URB Global Offset and Per-Slot Offsets are specified to be
unsigned numbers. The URB Global Offset is only 11 bits, and so is
limited to be between [0, 2047]. While the per-slot offsets are
given as U32 values, it would appear that adding the two offsets
does not handle 32-bit overflow/unsigned wrap correctly.
This pops up in Piglit's TCS variable-indexing tests, which ends up
performing loads from offset (x - 16) and a base of 18, and at an offset
(x) with a base of 2. These should be equivalent, but when x <= 15, the
per-slot offset calculated in the shader is negative (0xfffffff[0-f])
and adding the base of 18 is not wrapping around correctly to [2, 17].
To work around this, avoid using the global offset when the per-slot
offset is present, and just add the two in the shader where unsigned
wrap works correctly.
Tigerlake and later don't seem to have this issue.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
We were checking for 0xf which is fine for vec4, but vec8 gets 0xff.
Either way, nothing is writemasked, so we can skip sending the mask.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
This makes it easier for NIR passes to distinguish between inputs and
outputs without having to reason about which URB handle source was
passed to the intrinsic. It probably also makes it a bit easier for
humans to read the NIR too.
v2: Don't add memory mode to store intrinsics. It's always output.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39250>
Tested on PTL, fixes various copy_and_blit tests that utilize compute
after ab9d3528dc that exposed this to them.
Fixes: ab9d3528dc ("anv: fix queue check in anv_blorp_execute_on_companion on xe3")
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39548>
v2: Add additional AUX state transition test-cases for HIZ_CCS (Nanley).
v3: Assume partial resolve is equivalent to full resolve on legacy HiZ
surfaces during isl_aux_state_transition_aux_op() instead of
asserting (Nanley).
v4: Move some tests into different group, add more MCS tests (Nanley).
Acked-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
This appears to be needed to guarantee that a resolved depth surface
has no remaining fast-cleared blocks on DG2 as well as MTL. After
this series this should no longer be hit in practice since we'll be
doing partial resolves in most cases, but it seems sensible to keep
and correct the workaround for our peace of mind to make sure that
full resolves are truly resolving the main surface.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
Issue a partial resolve instead of a full resolve from
transition_depth_buffer() when the final usage requires the
CCS-compressed surface to provide a complete representation of the
image.
This significantly improves performance of applications that
frequently interleave depth rendering and sampling on non-WT surfaces
(e.g. MSAA surfaces). Nba2K23-trace-dx11-2160p-ultra improves
performance by about 260% with this on MTL, DG2 shows a similar
benefit.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
This updates the isl_aux_state transition helpers to consider partial
resolves for HiZ-CCS surfaces, and as a side effect of the update to
isl_aux_prepare_access() partial resolves should be implicitly enabled
in iris now for platforms that support it.
v2: HiZ partial resolves aren't enough to remove cleared blocks unlike
color partial resolves (Nanley).
v3: Treat ISL_AUX_STATE_CLEAR similar to
ISL_AUX_STATE_COMPRESSED_HIER_DEPTH so we can continue using it
after depth buffer fast clears. Drop flagging partial_resolve ==
true for HiZ usages so we don't do the wrong thing while preparing
access of a surface in ISL_AUX_STATE_CLEAR state.
v4: Assume partial resolve is equivalent to full resolve on legacy HiZ
surfaces during isl_aux_state_transition_aux_op() instead of
asserting (Nanley).
Acked-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
v2: Define additional enum BLORP_OP_HIZ_PARTIAL_RESOLVE to track
partial resolves (Nanley).
v3: Add comment regarding fall back to full resolve on Gfx12.0 (Nanley).
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
As long as the surface is in a state with valid AUX state with
identity contents of the HiZ surface (E.g. in
ISL_AUX_STATE_COMPRESSED_CLEAR, ISL_AUX_STATE_COMPRESSED_NO_CLEAR,
ISL_AUX_STATE_RESOLVED or ISL_AUX_STATE_PASS_THROUGH states) we can
keep compression enabled, which works around hardware bugs on MTL and
DG2, and will be helpful to switch to partial resolves in a future
commit.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
Update anv_layout_to_aux_state() to return the
ISL_AUX_STATE_COMPRESSED_HIER_DEPTH state in cases where we may be
rendering into a HiZ surface in non-WT aux mode, instead of
ISL_AUX_STATE_COMPRESSED_CLEAR.
v2: No need to handle ISL_AUX_STATE_COMPRESSED_HIER_DEPTH in
anv_layout_to_fast_clear_type() since it should never be reached
(Nanley).
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
For transitions to a state that requires the image to be fully defined
by the primary+CCS surface without necessarily requiring a valid
primary we have to perform a resolve if the initial state was
ISL_AUX_STATE_COMPRESSED_HIER_DEPTH, which isn't fully defined by its
primary+CCS surface. This full resolve will be replaced with a more
efficient partial resolve in a future commit, but we have to do this
up front in order to avoid breaking bisectability.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
We can end up in this situation in cases where the application uses a
layout that allows both rendering and sampling from a depth surface,
since in such cases we will attempt to render with HIZ CCS WT usage as
a side effect of using ISL_AUX_USAGE_HIZ_CCS_WT for all layouts that
allow the image to be sampled from.
Disabling fast clears for that case isn't expected to cause
performance regression since before this series for HiZ CCS non-WT
images transitioning to such a layout we would have issued a full
resolve and used ISL_AUX_USAGE_NONE, which also doesn't support fast
clears.
Multisample depth images should still get fast clears after this
commit in cases where the rendering and sampling is split into
separate render pasess with a layout transition between them that
transitions the image from a W/O layout into a R/W one -- Such
transitions will be handled with a relatively cheap partial resolve in
a subsequent commit.
v2: Add details of additional findings about these hardware issues in
comment.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
v3: Pass aspect bit consistent with layout to
anv_layout_to_aux_usage() instead of defaulting to
VK_IMAGE_ASPECT_DEPTH_BIT.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>
Currently anv_fast_clear_depth_stencil() doesn't know the correct
layout of the depth and stencil images, instead it uses
ANV_IMAGE_LAYOUT_EXPLICIT_AUX to force the base AUX usage of each
plane, which can be inconsistent with the VkImageLayout currently in
use. Plumb the correct depth and stencil layouts.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31139>