In the case of:
halt
// succs: b9
if %618 {
block b3:// preds:
break
// succs: b6
} else {
block b4: // preds: , succs: b5
}
block b5: // preds: b4
32 %556 = iadd %617, %2 (0x1)
opt_constant_if() doesn't work because stitch_blocks() can't join blocks if the
before ends in a jump and the after isn't empty.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24235>
Same cause as for other R8G8 formats - msaa resolve via
blit event causes gpu fault.
Fixes:
dEQP-VK.api.image_clearing.*.clear_color_attachment.*.r8g8_srgb_*
Fixes: 029919f3c8
("tu: allow using resolve engine for SRGB MSAA resolves")
Cc: mesa-stable
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24277>
Both per-primitive and per-vertex space is allocated in MUE in 8 dword
chunks and those 8-dword chunks (granularity of
3DSTATE_SBE_MESH.Per[Primitive|Vertex]URBEntryOutputReadLength)
are passed to fragment shaders as inputs (either non-interpolated
for per-primitive and flat vertex attributes or interpolated
for non-flat vertex attributes).
Some attributes have a special meaning and must be placed in separate
8/16-dword slot called Primitive Header or Vertex Header.
Primitive Header contains 4 such attributes (Cull Primitive,
ViewportIndex, RTAIndex, CPS), leaving 4 dwords (the rest of 8-dword
slot) potentially unused.
Vertex Header is similar - it starts with 3 unused dwords, 1 dword for
Point Size (but if we declare that shader doesn't produce Point Size
then we can reuse it), followed by 4 dwords for Position and optionally
8 dwords for clip distances.
This means we have an interesting optimization problem - we can put
some user attributes into holes in Primitive and Vertex Headers, which
may lead to smaller MUE size and potentially more mesh threads running
in parallel, but we have to be careful to use those holes only when
we need it, otherwise we could force HW to pass too much data to
fragment shader.
Example 1:
Let's assume that Primitive Header is enabled and user defined
12 dwords of per-primitive attributes.
Without packing we would consume 8 + ALIGN(12, 8) = 24 dwords of
MUE space and pass ALIGN(12, 8) = 16 dwords to fragment shader.
With packing, we'll consume 4 + 4 + ALIGN(12 - 4, 8) = 16 dwords of
MUE space and pass ALIGN(4, 8) + ALIGN(12 - 4, 8) = 16 dwords to
fragment shader.
16/16 is better than 24/16, so packing makes sense.
Example 2:
Now let's assume that Primitive Header is enabled and user defined
16 dwords of per-primitive attributes.
Without packing we would consume 8 + ALIGN(16, 8) = 24 dwords of
MUE space and pass ALIGN(16, 16) = 16 dwords to fragment shader.
With packing, we'll consume 4 + 4 + ALIGN(16 - 4, 8) = 24 dwords of
MUE space and pass ALIGN(4, 8) + ALIGN(16 - 4, 8) = 24 dwords to
fragment shader.
24/24 is worse than 24/16, so packing doesn't make sense.
This change doesn't affect vk_meshlet_cadscene in default configuration,
but it speeds it up by up to 25% with "-extraattributes N", where
N is some small value divisible by 2 (by default N == 1) and we
are bound by URB size.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20407>
Instead of using 4 dwords for each output slot, use only the amount
of memory actually needed by each variable.
There are some complications from this "obvious" idea:
- flat and non-flat variables can't be merged into the same vec4 slot,
because flat inputs mask has vec4 stride
- multi-slot variables can have different layout:
float[N] requires N 1-dword slots, but
i64vec3 requires 1 fully occupied 4-dword slot followed by 2-dword slot
- some output variables occur both in single-channel/component split
and combined variants
- crossing vec4 boundary requires generating more writes, so avoiding them
if possible is beneficial
This patch fixes some issues with arrays in per-vertex and per-primitive data
(func.mesh.ext.outputs.*.indirect_array.q0 in crucible)
and by reduction in single MUE size it allows spawning more threads at
the same time.
Note: this patch doesn't improve vk_meshlet_cadscene performance because
default layout is already optimal enough.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20407>
After the previous commit there are so few to reuse that this is no
longer worth doing and actually causes compilation to slow down.
The Blender shader compile time in issue #9326 improves as folows:
21.11 seconds -> 9.90 seconds
The CTS test dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20
improves as follows:
0.92 seconds -> 0.68 seconds
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9326
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24227>
Most of the variables in the hash table will never actually be looked up
for any given block so cloning every possible value just creates a bunch
of unrequired memcpy calls.
Here we change the code to only clone the copies array once it is
actually looked up for the first time.
The Blender shader compile time in issue #9326 improves as folows:
151.09 seconds -> 21.11 seconds
The CTS test dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20
improves as follows:
1.67 seconds -> 0.92 seconds
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24227>
If kill alias results in the hash table entry holding an empty
copies array then remove the hash entry and return the dynamic array
to the unused pool.
This helps avoid hash table size getting out of control in very large
shaders.
151.09 seconds -> 118.60 seconds
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24227>
Here we change things to simply clone the entire hash table. This
is much faster than trying to rebuild it and is needed to avoid
slow compilation of very large shaders.
The Blender shader compile time in issue #9326 improves as folows:
251.29 seconds -> 151.09 seconds
The CTS test dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20
improves as follows:
2.38 seconds -> 1.67 seconds
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24227>
There is no point doing an expensive clone of the copies if the
if-branch is empty.
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24227>
1. merge si_set_es_return_value_for_gs into si_llvm_es_build_end
2. stop return value when mono mode in which case GS use ES input as
input instead of ES output
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24204>
1. merge si_set_ls_return_value_for_tcs into si_llvm_ls_build_end because they
do the same job to return value
2. stop return value when mono mode with different thread count, in which case
TCS use LS input as its input instead of LS output
3. use si_insert_input_ret_float
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24204>
ACO use same args for merged shader stages, but vb desc input sgpr args
is not present when second stage of merged shader. In order to share
same shaders args, move it to last so other args have same index.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24204>
We only need it to merge LS/HS or ES/GS now, prolog and epilog have
been lowered in nir already. So we just need to handle two parts and
they are sure to be first and second stage of a merged shader.
This also remove the needs SGPRs must be before VGPRs, which is required
by following commits to move some SGPRs after VGPRs.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24204>
fopencookie is a glibc feature, so we can't use it on macOS (and
probably other libc's?). It's only used for the hypervisor interface,
though, so we can just make the hypervisor piece glibc-only while
otherwise fixing the wrap.dylib build.
Fixes: ee83453f69 ("asahi: Add a shared library interface for decode")
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24293>
sliceInterleaved may be true for layered images on gfx8. Such a htile
cannot be cleared with radv_clear_htile.
Fixes 24 failures in
dEQP-VK.renderpass2.depth_stencil_resolve.image_2d_16_64_6.* on GFX8.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23959>
For secondary command buffers, vn_cmd_begin_render_pass was only used to
track inherited render pass previously. So we clean it up.
Signed-off-by: Yiwei Zhang <zzyiwei@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24103>