Borderlands 3 (both DX11 and DX12 renderers) have a common pattern
across many shaders:
con 32x4 %510 = (uint32)txf %2 (handle), %1191 (0x10) (coord), %1 (0x0) (lod), 0 (texture)
con 32x4 %512 = (uint32)txf %2 (handle), %1511 (0x11) (coord), %1 (0x0) (lod), 0 (texture)
...
con 32x4 %550 = (uint32)txf %2 (handle), %1549 (0x25) (coord), %1 (0x0) (lod), 0 (texture)
con 32x4 %552 = (uint32)txf %2 (handle), %1551 (0x26) (coord), %1 (0x0) (lod), 0 (texture)
A single basic block contains piles of texelFetches from a 1D buffer
texture, with constant coordinates. In most cases, only the .x channel
of the result is read. So we have something on the order of 28 sampler
messages, each asking for...a single uint32_t scalar value. Because our
sampler doesn't have any support for convergent block loads (like the
untyped LSC transpose messages for SSBOs)...this means we were emitting
SIMD8/16 (or SIMD16/32 on Xe2) sampler messages for every single scalar,
replicating what's effectively a SIMD1 value to the entire register.
This is hugely wasteful, both in terms of register pressure, and also in
back-and-forth sending and receiving memory messages.
The good news is we can take advantage of our explicit SIMD model to
handle this more efficiently. This patch adds a new optimization pass
that detects a series of SHADER_OPCODE_TXF_LOGICAL, in the same basic
block, with constant offsets, from the same texture. It constructs a
new divergent coordinate where each channel is one of the constants
(i.e <10, 11, 12, ..., 26> in the above example). It issues a new
NoMask divergent texel fetch which loads N useful channels in one go,
and replaces the rest with expansion MOVs that splat the SIMD1 result
back to the full SIMD width. (These get copy propagated away.)
We can pick the SIMD size of the load independently of the native shader
width as well. On Xe2, those 28 convergent loads become a single SIMD32
ld message. On earlier hardware, we use 2 SIMD16 messages. Or we can
use a smaller size when there aren't many to combine.
In fossil-db, this cuts 27% of send messages in affected shaders, 3-6%
of cycles, 2-3% of instructions, and 8-12% of live registers. On A770,
this improves performance of Borderlands 3 by roughly 2.5-3.5%.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32573>
This updates entry for 14017823839 which fixes issues on BMG with:
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.1
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32550>
Rather than checking hwconfig items when using them, wait until after
devinfo has been fully initialized. This includes having workarounds
implemented.
We can then check if the hwconfig data and final Mesa initialization
agree. If the match fails, we need to investigate if Mesa or the
hwconfig data is wrong.
This code becomes a no-op when not on a release build.
Fixes: a4c5bfd34c ("intel/dev: Use hwconfig for urb min/max entry values")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12141
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32359>
In panvk we pass absolute view indices to the hardware, so we need to do
the conversion from compacted to absolute at some point. Emitting
absolute indices from nir_lower_multiview initially looks like the
simplest option, but nir_lower_io_to_temporaries will emit a write for
every element of array varyings. This results in unnecessary writes to
disabled views.
Signed-off-by: Benjamin Lee <benjamin.lee@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31704>
This is needed for implementing multiview in panvk, where the address
calculation for multiview outputs is not well-represented by lowering to
nir_intrinsic_store_output with a single offset.
The case where a variable is both per-view and per-{vertex,primitive} is
now unsupported. This would come up with drivers implementing
NV_mesh_shader or using nir_lower_multiview on geometry, tessellation,
or mesh shaders. No drivers currently do either of these. There was some
code that attempted to handle the nested per-view case by unwrapping
per-view/arrayed types twice, but it's unclear to what extent this
actually worked.
ANV and Turnip both rely on per-view outputs being assigned a unique
driver location for each view, so I've added on option to configure that
behavior rather than removing it.
Signed-off-by: Benjamin Lee <benjamin.lee@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31704>
This is needed for panvk, where multiview is "all or nothing". When
multiview is enabled, all outputs may be written with separate values
for each view.
The edge case mentioned in the previous `nir_can_lower_multiview` is now
handled because we now handle an arbitrary number of per-view output
vars instead of expecting to find exactly one.
Signed-off-by: Benjamin Lee <benjamin.lee@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31704>
Black Myth Wukong is generating more than 2Gb of shaders in
pre-compiling stage after VK_EXT_shader_image_atomic_int64 extension
enabled. Driver will crash in create shader stages due to dereference
null pointer of kernel map.
Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32548>
The last argument seems to be used as brw_shader_reloc::delta (from
brw_add_reloc), and we're unconditionally setting it to 0 here, while
the other place where we handle nir_intrinsic_load_reloc_const_intel
seems to be setting the base appropriately.
I found this by inspection while debugging a bug related to this code,
so I'm not aware of any workloads that get improved by this patch.
Related patches:
- ecbec25e84 ("intel/nir: add reloc delta to load_reloc_const_intel intrinsic")
- 99047451c9 ("intel/fs: add plumbing for embedded samplers")
Fixes: ecbec25e84 ("intel/nir: add reloc delta to load_reloc_const_intel intrinsic")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32531>
Real Xe KMD actually returns a uint64, so here changing from uint32
to uint64.
Fixes: 04bdbeec31 ("intel/dev/xe: Fix access to eu_per_dss_mask")
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Acked-by: Nanley Chery <nanley.g.chery@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32527>
"constant" is a special keyword in OpenCL C, and we'd like to #define it
suitably in host C23 to facilitate compatiblity between host/device headers.
That means we can't have any identifiers named "global" or "constant".
Fortunately, this is the only 'constant' in any file I'm hitting.
To avoid the clash, don't abbreviate "constant factor", use "constant_factor"
instead. For consistency, "slope factor" then becomes "slope_factor".
The new names are longer but match the Vulkan API exactly.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> [Intel]
Reviewed-by: Mary Guillemard <mary.guillemard@collabora.com> [NVK and panvk]
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> [V3DV]
Reviewed-by: Simon Perretta <simon.perretta@imgtec.com> [IMG]
Acked-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32505>
Clang is capable of tacking what headers it imports, as long as we set
it up to do that. While that isn't important for rusticl, it would be
useful for the various `_clc` tools, as they can then tell Ninja which
headers they read to make rebuilds more reliable.
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Dylan Baker <dylan.c.baker@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32505>
Source 2 games segfault if certain buffers are not able to use the same
memory types as images. CS2 specifically expects this to be the case for
vertex and index buffers (VK_BUFFER_USAGE_2_INDEX_BUFFER_BIT,
VK_BUFFER_USAGE_2_VERTEX_BUFFER_BIT). I have not tested other Source 2
games to see how much the requirement differs for the usage (if at all).
Up until now, we've disabled CCS for the Source 2 engine with the
anv_disable_xe2_ccs driconf option. However, this option is not great
for performance. So, replace this with a new option to allow the same
memory types we use for images on buffers - anv_enable_buffer_comp.
Compression of buffers is generally not good for performance. I
collected the result of unconditionally enabling the feature in the
performance CI on BMG. I used the default configuration to average the
result of two runs of each trace.
The CI reports that 4 game traces would regress between 0.44-1.01% FPS
with buffer compression. However, the CI actually shows it to be
beneficial in three of our game traces:
* Cyberpunk-trace-dx12-1080p-high 106.51%
* Hitman3-trace-dx12-1080p-med 101.59%
* Blackops3-trace-dx11-1080p-high 100.44%
So, enable the option for the two games we already have driconf entries
for, Cyberpunk and Hitman3.
Of course, also enable the option for Source 2 games. Casey Bowman
reports that on BMG, some frame times drop from ~15ms to ~7ms in CS2.
This is in large part due to the removal of HiZ resolves, which is a
consequence of the game now using of HIZ_CCS_WT instead of plain HIZ.
Ref: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11520
Acked-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32519>
DRM_XE_TOPO_EU_PER_DSS and DRM_XE_TOPO_SIMD16_EU_PER_DSS can be any
number of bytes long but it was assuming it was always 4 bytes long.
That was not a issue because Xe KMD return 4 bytes even if only needs
1 or 2 bytes but that is a problem with our HW simulator that was
returning 2 bytes.
Fixes: a24d93aa89 ("intel/dev: Query and compute hardware topology for Xe")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32307>
This is the selector, and it must always be a uniform UD, so there's no
reason to not propagate into it.
No shader-db change on any Intel platform.
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 220507131 -> 220507127 (-0.00%)
Cycle count: 31607052398 -> 31607053364 (+0.00%); split: -0.00%, +0.00%
Totals from 5 (0.00% of 702410) affected shaders:
Instrs: 995 -> 991 (-0.40%)
Cycle count: 86392 -> 87358 (+1.12%); split: -0.07%, +1.19%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32097>
The next two commits modify the destination regioning in a way that,
which still correct, trigger assertion failures if we try to fix the
regioning here.
Broadcast gets lowered in brw_eu_emit. For the purposes of region
restrictions, let's assume that the final code emission will do the
right thing. Doing a bunch of shuffling here is only going to make a
mess of things.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32097>
This commit allows you to dump different regions of memory related to
bvh building. An additional script to decode the memory dump is also
added, and you're able to view the built bvh in 3D view in html. See the
included README.md for usage.
Rework:
- you can now view the actual child_coord in internalNode in html
- change exponent to be int8_t in the interpreter
- fix the actual coordinates using an updated formula
- now you can have 3D view of the bvh
- blockIncr could be 2 and vk_aabb should be first
- Now, if any bvh dump is enabled, we will zero out tlas, to prevent gpu
hang caused by incorrect tlas traversal
- rootNodeOffset is back to the beginning
- Add INTEL_DEBUG=bvh_no_build.
- Fix type of dump_size
- add assertion for a 4B alignment
- when clearing out bvh, only clear out everything after
(header+bvh_offset)
- TODO: instead of dumping on destory, track in the command buffer
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Rework: (Kevin)
- Properly setup bvh_layout
Our bvh resides in contiguous memory and can be divided into two sections:
1. anv_accel_struct_header, tightly followed by
2. actual bvh, which starts with root node, followed by interleaving
leaves or internal nodes.
- Update comments for some fields for BVH and nodes.
- Properly populate the UUIDs in serialization header
- separate header func into completely two paths based on compaction bit
- Encode rt_uuid at second VK_UUID_SIZE.
- Write query result at correct slot
- add assertion for a 4B alignment
- move bvh_layout to anv_bvh
- Use meson option to decide which files to compile
- The alignment of serialization size is not needed
- Change static_assert to STATIC_ASSERT and move them inside functions
Rework (Sagar)
- Use anv_cmd_buffer_update_buffer instead of MI to copy data
Rework (Lionel)
- Remove flush after builds, and add flush in copy before dispatch
- Handle the flushes in CmdWriteAccelerationStructuresPropertiesKHR properly
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Rework: (Kevin)
- Calculate correct number of threads in GPGPU thread group based on
SIMD size.
- Instead of round up, just use the simple division and let the
remainder part handle groupCount < local_size_x.
- Drop indirect_unroll_off and fix the bug that we're not using is_unaligned_size_x
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
This shader gets called and will construct ANV BVH from IR BVH. More
specifically, each invocation will take care of one internal node. The
internal nodes get processed starting from root node all the way to the
bottom leaves.
During processing, we keep track of the destination of
where the internal node should be encoded (tracked in
vk_ir_box.bvh_offset), and where its leaves should be encoded (tracked
in vk_ir_header.dst_node_offset).
The processed bvh is in contiguous memory, which starts with header,
followed by interleaving internal nodes and leaves. The nodes
information are also populated.
Rework: (Sagar)
- Return out of bounds threads early
- Mimic GRL internal node encoding
- Handle node mask
- Fix block_incr_and_start_prim
- Fix shader_index_and_geom_mask for instance node
- Fix instance flag
- Fix block_incr and instance_contribution_and_geom_flags initialized to be zero
- Fix lower_x and upper_x to be properly flipped for invalid child
- For invalid node, clear blockIncr and set startPrim to INVALID
- Calculated things upfront and assign, cutting down more than ~200
instructions
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Rework (Kevin)
- encode the address of anv_instance_leaf after header in order to
handle serialization and deserialization part.
- draw serialized data layout and explanation
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
This commit adds build interface and helper header for ANV BVH.
Rework: (Kevin)
- Use block_size macro to represent bvh node/leaf size
- Rename BVH-related node/leaf size macros for clarity
- Updated comments for some fields for bvh and nodes.
- move bvh_layout to anv_bvh.h
- Draw anv_bvh layout
- rename child_offset to child_block_offset
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Only single g33 as part of r300 ci-tron-based farm.
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Reviewed-by: Eric Engestrom <eric@igalia.com>
Reviewed-by: Martin Roukala (né Peres) <martin.roukala@mupuf.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32376>
We were wanting to check if the destination region spanned multiple
registers. But we were checking against REG_SIZE, when the register
size is actually REG_SIZE * reg_unit(devinfo) now.
This meant that SIMD32 LOAD_PAYLOAD was always getting SIMD-split
on Xe2 platforms, generating a lot of unnecessary mess for compute
shaders.
fossil-db results on Lunar Lake:
Totals:
Instrs: 146178614 -> 143291988 (-1.97%); split: -1.98%, +0.00%
Subgroup size: 11089632 -> 11089376 (-0.00%); split: +0.00%, -0.00%
Cycle count: 22528892444 -> 22507551650 (-0.09%); split: -0.12%, +0.03%
Max live registers: 48834202 -> 48886685 (+0.11%); split: -0.09%, +0.20%
Totals from 134306 (24.10% of 557327) affected shaders:
Instrs: 28806335 -> 25919709 (-10.02%); split: -10.02%, +0.00%
Subgroup size: 4297680 -> 4297424 (-0.01%); split: +0.00%, -0.01%
Cycle count: 956867650 -> 935526856 (-2.23%); split: -2.84%, +0.61%
Max live registers: 13085711 -> 13138194 (+0.40%); split: -0.33%, +0.73%
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32471>