DRM_XE_TOPO_EU_PER_DSS and DRM_XE_TOPO_SIMD16_EU_PER_DSS can be any
number of bytes long but it was assuming it was always 4 bytes long.
That was not a issue because Xe KMD return 4 bytes even if only needs
1 or 2 bytes but that is a problem with our HW simulator that was
returning 2 bytes.
Fixes: a24d93aa89 ("intel/dev: Query and compute hardware topology for Xe")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32307>
This is the selector, and it must always be a uniform UD, so there's no
reason to not propagate into it.
No shader-db change on any Intel platform.
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 220507131 -> 220507127 (-0.00%)
Cycle count: 31607052398 -> 31607053364 (+0.00%); split: -0.00%, +0.00%
Totals from 5 (0.00% of 702410) affected shaders:
Instrs: 995 -> 991 (-0.40%)
Cycle count: 86392 -> 87358 (+1.12%); split: -0.07%, +1.19%
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32097>
The next two commits modify the destination regioning in a way that,
which still correct, trigger assertion failures if we try to fix the
regioning here.
Broadcast gets lowered in brw_eu_emit. For the purposes of region
restrictions, let's assume that the final code emission will do the
right thing. Doing a bunch of shuffling here is only going to make a
mess of things.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32097>
This commit allows you to dump different regions of memory related to
bvh building. An additional script to decode the memory dump is also
added, and you're able to view the built bvh in 3D view in html. See the
included README.md for usage.
Rework:
- you can now view the actual child_coord in internalNode in html
- change exponent to be int8_t in the interpreter
- fix the actual coordinates using an updated formula
- now you can have 3D view of the bvh
- blockIncr could be 2 and vk_aabb should be first
- Now, if any bvh dump is enabled, we will zero out tlas, to prevent gpu
hang caused by incorrect tlas traversal
- rootNodeOffset is back to the beginning
- Add INTEL_DEBUG=bvh_no_build.
- Fix type of dump_size
- add assertion for a 4B alignment
- when clearing out bvh, only clear out everything after
(header+bvh_offset)
- TODO: instead of dumping on destory, track in the command buffer
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Rework: (Kevin)
- Properly setup bvh_layout
Our bvh resides in contiguous memory and can be divided into two sections:
1. anv_accel_struct_header, tightly followed by
2. actual bvh, which starts with root node, followed by interleaving
leaves or internal nodes.
- Update comments for some fields for BVH and nodes.
- Properly populate the UUIDs in serialization header
- separate header func into completely two paths based on compaction bit
- Encode rt_uuid at second VK_UUID_SIZE.
- Write query result at correct slot
- add assertion for a 4B alignment
- move bvh_layout to anv_bvh
- Use meson option to decide which files to compile
- The alignment of serialization size is not needed
- Change static_assert to STATIC_ASSERT and move them inside functions
Rework (Sagar)
- Use anv_cmd_buffer_update_buffer instead of MI to copy data
Rework (Lionel)
- Remove flush after builds, and add flush in copy before dispatch
- Handle the flushes in CmdWriteAccelerationStructuresPropertiesKHR properly
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Rework: (Kevin)
- Calculate correct number of threads in GPGPU thread group based on
SIMD size.
- Instead of round up, just use the simple division and let the
remainder part handle groupCount < local_size_x.
- Drop indirect_unroll_off and fix the bug that we're not using is_unaligned_size_x
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
This shader gets called and will construct ANV BVH from IR BVH. More
specifically, each invocation will take care of one internal node. The
internal nodes get processed starting from root node all the way to the
bottom leaves.
During processing, we keep track of the destination of
where the internal node should be encoded (tracked in
vk_ir_box.bvh_offset), and where its leaves should be encoded (tracked
in vk_ir_header.dst_node_offset).
The processed bvh is in contiguous memory, which starts with header,
followed by interleaving internal nodes and leaves. The nodes
information are also populated.
Rework: (Sagar)
- Return out of bounds threads early
- Mimic GRL internal node encoding
- Handle node mask
- Fix block_incr_and_start_prim
- Fix shader_index_and_geom_mask for instance node
- Fix instance flag
- Fix block_incr and instance_contribution_and_geom_flags initialized to be zero
- Fix lower_x and upper_x to be properly flipped for invalid child
- For invalid node, clear blockIncr and set startPrim to INVALID
- Calculated things upfront and assign, cutting down more than ~200
instructions
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Rework (Kevin)
- encode the address of anv_instance_leaf after header in order to
handle serialization and deserialization part.
- draw serialized data layout and explanation
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
This commit adds build interface and helper header for ANV BVH.
Rework: (Kevin)
- Use block_size macro to represent bvh node/leaf size
- Rename BVH-related node/leaf size macros for clarity
- Updated comments for some fields for bvh and nodes.
- move bvh_layout to anv_bvh.h
- Draw anv_bvh layout
- rename child_offset to child_block_offset
Co-authored-by: Kevin Chuang <kaiwenjon23@gmail.com>
Co-authored-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31588>
Only single g33 as part of r300 ci-tron-based farm.
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Reviewed-by: Eric Engestrom <eric@igalia.com>
Reviewed-by: Martin Roukala (né Peres) <martin.roukala@mupuf.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32376>
We were wanting to check if the destination region spanned multiple
registers. But we were checking against REG_SIZE, when the register
size is actually REG_SIZE * reg_unit(devinfo) now.
This meant that SIMD32 LOAD_PAYLOAD was always getting SIMD-split
on Xe2 platforms, generating a lot of unnecessary mess for compute
shaders.
fossil-db results on Lunar Lake:
Totals:
Instrs: 146178614 -> 143291988 (-1.97%); split: -1.98%, +0.00%
Subgroup size: 11089632 -> 11089376 (-0.00%); split: +0.00%, -0.00%
Cycle count: 22528892444 -> 22507551650 (-0.09%); split: -0.12%, +0.03%
Max live registers: 48834202 -> 48886685 (+0.11%); split: -0.09%, +0.20%
Totals from 134306 (24.10% of 557327) affected shaders:
Instrs: 28806335 -> 25919709 (-10.02%); split: -10.02%, +0.00%
Subgroup size: 4297680 -> 4297424 (-0.01%); split: +0.00%, -0.01%
Cycle count: 956867650 -> 935526856 (-2.23%); split: -2.84%, +0.61%
Max live registers: 13085711 -> 13138194 (+0.40%); split: -0.33%, +0.73%
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32471>
Only consider R32 image formats as supporting atomics because we only
expose VK_FORMAT_FEATURE_2_STORAGE_IMAGE_ATOMIC_BIT for those formats.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32192>
I tested this patch with an ACM card. It enables "Halo: The Master Chief
Collection" to use the clear color modifier instead falling back to the
uncompressed Tile4 modifier.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32192>
* In the ACM PRMs, the programming notes under
RENDER_SURFACE_STATE::MemoryCompressionEnable state that the
DecompressInL3 bit must be set for media compression.
* Unlike TGL, ACM seems to handle format reinterpretation just fine
without using the bit.
Update the assignment accordingly.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32192>
This is a prerequisite for enabling nir_opt_varyings for all gallium
drivers.
nir_lower_io_passes (called by the GLSL linker) only uses NIR options
to lower indirect IO access before lowering IO and calling
nir_opt_varyings.
Most drivers report full support for indirect IO and lower it themselves,
which prevents compaction of lowered indirectly accessed varyings because
nir_opt_varyings doesn't touch indirect varyings.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> (Rb for asahi)
Reviewed-by: Pavel Ondračka <pavel.ondracka@gmail.com> (for r300)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32423>
Notably, our convergent block loads were already overfetching - we
rounded up to block sizes of 8, 16, 32, or 64(LSC-only). But we did
so in the backend, rather than NIR.
With recent changes, nir_opt_load_store_vectorizer allows holes of up
to 28 bytes (7 components at 4 bytes each). This allows us to detect
cases where we did a convergent block load for 1 component (but loaded
a whole vec8), then another load for the next vec8, and combine them
into a single V16 load. Single component loads aren't the most common,
but convergent loads of a vec2 in one group and a vec3 in another are
quite common, and it makes no sense to do V8+V8 loads instead of V16.
For non-block loads, we allow a max hole of 4 bytes. This allows the
common case of XYZ_ + XYZ_ loads (where the last component is unread)
to combine into a single larger load.
fossil-db results on Lunarlake:
Totals:
Instrs: 146692608 -> 146246432 (-0.30%); split: -0.33%, +0.02%
Subgroup size: 11100528 -> 11100512 (-0.00%)
Send messages: 7003425 -> 6862529 (-2.01%); split: -2.01%, +0.00%
Cycle count: 22396273274 -> 22523048654 (+0.57%); split: -1.08%, +1.64%
Spill count: 67671 -> 67594 (-0.11%); split: -1.59%, +1.48%
Fill count: 128999 -> 130223 (+0.95%); split: -1.73%, +2.68%
Scratch Memory Size: 5986304 -> 6042624 (+0.94%); split: -1.40%, +2.34%
Max live registers: 48898858 -> 48881655 (-0.04%); split: -0.05%, +0.01%
Non SSA regs after NIR: 172397792 -> 167577380 (-2.80%); split: -2.80%, +0.00%
Totals from 451003 (80.87% of 557667) affected shaders:
Instrs: 134111754 -> 133665578 (-0.33%); split: -0.36%, +0.03%
Subgroup size: 9039104 -> 9039088 (-0.00%)
Send messages: 6127775 -> 5986879 (-2.30%); split: -2.30%, +0.00%
Cycle count: 20306336726 -> 20433112106 (+0.62%); split: -1.19%, +1.81%
Spill count: 56230 -> 56153 (-0.14%); split: -1.92%, +1.78%
Fill count: 112920 -> 114144 (+1.08%); split: -1.97%, +3.06%
Scratch Memory Size: 3769344 -> 3825664 (+1.49%); split: -2.23%, +3.72%
Max live registers: 43750259 -> 43733056 (-0.04%); split: -0.05%, +0.01%
Non SSA regs after NIR: 158449343 -> 153628931 (-3.04%); split: -3.04%, +0.00%
In particular, sends get cut by 20.85% for Borderlands 3 DX12, 13.82%
on Cyberpunk 2077, 10.75% on Strange Brigade, and 10.20% on Red Dead
Redemption 2. Yet, spill/fills remain about the same.
fossil-db results on Alchemist are similar though not quite as good.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
nir_opt_load_store_vectorize checks for potential address wrapping
when vectorizing two loads ("low" and "high"). It looks for cases where
"low" might have a large address, and "high" has a positive offset
which, when added together, could trigger integer wraparound. The issue
here is that if the large address of "low" was considered out-of-bounds,
adding offset could wrap around to a small address, which might actually
be in-bounds. Thus, when loaded separately, "low" will fail and trigger
robustness out-of-bound-read behavior, but "high" would read correctly.
When vectorized, the entire load would fail. This is explicitly tested
for with 32-bit SSBO addresses in the Vulkan CTS.
However, anv's 64-bit global addresses and VMA handling effectively
prevent this case. Addresses 0-4095 are a reserved page so that if
people try to use 0 as a NULL pointer, it never maps to a valid BO.
That alone guarantees that the above case where "high" gets a small
address would never be in-bounds, so we don't need to check for it.
In fact, we allocate most user allocations out of high addresses,
and have specialized allocation heaps for certain types of GPU data
structures in the lower GB of memory. For a load to wrap around and
successfully land in the right heap, it would have to load gigabytes.
Disabling this allows load vectorization and overfetching in more cases.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
Just calculate the block size using util_logbase2() - it's simpler.
Also drop the name "oword" as this refers to legacy HDC messages,
rather than the newer LSC "vector size" field.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
This will matter more with overfetching, where we may suggest loading
additional data that we don't actually need for vectorization purposes.
We want to make sure that push ranges have the data we actually need;
any extra padding is irrelevant.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
i965 used to upload its own regular GL uniforms and push those in
addition to UBO ranges. st/mesa instead uploads regular uniforms
and presents those to use as UBO 0. So this really isn't a thing
anymore.
nir_intrinsic_load_uniform is still used today but it represents
Vulkan push constants. anv_nir_compute_push_layout already takes
care of ensuring too many ranges aren't present, so it doesn't need
the pass to do so. iris doesn't use this intrinsic at all.
We can also drop the compute shader check, because neither iris nor
anv use UBO push analysis for compute shaders - except for anv's
internal kernels, which already have well specified push layouts.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32315>
The specification doesn't say which error should be reported, but
piglit expects BadMatch:
/* The GLX_ARB_create_context_robustness spec does not say what error
* code should be generated. However, similar cases (e.g., valid GL
* versions) specify BadMatch. This is also the behavior of NVIDIA's
* closed-source driver.
*/
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32281>