mesa/src/intel/compiler
Kenneth Graunke 6341b3cd87 brw: Combine convergent texture buffer fetches into fewer loads
Borderlands 3 (both DX11 and DX12 renderers) have a common pattern
across many shaders:

  con 32x4 %510 = (uint32)txf %2 (handle), %1191 (0x10) (coord), %1 (0x0) (lod), 0 (texture)
  con 32x4 %512 = (uint32)txf %2 (handle), %1511 (0x11) (coord), %1 (0x0) (lod), 0 (texture)
  ...
  con 32x4 %550 = (uint32)txf %2 (handle), %1549 (0x25) (coord), %1 (0x0) (lod), 0 (texture)
  con 32x4 %552 = (uint32)txf %2 (handle), %1551 (0x26) (coord), %1 (0x0) (lod), 0 (texture)

A single basic block contains piles of texelFetches from a 1D buffer
texture, with constant coordinates.  In most cases, only the .x channel
of the result is read.  So we have something on the order of 28 sampler
messages, each asking for...a single uint32_t scalar value.  Because our
sampler doesn't have any support for convergent block loads (like the
untyped LSC transpose messages for SSBOs)...this means we were emitting
SIMD8/16 (or SIMD16/32 on Xe2) sampler messages for every single scalar,
replicating what's effectively a SIMD1 value to the entire register.
This is hugely wasteful, both in terms of register pressure, and also in
back-and-forth sending and receiving memory messages.

The good news is we can take advantage of our explicit SIMD model to
handle this more efficiently.  This patch adds a new optimization pass
that detects a series of SHADER_OPCODE_TXF_LOGICAL, in the same basic
block, with constant offsets, from the same texture.  It constructs a
new divergent coordinate where each channel is one of the constants
(i.e <10, 11, 12, ..., 26> in the above example).  It issues a new
NoMask divergent texel fetch which loads N useful channels in one go,
and replaces the rest with expansion MOVs that splat the SIMD1 result
back to the full SIMD width.  (These get copy propagated away.)

We can pick the SIMD size of the load independently of the native shader
width as well.  On Xe2, those 28 convergent loads become a single SIMD32
ld message.  On earlier hardware, we use 2 SIMD16 messages.  Or we can
use a smaller size when there aren't many to combine.

In fossil-db, this cuts 27% of send messages in affected shaders, 3-6%
of cycles, 2-3% of instructions, and 8-12% of live registers.  On A770,
this improves performance of Borderlands 3 by roughly 2.5-3.5%.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32573>
2024-12-12 00:05:42 +00:00
..
elk intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
tests intel/brw: Remove assembler tests for Gfx8- 2024-02-24 02:10:56 +00:00
brw_asm.c intel/brw: Dump errors when brw_assemble() fails EU validation 2024-12-10 20:23:25 +00:00
brw_asm.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_asm_internal.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_asm_tool.c intel/brw: Split off assembler logic into library 2024-07-12 19:34:23 +00:00
brw_cfg.cpp intel/brw: Add a file parameter to idom_tree::dump() 2024-08-22 22:54:45 +00:00
brw_cfg.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_compile_bs.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_compile_cs.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_compile_fs.cpp brw: move barycentric_mode enum to intel_shader_enums.h 2024-11-26 13:05:30 +00:00
brw_compile_gs.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_compile_mesh.cpp brw: fix task/mesh push constant loading 2024-10-26 18:12:41 +00:00
brw_compile_tcs.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_compile_tes.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_compile_vs.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_compiler.c nir: add option to use compact view indices 2024-12-09 20:31:49 +00:00
brw_compiler.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_debug_recompile.c intel/brw: Simplify @file annotations 2024-07-22 22:48:03 +00:00
brw_def_analysis.cpp intel: Add statistic for Non SSA registers after NIR to BRW 2024-10-11 06:40:29 +00:00
brw_device_sha1_gen_c.py intel/compiler: drop unused ray-tracing fields from cache hash 2024-03-22 00:01:28 +00:00
brw_disasm.c intel/brw_asm: Add BranchCtrl support 2024-11-02 18:01:19 +00:00
brw_disasm.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_disasm_info.cpp intel/brw: Simplify fs_inst annotation 2024-08-28 03:59:50 +00:00
brw_disasm_info.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_disasm_tool.c intel/brw: Remove Gfx8- code from disassembler 2024-02-28 05:45:38 +00:00
brw_eu.c brw,elk: Fix opening flags on dumping shader binaries 2024-08-27 08:26:08 +00:00
brw_eu.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_eu_compact.c intel/compiler: Xe2 and Xe3 use the same compaction tables 2024-10-26 07:39:30 +00:00
brw_eu_defines.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_eu_emit.c brw/emit: Add correct 3-source instruction assertions for each platform 2024-11-08 16:48:57 +00:00
brw_eu_validate.c intel/brw: Fix decoding of cond_modifier and saturate in EU validation 2024-11-22 21:15:46 +00:00
brw_fs.cpp intel/brw: Add is_control_source for the new subgroup ops 2024-12-04 01:19:37 +00:00
brw_fs.h brw: Combine convergent texture buffer fetches into fewer loads 2024-12-12 00:05:42 +00:00
brw_fs_bank_conflicts.cpp intel/brw: Simplify @file annotations 2024-07-22 22:48:03 +00:00
brw_fs_builder.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_fs_cmod_propagation.cpp brw: Fix mov cmod propagation when there's int signedness mismatch 2024-09-09 22:13:08 +00:00
brw_fs_combine_constants.cpp intel/brw: Allow immediates in the BFE instruction on Gfx12+ 2024-10-24 21:31:28 +00:00
brw_fs_copy_propagation.cpp brw/copy: Allow copy prop into src1 of broadcast 2024-12-05 00:15:27 +00:00
brw_fs_cse.cpp brw/cse: Don't eliminate instructions that write flags 2024-11-08 17:46:45 +00:00
brw_fs_dead_code_eliminate.cpp intel/brw: Delete old-style surface and A64 message opcodes 2024-09-12 20:54:36 +00:00
brw_fs_generator.cpp brw: add a NOP in between WHILE instructions on LNL 2024-10-31 23:57:10 +00:00
brw_fs_live_variables.cpp intel/brw: Simplify @file annotations 2024-07-22 22:48:03 +00:00
brw_fs_live_variables.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_fs_lower.cpp intel/brw: Simplify fs_inst annotation 2024-08-28 03:59:50 +00:00
brw_fs_lower_dpas.cpp intel/brw: Replace uses of fs_reg with brw_reg 2024-07-03 02:53:19 +00:00
brw_fs_lower_integer_multiplication.cpp intel/brw: Replace uses of fs_reg with brw_reg 2024-07-03 02:53:19 +00:00
brw_fs_lower_pack.cpp intel/brw: Replace uses of fs_reg with brw_reg 2024-07-03 02:53:19 +00:00
brw_fs_lower_regioning.cpp brw/lower: Don't "fix" regioning of broadcast 2024-12-05 00:15:27 +00:00
brw_fs_lower_simd_width.cpp brw: Allow SIMD32 math instructions on Xe2 2024-12-04 02:42:34 +00:00
brw_fs_nir.cpp brw: don't forget the base when emitting SHADER_OPCODE_MOV_RELOC_IMM 2024-12-09 15:45:49 +00:00
brw_fs_opt.cpp brw: Combine convergent texture buffer fetches into fewer loads 2024-12-12 00:05:42 +00:00
brw_fs_opt_algebraic.cpp brw/build: Use SIMD8 temporaries in emit_uniformize 2024-12-05 00:15:27 +00:00
brw_fs_opt_virtual_grfs.cpp brw: fix virtual register splitting to not go below physical register size 2024-09-18 23:26:34 +00:00
brw_fs_reg_allocate.cpp brw: use transpose unspill messages when possible 2024-12-04 08:59:07 +00:00
brw_fs_register_coalesce.cpp intel/brw: Simplify @file annotations 2024-07-22 22:48:03 +00:00
brw_fs_saturate_propagation.cpp intel/brw: Use def analysis for simple cases of saturate propagation 2024-08-09 14:26:05 -07:00
brw_fs_scoreboard.cpp intel/brw: Allow extra SWSB encodings for Xe2 2024-11-19 04:27:00 +00:00
brw_fs_thread_payload.cpp brw: move barycentric_mode enum to intel_shader_enums.h 2024-11-26 13:05:30 +00:00
brw_fs_validate.cpp intel/brw: Add SHADER_OPCODE_QUAD_SWAP 2024-11-22 00:27:01 +00:00
brw_fs_visitor.cpp intel/brw: Add phases to backend 2024-10-11 06:40:29 +00:00
brw_fs_workaround.cpp intel/brw/gfx9: Implement WaClearArfDependenciesBeforeEot 2024-10-23 15:02:27 +00:00
brw_gram.y intel/brw_asm: Add BranchCtrl support 2024-11-02 18:01:19 +00:00
brw_inst.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_ir.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_ir_allocator.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_ir_analysis.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_ir_fs.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_ir_performance.cpp intel/brw/xe2+: Adjust performance analysis divergence weight due to EU fusion removal. 2024-10-24 22:06:52 +00:00
brw_ir_performance.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_isa_info.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_kernel.c intel-clc: missing printf lowering 2024-08-06 17:55:18 +00:00
brw_kernel.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_lex.l intel/brw_asm: Add BranchCtrl support 2024-11-02 18:01:19 +00:00
brw_lower_logical_sends.cpp brw: rename brw_sometimes to intel_sometimes 2024-11-26 13:05:30 +00:00
brw_lower_subgroup_ops.cpp intel/brw: Add SHADER_OPCODE_QUAD_SWAP 2024-11-22 00:27:01 +00:00
brw_nir.c nir: treat per-view outputs as arrayed IO 2024-12-09 20:31:49 +00:00
brw_nir.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_nir_analyze_ubo_ranges.c brw: Only consider components read for UBO push analysis 2024-12-03 02:02:33 +00:00
brw_nir_lower_alpha_to_coverage.c brw: rename brw_sometimes to intel_sometimes 2024-11-26 13:05:30 +00:00
brw_nir_lower_cooperative_matrix.c intel/brw/xe2+: Allow vec16 for cooperative matrix 2024-06-25 14:17:47 -07:00
brw_nir_lower_cs_intrinsics.c compiler: Allow derivative_group to be used for all stages in shader_info 2024-09-03 20:03:18 +00:00
brw_nir_lower_fsign.py intel/brw: Use range analysis to optimize fsign 2024-05-14 01:28:21 +00:00
brw_nir_lower_intersection_shader.c intel/rt: fix terminateOnFirstHit handling 2024-08-05 21:43:36 +00:00
brw_nir_lower_ray_queries.c intel/rt: fix ray_query stack address calculation 2024-11-08 18:31:52 +00:00
brw_nir_lower_rt_intrinsics.c brw/rt: fix ray_object_(direction|origin) for closest-hit shaders 2024-08-13 10:28:50 +00:00
brw_nir_lower_shader_calls.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
brw_nir_lower_storage_image.c intel/brw: Drop image_{load,store}_raw_intel handling 2024-08-09 07:20:08 +00:00
brw_nir_opt_fsat.c intel/brw: Move fsat instructions closer to the source 2024-08-09 14:26:10 -07:00
brw_nir_rt.c brw/nir: rework inline_data_intel to work with compute 2024-10-17 19:35:59 +00:00
brw_nir_rt.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_nir_rt_builder.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_nir_trig_workarounds.py
brw_opt_txf_combiner.cpp brw: Combine convergent texture buffer fetches into fewer loads 2024-12-12 00:05:42 +00:00
brw_packed_float.c
brw_prim.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_print.cpp intel/brw: Fix SWSB output when printing IR 2024-11-22 21:47:46 +00:00
brw_private.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_reg.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_reg_type.c intel/brw: Rename brw_reg_type_to_hw_type to brw_type_encode 2024-04-25 11:41:48 +00:00
brw_reg_type.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_rt.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
brw_schedule_instructions.cpp intel/brw: Only force g0's liveness to be the whole program if spilling 2024-08-01 16:37:34 -07:00
brw_shader.cpp intel/brw: Delete old-style surface and A64 message opcodes 2024-09-12 20:54:36 +00:00
brw_simd_selection.cpp intel/brw: fix subgroup size of geometry stages for lnl+ 2024-05-14 23:13:37 +00:00
brw_vue_map.c intel/brw: Simplify @file annotations 2024-07-22 22:48:03 +00:00
intel_clc.c clc: Tell clang to track imported dependencies 2024-12-06 13:48:26 -05:00
intel_gfx_ver_enum.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
intel_nir.c intel/compiler: Rename the passes and files related to intel_nir.h 2024-02-16 22:35:05 +00:00
intel_nir.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
intel_nir_blockify_uniform_loads.c Revert in correct commit "fix" 2024-11-26 16:36:06 +02:00
intel_nir_clamp_image_1d_2d_array_sizes.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_clamp_per_vertex_loads.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_lower_conversions.c intel/nir: Don't needlessly split u2f16 for nir_type_uint32 2024-07-11 02:37:05 -07:00
intel_nir_lower_non_uniform_barycentric_at_sample.c nir: change signature of nir_src_is_divergent() 2024-10-24 10:06:17 +00:00
intel_nir_lower_non_uniform_resource_intel.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_lower_printf.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_lower_shading_rate_output.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_lower_sparse.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_lower_texture.c intel/compiler: Pack texture LOD and offset to a single 32-bit value 2024-02-27 00:22:46 +00:00
intel_nir_opt_peephole_ffma.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_opt_peephole_imul32x16.c treewide: use nir_metadata_control_flow 2024-06-17 16:28:14 -04:00
intel_nir_tcs_workarounds.c intel/nir: Set src_type on TCS quads workaround store_output 2024-05-02 13:58:21 -07:00
intel_shader_enums.h intel/compiler: Use #pragma once instead of header guards 2024-12-11 19:47:44 +00:00
meson.build brw: Combine convergent texture buffer fetches into fewer loads 2024-12-12 00:05:42 +00:00
test_eu_compact.cpp intel/brw: Enable EU validation and compaction tests for PTL 2024-12-04 23:03:11 +00:00
test_eu_validate.cpp intel/brw: Enable EU validation and compaction tests for PTL 2024-12-04 23:03:11 +00:00
test_fs_cmod_propagation.cpp brw: Fix mov cmod propagation when there's int signedness mismatch 2024-09-09 22:13:08 +00:00
test_fs_combine_constants.cpp intel/brw: Move calculate_cfg out of fs_visitor 2024-07-25 15:37:13 +00:00
test_fs_copy_propagation.cpp intel/brw: Copy prop from raw integer moves with mismatched types 2024-08-30 03:39:31 +00:00
test_fs_cse.cpp intel/brw: Move calculate_cfg out of fs_visitor 2024-07-25 15:37:13 +00:00
test_fs_saturate_propagation.cpp brw/sat: Convert nearly all tests to use new style builders 2024-10-25 20:31:45 +00:00
test_fs_scoreboard.cpp intel/brw: Allow extra SWSB encodings for Xe2 2024-11-19 04:27:00 +00:00
test_simd_selection.cpp intel: Remove brw_ prefix from process debug function 2024-02-16 22:35:05 +00:00
test_vf_float_conversions.cpp