fdo-mirrors/mesa

mirror of https://gitlab.freedesktop.org/mesa/mesa.git synced 2026-05-18 11:38:06 +02:00

Author	SHA1	Message	Date
Dylan Baker	7b337e214d	anv: remove dead code This code cannot be reached, since we already checked for `!valid_samples` and returned `VK_ERROR_FEATURE_NOT_PRESET` in that case above, and have not altered `valid_samples` since. Fixes: `d5da6980d3` ("anv/sparse: don't support depth/stencil with sparse") CID: 1662063 Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37341>	2025-09-12 23:20:35 +00:00
Sushma Venkatesh Reddy	5f10c1a8fb	intel/compiler: generalize workaround script name for broader applicability Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details Renamed brw_nir_trig_workarounds.py to brw_nir_workarounds.py to reflect its expanded scope beyond just trignometric workarounds. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36990>	2025-09-12 22:32:46 +00:00
Sushma Venkatesh Reddy	fe1d84e083	intel/compiler: apply sqrt workaround for Horizon Forbidden West shader Added a workaround for a known shader in Horizon Forbidden West that causes visual corruption on Intel anv driver. The fix clamps fsqrt inputs using fmax(x, 1e-12) to avoid invalid values. Integrated the workaround via brw_nir_apply_sqrt_workarounds() and applied it conditionally in the Vulkan pipeline based on the shader's BLAKE3 hash. Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12555 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36990>	2025-09-12 22:32:46 +00:00
Georg Lehmann	79d02047b8	intel: switch to new subgroup size info Reviewed-by: Iván Briano <ivan.briano@intel.com> Acked-by: Timur Kristóf <timur.kristof@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37258>	2025-09-12 21:05:17 +00:00
Georg Lehmann	95c2a65662	nir: remove unused shader_info param in nir_create_shader Reviewed-by: Marek Olšák <marek.olsak@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37258>	2025-09-12 21:05:17 +00:00
Caio Oliveira	c358842c1d	brw: Don't use individual rallocs for each instruction Move from a single ralloc allocation per instruction to contiguous blocks of allocations. Still use ralloc for those large blocks. Each ralloc allocation has at least 5 pointers of overhead, which would be about a third of the current brw_inst, and get worse as we try to pack brw_inst better. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:05 +00:00
Caio Oliveira	2506540566	brw: Repack brw_inst fields In Release build, goes from 72 to 64 bytes, and now fits in a single cacheline. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:05 +00:00
Caio Oliveira	8ded571ef4	brw: Allocate only brw_inst for BASE instructions Now that all the other kinds were added, all transforms to SEND will come from non-BASE kinds, so we don't need overallocate for BASE instructions. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:05 +00:00
Caio Oliveira	08c0f33874	brw: Add a generic LOGICAL instruction kind This kind of instruction doesn't have a special struct but will still be always allocated so that it can be lowered to SEND. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:05 +00:00
Caio Oliveira	df2b5fb03f	brw: Add brw_fb_write_inst Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:04 +00:00
Caio Oliveira	d06c0a370e	brw: Add brw_urb_inst Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:04 +00:00
Caio Oliveira	90967e7b16	brw: Add brw_load_payload_inst Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:03 +00:00
Caio Oliveira	388bac06ce	brw: Add brw_dpas_inst Fixed the types in brw_inst::bits so the struct is packed correctly. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:03 +00:00
Caio Oliveira	09a26526cc	brw: Add brw_mem_inst Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:02 +00:00
Caio Oliveira	f0f1e63f99	brw: Add brw_tex_inst Incorporate some "control sources" directly into the instruction. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:02 +00:00
Caio Oliveira	0fcce2722f	brw: Add brw_send_inst Move all the SEND specific fields from brw_inst into brw_send_inst. This new instruction kind will contain all variants of SENDs plus the virtual opcodes that were already relying on those SEND fields. Use the `as_send()` helper to go from a brw_inst into the brw_send_inst when applicable. Some of the code was changed to use the brw_send_inst type directly. Until other kinds are added, all the instructions are allocated the same amount of space as brw_send_inst. This ensures that all brw_transform_inst() calls are still valid. This will change after a few patches so that BASE instructions can use less memory. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:01 +00:00
Caio Oliveira	b27f6621ae	brw: Add initial support for different instruction kinds Prepare code for supporting subclasses of brw_inst for certain specialized kinds of instructions. This will allow - Move certain fields from brw_inst to the specialized one, reducing its size and making it easy to understand what applies to which instruction; - Move certain control sources into the specialized inst type, which currently take a full brw_reg to encode small integers. Reducing the overall sources we walk and care also might help the code in general. Next commits will add the new instruction kinds. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:01 +00:00
Caio Oliveira	339a4e8680	brw: Remove the extra function call when lowering samplers Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:00 +00:00
Caio Oliveira	71c23c6722	brw: Add brw_builder::URB_READ and URB_WRITE helpers Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:25:00 +00:00
Caio Oliveira	f92116832f	brw: Add brw_builder::SEND() helper Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:59 +00:00
Caio Oliveira	e194909b3f	brw: Add and use brw_transform_inst() The new function takes care of changing an instruction opcode and sources, which will allow later patches to tweak how allocations are done in those cases. Like the instruction allocation, this also takes a shader (or a builder, for it to get a shader). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:59 +00:00
Caio Oliveira	5d0160a87f	brw: Pass brw_shader in fold_instruction Will be used later for the general instruction transforming function. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:58 +00:00
Caio Oliveira	8f16cac492	brw: Allow emit instruction with only number of sources The emit will allocate the necessary number of sources but will let the caller fill them in. Change a couple of places to take advantage of that. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:58 +00:00
Caio Oliveira	3ef86a8d00	brw: Let the builder fill the sources of brw_inst Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:58 +00:00
Caio Oliveira	506fce20f0	brw: Bundle the allocation of brw_inst and its sources Flatten all the work being done into brw_new_inst() and brw_clone_inst() and allocate both the instruction and the sources in one swoop. For now we still keep a pointer to the array instead of declaring an array as last element to still allow growing the array -- which is used by the compiler in a few places. This commit removes the constructors for brw_inst, the idea is that the instructions are managed by the brw_shader, so we always go through it for new ones. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:57 +00:00
Caio Oliveira	c81c8c917f	brw: Remove builtin sources from brw_inst A later patch will add a different mechanism to achieve the same goal. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:57 +00:00
Caio Oliveira	858162a2fc	brw: Allocate brw_inst::src with ralloc In the few cases we have to _increase_ the number of sources, the new code will not attempt to recollect the memory, i.e. it delays freeing the old smaller one source array. For the instructions that may need this (when making a SEND into a SEND_GATHER), this is not expected to happen more than once. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:56 +00:00
Caio Oliveira	29c12bbebf	brw: Centralize brw_inst allocation Add and use brw_new_inst() and brw_clone_inst() and do not use stack allocated brw_insts. The builder was changed to not use the temporary ones either. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:56 +00:00
Caio Oliveira	c90ec6d7e7	brw: Use uint16_t for size_written UINT16_MAX is larger than the maximum number of bytes in the general register file: 256 GRFs * 16 slots * 4 bytes = 16384. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:55 +00:00
Kenneth Graunke	6281a12822	brw: Remove brw_inst::no_dd_check/no_dd_clear These dependency hints were primarily useful for the vec4 backend, where it was common to write subsets of a vec4's components across multiple instructions. In the scalar backend, we rarely used them. They also no longer exist on Tigerlake and later in favor of software scoreboarding. Dropping this allows us to clean up the IR a bit. We still use the hardware hints in the generator in a couple places: - Gfx9-12.0 scratch headers - Quad swizzles - Indirect MOV lowering In theory we might want them back if we moved that lowering to the IR. For scratch at least, I suspect it won't have a huge impact, as we're already incurring the cost of spills/fills. The others are fairly rare as well, so it may not be worth keeping. Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:55 +00:00
Sagar Ghuge	99cd6ffd1f	isl: Respect driconf option for EnableSamplerRoutetoLSC Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details For EnableSamplerRoutetoLSC, we do check driconf option. Buffer state setup is just missing that option so add check for that too. Fixes: `7934b70f` ("isl/iris/anv: provide drirc toggle intel_sampler_route_to_lsc") Cc: mesa-stable Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com> Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Reviewed-by: Nanley Chery <nanley.g.chery@intel.com> Reviewed-by: Caleb Callaway <caleb.callaway@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37190>	2025-09-11 20:31:51 +00:00
Caio Oliveira	03e9c01f0c	brw: Add and use more brw_validate.cpp macros Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details Add and use more comparison variants (which provide more detailed print out of the values), remove old references to "fsv" and "scalar", use assertion names more similar to GoogleTest that we already use elsewhere. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37267>	2025-09-10 17:44:38 -07:00
Dylan Baker	08a3497223	anv: add assertion that tes and tcs data is non-null Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details It doesn't make any sense ot have TCS but not TES (or vice versa), but coverity doesn't realize that. Add an assertion that they are both non-null before we start reading them. Fixes: `50fd669294` ("anv: prep work for separate tessellation shaders") CID: 1665360 CID: 1665327 Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37266>	2025-09-10 18:18:42 +00:00
Dylan Baker	ecfce9f9ad	blorp: Fix potential read of uninitaized elk fields in debug paths The intel_vue_map is only partially initialized before being used. All used fields are initialized, but in debug paths the unitialzed fields will also be read. To fix this initialize the struct to 0. In the brw path this struct is part of the prog_data, and is rzalloc'd. CID: 1665308 Reviewed-by: Iván Briano <ivan.briano@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37261>	2025-09-10 17:51:34 +00:00
Dylan Baker	6fe4b7344d	isl: prevent potential overflow before widen Fixes: `73608eb8b7` ("isl: Add support for creating layered surfaces for video encode/decode") CID: 1665354 Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37260>	2025-09-10 17:01:40 +00:00
Dylan Baker	f18aca8689	intel/brw: Fix implementaiton of \|= operator for enum Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details The current implementation does nothing, since it has no side effects, only a return value. By passing `x` as a reference we can mutate the value before returning. Fixes: `df37c7ca74` ("brw: fix analysis dirtying with pulled constants") CID: 1665293 Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37263>	2025-09-10 16:30:19 +00:00
Dylan Baker	70ebc14de9	anv: avoid potential integer overflow in video address calculation Coverity caught one instance of this, by visual inspection I found another case. Fixes: `3fb25cc78a` ("anv: Add support for creating layered surfaces for video encode/decode") CID: 1665326 Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37262>	2025-09-10 16:06:37 +00:00
Lionel Landwerlin	1646e7d311	anv: run nir_opt_acquire_release_barriers In the middle of writing all this new shader object compile code, this pass got added and I missed adding it to the shader object path. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `d39e443ef8` ("anv: add infrastructure for common vk_pipeline") Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37269>	2025-09-10 11:47:05 +00:00
Konstantin Seurer	850f339b89	vulkan: Add more detail to encode debug markers Useful for radv because radv has quite a few different configurations. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36982>	2025-09-10 08:35:50 +00:00
Konstantin Seurer	5c94e20abe	vulkan: Use a struct for debug markers Improves u_trace integation with anv. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36982>	2025-09-10 08:35:50 +00:00
Lionel Landwerlin	33d2c31d7a	brw: don't use brw_null_reg() for unused SEND sources Just avoiding the validation assert. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `47fe9d28e7` ("brw: Enumerate SHADER_OPCODE_SEND sources and standardize how many") Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13777 Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37112>	2025-09-10 09:08:27 +03:00
Francisco Jerez	5bf7bb5cf9	intel/brw/xe3+: Re-enable static analysis-based SIMD32 FS heuristic for the moment. This disables for now the "optimistic" SIMD heuristic that was implemented for xe3+ and makes it dependent on a debugging option, instead use the static analysis-based codepath that was used in previous generations and was extended by previous commits in this MR to model the xe3 trade-off between register use and thread parallelism. The reason is that the main assumption of the optimistic SIMD heuristic didn't hold up with reality: Real-world testing on PTL shows that there are many cases where SIMD32 shows performance degradation relative to SIMD16 despite the ability of xe3 hardware to scale the GRF file of a thread on demand, unfortunately that scenario seems to be more pervasive than hoped when the optimistic SIMD heuristic was implemented pre-silicon. In many cases what seems to be going on is that even when the register file is able to scale with the increased register use of SIMD32, the thread parallelism of the EU is scaled down by a similar factor, so at the bottom line SIMD32 (depending on the actual ratio of register use between both variants) may not buy us anything, and it frequently encounters constraints (like SIMD lowering and less effective scheduling) that lead to worse codegen than SIMD16, easily tipping the balance in favor of SIMD16. The extension of the performance analysis pass that was done in a previous commit allows the original SIMD32 heuristic to take into account quantitatively this effect, and that seems pretty effective at disabling SIMD32 shaders that underperform judging from the statistically significant improvement of most Traci test-cases that run on my PTL system (4 iterations, 5% significance), no statistically significant regressions were observed: Nba2K23-trace-dx11-2160p-ultra: 10.16% ±0.34% Superposition-trace-dx11-2160p-extreme: 4.06% ±0.50% TotalWarWarhammer3-trace-dx11-1080p-high: 3.52% ±0.76% Payday3-trace-dx11-1440p-ultra: 2.41% ±0.81% MetroExodus-trace-dx11-2160p-ultra: 2.28% ±0.78% Borderlands3-trace-dx11-2160p-ultra: 1.89% ±0.65% MountAndBlade2-trace-dx11-1440p-veryhigh: 1.81% ±0.40% Blackops3-trace-dx11-1080p-high: 1.66% ±0.29% HogwartsLegacy-trace-dx12-1080p-ultra: 1.53% ±0.22% TotalWarPharaoh-trace-dx11-1440p-ultra: 1.44% ±0.31% Fortnite-trace-dx11-2160p-epix: 1.44% ±0.27% Naraka-trace-dx11-1440p-highest: 1.39% ±0.27% PubG-trace-dx11-1440p-ultra: 1.30% ±0.49% Destiny2-trace-dx11-1440p-highest: 1.10% ±0.23% Factorio-trace-1080p-high: 1.10% ±1.77% TerminatorResistance-trace-dx11-2160p-ultra: 1.08% ±0.31% Ghostrunner2-trace-dx11-1440p-ultra: 1.05% ±0.15% ShadowTombRaider-trace-dx11-2160p-ultra: 0.98% ±0.19% CitiesSkylines2-trace-dx11-1440p-high: 0.67% ±0.19% Palworld-trace-dx11-1080p-med: 0.44% ±0.22% The downside is that this will reverse the large reduction in compile-time we gained from the optimistic SIMD heuristic -- The run-time of both shader-db and fossil-db jump back up by nearly 20% with this change. I'm working on a better compromise based on run-time feedback that will hopefully allow us to preserve the compile-time benefit of the optimistic heuristic without the reduction in run-time performance, but in the meantime it seems like the run-time performance gap from SIMD32 is the more urgent issue to address since it has an impact on titles across the board. Despite the reversal of that compile-time improvement xe3 still achieves slightly lower compile time on the average than previous generations as a result of VRT, so this doesn't seem terribly tragic. v2: Add bit to brw_get_compiler_config_value() (Lionel). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:58 +00:00
Francisco Jerez	a7969b5d42	intel/brw: Apply `7e1362e9c0` to pre-xe3 codepath of brw_compile_fs(). This applies the same workaround as `7e1362e9c0` to the pre-xe3 codepath of brw_compile_fs(), since ray queries appear to be unsupported from SIMD32 fragment shaders. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:58 +00:00
Francisco Jerez	531a34c7dd	intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency. The current register allocation loop attempts to use a sequence of pre-RA scheduling heuristics until register allocation is successful. The sequence of scheduling heuristics is expected to be increasingly aggressive at reducing the register pressure of the program (at a performance cost), so that the instruction ordering chosen gives the lowest latency achievable with the register space available. Unfortunately that approach doesn't consistently give the best performance on xe3+, since on recent platforms a schedule with higher latency may actually give better performance if its lower register pressure allows the use of a lower number of VRT register blocks which allows the EU to run more threads in parallel. This means that on xe3+ the scheduling mode with highest performance is fundamentally dependent on the specific scenario (in particular where in the thread count-register use curve the program is at, and how effective the scheduler heuristics are at reducing latency for each additional block of GRFs used), so it isn't possible to construct a fixed sequence of the existing heuristics guaranteed to be ordered by decreasing performance. In order to find the scheduling heuristic with better performance we have to run multiple of them prior to register allocation and do some arithmetic to account for the effect on parallelism of the register pressure estimated in each case, in order to decide which heuristic will give the best performance. This sounds costly but it is similar to the approach taken by brw_allocate_registers() when unable to allocate without spills in order to decide which scheduling heuristic to use in order to minimize the number of spills. In cases where that happens on xe3+ the scheduling runs introduced here don't add to the scheduling runs done to find the heuristic with minimum register pressure, we attempt to determine the heuristic with lowest pressure and best performance in the same loop, and then use one or the other depending on whether register allocation succeeds without spills. Significantly improves performance on PTL of the following Traci test cases (4 iterations, 5% significance): Nba2K23-trace-dx11-2160p-ultra: 4.48% ±0.38% Fortnite-trace-dx11-2160p-epix: 1.61% ±0.28% Superposition-trace-dx11-2160p-extreme: 1.37% ±0.26% PubG-trace-dx11-1440p-ultra: 1.15% ±0.29% GtaV-trace-dx11-2160p-ultra: 0.80% ±0.24% CitiesSkylines2-trace-dx11-1440p-high: 0.68% ±0.19% SpaceEngineers-trace-dx11-2160p-high: 0.65% ±0.34% The compile-time cost of shader-db increases significantly by 3.7% after this commit (15 iterations, 5% significance), the compile-time of fossil-db doesn't change significantly in my setup. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	0e802cecba	intel/brw: Make sure we don't use stale analysis after inst. order restore in brw_allocate_registers(). Do invalidate_analysis() from restore_instruction_order() to make sure we don't re-use stale analysis pass results if the user forgets to call invalidate_analysis() explicitly. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	dfc2a89d96	intel/brw: Allow using performance analysis pass pre-register allocation. Mainly this involves changing 'struct state' so that the dep_ready array is allocated with a dynamic size based on the number of VGRFs of the program instead of assuming a fixed XE3_MAX_GRF count of GRF dependencies. VGRF register dependencies are then handled by using one dep_ready entry per VGRF allocation instead of one per hardware register. The ability to use the performance analysis pass pre-regalloc will mostly be useful on xe3+, but this also has the side effect of saving some memory on xe2 and earlier platforms since we no longer need to allocate XE3_MAX_GRF dep_ready entries for them. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	3936a43496	intel/brw/xe3+: Tweak render target write timings in performance modeling pass. Reduce the cycle-count cost estimate used by the performance model for render target writes on xe3+ in order to match the real-world observation of shaders with latency lower than the previously estimated cost of its render target write. In a shader used by Factorio this would have led us to incorrectly model the shader as fillrate-bound, even though in reality the shader is EU-bound and benefits from the higher parallelism of SIMD32, so the subsequent commit that re-enables the static analysis-based SIMD32 heuristic on PTL would lead to a ~2% regression without this tweak. There appear to be no other regressions nor other changes from this in combination with the subsequent commit that enables it to have an effect, but it is possible that the real cycle count cost of a render target write still lies below the estimated value, ~400 is just the upper bound that can be inferred from the behavior of this test case. Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	6ccf2a375a	intel/brw/xe3+: Adjust weights of discard control flow for non-EU-fused platforms. Currently on platforms without EU fusion (all platforms other than gfx12.x) we were using a constant discard_weight = 1.0 regardless of SIMD width. This was far from ideal, in particular since it made the performance analysis pass fully insensitive to the presence of discard jumps, even though the scheduler is able to move code past a discard statement so the range of the program under discard control flow can vary and have a material effect on the relative performance of SIMD16 vs. SIMD32, since the scheduler is typically more constrained in SIMD32 dispatch mode. In order to fix this use a discard_weight lower than 1.0 for all dispatch modes, so that the performance analysis pass accounts for the presence and range of discard control flow. In addition use a lower discard_weight for SIMD16 dispatch like we do on Gfx12.x in order to account for the higher likelihood of divergent discard in SIMD32 mode. The specific weights were determined iteratively on PTL based on the final FPS result of several traces that are sensitive to the dispatch width of one or more fragment shaders that use discard, in order to ensure that in none of those cases we end up using the lower-performing dispatch width variant. This avoids regressions between 3.7% and 0.8% in Superposition-trace-dx11-2160p-extreme, BaldursGate3-trace-dx11-1440p-ultra and MetroExodus-trace-dx11-2160p-ultra after enabling the static analysis-based SIMD32 heuristic in PTL. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1) v2: Limit to xe3+ for now since performance effect seems to be a wash on xe2. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	1272ff5ed1	intel/brw/xehp+: Adjust performance model weights of LSC atomic ops. The LSC implements several optimizations for atomic operations on a memory addresses that are uniform across all lanes, in which case its cost is approximately O(1) instead of O(exec_size). Even cases where memory offsets are non-uniform but packed in a cacheline appear to have a cost that is non-linear with the number of lanes. In order to approximate this behavior more closely approximate its back-end cost as roughly 1300 cycles instead of the previous 400 * exec_size/8. This fixes some cases where we were incorrectly predicting the SIMD32 shader would be bound by the throughput of LSC atomic operations, even though the observed cost per lane of the LSC operations was significantly lower in SIMD32 mode so it would have the best performance. Clearly this is still a rough approximation and it might be possible to obtain a more accurate result by plumbing divergence analysis data all the way down to codegen, however the goal of the performance analysis pass isn't to provide an exact prediction of the performance of a shader (that's not really possible in general via static analysis without solving the halting problem), but to provide a good enough approximation at a low cost -- And the constant approximation seems to be strictly better in practice than the approximation we were using before, there appear to be no regressions from this change, and ShadowTombRaider-trace-dx11-2160p-ultra shows 5.7% better performance on PTL with a subsequent commit that re-enables the use of the static analysis-based SIMD32 heuristic on xe3+. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:56 +00:00
Francisco Jerez	6eea9659db	intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis. This extends the performance analysis pass used in previous generations to make it more useful to deal with the performance trade-off encountered on xe3 hardware as a result of VRT. VRT allows the driver to request a per-thread GRF allocation different from the 128 GRFs that were typical in previous platforms, but this comes at either a thread parallelism cost or benefit depending on the number of GRF register blocks requested. This makes a number of decisions more difficult for the compiler since certain optimizations potentially trade off run-time in a thread against the total number of threads that can run in parallel (e.g. consider scheduling and how reordering an instruction to avoid a stall can increase GRF use and therefore reduce thread-level parallelism when trying to improve instruction-level parallelism). This patch provides a simple heuristic tool to account for the combined interaction of register pressure and other single-threaded factors that affect performance. This is expressed with the redefinition of the pre-existing brw_performance::throughput estimate as the number of invocations per cycle per EU that would be achieved if there were enough threads to reach full load (in this sense this is to be considered a heuristic since the penalty from VRT may be lower than expected from this model at low EU load). This will be used e.g. in order to decide whether to use a more aggressive latency-minimizing mode during scheduling or a mode more effective at minimizing register pressure (it makes sense to take the path that will lead to the most invocations being serviced per cycle while under load). This also allows us to re-enable the old PS SIMD32 heuristic on xe3+, and due to this change it is able to identify cases where the combined effect of poorer scheduling and higher GRF use of the SIMD32 variant makes it more favorable to use SIMD16 only (see last patch of the MR for details and numbers). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:56 +00:00

1 2 3 4 5 ...

14620 commits