fdo-mirrors/mesa

mirror of https://gitlab.freedesktop.org/mesa/mesa.git synced 2025-12-24 06:40:11 +01:00

Author	SHA1	Message	Date
Caio Oliveira	506fce20f0	brw: Bundle the allocation of brw_inst and its sources Flatten all the work being done into brw_new_inst() and brw_clone_inst() and allocate both the instruction and the sources in one swoop. For now we still keep a pointer to the array instead of declaring an array as last element to still allow growing the array -- which is used by the compiler in a few places. This commit removes the constructors for brw_inst, the idea is that the instructions are managed by the brw_shader, so we always go through it for new ones. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:57 +00:00
Caio Oliveira	c81c8c917f	brw: Remove builtin sources from brw_inst A later patch will add a different mechanism to achieve the same goal. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:57 +00:00
Caio Oliveira	858162a2fc	brw: Allocate brw_inst::src with ralloc In the few cases we have to _increase_ the number of sources, the new code will not attempt to recollect the memory, i.e. it delays freeing the old smaller one source array. For the instructions that may need this (when making a SEND into a SEND_GATHER), this is not expected to happen more than once. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:56 +00:00
Caio Oliveira	29c12bbebf	brw: Centralize brw_inst allocation Add and use brw_new_inst() and brw_clone_inst() and do not use stack allocated brw_insts. The builder was changed to not use the temporary ones either. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:56 +00:00
Caio Oliveira	c90ec6d7e7	brw: Use uint16_t for size_written UINT16_MAX is larger than the maximum number of bytes in the general register file: 256 GRFs * 16 slots * 4 bytes = 16384. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:55 +00:00
Kenneth Graunke	6281a12822	brw: Remove brw_inst::no_dd_check/no_dd_clear These dependency hints were primarily useful for the vec4 backend, where it was common to write subsets of a vec4's components across multiple instructions. In the scalar backend, we rarely used them. They also no longer exist on Tigerlake and later in favor of software scoreboarding. Dropping this allows us to clean up the IR a bit. We still use the hardware hints in the generator in a couple places: - Gfx9-12.0 scratch headers - Quad swizzles - Indirect MOV lowering In theory we might want them back if we moved that lowering to the IR. For scratch at least, I suspect it won't have a huge impact, as we're already incurring the cost of spills/fills. The others are fairly rare as well, so it may not be worth keeping. Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>	2025-09-12 00:24:55 +00:00
Caio Oliveira	03e9c01f0c	brw: Add and use more brw_validate.cpp macros Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details Add and use more comparison variants (which provide more detailed print out of the values), remove old references to "fsv" and "scalar", use assertion names more similar to GoogleTest that we already use elsewhere. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37267>	2025-09-10 17:44:38 -07:00
Dylan Baker	f18aca8689	intel/brw: Fix implementaiton of \|= operator for enum Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details The current implementation does nothing, since it has no side effects, only a return value. By passing `x` as a reference we can mutate the value before returning. Fixes: `df37c7ca74` ("brw: fix analysis dirtying with pulled constants") CID: 1665293 Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37263>	2025-09-10 16:30:19 +00:00
Lionel Landwerlin	33d2c31d7a	brw: don't use brw_null_reg() for unused SEND sources Just avoiding the validation assert. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `47fe9d28e7` ("brw: Enumerate SHADER_OPCODE_SEND sources and standardize how many") Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13777 Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37112>	2025-09-10 09:08:27 +03:00
Francisco Jerez	5bf7bb5cf9	intel/brw/xe3+: Re-enable static analysis-based SIMD32 FS heuristic for the moment. This disables for now the "optimistic" SIMD heuristic that was implemented for xe3+ and makes it dependent on a debugging option, instead use the static analysis-based codepath that was used in previous generations and was extended by previous commits in this MR to model the xe3 trade-off between register use and thread parallelism. The reason is that the main assumption of the optimistic SIMD heuristic didn't hold up with reality: Real-world testing on PTL shows that there are many cases where SIMD32 shows performance degradation relative to SIMD16 despite the ability of xe3 hardware to scale the GRF file of a thread on demand, unfortunately that scenario seems to be more pervasive than hoped when the optimistic SIMD heuristic was implemented pre-silicon. In many cases what seems to be going on is that even when the register file is able to scale with the increased register use of SIMD32, the thread parallelism of the EU is scaled down by a similar factor, so at the bottom line SIMD32 (depending on the actual ratio of register use between both variants) may not buy us anything, and it frequently encounters constraints (like SIMD lowering and less effective scheduling) that lead to worse codegen than SIMD16, easily tipping the balance in favor of SIMD16. The extension of the performance analysis pass that was done in a previous commit allows the original SIMD32 heuristic to take into account quantitatively this effect, and that seems pretty effective at disabling SIMD32 shaders that underperform judging from the statistically significant improvement of most Traci test-cases that run on my PTL system (4 iterations, 5% significance), no statistically significant regressions were observed: Nba2K23-trace-dx11-2160p-ultra: 10.16% ±0.34% Superposition-trace-dx11-2160p-extreme: 4.06% ±0.50% TotalWarWarhammer3-trace-dx11-1080p-high: 3.52% ±0.76% Payday3-trace-dx11-1440p-ultra: 2.41% ±0.81% MetroExodus-trace-dx11-2160p-ultra: 2.28% ±0.78% Borderlands3-trace-dx11-2160p-ultra: 1.89% ±0.65% MountAndBlade2-trace-dx11-1440p-veryhigh: 1.81% ±0.40% Blackops3-trace-dx11-1080p-high: 1.66% ±0.29% HogwartsLegacy-trace-dx12-1080p-ultra: 1.53% ±0.22% TotalWarPharaoh-trace-dx11-1440p-ultra: 1.44% ±0.31% Fortnite-trace-dx11-2160p-epix: 1.44% ±0.27% Naraka-trace-dx11-1440p-highest: 1.39% ±0.27% PubG-trace-dx11-1440p-ultra: 1.30% ±0.49% Destiny2-trace-dx11-1440p-highest: 1.10% ±0.23% Factorio-trace-1080p-high: 1.10% ±1.77% TerminatorResistance-trace-dx11-2160p-ultra: 1.08% ±0.31% Ghostrunner2-trace-dx11-1440p-ultra: 1.05% ±0.15% ShadowTombRaider-trace-dx11-2160p-ultra: 0.98% ±0.19% CitiesSkylines2-trace-dx11-1440p-high: 0.67% ±0.19% Palworld-trace-dx11-1080p-med: 0.44% ±0.22% The downside is that this will reverse the large reduction in compile-time we gained from the optimistic SIMD heuristic -- The run-time of both shader-db and fossil-db jump back up by nearly 20% with this change. I'm working on a better compromise based on run-time feedback that will hopefully allow us to preserve the compile-time benefit of the optimistic heuristic without the reduction in run-time performance, but in the meantime it seems like the run-time performance gap from SIMD32 is the more urgent issue to address since it has an impact on titles across the board. Despite the reversal of that compile-time improvement xe3 still achieves slightly lower compile time on the average than previous generations as a result of VRT, so this doesn't seem terribly tragic. v2: Add bit to brw_get_compiler_config_value() (Lionel). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:58 +00:00
Francisco Jerez	a7969b5d42	intel/brw: Apply `7e1362e9c0` to pre-xe3 codepath of brw_compile_fs(). This applies the same workaround as `7e1362e9c0` to the pre-xe3 codepath of brw_compile_fs(), since ray queries appear to be unsupported from SIMD32 fragment shaders. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:58 +00:00
Francisco Jerez	531a34c7dd	intel/brw/xe3+: Select scheduler heuristic with best trade-off between register pressure and latency. The current register allocation loop attempts to use a sequence of pre-RA scheduling heuristics until register allocation is successful. The sequence of scheduling heuristics is expected to be increasingly aggressive at reducing the register pressure of the program (at a performance cost), so that the instruction ordering chosen gives the lowest latency achievable with the register space available. Unfortunately that approach doesn't consistently give the best performance on xe3+, since on recent platforms a schedule with higher latency may actually give better performance if its lower register pressure allows the use of a lower number of VRT register blocks which allows the EU to run more threads in parallel. This means that on xe3+ the scheduling mode with highest performance is fundamentally dependent on the specific scenario (in particular where in the thread count-register use curve the program is at, and how effective the scheduler heuristics are at reducing latency for each additional block of GRFs used), so it isn't possible to construct a fixed sequence of the existing heuristics guaranteed to be ordered by decreasing performance. In order to find the scheduling heuristic with better performance we have to run multiple of them prior to register allocation and do some arithmetic to account for the effect on parallelism of the register pressure estimated in each case, in order to decide which heuristic will give the best performance. This sounds costly but it is similar to the approach taken by brw_allocate_registers() when unable to allocate without spills in order to decide which scheduling heuristic to use in order to minimize the number of spills. In cases where that happens on xe3+ the scheduling runs introduced here don't add to the scheduling runs done to find the heuristic with minimum register pressure, we attempt to determine the heuristic with lowest pressure and best performance in the same loop, and then use one or the other depending on whether register allocation succeeds without spills. Significantly improves performance on PTL of the following Traci test cases (4 iterations, 5% significance): Nba2K23-trace-dx11-2160p-ultra: 4.48% ±0.38% Fortnite-trace-dx11-2160p-epix: 1.61% ±0.28% Superposition-trace-dx11-2160p-extreme: 1.37% ±0.26% PubG-trace-dx11-1440p-ultra: 1.15% ±0.29% GtaV-trace-dx11-2160p-ultra: 0.80% ±0.24% CitiesSkylines2-trace-dx11-1440p-high: 0.68% ±0.19% SpaceEngineers-trace-dx11-2160p-high: 0.65% ±0.34% The compile-time cost of shader-db increases significantly by 3.7% after this commit (15 iterations, 5% significance), the compile-time of fossil-db doesn't change significantly in my setup. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	0e802cecba	intel/brw: Make sure we don't use stale analysis after inst. order restore in brw_allocate_registers(). Do invalidate_analysis() from restore_instruction_order() to make sure we don't re-use stale analysis pass results if the user forgets to call invalidate_analysis() explicitly. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	dfc2a89d96	intel/brw: Allow using performance analysis pass pre-register allocation. Mainly this involves changing 'struct state' so that the dep_ready array is allocated with a dynamic size based on the number of VGRFs of the program instead of assuming a fixed XE3_MAX_GRF count of GRF dependencies. VGRF register dependencies are then handled by using one dep_ready entry per VGRF allocation instead of one per hardware register. The ability to use the performance analysis pass pre-regalloc will mostly be useful on xe3+, but this also has the side effect of saving some memory on xe2 and earlier platforms since we no longer need to allocate XE3_MAX_GRF dep_ready entries for them. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	3936a43496	intel/brw/xe3+: Tweak render target write timings in performance modeling pass. Reduce the cycle-count cost estimate used by the performance model for render target writes on xe3+ in order to match the real-world observation of shaders with latency lower than the previously estimated cost of its render target write. In a shader used by Factorio this would have led us to incorrectly model the shader as fillrate-bound, even though in reality the shader is EU-bound and benefits from the higher parallelism of SIMD32, so the subsequent commit that re-enables the static analysis-based SIMD32 heuristic on PTL would lead to a ~2% regression without this tweak. There appear to be no other regressions nor other changes from this in combination with the subsequent commit that enables it to have an effect, but it is possible that the real cycle count cost of a render target write still lies below the estimated value, ~400 is just the upper bound that can be inferred from the behavior of this test case. Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	6ccf2a375a	intel/brw/xe3+: Adjust weights of discard control flow for non-EU-fused platforms. Currently on platforms without EU fusion (all platforms other than gfx12.x) we were using a constant discard_weight = 1.0 regardless of SIMD width. This was far from ideal, in particular since it made the performance analysis pass fully insensitive to the presence of discard jumps, even though the scheduler is able to move code past a discard statement so the range of the program under discard control flow can vary and have a material effect on the relative performance of SIMD16 vs. SIMD32, since the scheduler is typically more constrained in SIMD32 dispatch mode. In order to fix this use a discard_weight lower than 1.0 for all dispatch modes, so that the performance analysis pass accounts for the presence and range of discard control flow. In addition use a lower discard_weight for SIMD16 dispatch like we do on Gfx12.x in order to account for the higher likelihood of divergent discard in SIMD32 mode. The specific weights were determined iteratively on PTL based on the final FPS result of several traces that are sensitive to the dispatch width of one or more fragment shaders that use discard, in order to ensure that in none of those cases we end up using the lower-performing dispatch width variant. This avoids regressions between 3.7% and 0.8% in Superposition-trace-dx11-2160p-extreme, BaldursGate3-trace-dx11-1440p-ultra and MetroExodus-trace-dx11-2160p-ultra after enabling the static analysis-based SIMD32 heuristic in PTL. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1) v2: Limit to xe3+ for now since performance effect seems to be a wash on xe2. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:57 +00:00
Francisco Jerez	1272ff5ed1	intel/brw/xehp+: Adjust performance model weights of LSC atomic ops. The LSC implements several optimizations for atomic operations on a memory addresses that are uniform across all lanes, in which case its cost is approximately O(1) instead of O(exec_size). Even cases where memory offsets are non-uniform but packed in a cacheline appear to have a cost that is non-linear with the number of lanes. In order to approximate this behavior more closely approximate its back-end cost as roughly 1300 cycles instead of the previous 400 * exec_size/8. This fixes some cases where we were incorrectly predicting the SIMD32 shader would be bound by the throughput of LSC atomic operations, even though the observed cost per lane of the LSC operations was significantly lower in SIMD32 mode so it would have the best performance. Clearly this is still a rough approximation and it might be possible to obtain a more accurate result by plumbing divergence analysis data all the way down to codegen, however the goal of the performance analysis pass isn't to provide an exact prediction of the performance of a shader (that's not really possible in general via static analysis without solving the halting problem), but to provide a good enough approximation at a low cost -- And the constant approximation seems to be strictly better in practice than the approximation we were using before, there appear to be no regressions from this change, and ShadowTombRaider-trace-dx11-2160p-ultra shows 5.7% better performance on PTL with a subsequent commit that re-enables the use of the static analysis-based SIMD32 heuristic on xe3+. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:56 +00:00
Francisco Jerez	6eea9659db	intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis. This extends the performance analysis pass used in previous generations to make it more useful to deal with the performance trade-off encountered on xe3 hardware as a result of VRT. VRT allows the driver to request a per-thread GRF allocation different from the 128 GRFs that were typical in previous platforms, but this comes at either a thread parallelism cost or benefit depending on the number of GRF register blocks requested. This makes a number of decisions more difficult for the compiler since certain optimizations potentially trade off run-time in a thread against the total number of threads that can run in parallel (e.g. consider scheduling and how reordering an instruction to avoid a stall can increase GRF use and therefore reduce thread-level parallelism when trying to improve instruction-level parallelism). This patch provides a simple heuristic tool to account for the combined interaction of register pressure and other single-threaded factors that affect performance. This is expressed with the redefinition of the pre-existing brw_performance::throughput estimate as the number of invocations per cycle per EU that would be achieved if there were enough threads to reach full load (in this sense this is to be considered a heuristic since the penalty from VRT may be lower than expected from this model at low EU load). This will be used e.g. in order to decide whether to use a more aggressive latency-minimizing mode during scheduling or a mode more effective at minimizing register pressure (it makes sense to take the path that will lead to the most invocations being serviced per cycle while under load). This also allows us to re-enable the old PS SIMD32 heuristic on xe3+, and due to this change it is able to identify cases where the combined effect of poorer scheduling and higher GRF use of the SIMD32 variant makes it more favorable to use SIMD16 only (see last patch of the MR for details and numbers). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:56 +00:00
Francisco Jerez	760437c4c4	intel/brw/xe3+: Override P value of GRF register classes to increase thread parallelism. This causes the graph coloring allocator to use the optimistic coloring codepath for all nodes whose total Q value exceeds the threshold of 96 GRFs, in order to do a better job at minimizing the register requirement of programs even when they are trivially colorable. At the threshold of 96 GRFs the number of threads available per EU starts decreasing as the number of register blocks requested by the program increases, so decreasing the number of registers can increase performance. That showed up in some test cases as a performance inversion from the enabling of VRT, since the extension of the register set to 256 GRFs has the side effect of making some non-trivially colorable programs trivially colorable, which would cause the register allocator to do a worse job at ordering the (trivial) allocations due to the optimistic coloring path being skipped, leading to increased register use and reduced performance. The following Traci test cases improve significantly as a result of this change (4 iterations, 5% significance): MetroExodus-trace-dx11-2160p-ultra: 1.90% ±0.85% BaldursGate3-trace-dx11-1440p-ultra: 1.47% ±0.38% Palworld-trace-dx11-1080p-med: 1.01% ±0.09% TerminatorResistance-trace-dx11-2160p-ultra: 0.95% ±0.29% Control-trace-dx11-1440p-high: 0.87% ±0.50% Even though lowering the P value threshold is expected to have a cost in compile time theoretically due to the increased use of the slower optimistic path of the graph coloring allocator, this doesn't actually show up in my numbers, my shader-db and fossil-db compile-time numbers don't show any statistically significant change (13 iterations, 5% significance). Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:55 +00:00
Francisco Jerez	35ac517780	intel/brw/xe3+: Define BRW_SCHEDULE_PRE_LATENCY scheduling mode. This defines a new pre-RA scheduling mode similar to BRW_SCHEDULE_PRE but more aggressive at optimizing for minimum latency rather than minimum register usage. The main motivation is that on recent xe3 platforms we use a register allocation heuristic that packs variables more tightly at the bottom of the register file instead of the round-robin heuristic we used on previous platforms, since as a result of VRT there is a parallelism penalty when a program uses more GRF registers than necessary. Unfortunately the xe3 tight-packing heuristic severely constrains the work of the post-RA scheduler due to the false dependencies introduced during register allocation, so we can do a better job by making the scheduler aware of instruction latencies before the register allocator introduces any false dependencies. This can lead to higher register pressure, but only when the scheduler decides it could save cycles by extending a live range. It makes sense to preserve the preexisting BRW_SCHEDULE_PRE as a separate mode since some workloads can still benefit from neglecting latencies pre-RA due to the trade-off mentioned between parallelism and GRF use, a future commit will introduce a more accurate estimate of the expected relative performance of BRW_SCHEDULE_PRE vs. BRW_SCHEDULE_PRE_LATENCY taking into account this trade-off. In theory this could also be helpful on earlier pre-xe3 platforms, but the benefit should be significantly smaller due to the different RA heuristic so it hasn't been tested extensively pre-xe3. The following Traci tests are improved significantly by this change on PTL (nearly all tests that run on my system are affected positively): Ghostrunner2-trace-dx11-1440p-ultra: 7.12% ±0.36% SpaceEngineers-trace-dx11-2160p-high: 5.77% ±0.43% HogwartsLegacy-trace-dx12-1080p-ultra: 4.40% ±0.03% Naraka-trace-dx11-1440p-highest: 3.06% ±0.43% MetroExodus-trace-dx11-2160p-ultra: 2.26% ±0.60% Fortnite-trace-dx11-2160p-epix: 2.12% ±0.53% Nba2K23-trace-dx11-2160p-ultra: 1.98% ±0.30% Control-trace-dx11-1440p-high: 1.93% ±0.36% GodOfWar-trace-dx11-2160p-ultra: 1.62% ±0.47% TotalWarPharaoh-trace-dx11-1440p-ultra: 1.55% ±0.18% MountAndBlade2-trace-dx11-1440p-veryhigh: 1.51% ±0.37% Destiny2-trace-dx11-1440p-highest: 1.44% ±0.34% GtaV-trace-dx11-2160p-ultra: 1.26% ±0.27% ShadowTombRaider-trace-dx11-2160p-ultra: 1.10% ±0.58% Borderlands3-trace-dx11-2160p-ultra: 0.95% ±0.43% TerminatorResistance-trace-dx11-2160p-ultra: 0.87% ±0.22% BaldursGate3-trace-dx11-1440p-ultra: 0.84% ±0.28% CitiesSkylines2-trace-dx11-1440p-high: 0.82% ±0.22% PubG-trace-dx11-1440p-ultra: 0.72% ±0.37% Palworld-trace-dx11-1080p-med: 0.71% ±0.26% Superposition-trace-dx11-2160p-extreme: 0.69% ±0.19% The compile-time cost of shader-db increases significantly by 1.85% after this commit (14 iterations, 5% significance), the compile-time of fossil-db doesn't change significantly in my setup. v2: Addressed interaction with `81594d0db1`, since the code that calculates deps, delays and exits is no longer mode-independent after this change. Instead of reverting that commit (which is non-trivial and would have a greater compile-time hit) simply reconstruct the scheduler object during the transition between BRW_SCHEDULE_PRE_LATENCY and any other PRE mode that doesn't require instruction latencies. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:55 +00:00
Francisco Jerez	501b1cbc2c	intel/brw: Fix behavior of scheduler around flag register writes. We were currently treating explicit flag writes and reads as a full scheduler barrier, which is unnecessary since the tracking we already do handles explicit flag access correctly so there is no reason for taking a possibly large performance hit from add_barrier_deps(). Found by inspection while trying to understand the poor scheduling of some fragment shaders. Improves performance by a small but statistically significant amount (4 iterations, 5% significance) for the following Traci tests in combination with a subsequent commit that makes the pre-RA scheduler sensitive to instruction latencies: SpaceEngineers-trace-dx11-2160p-high: 0.66% ±0.30% MountAndBlade2-trace-dx11-1440p-veryhigh: 0.62% ±0.23% Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:55 +00:00
Francisco Jerez	17b068ed1c	intel/brw/xe3+: Handle SENDG in instruction scheduler. We weren't handling the SHADER_OPCODE_SEND_GATHER instruction in the instruction scheduler and this was leading to reduced performance in many programs since SEND instructions have the longest latency and tend to be among the most critical to schedule efficiently. Handle SENDG similarly to SEND since the timings of both instructions are mostly bound by the shared function which doesn't care if the message was sent by SEND or SENDG. Improves performance significantly in the following Traci traces (4 iterations, 5% significance), most of them regressions from SENDG being enabled: MetroExodus-trace-dx11-2160p-ultra: 1.99% ±0.88% HogwartsLegacy-trace-dx12-1080p-ultra: 1.33% ±0.20% GtaV-trace-dx11-2160p-ultra: 1.12% ±0.19% Borderlands3-trace-dx11-2160p-ultra: 1.00% ±0.58% TerminatorResistance-trace-dx11-2160p-ultra: 0.98% ±0.27% Control-trace-dx11-1440p-high: 0.91% ±0.36% Naraka-trace-dx11-1440p-highest: 0.90% ±0.30% Ghostrunner2-trace-dx11-1440p-ultra: 0.87% ±0.38% Palworld-trace-dx11-1080p-med: 0.71% ±0.17% Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>	2025-09-10 02:15:54 +00:00
Caio Oliveira	67fcfed67b	brw: Add `FILE *` parameter to dump_assembly Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details Reviewed-by: Tapani Pälli <tapani.palli@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37259>	2025-09-09 10:40:42 -07:00
Mel Henning	17876a00af	nir: Add a faster lowest common ancestor algorithm On a fossil from the blender 4.5.0 vulkan backend, this improves compile times in nak by about 17%. Compile time of other shaders improves by a more modest 1.2%. No stat changes on shader-db. Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36184>	2025-09-08 23:03:13 +00:00
Caio Oliveira	f37c9c873c	brw: Fix printing of blocks in disassembly when BRW is available Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details When disassembling and BRW IR is available (which happens in the generator), there will be pointers to the BRW's basic block structures that are used to print the block numbers and predecessor/successors in the output. There are two challenges: - Because DO and FLOW instructions are not real instructions, they are not emitted in the output but would still cause the output to contain empty blocks. Previous code accounted for DO but still had problems. - DO blocks have special physical links that don't make sense when the DO is not emitted at the end, but they would be shown even if that block was omitted. These issues can be seen here (edited to remove non-essential bits) ``` START B0 (2 cycles) mov(8) g126<1>UD 0x3f800000UD END B0 ->B1 START B2 <-B1 <-B4 (0 cycles) END B2 ->B3 START B3 <-B2 (260 cycles) LABEL1: mov(8) g1<1>D 0D cmp.ge.f0.0(8) null<1>D g2<0,1,0>D 10D sync nop(1) null<0,1,0>UB send(1) g0UD g1UD nullUD (+f0.0) break(8) JIP: LABEL0 UIP: LABEL0 END B3 ->B1 ->B5 ->B4 START B4 <-B3 (1000 cycles) sync nop(1) null<0,1,0>UB mov(8) g126<1>UD g0<0,1,0>UD LABEL0: while(8) JIP: LABEL1 END B4 ->B2 START B5 <-B1 <-B3 (20 cycles) ``` For example: - Block 1 is missing (a skipped DO block) - Block 2 is empty (it was a FLOW block) - Block 3 ends with a link to Block 1 (the special links involving DO blocks). Two key changes were made to fix this. First, skip the DO and FLOW blocks completely. The use_tail ensures that the instruction group is reused to avoid empty blocks. Second, when printing, the successors and predecessors, walk through the skipped blocks. And finally, don't print the special blocks. With the fix, here's the output. Note the blocks retain their original BRW IR number. ``` START B0 (2 cycles) mov(8) g127<1>UD 0x3f800000UD END B0 ->B3 START B3 <-B0 <-B4 (260 cycles) LABEL1: mov(8) g1<1>D 0D cmp.ge.f0.0(8) null<1>D g2<0,1,0>D 10D sync nop(1) null<0,1,0>UB send(1) g0UD g1UD nullUD (+f0.0) break(8) JIP: LABEL0 UIP: LABEL0 END B3 ->B5 ->B4 START B4 <-B3 (1000 cycles) sync nop(1) null<0,1,0>UB mov(8) g127<1>UD g0<0,1,0>UD LABEL0: while(8) JIP: LABEL1 END B4 ->B3 START B5 <-B3 (20 cycles) ``` Issue was spotted by Ken. Fixes: `d2c39b1779` ("intel/brw: Always have a (non-DO) block after a DO in the CFG") Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36226>	2025-09-06 16:42:05 +00:00
Lionel Landwerlin	a91e0e0d61	brw: add support for separate tessellation shader compilation Tessellation factors have to be written dynamically (based on the next shader primitive topology) and the builtins read using a dynamic offset (based on the preceeding shader's VUE). Anv is updated to use this new infrastructure for dynamic patch_control_points. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34872>	2025-09-05 07:46:17 +00:00
Lionel Landwerlin	a18835a9ca	anv/brw/iris: move VS VUE computation to backend Drivers can provide the inputs required for the backend to call the compute function. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Ivan Briano <ivan.briano@intel.com> Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34872>	2025-09-05 07:46:16 +00:00
Lionel Landwerlin	8dee4813b0	brw: add ability to compute VUE map for separate tcs/tes Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34872>	2025-09-05 07:46:16 +00:00
Ian Romanick	1ce90ad5e1	elk: Use nir_opt_sink and more nir_opt_move Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details I spent a bunch of time playing around with the various enable bits, and this was the best I could come up with. Enabling any of nir_move_comparisons or nir_move_load_ubo in nir_opt_sink helped instructions quite a bit, but it also caused a large pile of added spills and fills. shader-db: Broadwell total instructions in shared programs: 18428980 -> 18427957 (<.01%) instructions in affected programs: 425245 -> 424222 (-0.24%) helped: 1522 / HURT: 405 total cycles in shared programs: 954756705 -> 953755695 (-0.10%) cycles in affected programs: 623470486 -> 622469476 (-0.16%) helped: 17989 / HURT: 21175 total spills in shared programs: 8349 -> 8356 (0.08%) spills in affected programs: 285 -> 292 (2.46%) helped: 7 / HURT: 13 total fills in shared programs: 10426 -> 10192 (-2.24%) fills in affected programs: 675 -> 441 (-34.67%) helped: 25 / HURT: 1 LOST: 346 GAINED: 554 Haswell total instructions in shared programs: 16809730 -> 16801634 (-0.05%) instructions in affected programs: 772251 -> 764155 (-1.05%) helped: 3055 / HURT: 840 total cycles in shared programs: 945179935 -> 944315696 (-0.09%) cycles in affected programs: 549177588 -> 548313349 (-0.16%) helped: 34143 / HURT: 23605 total spills in shared programs: 7699 -> 7666 (-0.43%) spills in affected programs: 353 -> 320 (-9.35%) helped: 10 / HURT: 16 total fills in shared programs: 8184 -> 7671 (-6.27%) fills in affected programs: 1006 -> 493 (-50.99%) helped: 30 / HURT: 2 total sends in shared programs: 1016676 -> 1016682 (<.01%) sends in affected programs: 49 -> 55 (12.24%) helped: 0 / HURT: 6 LOST: 415 GAINED: 441 Ivy Bridge total instructions in shared programs: 15764955 -> 15757178 (-0.05%) instructions in affected programs: 707453 -> 699676 (-1.10%) helped: 2893 / HURT: 547 total cycles in shared programs: 430017934 -> 429720104 (-0.07%) cycles in affected programs: 251816726 -> 251518896 (-0.12%) helped: 33110 / HURT: 22056 total spills in shared programs: 1537 -> 1525 (-0.78%) spills in affected programs: 18 -> 6 (-66.67%) helped: 6 / HURT: 0 total fills in shared programs: 926 -> 905 (-2.27%) fills in affected programs: 24 -> 3 (-87.50%) helped: 6 / HURT: 0 total sends in shared programs: 816646 -> 816652 (<.01%) sends in affected programs: 49 -> 55 (12.24%) helped: 0 / HURT: 6 LOST: 332 GAINED: 417 Sandy Bridge total instructions in shared programs: 14055229 -> 14045281 (-0.07%) instructions in affected programs: 1436142 -> 1426194 (-0.69%) helped: 5858 / HURT: 757 total cycles in shared programs: 772123170 -> 813543451 (5.36%) cycles in affected programs: 521342483 -> 562762764 (7.94%) helped: 27928 / HURT: 35923 total spills in shared programs: 1742 -> 1741 (-0.06%) spills in affected programs: 66 -> 65 (-1.52%) helped: 1 / HURT: 0 total fills in shared programs: 970 -> 967 (-0.31%) fills in affected programs: 93 -> 90 (-3.23%) helped: 1 / HURT: 0 total sends in shared programs: 1239222 -> 1238992 (-0.02%) sends in affected programs: 6137 -> 5907 (-3.75%) helped: 342 / HURT: 112 LOST: 244 GAINED: 434 Iron Lake and GM45 had similar results. (Iron Lake shown) total instructions in shared programs: 8366385 -> 8363954 (-0.03%) instructions in affected programs: 162761 -> 160330 (-1.49%) helped: 600 / HURT: 195 total cycles in shared programs: 248992618 -> 252119334 (1.26%) cycles in affected programs: 50774708 -> 53901424 (6.16%) helped: 3435 / HURT: 5131 total sends in shared programs: 623693 -> 623681 (<.01%) sends in affected programs: 351 -> 339 (-3.42%) helped: 12 / HURT: 0 LOST: 0 GAINED: 6 Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25463>	2025-09-04 15:01:18 -07:00
Ian Romanick	6f30cf71fe	brw: Use nir_opt_sink and more nir_opt_move The shader-db results on most platforms are pretty mixed. However, this seems to be a decent improvement in fossil-db. shader-db:: Lunar Lake total instructions in shared programs: 17019147 -> 17023017 (0.02%) instructions in affected programs: 1200847 -> 1204717 (0.32%) helped: 814 / HURT: 2458 total cycles in shared programs: 880532116 -> 880406462 (-0.01%) cycles in affected programs: 798253846 -> 798128192 (-0.02%) helped: 30064 / HURT: 33008 total spills in shared programs: 3262 -> 3260 (-0.06%) spills in affected programs: 66 -> 64 (-3.03%) helped: 1 / HURT: 2 total fills in shared programs: 1616 -> 1637 (1.30%) fills in affected programs: 89 -> 110 (23.60%) helped: 1 / HURT: 2 LOST: 241 GAINED: 356 Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown) total instructions in shared programs: 19859724 -> 19865383 (0.03%) instructions in affected programs: 2166810 -> 2172469 (0.26%) helped: 942 / HURT: 3563 total cycles in shared programs: 879095859 -> 878616086 (-0.05%) cycles in affected programs: 753840990 -> 753361217 (-0.06%) helped: 33442 / HURT: 35053 total spills in shared programs: 4679 -> 4677 (-0.04%) spills in affected programs: 80 -> 78 (-2.50%) helped: 1 / HURT: 2 total fills in shared programs: 4113 -> 4175 (1.51%) fills in affected programs: 87 -> 149 (71.26%) helped: 1 / HURT: 2 LOST: 706 GAINED: 563 Ice Lake and Skylake had similar results. (Ice Lake shown) total instructions in shared programs: 20610947 -> 20615741 (0.02%) instructions in affected programs: 2138334 -> 2143128 (0.22%) helped: 979 / HURT: 3635 total cycles in shared programs: 863103771 -> 862153697 (-0.11%) cycles in affected programs: 731626072 -> 730675998 (-0.13%) helped: 34060 / HURT: 34256 total spills in shared programs: 3992 -> 3949 (-1.08%) spills in affected programs: 504 -> 461 (-8.53%) helped: 8 / HURT: 6 total fills in shared programs: 3640 -> 3573 (-1.84%) fills in affected programs: 1505 -> 1438 (-4.45%) helped: 8 / HURT: 5 LOST: 622 GAINED: 1018 fossil-db: All Intel platforms had similar results. (Lunar Lake shown) Totals: Instrs: 232649299 -> 232485503 (-0.07%); split: -0.16%, +0.09% Subgroup size: 15932144 -> 15933056 (+0.01%); split: +0.01%, -0.00% Loop count: 137431 -> 137430 (-0.00%) Cycle count: 32619860020 -> 32714539770 (+0.29%); split: -0.80%, +1.09% Spill count: 540835 -> 519861 (-3.88%); split: -4.79%, +0.91% Fill count: 700278 -> 663650 (-5.23%); split: -6.46%, +1.23% Scratch Memory Size: 37258240 -> 35654656 (-4.30%); split: -5.24%, +0.94% Max live registers: 72561256 -> 71501759 (-1.46%); split: -1.62%, +0.16% Non SSA regs after NIR: 67682385 -> 67692495 (+0.01%); split: -0.00%, +0.02% Totals from 617432 (78.20% of 789594) affected shaders: Instrs: 217754449 -> 217590653 (-0.08%); split: -0.17%, +0.10% Subgroup size: 12656912 -> 12657824 (+0.01%); split: +0.01%, -0.00% Loop count: 133283 -> 133282 (-0.00%) Cycle count: 32367979192 -> 32462658942 (+0.29%); split: -0.81%, +1.10% Spill count: 540770 -> 519796 (-3.88%); split: -4.79%, +0.91% Fill count: 700277 -> 663649 (-5.23%); split: -6.46%, +1.23% Scratch Memory Size: 37182464 -> 35578880 (-4.31%); split: -5.25%, +0.94% Max live registers: 64912683 -> 63853186 (-1.63%); split: -1.81%, +0.18% Non SSA regs after NIR: 60158776 -> 60168886 (+0.02%); split: -0.00%, +0.02% Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25463>	2025-09-04 15:01:18 -07:00
Caio Oliveira	4e253184de	brw: Run validation as soon as we have the CFG around Fixes: `affa7567c2` ("intel/brw: Add phases to backend") Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37148>	2025-09-03 20:42:05 +00:00
Lionel Landwerlin	23a4aef14a	Revert "brw: move texture offset packing to NIR" Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details This reverts commit `4346210ae6`. Fixes: `4346210ae6` ("brw: move texture offset packing to NIR") Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37050>	2025-08-29 06:29:14 +00:00
Ian Romanick	49141ad5f2	brw: Strategically place flags initialization to help cmod prop v2: Rebase on `ac2b072312` ("brw: Add more specific brw_builder helpers"), and fix a bug that caused the new instruction to possibly be put in the wrong place. No shader-db changes on any Intel platform. fossil-db: All Intel platforms had similar results. (Lunar Lake shown) Totals: Instrs: 233675305 -> 233641585 (-0.01%) Cycle count: 32593658094 -> 32591467794 (-0.01%); split: -0.01%, +0.00% Totals from 33513 (4.25% of 789264) affected shaders: Instrs: 5200332 -> 5166612 (-0.65%) Cycle count: 1499831128 -> 1497640828 (-0.15%); split: -0.15%, +0.00% Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35444>	2025-08-28 22:08:20 +00:00
Ian Romanick	3018849535	brw: Don't emit redundant flags initialization for subgroup op lowering No shader-db changes on any Intel platform. fossil-db: All Intel platforms had similar results. (Lunar Lake shown) Totals: Instrs: 233676039 -> 233675305 (-0.00%) Cycle count: 32594097814 -> 32593658094 (-0.00%); split: -0.00%, +0.00% Totals from 325 (0.04% of 789264) affected shaders: Instrs: 104491 -> 103757 (-0.70%) Cycle count: 1183870034 -> 1183430314 (-0.04%); split: -0.04%, +0.00% Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35444>	2025-08-28 22:08:20 +00:00
Ian Romanick	4a238f461d	brw: Do cmod prop again after brw_lower_subgroup_ops shader-db: All Intel platforms had similar results. (Lunar Lake shown) total instructions in shared programs: 17114300 -> 17114294 (<.01%) instructions in affected programs: 3617 -> 3611 (-0.17%) helped: 6 / HURT: 0 total cycles in shared programs: 886397556 -> 886397454 (<.01%) cycles in affected programs: 511400 -> 511298 (-0.02%) helped: 6 / HURT: 0 fossil-db: Lunar Lake Totals: Instrs: 233683694 -> 233676039 (-0.00%); split: -0.00%, +0.00% Cycle count: 32602038466 -> 32594097814 (-0.02%); split: -0.03%, +0.01% Spill count: 540908 -> 540704 (-0.04%) Fill count: 700935 -> 700258 (-0.10%) Totals from 2200 (0.28% of 789264) affected shaders: Instrs: 2062360 -> 2054705 (-0.37%); split: -0.37%, +0.00% Cycle count: 2506073282 -> 2498132630 (-0.32%); split: -0.41%, +0.09% Spill count: 14423 -> 14219 (-1.41%) Fill count: 34219 -> 33542 (-1.98%) Meteor Lake and DG2 had similar results. (Meteor Lake shown) Totals: Instrs: 263545171 -> 263543341 (-0.00%); split: -0.00%, +0.00% Cycle count: 26480835985 -> 26484748317 (+0.01%); split: -0.01%, +0.03% Spill count: 554335 -> 554338 (+0.00%) Fill count: 645486 -> 645498 (+0.00%) Totals from 610 (0.07% of 903944) affected shaders: Instrs: 1139871 -> 1138041 (-0.16%); split: -0.17%, +0.01% Cycle count: 2274612327 -> 2278524659 (+0.17%); split: -0.15%, +0.33% Spill count: 15153 -> 15156 (+0.02%) Fill count: 36831 -> 36843 (+0.03%) Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown) Totals: Instrs: 268713723 -> 268712817 (-0.00%); split: -0.00%, +0.00% Cycle count: 24653238085 -> 24652269669 (-0.00%); split: -0.00%, +0.00% Fill count: 671369 -> 671361 (-0.00%) Totals from 666 (0.07% of 899711) affected shaders: Instrs: 924423 -> 923517 (-0.10%); split: -0.11%, +0.01% Cycle count: 840380565 -> 839412149 (-0.12%); split: -0.13%, +0.02% Fill count: 13006 -> 12998 (-0.06%) Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35444>	2025-08-28 22:08:20 +00:00
Caio Oliveira	84963d6833	intel/brw: Take shader in the brw_generator::generate_code() parameters Simplify the calls in all the stage compile functions. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>	2025-08-28 00:06:20 +00:00
Caio Oliveira	c19a4150b5	intel/brw: Simplify variant tracking in brw_compile_fs Remove the cfg variables and use the shader pointers directly. Reset the variant pointer if a shader failed or will not be used. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>	2025-08-28 00:06:20 +00:00
Caio Oliveira	834e30d244	intel/brw: Simplify tracking of dispatch_width_limit in brw_compile_fs Keep it in a variable, that way don't need to check which shader to look for the limit. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>	2025-08-28 00:06:20 +00:00
Caio Oliveira	9d53e27579	intel/brw: Remove brw_shader::import_uniforms() The brw_shader::uniforms now is derived from the nir_shader. The only exception is compute shaders for older Gfx versions, so we move the adjust logic for that. The benefit here is untangling the code for compilation variants, that before needed to keep track of the first that compiled to, in most cases, copy an integer. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>	2025-08-28 00:06:19 +00:00
Caio Oliveira	b8a35a8a27	brw: Pass per_primitive_offset in brw_shader_params Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>	2025-08-28 00:06:19 +00:00
Caio Oliveira	6ca9021758	brw: Add brw_shader_params And unify the initialization code for brw_shader. Avoid passing brw_compile_params since for a single compilation we might have multiple shaders (the case for BS stage). Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33541>	2025-08-28 00:06:18 +00:00
Caio Oliveira	1c933b6511	brw: Fix checking sources of wrong instruction in opt_address_reg_load Fixes: `8ac7802ac8` ("brw: move final send lowering up into the IR") Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37019>	2025-08-27 22:50:23 +00:00
Lionel Landwerlin	93996c07e2	brw: fix broadcast opcode The problem with the current code is that there is a disconnect between : - the virtual register size allocated - the dispatch size - the size_written value Only the last 2 are in sync and this confuses the spiller that only looks at the destination register allocation & dispatch size to figure out how much to spill. The solution in this change is to make BROADCAST more like MOV_INDIRECT, so that you can do a BROADCAST(8) that actually reads a SIMD32 register. We put the size of the register read into src2. Now the spiller sees correct read/write sizes just looking at the destination register & dispatch size. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `662339a2ff` ("brw/build: Use SIMD8 temporaries in emit_uniformize") Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13614 Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36564>	2025-08-28 00:23:44 +03:00
Lionel Landwerlin	e6ca709a4e	brw: fix INTEL_DEBUG=spill_fs We need to dirty the instruction BRW_DEPENDENCY_INSTRUCTIONS & BRW_DEPENDENCY_VARIABLES if anything was spilled. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `a6b0783375` ("brw: Use brw_ip_ranges in scheduling / regalloc") Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13233 Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36925>	2025-08-27 15:08:35 +00:00
Lionel Landwerlin	3362b8dcb5	brw: use a scalar builder for the load_payload on transpose loads I noticed SIMD32 shaders have that kind of pattern : mov(32) g94<1>D 0D { align1 WE_all }; send(1) g15UD g94UD nullUD 0x6210d500 0x02010000 ugm MsgDesc: ( load, a32, d32, V16, transpose, L1STATE_L3MOCS dst_len = 1, src0_len = 1, src1_len = 0 bti ) BTI 2 base_offset 16 { align1 WE_all 1N I@5 $1 }; Why use a 32 wide register for a SEND that is only going to read the first lane? We can stick a single physical register and reduce register pressure. DG2 fossils-db results : Totals: Instrs: 157417515 -> 157417796 (+0.00%); split: -0.00%, +0.00% Cycle count: 15362185116 -> 15363086774 (+0.01%); split: -0.05%, +0.05% Max live registers: 29059141 -> 29051166 (-0.03%) Max dispatch width: 5071256 -> 5075720 (+0.09%); split: +0.33%, -0.24% Totals from 82132 (14.43% of 569221) affected shaders: Instrs: 26564632 -> 26564913 (+0.00%); split: -0.00%, +0.00% Cycle count: 4630907475 -> 4631809133 (+0.02%); split: -0.16%, +0.18% Max live registers: 5425037 -> 5417062 (-0.15%) Max dispatch width: 128384 -> 132848 (+3.48%); split: +12.92%, -9.45% LNL fossils-db results : Totals: Instrs: 141870413 -> 141870745 (+0.00%); split: -0.00%, +0.00% Cycle count: 20176018818 -> 20191262632 (+0.08%); split: -0.07%, +0.14% Max live registers: 44858167 -> 44838370 (-0.04%) Totals from 51859 (10.55% of 491590) affected shaders: Instrs: 16834547 -> 16834879 (+0.00%); split: -0.00%, +0.00% Cycle count: 5761980106 -> 5777223920 (+0.26%); split: -0.24%, +0.50% Max live registers: 5893878 -> 5874081 (-0.34%) Perf A/B testing only reported a 0.5% improvement on DG2 on one trace, no changes on BMG. Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36958>	2025-08-26 12:03:22 +00:00
Lionel Landwerlin	27c69acb6a	brw: remove uniform from opt_offsets Those are for push constants, no point in doing that because : - there is no HW constant offsets in push constants (payload delivery), it's just register offset calculation - if we have an dynamic value it's already using MOV_INDIRECT Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `e103afe7be` ("brw: run the nir_opt_offsets pass and set the maximum offset size") Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36958>	2025-08-26 12:03:22 +00:00
Konstantin Seurer	9df7b48d2f	nir: Use nir_def_as_* in more places Reviewed-by: Marek Olšák <marek.olsak@amd.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36746>	2025-08-24 14:03:09 +00:00
Caio Oliveira	74a4e7dd4b	brw: Fix folding case for MAD instruction with all immediates Fixes: `b605f76b2a` ("brw/algebraic: Constant fold multiplicands of MAD") Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36867>	2025-08-21 17:19:18 +00:00
Caio Oliveira	eec64c865f	brw: Add disabled test for MAD constant folding Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36867>	2025-08-21 17:19:18 +00:00
Calder Young	c7e48f79b7	brw,anv: Reduce UBO robustness size alignment to 16 bytes Some checks are pending macOS-CI / macOS-CI (dri) (push) Waiting to run Details macOS-CI / macOS-CI (xlib) (push) Waiting to run Details Instead of being encoded as a contiguous 64-bit mask of individual registers, the robustness information is now encoded as a vector of up to 4 bytes that represent the limits of each of the pushed UBO ranges in 16 byte units. Some buggy Direct3D workloads are known to depend on a robustness alignment as low as 16 bytes to work properly. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36455>	2025-08-21 09:04:55 +00:00

1 2 3 4 5 ...

4599 commits