Add optional OA performance counter collection around each execute()
call. Examples:
```
# List all profiles and counters, with descriptions.
$ executor --oa list
# Collect all counters from a profile.
$ executor --oa ComputeBasic file.lua
# Collect a subset of counters from a profile, separated by comma.
$ executor --oa ComputeBasic:GpuTime,AvgGpuCoreFrequency file.lua
# By default use ComputeBasic profile, so counter names only also work.
$ executor --oa GpuTime file.lua
```
The selected counters are printed to stdout after the script finishes,
or written to a file specified by --oa-csv FILENAME.
Assisted-by: Pi coding agent (GPT-5.5)
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41610>
It's not about the memory traffic but updating the Tmax value/distance
so that on next intersection, we would be comparing the updated Tmax
value/distance instead of original distance.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Iván Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41709>
Execution mask gets applied to last thread in the threadgroup to mask
off simd lanes, But with BTD enabled, we are seeing only last 4
components has valid stack ID's and upper 4 components of the register
are zero.
Changing execution mask somehow populates the stack IDs properly.
This is on simulator, before changing the execution mask:
00000000 00000000 00000000 00000000 000F000E 000D000C 000B000A 00090008 00000000 00000000 00000000 00000000 000F000E 000D000C 000B000A 00090008 r1
After changing execution mask:
000F000E 000D000C 000B000A 00090008 00070006 00050004 00030002 00010000 000F000E 000D000C 000B000A 00090008 00070006 00050004 00030002 00010000 r1
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41409>
When updating a register after successfully finding a pair to coalesce,
use the live range of the source register to walk only the instructions
that might use it. Depending on the shader this allows skipping a bunch
of blocks -- and also terminating early.
Below are fossil compilation times in a MTL machine compiling shaders
for a BMG GPU, the big win here was for Cyberpunk 2077.
```
// Differences at 95.0% confidence.
// Rise of the Tomb Raider (n=20)
-0.0095 +/- 0.00706877
-1.90572% +/- 1.40609%
// Alan Wake (n=20)
-0.031 +/- 0.0172806
-0.93599% +/- 0.51952%
// Borderlands 3 (n=15)
-0.353333 +/- 0.118679
-2.44307% +/- 0.80787%
// Oblivion Remastered (n=15)
-0.134 +/- 0.026008
-2.76898% +/- 0.531637%
// Baldur's Gate 3 (n=15)
-0.954286 +/- 0.163625
-2.21713% +/- 0.377562%
// Cyberpunk 2077 (n=20)
-2.8665 +/- 0.228489
-8.08661% +/- 0.621779%
```
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41495>
Instead save to a local variable and use that. In various cases the
compiler is not able to pull it out of the loop, since there are other
not inlined function calls as part of the loop's body, resulting in
repeated unnecessary calls to either size_read() or its pieces that
get inlined.
Below are fossil compilation times in a MTL machine compiling shaders
for a BMG GPU:
```
// Differences at 95.0% confidence.
// Rise of the Tomb Raider (n=20)
-0.017 +/- 0.00724575
-3.45177665% +/- 1.45084%
// Alan Wake (n=20)
-0.153 +/- 0.00960067
-4.99265786% +/- 0.303695%
// Borderlands 3 (n=14)
-0.486428571 +/- 0.15354
-3.51248195% +/- 1.0835%
// Oblivion Remastered (n=14)
-0.143571429 +/- 0.0357991
-3.05749924% +/- 0.747872%
// Baldur's Gate 3 (n=14)
-1.68928571 +/- 0.151598
-4.12128605% +/- 0.364259%
```
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41496>
Compute var_from_reg() once in setup_def_use() and pass the variable
number to setup_one_read() and setup_one_write(). This lets the loops walk
consecutive variable numbers directly instead of mutating a brw_reg offset.
Also: setup_one_write() is only called for VGRFs, so remove the check
for VGRF there.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41496>
Only ARF sources are relevant in this case, so check the file
before calling size_read().
Below are fossil compilation times in a MTL machine compiling shaders
for a BMG GPU:
```
// Differences at 95.0% confidence.
// Rise of the Tomb Raider (n=20)
No difference proven
// Alan Wake (n=20)
-0.0725 +/- 0.0139437
-2.30965276% +/- 0.438787%
// Borderlands 3 (n=14)
-0.248571429 +/- 0.135107
-1.76946153% +/- 0.954171%
// Oblivion Remastered (n=14)
-0.0735714286 +/- 0.0235712
-1.54770849% +/- 0.492117%
// Baldur's Gate 3 (n=14)
-0.832142857 +/- 0.23095
-1.98028217% +/- 0.545648%
```
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41496>
regs_read() itself gets inlined, but size_read() does not. In GCC
release builds this results in three calls to size_read() at each site,
one of them due to how MIN2 is expanded. Use a local variable to store
the result.
Below are fossil compilation times in a MTL machine compiling shaders
for a BMG GPU:
```
// Differences at 95.0% confidence.
// Rise of the Tomb Raider (n=20)
-0.013 +/- 0.00596452
-2.56410256% +/- 1.15623%
// Alan Wake (n=20)
-0.1755 +/- 0.0144896
-5.29491628% +/- 0.425556%
// Borderlands 3 (n=14)
-0.562142857 +/- 0.129678
-3.84765816% +/- 0.870239%
// Oblivion Remastered (n=14)
-0.0821428571 +/- 0.0262485
-1.69867061% +/- 0.537247%
// Baldur's Gate 3 (n=14)
-1.61357143 +/- 0.21693
-3.69788342% +/- 0.486462%
```
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41496>
The instruction may get transformed, modifying the destination before
the loop index gets incremented. So save the original regs_written
value to be used in the loop increment.
While we are here, assert that all the slots in mov[] are filled
at this point in the code.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41496>
On Xe2+ the Wa_1407528679 NoMask workaround is disabled, so
baked_ordered_dependency_mode() should treat all instructions as
exec_all, matching the logic in gather_inst_dependencies() and
emit_inst_dependencies().
Without this, ordered RegDist dependencies from uniform/WE_all
producers (e.g. 'mov s0, imm') are not found during baking and
fall through as separate WE_all SYNC NOPs. Real shaders pile up
dozens of these in front of masked sends.
v2(Caio): Fix existing scalar_register test expectations
Signed-off-by: Michael Cheng <michael.cheng@intel.com>
Fixes: 47a6ef3fef ("brw/scoreboard: Use a predicate helper for the nomask workaround")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41713>
Add two tests verifying that ordered RegDist dependencies from
uniform/WE_all producers are baked into the consumer's SWSB on Xe2+.
Disabled for now since they fail on current main.
Reviewed-by: Michael Cheng <michael.cheng@intel.com>
Assisted-by: Pi coding agent (Opus-4.7)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41713>
Fixes "multiple stores to the same location" assertions in tests like
dEQP-VK.pipeline.monolithic.color_write_enable_maxa.cwe_after_bind.attachments3_more0
In that case, the stores were actually to different locations, but some
constant additions hadn't been folded into the location field yet.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
jay is trying to use the fragment shader dispatch mask for helper
invocation lowering, but it was using load_sample_mask_in for that
(now load_coverage_mask_intel). But this isn't the MSAA coverage
mask, the two are different payload fields.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
We want it to be set to wherever the push constants ended up.
Setting it close to the setup_payload_push() call makes this easier.
We'll also be adding some extra UGPRs for the fragment shader payload
soon, and the partitioning code will just have one big UGPR partition
for payload fields, push constants, and general purpose UGPRs, so it
really won't know how to do this very well without duplicating a bunch
of information.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
The documentation is large and hard to follow due to all the optional
fields and the SIMD16 vs. SIMD32 split for barycentrics. This quick
summary helps clarify what fields exist, which are split for SIMD32
or kept together, and which pairs of registers are involved for splits.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
Constructing the render target store payload is more complex than we can
reasonably handle at the NIR level. The main reason is that samplemask
and stencil are packed 16-bit and 8-bit parameters, respectively, which
are intermixed with other values that are 32-bit. In SIMD32 mode, the
packed sub-32-bit values take up fewer registers than normal values.
Currently we also don't specialize the NIR for each FS dispatch width,
and we can't construct the message descriptor without knowing it.
So, we alter nir_intrinsic_store_render_target_intel to take each of
the expected parameters - colour, depth, stencil, samplemask,
src0_alpha, and discard predicate. We construct the payloads and
descriptors in the backend.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
Implement a simple pre-RA bottom-up list scheduler with the goal of decreasing
register pressure. On Xe2, this significantly reduces spilling.
SSA form allows us to estimate register demand cheaply and accurately, which
theoretically [1] gives this algorithm the two Hippocratic properties:
1. Shaders with low register pressure are unaffected.
2. Register pressure can only be decreased, never increased.
In other words: first, do no harm.
The heuristic itself is very simple: greedily choose instructions that decrease
liveness using a backwards list scheduler. This is far from optimal! But thanks
to the above properties, even a heuristic that picked random instructions would
be a win overall - by construction, we can only ever win.
In other words: this scheduler is your older brother powering off the game
console any time he's about to lose a game, maintaining a 100% win rate.
[1] In reality, neither property is strictly satisfied due to the messy details
of mapping our clean logical model onto Intel's many weird physical register
files. Nevertheless, the algorithm is well-motivated and the empirical results
on Xe2 are excellent.
SIMD16:
Totals:
Instrs: 2754194 -> 2753957 (-0.01%); split: -0.23%, +0.22%
CodeSize: 41094768 -> 41092768 (-0.00%); split: -0.23%, +0.23%
Number of spill instructions: 1724 -> 1129 (-34.51%)
Number of fill instructions: 1912 -> 1119 (-41.47%)
Totals from 168 (6.35% of 2647) affected shaders:
Instrs: 850994 -> 850757 (-0.03%); split: -0.75%, +0.73%
CodeSize: 12825680 -> 12823680 (-0.02%); split: -0.74%, +0.73%
Number of spill instructions: 1724 -> 1129 (-34.51%)
Number of fill instructions: 1912 -> 1119 (-41.47%)
SIMD32:
Totals:
Instrs: 4688858 -> 4557800 (-2.80%); split: -3.53%, +0.74%
CodeSize: 70177200 -> 68214816 (-2.80%); split: -3.53%, +0.74%
Number of spill instructions: 50316 -> 45795 (-8.99%); split: -9.56%, +0.57%
Number of fill instructions: 51526 -> 45075 (-12.52%); split: -13.23%, +0.71%
Totals from 819 (30.94% of 2647) affected shaders:
Instrs: 3810182 -> 3679124 (-3.44%); split: -4.35%, +0.91%
CodeSize: 57044000 -> 55081616 (-3.44%); split: -4.35%, +0.91%
Number of spill instructions: 49264 -> 44743 (-9.18%); split: -9.76%, +0.58%
Number of fill instructions: 50182 -> 43731 (-12.86%); split: -13.58%, +0.73%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
logically it doesn't matter because we'll bail on a later check, but this is
still UB and therefore releases nasal demons.
i am jealous of Faith's Rust compilers. there, I said it.
==107281== Conditional jump or move depends on uninitialised value(s)
==107281== at 0x7069768: propagate_backwards (jay_opt_propagate.c:327)
==107281== by 0x7069768: jay_opt_propagate_backwards (jay_opt_propagate.c:367)
==107281== by 0x7058960: jay_compile (jay_from_nir.c:2677)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
Challenging to hit but fixes
dEQP-GLES3.functional.shaders.swizzle_math_operations.vector_multiply.mediump_ivec4_wzyx_zyxw_fragment
with scheduling changes.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
on top of scheduler changes, compile-time of shaders/blender/1017.shader_test:
Difference at 95.0% confidence
-0.00173202 +/- 0.00116931
-0.791537% +/- 0.532384%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
We generalize the sample_mask_in lowering to handle this too.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41688>
Blender uses atomic operations as part of its virtual shadow mapping
implementation. Virtual shadow mapping page tagging in compute shaders
benefits from divergent atomics fusion, while fragment shaders doing the
atomic raster step in general have worse performance with this
optimization turned on.
Thus, an option is added to only apply divergent atomics fusion to compute
shaders in ANV, and this option is enabled for Blender.
Initial support for divergent atomics fusion optimization in ANV was added
in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40631.
Signed-off-by: Christoph Neuhauser <christoph.neuhauser@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41706>
In shader-db, with `-p skl`, shaders/0ad/12.shader_test does not
compact an instruction because precompact overwrites portions of the
instruction. (Treating the three source instruction as a two source
when accessing instruction fields.)
This instruction could be compacted:
mad(8) g65<1>F g61<4,4,1>F g64<4,4,1>F -g17<4,4,1>F { align16 1Q };
But, since precompact erroneously sets bits, the instruction isn't
compacted.
Fossil testing:
* Tested with 0a3f3fd193 ("brw: drop unused color_outputs_valid
key") reverted, as fossils are currently producing inconsitent
results otherwise.
* Tested skl, icl, dg2, mtl, lnl, bmg and ptl. Only skl had a change.
SKL:
Totals:
CodeSize: 8335219296 -> 8320248992 (-0.18%)
Totals from 359508 (14.42% of 2492689) affected shaders:
CodeSize: 2838254352 -> 2823284048 (-0.53%)
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41588>
Unlike most other things where the MOCS setting combines the MOCS Index
and the protected memory bit, the EXECUTE_INDIRECT_DRAW/DISPATCH
commands take only the MOCS Index, and it's limited to only 4 bits.
Enabling the feature on ARL-H caused some tests to hit an assert when
the MOCS selected ended up out of range.
Rename the field to avoid confusion (and match documentation) and set it
through a helper function that calls the same old function and shifts it
down to fit.
Fixes: d1109f67bb ("iris: Emit EXECUTE_INDIRECT_DRAW when available")
Fixes: d161e3c2e2 ("iris: Emit a EXECUTE_INDIRECT_DISPATCH when available")
Fixes: 580728564e ("anv: Emit a EXECUTE_INDIRECT_DISPATCH when available")
Fixes: 6d4f43f0d6 ("anv: Emit EXECUTE_INDIRECT_DRAW when available")
Fixes: 7a9e82e82f ("genxml/12.5: Add the EXECUTE_INDIRECT_DISPATCH instruction")
Fixes: 4229757309 ("genxml/12.5: Add the EXECUTE_INDIRECT_DRAW instruction")
Signed-off-by: Iván Briano <ivan.briano@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41372>
Unless the tristate is unset, which is not, it will be true when casted
to bool, as the return of this function expects.
Fixes: 2741ddd75a ("anv: fix issues found with indirect data stride")
Signed-off-by: Iván Briano <ivan.briano@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41372>
Stops allocating events in chunks. u_trace_event is allocated using a
linear allocator which has minimal overhead. Buffers for timestamps are
allocated using a custom allocator.
As a sideeffect, it is possible to deduplicate consecutive tracepoints.
Reviewed-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41271>
In case of nir_intrinsic_load_inline_data_intel it was not using base_offset to
create the uniform, instead it was using only the special BRW_INLINE_PARAM_REG
value that later will be replaced by the inline_data fixed register.
So here using base_offset for both intrinsics, adding BRW_INLINE_PARAM_REG if
nir_intrinsic_load_inline_data_intel and then in brw_shader::assign_curb_setup
checking for inst->src[i].nr >= BRW_INLINE_PARAM_REG and adjusting brw_reg by
the remaining of the subtraction with BRW_INLINE_PARAM_REG.
Fixes: 7f19814414 ("brw/nir: handle inline_data_intel more like push_data_intel")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41607>