Thanks to Ken for suggesting this URB refactoring change and pointing
out that the LSC can operate on the byte offset granularity.
This should fix the geometry shader test cases where we have more than
32 vertices since previously we were failing to write the correct
control data bits because of incorrect write mask.
Shader-db results for Xe2:
total instructions in shared programs: 153475 -> 153437 (-0.02%)
instructions in affected programs: 1374 -> 1336 (-2.77%)
helped: 11
HURT: 0
helped stats (abs) min: 3 max: 5 x̄: 3.45 x̃: 3
helped stats (rel) min: 1.67% max: 4.92% x̄: 3.23% x̃: 2.70%
95% mean confidence interval for instructions value: -3.92 -2.99
95% mean confidence interval for instructions %-change: -4.10% -2.36%
Instructions are helped.
total loops in shared programs: 140 -> 140 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 16002649 -> 16002329 (<.01%)
cycles in affected programs: 9174 -> 8854 (-3.49%)
helped: 11
HURT: 0
helped stats (abs) min: 22 max: 38 x̄: 29.09 x̃: 32
helped stats (rel) min: 2.62% max: 5.54% x̄: 3.78% x̃: 3.85%
95% mean confidence interval for cycles value: -33.56 -24.62
95% mean confidence interval for cycles %-change: -4.48% -3.08%
Cycles are helped.
total spills in shared programs: 52 -> 52 (0.00%)
spills in affected programs: 0 -> 0
helped: 0
HURT: 0
total fills in shared programs: 94 -> 94 (0.00%)
fills in affected programs: 0 -> 0
helped: 0
HURT: 0
total sends in shared programs: 4240 -> 4240 (0.00%)
sends in affected programs: 0 -> 0
helped: 0
HURT: 0
LOST: 0
GAINED: 0
Rework: (Sagar)
- Adjust offset/indirect offset calculation.
- Add shader-db results
- Always calculate dword index
- Drop changes for indirect writes
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27602>
This doesn't help very much now. A later commit adds a NIR optimization
pass, tentatively called nir_opt_uniform_subgroup, that converts many
kinds of subgroup operations to things involving
bitCount(ballot(true)). This commit makes a huge difference in the
results of that later commit.
No shader-db changes on any Intel platform.
Fossil-db results:
All Intel platforms had similar results. (Ice Lake shown)
Totals:
Instrs: 165558033 -> 165557519 (-0.00%)
Cycles: 15156188362 -> 15156178922 (-0.00%); split: -0.00%, +0.00%
Totals from 299 (0.05% of 656117) affected shaders:
Instrs: 88293 -> 87779 (-0.58%)
Cycles: 3709498 -> 3700058 (-0.25%); split: -0.28%, +0.03%
v2: Rebase on splitting ELK from BRW. Remove devinfo->ver >= 8 check.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27044>
Since SIMD8 no longer exists, the SIMD modes enums have different names
and different values.
v2 (Francisco Jerez): Rebase on 07b9bfacc7 ("intel/compiler: Move
logical-send lowering to a separate file").
v3: Update brw_disasm.c with SIMD descriptions.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27305>
Note: a future commit will expand the sampler message type to the 6 bits
used on Xe2.
v2 (Francisco Jerez): Rebase on 07b9bfacc7 ("intel/compiler: Move
logical-send lowering to a separate file").
v3: Drop XE2_SAMPLER_MESSAGE_SAMPLE_BIAS_MLOD as it does not actually
exist. This resulted in some bigger changes in brw_disasm.c. Noticed
by Sagar.
v4: Now that XE2_SAMPLER_MESSAGE_SAMPLE_MLODc conflicts with
GFX7_SAMPLER_MESSAGE_SAMPLE_GATHER4_PO_C, the determination of
min_lod_is_first must include devinfo->ver or previous platforms will
break.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27305>
v2: Add brw_ir_performance.cpp and brw_fs_generator.cpp changes. Fix
overlapping register allocation (via has_source_and_destination_hazard). Fix
incorrect destination register file encoding.
v3: Prevent lower_regioning from trying to "fix" DPAS sources.
v4: Add instruction latency information for scheduling and perf
estimates.
v5: Remove all mention of DPASW. Suggested by Curro and Caio. Update
the comment in fs_inst::has_source_and_destination_hazard. Suggested
by Caio.
v6: Add some comments near the src2 calculation in
fs_inst::size_read. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25994>
Once NIR code is lowered and a few optimization passes have run, there
might be flag register interactions between instructions quite far
away from one another.
In the following case :
f0 = and r0, r1
...
fs_interpolate r2, r3
...
if f0
...
endif
If we lower fs_inteporlate while using the f0 register, we completely
garble the value meant for the if block.
To fix this, emit the predication for fs_interpolate in brw_fs_nir.cpp
when doing the NIR translation to the backend IR. This will guarantee
that the flag register interactions are visible to the optimization
passes, avoiding the problem above.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 68027bd38e ("intel/fs: implement dynamic interpolation mode for dynamic persample shaders")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9757
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26306>
This implements the extended 10-bit encoding of the software
scoreboard information used by Xe2 platforms. The new encoding is
different enough that there are few opportunities for sharing code
during translation to machine code, but the high-level tgl_swsb
representation remains roughly the same.
Among other changes the 10-bit SWSB format provides 5 bits worth of
SBID tokens (though they're only usable in large GRF mode) instead of
4 bits, the extended math pipeline is handled as an in-order (RegDist)
pipeline instead of as an out-of-order one, and the dual-argument
encodings support additional combinations of RegDist and SBID
synchronization modes. A new encoding is introduced for preventing
the accumulator hardware scoreboard from being updated, but this is
currently not needed.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25514>
This is what most logical SEND messages do when they take a variable
number of components. 'inst->mlen' is expected to be zero for logical
SEND opcodes, which are expected to behave like plain arithmetic
operations, so certain automated transformations (like SIMD lowering)
can manipulate them without opcode-specific special-casing.
Guessing the number of components from 'inst->mlen' has other
disadvantages, because it requires duplicating the logic that infers
the message payload size in every use of the instruction -- Instead we
can just do the computation once during logical send lowering. In
addition on LNL platform this causes the 'inst->mlen' field of URB
writes to have units inconsistent with every other SEND instruction,
which is likely to lead to confusion and bugs down the road.
Rework:
* Marcin: update emit_urb_indirect_vec4_write
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25195>
Purely from the backend point of view it's just an additional
parameter to sampler messages.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23882>
We can lower FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD into other more
generic sends and drop this internal opcode.
The idea behind this change is to allow bindless surfaces to be used
for UBO pulls and why it's interesting to be able to reuse
setup_surface_descriptors(). But that will come in a later change.
No shader-db changes on TGL & DG2.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20416>
These are handled identically in almost all cases. There is one place
in the legacy surface lowering that was obtaining the bitsize from the
opcode, but the LSC-based lowering uses (type_sz(inst->dst.type) * 8)
for that and works just fine. If we just do that in the legacy lowering
too, then we don't need this plethora of opcodes.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
The only reason for the separate opcode was because of the overlapping
BRW_AOP_* enums, making it impossible to tell whether a particular AOP
was the integer or float operation. Now that we use the lsc_opcode
enums, we can just have the legacy lowering inspect the opcode and
select the right descriptor. No need for a separate opcode.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
In the previous patch we adjusted the scoreboard pass to take into
consideration a new case of unordered operations for TGL. Fix the
decoding as well.
v2: use intel_device_info_is_mtl() (Curro, Jordan)
v3: the part where we export num_sources_from_inst() is now a separate patch
(Curro).
v4: Work around false positive maybe-unitialized warning since Marge
uses -Werror=maybe-uninitialized (Marge).
Reviewed-by: Francisco Jerez <currojerez@riseup.net> (v3)
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20072>
In 4ceaed7839 we made scratch surface state allocations part of the
internal heap (mapped to STATE_BASE_ADDRESS::SurfaceStateBaseAddress)
so that it doesn't uses slots in the application's expected 1M
descriptors (especially with vkd3d-proton).
But all our compiler code relies on BSS
(STATE_BASE_ADDRESS::BindlessSurfaceStateBaseAddress).
The additional issue is that there is only 26bits of surface offset
available in CS instruction (CFE_STATE, 3DSTATE_VS, etc...) for
scratch surfaces. So we need the drivers to put the scratch surfaces
in the first chunk of STATE_BASE_ADDRESS::SurfaceStateBaseAddress
(hence all the driver changes).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 4ceaed7839 ("anv: split internal surface states from descriptors")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7687
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19727>
The changes to fs_visitor::validate() helped track down a place where I
initially forgot to convert a message to the new sources layout. This
had caused a different validation failure in
dEQP-GLES31.functional.tessellation.tesscoord.triangles_equal_spacing,
but this were not detected until after SENDs were lowered.
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
total instructions in shared programs: 19951145 -> 19951133 (<.01%)
instructions in affected programs: 2429 -> 2417 (-0.49%)
helped: 8 / HURT: 0
total cycles in shared programs: 858904152 -> 858862331 (<.01%)
cycles in affected programs: 5702652 -> 5660831 (-0.73%)
helped: 2138 / HURT: 1255
Broadwell
total cycles in shared programs: 904869459 -> 904835501 (<.01%)
cycles in affected programs: 7686744 -> 7652786 (-0.44%)
helped: 2861 / HURT: 2050
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
Instructions in all programs: 141442369 -> 141442032 (-0.0%)
Instructions helped: 337
Cycles in all programs: 9099270231 -> 9099036492 (-0.0%)
Cycles helped: 40661
Cycles hurt: 28606
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17605>
The lowering is currently fake. It just changes the opcode from the
_LOGICAL version to the non-_LOGICAL version.
v2: Remove some rebase cruft. 's/gfx8_//;s/simd8_/' in
brw_instruction_name. Both suggested by Ken.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17379>
An argument could be made that all stage-specific opcodes for vec4
stages should be prefixed with VEC4_ like the stage-agnostic opcodes.
I'll leave those additional sed jobs for another day.
egrep -lr '(VS|GS|TCS)_OPCODE_URB_WRITE' src |\
while read f; do
sed --in-place 's/\(VS\|GS\|TCS\)_OPCODE_URB_WRITE/VEC4_\1_OPCODE_URB_WRITE/g' $f
done
egrep -lr 'T.S_OPCODE[_A-Z]*URB_OFFSETS' src |\
while read f; do
sed --in-place 's/\(T.S_OPCODE[_A-Z]*URB_OFFSETS\)/VEC4_\1/g' $f
done
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17379>
We haven't exposed this intrinsic as it doesn't directly correspond to
anything in SPIR-V. However, it's used internally by some NIR passes,
namely nir_opt_uniform_atomics().
We reuse most of the infrastructure in brw_find_live_channel, but with
LZD/ADD instead of FBL. A new SHADER_OPCODE_FIND_LAST_LIVE_CHANNEL is
like SHADER_OPCODE_FIND_LIVE_CHANNEL but from the other side.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15484>
In the future we'll want to reuse this intrinsic to deal with ray
queries. Ray queries will use a different global pointer and
programmatically change the control/level arguments of the trace send
instruction.
v2: Comment on barrier after sync trace instruction (Caio)
Generalize lsc helper (Caio)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13719>
We'll want different types of IDs based on topology. Let's make this
more flexible and also move the bit shifting code a layer above where
it's easier to do bitshifting operations, especially if you need to
stash things into temporary registers.
v2: Keep previous comment.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13719>