v2: Require that the constant value be representable as either uint16_t
or int16_t. Suggested by Matt.
v3: Remove redundant patterns. Noticed by Matt.
shader-db:
DG2
total instructions in shared programs: 23103767 -> 23103577 (<.01%)
instructions in affected programs: 51822 -> 51632 (-0.37%)
helped: 98 / HURT: 15
total cycles in shared programs: 842347714 -> 842380017 (<.01%)
cycles in affected programs: 1942595 -> 1974898 (1.66%)
helped: 97 / HURT: 32
Nearly all of the affected shaders (around 9,900) are shaders in
Cyberpunk 2077. It's about an even split between vertex and fragment
shaders. The majority of the remaining affected shaders (3,600) are
from Strange Brigade. This was also a nearly even split between
fragment and vertex.
All but two of the lost shaders are SIMD32 fragment shaders in
Cyberpunk 2077. The other two are SIMD32 fragment shaders in Dota2.
fossil-db:
DG2
Instructions in all programs: 196379107 -> 196248608 (-0.1%)
helped: 13467 / HURT: 1210
Cycles in all programs: 13931355281 -> 13929955971 (-0.0%)
helped: 11801 / HURT: 2922
Lost: 90
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23262>
There should not be any isub at this point due to lowerings that happened
ages before getting to late algebraic.
shader-db:
DG2
total instructions in shared programs: 23103769 -> 23103767 (<.01%)
instructions in affected programs: 65 -> 63 (-3.08%)
helped: 1 / HURT: 0
total cycles in shared programs: 842348074 -> 842347714 (<.01%)
cycles in affected programs: 28572 -> 28212 (-1.26%)
helped: 3 / HURT: 0
One compute shader in Assassin's Creed Odyssey was affected.
fossil-db:
DG2
Instructions in all programs: 196400668 -> 196400676 (+0.0%)
helped: 8 / HURT: 5
Cycles in all programs: 13931740724 -> 13931758003 (+0.0%)
helped: 8 / HURT: 7
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23262>
The DXIL backend would like to distinguish between casts to 16-bit
that must cast, vs those that may. If a shader only ever produces
16-bit types from mediump casts and ALU ops on those values, then
the resulting shader can be annotated with DXIL's min-precision
qualifier, basically telling the driver to use 16-bit precision if
it's faster for them. If it uses concrete 16-bit casts, or loads/
stores to externally-visible memory, then it must use the "native"
16-bit flag, which is not supported on all hardware.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23344>
Otherwise this can cause optimizations to fight resulting in infinite
optimization loops with opt_algebraic, constant_folding, and copy_prop.
Fixes: 368be872 ("nir/algebraic: shrink 64-bit bitwise operations with 0/-1 constant half")
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23192>
To be helpful, the thing inside the fsat has to be used with and without
the fsat. Otherwise it just moves a saturate destination modifier
around. To not be harmful, the fsat has to only be used by the bcsel.
All Broadwell and newer Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 20174475 -> 20174449 (<.01%)
instructions in affected programs: 3913 -> 3887 (-0.66%)
helped: 13 / HURT: 0
total cycles in shared programs: 866844832 -> 866844719 (<.01%)
cycles in affected programs: 46037 -> 45924 (-0.25%)
helped: 10 / HURT: 1
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 161491468 -> 161491372 (-0.0%)
helped: 31 / HURT: 8
Cycles in all programs: 10933090736 -> 10933024716 (-0.0%)
helped: 32 / HURT: 18
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22169>
As noted on [1], lowering patterns of the form
floatS -> floatB -> floatS ==> floatS
cannot require precision since this may cause flush denorming.
[1] 3f779013 ("nir: Add an algebraic optimization for float->double->float")
Fixes: b86305bb ("nir/algebraic: collapse conversion opcodes (many patterns)")
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Signed-off-by: Isabella Basso <isabellabdoamaral@usp.br>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20965>
Conversions on smaller bit sizes should also be collapsed when composed.
This also adds more patterns on the
intS -> intB -> floatB ==> intS -> floatB
lowering so as to deal with any int size C > B instead of a fixed intB.
Closes: #7776
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Signed-off-by: Isabella Basso <isabellabdoamaral@usp.br>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20965>
Some patterns were outside the list of optimizations.
Fixes: b86305bb ("nir/algebraic: collapse conversion opcodes (many patterns)")
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Signed-off-by: Isabella Basso <isabellabdoamaral@usp.br>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20965>
On Intel platforms, the uclz lowering if ufind_msb is either one
instruction better (Gfx7 and newer) or two instructions better (all
older platforms) than the ifind_msb implementations.
On platforms that use lower_find_msb_to_reverse, there should be no
difference.
All Haswell and newer Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 19938662 -> 19938634 (<.01%)
instructions in affected programs: 850 -> 822 (-3.29%)
helped: 2 / HURT: 0
total cycles in shared programs: 858467067 -> 858465538 (<.01%)
cycles in affected programs: 10080 -> 8551 (-15.17%)
helped: 2 / HURT: 0
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19042>
Fossil-db results:
All Intel platforms had similar results. (Ice Lake shown)
Cycles in all programs: 9098346105 -> 9098333765 (-0.0%)
Cycles helped: 6
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19042>
The 31-ufind_msb_rev(x) lowering only produces the correct result for
32-bit sources. ufind_msb_rev can also have 64-bit sources, and most
platforms are expected to lower this to 32-bit instructions with extra
logic operations.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19042>
Unlike ufind_msb, ifind_msb is only defined in NIR for 32-bit values, so
no @32 annotation is required.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19042>
We had all these nice fdot opts to drop individual channels that were 0,
but nothing handling it being entirely 0! Avoids r300g regression when
dropping them from GLSL.
Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21475>
Builds on the work of !15121. This gets to delete even more code
because many drivers shared a lot of code for i2b and f2b.
No shader-db or fossil-db changes on any Intel platform.
v2: Rebase on 1a35acd8d9.
v3: Update a comment in nir_opcodes_c.py. Suggested by Konstantin.
v4: Another rebase. Remove f2b stuff from Midgard.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20509>
VKD3D-Proton DXBC f32 to f16 conversion implements a float conversion using PackHalf2x16.
Because the spec does not specify a rounding mode, it emits a sequence to ensure
D3D-like behaviour for infinity.
When we know the current backend has pack_half_2x16_rtz_split,
we can eliminate the extra sequence.
Fossil DB stats on GFX11:
Totals from 835 (0.62% of 134913) affected shaders:
VGPRs: 49368 -> 49224 (-0.29%)
CodeSize: 5341956 -> 5124564 (-4.07%)
Instrs: 1024062 -> 987041 (-3.62%)
Latency: 6530956 -> 6465120 (-1.01%); split: -1.01%, +0.00%
InvThroughput: 908189 -> 870253 (-4.18%)
VClause: 18704 -> 18702 (-0.01%); split: -0.02%, +0.01%
SClause: 33406 -> 33284 (-0.37%); split: -0.38%, +0.01%
Copies: 67440 -> 65992 (-2.15%); split: -2.15%, +0.00%
Branches: 18498 -> 18465 (-0.18%)
PreSGPRs: 38409 -> 38331 (-0.20%)
PreVGPRs: 44089 -> 43834 (-0.58%)
Note, some fossils are from before this pattern was added to VKD3D-Proton,
so the above may not reflect real-world impact.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15838>
Constant folding always uses RTNE for pack_half_2x16_split, but some
backends implement it with RTZ.
Lowering to RTZ when available ensures that the behaviour will be
consistent between constant folding and the backend.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15838>
As we've done for the AMD one, to prevent any codegen regression from switching
the Midgard lowering.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Italo Nicola <italonicola@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19350>
There are a lot of optimizations in opt_algebraic that match ('ine', a,
0), but there are almost none that match i2b. Instead of adding a huge
pile of additional patterns (including variations that include both ine
and i2b), always lower i2b to a != 0.
At this point in the series, it should be impossible for anything to
generate i2b, so there /should not/ be any changes.
The failing test on d3d12 is a pre-existing bug that is triggered by
this change. I talked to Jesse about it, and, after some analysis, he
suggested just adding it to the list of known failures.
v2: Don't rematerialize i2b instructions in dxil_nir_lower_x2b.
v3: Don't rematerialize i2b instructions in zink_nir_algebraic.py.
v4: Fix zink-on-TGL CI failures by calling nir_opt_algebraic after
nir_lower_doubles makes progress. The latter can generate b2i
instructions, but nir_lower_int64 can't handle them (anymore).
v5: Add back most of the hunk at line 2125 of nir_opt_algebraic.py. I
had accidentally removed the f2b(bf2(x)) optimization.
v6: Just eliminate the i2b instruction.
v7: Remove missed i2b32 in midgard_compile.c. Remove (now unused)
emit_alu_i2orf2_b1 function from sfn_instr_alu.cpp. Previously this
function was still used. 🤷
No shader-db changes on any Intel platform.
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 141165875 -> 141165873 (-0.0%)
Instructions helped: 2
Cycles in all programs: 9098956382 -> 9098956350 (-0.0%)
Cycles helped: 2
The two Vulkan shaders are helped because of the "new" (('b2i32',
('ine', ('ubfe', a, b, 1), 0)), ('ubfe', a, b, 1)) algebraic pattern.
Acked-by: Jesse Natalie <jenatali@microsoft.com> [earlier version]
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev> [earlier version]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
v2: Remove the ineg from the b2i in the ior pattern. Suggested by
Jason.
All Ivy Bridge and newer Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 19914441 -> 19914369 (<.01%)
instructions in affected programs: 63507 -> 63435 (-0.11%)
helped: 24 / HURT: 0
total cycles in shared programs: 853869766 -> 853851470 (<.01%)
cycles in affected programs: 10551542 -> 10533246 (-0.17%)
helped: 24 / HURT: 0
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 141163061 -> 141092683 (-0.0%)
Instructions helped: 14103
Instructions hurt: 55
Cycles in all programs: 9132376195 -> 9133183045 (+0.0%)
Cycles helped: 13775
Cycles hurt: 380
Spills in all programs: 18286 -> 18284 (-0.0%)
Spills helped: 1
Fills in all programs: 30647 -> 30643 (-0.0%)
Fills helped: 1
Gained: 133
Lost: 130
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
This helps because it enables cmod propagation to do more.
The removed patterns involving b2i will be handled by other existing
patterns after the unary operations are removed.
All Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 19914458 -> 19914441 (<.01%)
instructions in affected programs: 5456 -> 5439 (-0.31%)
helped: 17 / HURT: 0
total cycles in shared programs: 855302118 -> 853869766 (-0.17%)
cycles in affected programs: 327354347 -> 325921995 (-0.44%)
helped: 291 / HURT: 81
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 141205979 -> 141205961 (-0.0%)
Instructions helped: 4
Instructions hurt: 3
SENDs in all programs: 7466919 -> 7466913 (-0.0%)
SENDs helped: 1
Cycles in all programs: 9133387327 -> 9133384475 (-0.0%)
Cycles helped: 3
Cycles hurt: 12
In the shader that was helped for sends, it appears that a NIR pass that
moves code out of loops was able to move 3 send operations outside a
loop after this change. I did not investigate further.
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
This prevents ~400 shader-db regresssions and a handful of fossil-db
regressions after i2b is always lowered.
All Ivy Bridge and newer Intel platforms had similar results. (Ice Lake shown)
total cycles in shared programs: 855301494 -> 855302118 (<.01%)
cycles in affected programs: 52787 -> 53411 (1.18%)
helped: 4 / HURT: 5
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 141206055 -> 141205979 (-0.0%)
Instructions helped: 14
Cycles in all programs: 9133376616 -> 9133387327 (+0.0%)
Cycles helped: 13
Cycles hurt: 3
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
No shader-db changes on any Intel platform.
All of the helped shaders were presumably regressed by 4676b3d3dd (nir:
Use nir_test_mask instead of i2b(iand)).
v2: Add some comments explaining why specific replacements are used. In
the umin pattern, only markup the first usage of 'b' in the source
pattern.
Tiger Lake, Ice Lake, and Skylake had similar results. (Ice Lake shown)
Instructions in all programs: 141384970 -> 141200966 (-0.1%)
Instructions helped: 45842
Cycles in all programs: 9133648977 -> 9133282672 (-0.0%)
Cycles helped: 26812
Cycles hurt: 6025
Gained: 23
Lost: 135
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
A loop below already adds all the permutations... including the 1-bit
version that isn't included in this group.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
The exact same pattern appears later (around line 1323).
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
Seems the intention was to check that both flags were not enabled
instead we were checking that the floor flag was both set and not
set so the result would always be false.
Fixes: 3749a6ecd2 ("nir: honor lower_double options for ffloor and ffract")
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19642>
ftrunc@64 also might need lowering on fp64 only, especially now
that it might be introduced by nir_lower_int64.
Fixes: 29da985682
nir/lower_int64: Enable lowering of 64-bit float to 64-bit integer conversions.
Signed-off-by: Gert Wollny <gert.wollny@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19657>
If there is a single use of fmul, and that single use is fadd, it makes
sense to fuse ffma, as we already do. However, if there are multiple
uses, fusing may impede code gen. Consider the source fragment:
a = fmul(x, y)
b = fadd(a, z)
c = fmin(a, t)
d = fmax(b, c)
The fmul has two uses. The current ffma fusing is greedy and will
produce the following "optimized" code.
a = fmul(x, y)
b = ffma(x, y, z)
c = fmin(a, t)
d = fmax(b, c)
Actually, this code is worse! Instead of 1 fmul + 1 fadd, we now have 1
fmul + 1 ffma. In effect, two multiplies (and a fused add) instead of
one multiply and an add. Depending on the ISA, that could impede
scheduling or increase code size. It can also increase register
pressure, extending the live range.
It's tempting to gate on is_used_once, but that would hurt in cases
where we really do fuse everything, e.g.:
a = fmul(x, y)
b = fadd(a, z)
c = fadd(a, t)
For ISAs that fuse ffma, we expect that 2 ffma is faster than 1 fmul + 2
fadd. So what we really want is to fuse ffma iff the fmul will get
deleted. That occurs iff all uses of the fmul are fadd and will
themselves get fused to ffma, leaving fmul to get dead code eliminated.
That's easy to implement with a new NIR search helper, checking that all
uses are fadd.
shader-db results on Mali-G57 [open shader-db + subset of closed]:
total instructions in shared programs: 179491 -> 178991 (-0.28%)
instructions in affected programs: 36862 -> 36362 (-1.36%)
helped: 190
HURT: 27
total cycles in shared programs: 10573.20 -> 10571.75 (-0.01%)
cycles in affected programs: 72.02 -> 70.56 (-2.02%)
helped: 28
HURT: 1
total fma in shared programs: 1590.47 -> 1582.61 (-0.49%)
fma in affected programs: 319.95 -> 312.09 (-2.46%)
helped: 194
HURT: 1
total cvt in shared programs: 812.98 -> 813.03 (<.01%)
cvt in affected programs: 118.53 -> 118.58 (0.04%)
helped: 65
HURT: 81
total quadwords in shared programs: 98968 -> 98840 (-0.13%)
quadwords in affected programs: 2960 -> 2832 (-4.32%)
helped: 20
HURT: 4
total threads in shared programs: 4693 -> 4697 (0.09%)
threads in affected programs: 4 -> 8 (100.00%)
helped: 4
HURT: 0
v2: Update trace checksums for virgl due to numerical differences.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18814>