This catches a number of bugs in the current NIR algebraic optimizations
or opcodes implementations (as fixed in this series, or documented in the
XFAIL tests), and should prevent many future bugs from landing.
This required bumping the test timeout, because s390x is very slow to
emulate in CI.
Closes: #3338
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39076>
This way as a pattern author/editor you can immediately see whether it's
getting test coverage and if there are known issues with the pattern.
This will also give us clear outcomes from testing as we fix failing
patterns.
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39076>
This should be exact, even for all special values:
fsqrt(NaN) -> NaN
fsqrt(-0.0) -> 0.0
fsqrt(-Inf) -> NaN
fsqrt(negative finite) -> NaN
So all of these get saturated to +0.0
All numbers >= 1.0 will have a square root >= 1.0,
which will be saturate to 1.0
Moving the fsat guarantees that it can use an output modifier
for hardware that has those, and shouldn't harm other hardware either.
Foz-DB Navi21:
Totals from 255 (0.31% of 82151) affected shaders:
Instrs: 664906 -> 664194 (-0.11%)
CodeSize: 3623500 -> 3619188 (-0.12%)
Latency: 11336397 -> 11335688 (-0.01%); split: -0.01%, +0.00%
InvThroughput: 2716430 -> 2715726 (-0.03%); split: -0.03%, +0.00%
VALU: 442603 -> 441891 (-0.16%)
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39202>
Without this, nir_algebraic.py was treating "f2i{int_sz}_sat" as the
literal opcode name when it should have been "f2i8_sat" or similar.
Fixes: c49d6e0480 ("nir/algebraic: Elide range clamping of f2u sources")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39031>
The issue is that the current narrowing patterns are not working in a lot
of cases, for example
(('fdot3', ('vec3', a, 0.0, 0.0), b), ('fmul', a, b)),
is missing patterns like this:
32x3 %1 = load_const (0x3f800000, 0x00000000, 0x00000000) = (1.000000, 0.000000, 0.000000)
32x4 %7 = vec4 %6, %2 (0x0), %2 (0x0), %2 (0x0)
32 %19 = fdot3 %1 (1.000000, 0.000000, 0.000000), %7.xyz
or after some later transforms:
32x2 %0 = load_const (0x3f800000, 0x00000000) = (1.000000, 0.000000)
32x2 %6 = vec2 %5, %1 (0x0)
32 %18 = fdot3 %0 (1.000000, 0.000000).xyy, %6.xyy
This patch is heavily based on old branch from Ian Romanick from 2019.
r300 RV530 shader-db:
total instructions in shared programs: 128900 -> 128882 (-0.01%)
instructions in affected programs: 621 -> 603 (-2.90%)
helped: 10
HURT: 1
total cycles in shared programs: 191837 -> 191828 (<.01%)
cycles in affected programs: 799 -> 790 (-1.13%)
helped: 7
HURT: 1
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39068>
In !35844, there was some discussion about allowing 64-bit bcsel that
would be lowered in the driver. One challenge there would be if a 64-bit
bcsel was transformed into integer min or max by an algebraic
optimization. I believe these were the only algebraic patterns that
could create new integer min or max that would not be immediately
constant folded.
There were no shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38033>
df1876f615 ("nir: Mark negative re-distribution on fadd as imprecise")
fixed the fadd case by marking it as imprecise. This commit fixes the
ffma case for the same reason.
However, "imprecise" isn't necessary and nowadays we have "nsz" which is
more accurate here. Use that for both fadd and ffma.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Fixes: 62795475e8 ("nir/algebraic: Distribute source modifiers into instructions")
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37930>
This is more in line with similar opcodes like umul_32x16.
Also change its const expr: the masking based on bit size was
unnecessary as it is only defined for 32 bits. Use simple casts instead.
Signed-off-by: Job Noorman <jnoorman@igalia.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37863>
Intel platforms will soon implement both bfi and bitfield_select. bfi is
more efficient for bitfield_insert.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37186>
The eliminated SENDs are from a single app that has a bunch of
fragment shaders with a sequence like:
con 32 %495 = fmul! %203.i, %1 (0.000000)
con 32 %496 = ffma! %203.j, %1 (0.000000), %495
con 32 %497 = ffma! %203.k, %1 (0.000000), %496
con 32 %498 = ffma! %203.l, %1 (0.000000), %497
con 32 %499 = @load_reloc_const_intel (param_idx=1, base=0)
con 32 %500 = @load_reloc_const_intel (param_idx=0, base=0)
con 32 %501 = f2u32 %498
con 32 %502 = umin %501, %172 (0x4)
con 32 %503 = ishl %502, %172 (0x4)
con 32 %504 = load_const (0x00000040 = 64)
con 32 %505 = umin %503, %504 (0x40)
con 32 %506 = iadd %500, %505
The `f2u` is replaced with 0, and that makes the `ffma` dot-product
sequence be unused. Since it is unused, most of the preceeding block
gets eliminated. A lot of instructions after the `f2u` are also
eliminated by other algebraic optimizations. Most importantly, %203 is
the result of a `load_ubo_uniform_block_intel` that is eliminated.
No shader-db changes on any Intel platform.
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 919895603 -> 919804051 (-0.01%); split: -0.01%, +0.00%
Send messages: 40892036 -> 40887569 (-0.01%)
Cycle count: 99176770712 -> 99174971806 (-0.00%); split: -0.00%, +0.00%
Max live registers: 190030365 -> 190030367 (+0.00%)
Max dispatch width: 47415040 -> 47415024 (-0.00%)
Non SSA regs after NIR: 228872538 -> 228863608 (-0.00%); split: -0.00%, +0.00%
Totals from 2234 (0.11% of 1955134) affected shaders:
Instrs: 1989743 -> 1898191 (-4.60%); split: -4.60%, +0.00%
Send messages: 44179 -> 39712 (-10.11%)
Cycle count: 25416114 -> 23617208 (-7.08%); split: -7.08%, +0.00%
Max live registers: 367357 -> 367359 (+0.00%)
Max dispatch width: 39184 -> 39168 (-0.04%)
Non SSA regs after NIR: 471173 -> 462243 (-1.90%); split: -1.90%, +0.00%
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37186>
If the source -1.0 < x < 0.0, simply removing the ftrun will introduce
undefined behavior. By chance of how at least Intel and NVIDIA GPUs
implement f2u, this has Just Worked.
No shader-db changes on any Intel platform.
fossil-db:
Lunar Lake
Totals:
Instrs: 913264354 -> 913264366 (+0.00%)
Cycle count: 104953995530 -> 104953996854 (+0.00%)
Max live registers: 189266026 -> 189266058 (+0.00%)
Non SSA regs after NIR: 227779417 -> 227779369 (-0.00%)
Totals from 24 (0.00% of 1984794) affected shaders:
Instrs: 4669 -> 4681 (+0.26%)
Cycle count: 50610 -> 51934 (+2.62%)
Max live registers: 1222 -> 1254 (+2.62%)
Non SSA regs after NIR: 1174 -> 1126 (-4.09%)
Meteor Lake, DG2, Tiger Lake, and Ice Lake had similar results. (Meteor Lake shown)
Totals:
Instrs: 1001288026 -> 1001288038 (+0.00%)
Cycle count: 92813392671 -> 92813392791 (+0.00%)
Max live registers: 121935383 -> 121935399 (+0.00%)
Max dispatch width: 19949928 -> 19949912 (-0.00%)
Totals from 2 (0.00% of 2284670) affected shaders:
Instrs: 1380 -> 1392 (+0.87%)
Cycle count: 18940 -> 19060 (+0.63%)
Max live registers: 136 -> 152 (+11.76%)
Max dispatch width: 32 -> 16 (-50.00%)
No fossil-db changes on Skylake.
Suggested-by: Georg Lehmann
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37186>
There are no shader-db changes on ELK platforms because those platforms
don't support 8- or 16-bit integer types.
v2: Restrict patterns generated such that the integer limits are exactly
representable in the specified floating point format. With the exception
of the value 0, this requires that float_sz > int_sz. This had no impact
on shader-db or fossil-db on any Intel platform. Noticed by Georg.
v3: Add a missing is_a_number.
shader-db:
All Intel platforms had similar results. (Lunar Lake shown)
total cycles in shared programs: 889936056 -> 889934082 (<.01%)
cycles in affected programs: 65806 -> 63832 (-3.00%)
helped: 2 / HURT: 0
fossil-db:
Lunar Lake
Totals:
Instrs: 233284796 -> 233282917 (-0.00%); split: -0.00%, +0.00%
Cycle count: 32756399804 -> 32754972188 (-0.00%); split: -0.01%, +0.00%
Spill count: 519861 -> 519813 (-0.01%)
Fill count: 663650 -> 663626 (-0.00%); split: -0.01%, +0.01%
Max live registers: 71738626 -> 71738696 (+0.00%)
Non SSA regs after NIR: 67837902 -> 67837648 (-0.00%)
Totals from 1236 (0.16% of 790723) affected shaders:
Instrs: 2134504 -> 2132625 (-0.09%); split: -0.09%, +0.01%
Cycle count: 604922278 -> 603494662 (-0.24%); split: -0.48%, +0.25%
Spill count: 16509 -> 16461 (-0.29%)
Fill count: 32760 -> 32736 (-0.07%); split: -0.22%, +0.15%
Max live registers: 250112 -> 250182 (+0.03%)
Non SSA regs after NIR: 302368 -> 302114 (-0.08%)
Meteor Lake, DG2, and Tiger Lake had similar results. (Meteor Lake shown)
Totals:
Instrs: 264095370 -> 264094056 (-0.00%); split: -0.00%, +0.00%
Cycle count: 26554146277 -> 26553027268 (-0.00%); split: -0.01%, +0.01%
Spill count: 530603 -> 530615 (+0.00%)
Fill count: 613231 -> 613273 (+0.01%)
Max live registers: 46559041 -> 46559087 (+0.00%)
Totals from 1237 (0.14% of 905547) affected shaders:
Instrs: 2262517 -> 2261203 (-0.06%); split: -0.07%, +0.01%
Cycle count: 518219799 -> 517100790 (-0.22%); split: -0.59%, +0.37%
Spill count: 17518 -> 17530 (+0.07%)
Fill count: 32273 -> 32315 (+0.13%)
Max live registers: 128360 -> 128406 (+0.04%)
Ice Lake and Skylake had similar results. (Ice Lake shown)
Totals:
Instrs: 269849640 -> 269848198 (-0.00%); split: -0.00%, +0.00%
Cycle count: 26718329643 -> 26718289020 (-0.00%); split: -0.00%, +0.00%
Max live registers: 46878430 -> 46878462 (+0.00%)
Totals from 1233 (0.14% of 905427) affected shaders:
Instrs: 2324225 -> 2322783 (-0.06%); split: -0.06%, +0.00%
Cycle count: 531467501 -> 531426878 (-0.01%); split: -0.11%, +0.10%
Max live registers: 130782 -> 130814 (+0.02%)
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37186>
On r600 ternary operations can't use the fabs source modifier, so
converting "fadd(fabs(fmul(a, b), c)" to "ffma(fabs(a), fabs(b), c)"
adds one more instruction in the backend, hence avoid this.
Signed-off-by: Gert Wollny <gert.wollny@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37440>
Add late optimization to fuse f2i32 and fround_even operations into a
single f2i32_rtne instruction when the intermediate fround_even result
is only used once. This eliminates redundant rounding since f2i32_rtne
performs round-to-nearest-even conversion directly.
Signed-off-by: Christian Gmeiner <cgmeiner@igalia.com>
Tested-by: Simon Perretta <simon.perretta@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37426>
In some cases after other passes, `(a == 0.0 ? 0 : b)` can be
turned into `(a != 0.0 ? b : 0)`, so let's match those cases too.
Also matching `min(abs(a), abs(b)) == 0.0 ? 0.0 : a * b`.
Signed-off-by: Karmjit Mahil <karmjit.mahil@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31479>
this chain of fmul is deliberately chosen for floating point precision
reasons, it needs to be exact, or else we might try to reassociate it
and break subnormal handling.
avoids regressing dEQP-VK.glsl.builtin.precision.ldexp_subnormals.*
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36257>
The fmul+fadd -> fma rules in nir_opt_algebraic are marked imprecise,
because they are a contraction. However, they respect signed zero/Inf/NaN rules.
As such, it is legal to do this fusion with shader float controls as long as the
exact bit is not set (mapping to SPIR-V NoContract).
Unfortunately, NIR's imprecise rules do not distinguish between contraction
issues versus float special case issues, forcing nir_search to skip all
imprecise rules when any shader float control modes are used. This notably
affects DXVK, which sets shader float controls to get D3D11 float behaviour and
hence loses FMA fusing.
Therefore, we plumb in the exact bit to express NoContract independent of the
float controls, and weaken the requirement for fma fusion to allowable
contraction. For fma splitting, it's a similar issue, as inexact GLSL fma in
SPIR-V is just a multiply add that we're allowed to contract rather than the
real deal.
Drivers that use their own FMA fusing passes (notably, Intel and AMD) are
unaffected, but DXVK-capable drivers using fuse_ffma should like this. Results
on hk shown:
Totals from 2194 (4.06% of 54019) affected shaders:
MaxWaves: 2174272 -> 2175936 (+0.08%); split: +0.08%, -0.01%
Instrs: 1173283 -> 1131494 (-3.56%); split: -3.57%, +0.01%
CodeSize: 8568168 -> 8381724 (-2.18%); split: -2.18%, +0.01%
Spills: 1094 -> 747 (-31.72%)
Fills: 988 -> 681 (-31.07%)
Scratch: 4444 -> 3820 (-14.04%)
ALU: 953032 -> 913149 (-4.18%); split: -4.19%, +0.01%
FSCIB: 953032 -> 913149 (-4.18%); split: -4.19%, +0.01%
IC: 215398 -> 215274 (-0.06%)
GPRs: 139865 -> 139032 (-0.60%); split: -1.56%, +0.96%
Uniforms: 414886 -> 414466 (-0.10%); split: -0.14%, +0.04%
Preamble instrs: 646398 -> 644017 (-0.37%); split: -0.43%, +0.07%
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35989>