2014-11-13 21:19:28 -08:00
|
|
|
/*
|
|
|
|
|
* Copyright © 2014 Intel Corporation
|
|
|
|
|
*
|
|
|
|
|
* Permission is hereby granted, free of charge, to any person obtaining a
|
|
|
|
|
* copy of this software and associated documentation files (the "Software"),
|
|
|
|
|
* to deal in the Software without restriction, including without limitation
|
|
|
|
|
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
|
|
|
|
* and/or sell copies of the Software, and to permit persons to whom the
|
|
|
|
|
* Software is furnished to do so, subject to the following conditions:
|
|
|
|
|
*
|
|
|
|
|
* The above copyright notice and this permission notice (including the next
|
|
|
|
|
* paragraph) shall be included in all copies or substantial portions of the
|
|
|
|
|
* Software.
|
|
|
|
|
*
|
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
|
|
|
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
|
|
|
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
|
|
|
|
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
|
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
|
|
|
|
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
|
|
|
|
|
* IN THE SOFTWARE.
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
#include "nir_search.h"
|
2023-08-08 12:00:35 -05:00
|
|
|
#include <inttypes.h>
|
|
|
|
|
#include "util/half_float.h"
|
2018-10-22 14:08:13 -05:00
|
|
|
#include "nir_builder.h"
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
#include "nir_worklist.h"
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2019-06-24 15:12:56 -07:00
|
|
|
/* This should be the same as nir_search_max_comm_ops in nir_algebraic.py. */
|
2019-06-24 15:30:35 -07:00
|
|
|
#define NIR_SEARCH_MAX_COMM_OPS 8
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
struct match_state {
|
2026-01-28 15:35:49 +01:00
|
|
|
unsigned fp_math_ctrl;
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
uint8_t comm_op_direction;
|
2014-11-13 21:19:28 -08:00
|
|
|
unsigned variables_seen;
|
2019-05-10 16:57:45 +02:00
|
|
|
|
|
|
|
|
/* Used for running the automaton on newly-constructed instructions. */
|
|
|
|
|
struct util_dynarray *states;
|
|
|
|
|
const struct per_op_table *pass_op_table;
|
2021-11-29 15:24:47 -08:00
|
|
|
const nir_algebraic_table *table;
|
2019-05-10 16:57:45 +02:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
nir_alu_src variables[NIR_SEARCH_MAX_VARIABLES];
|
2025-08-15 10:59:12 +01:00
|
|
|
const nir_search_state *state;
|
2014-11-13 21:19:28 -08:00
|
|
|
};
|
|
|
|
|
|
|
|
|
|
static bool
|
2021-11-30 14:23:39 -08:00
|
|
|
match_expression(const nir_algebraic_table *table, const nir_search_expression *expr, nir_alu_instr *instr,
|
2014-11-13 21:19:28 -08:00
|
|
|
unsigned num_components, const uint8_t *swizzle,
|
|
|
|
|
struct match_state *state);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
static bool
|
2019-05-10 16:57:45 +02:00
|
|
|
nir_algebraic_automaton(nir_instr *instr, struct util_dynarray *states,
|
|
|
|
|
const struct per_op_table *pass_op_table);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2023-08-08 12:00:35 -05:00
|
|
|
static const uint8_t identity_swizzle[NIR_MAX_VEC_COMPONENTS] = {
|
|
|
|
|
0,
|
|
|
|
|
1,
|
|
|
|
|
2,
|
|
|
|
|
3,
|
|
|
|
|
4,
|
|
|
|
|
5,
|
|
|
|
|
6,
|
|
|
|
|
7,
|
|
|
|
|
8,
|
|
|
|
|
9,
|
|
|
|
|
10,
|
|
|
|
|
11,
|
|
|
|
|
12,
|
|
|
|
|
13,
|
|
|
|
|
14,
|
|
|
|
|
15,
|
2019-03-09 17:17:55 +01:00
|
|
|
};
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2018-11-07 15:40:02 -06:00
|
|
|
static bool
|
|
|
|
|
nir_op_matches_search_op(nir_op nop, uint16_t sop)
|
|
|
|
|
{
|
|
|
|
|
if (sop <= nir_last_opcode)
|
|
|
|
|
return nop == sop;
|
|
|
|
|
|
2023-08-08 12:00:35 -05:00
|
|
|
#define MATCH_FCONV_CASE(op) \
|
|
|
|
|
case nir_search_op_##op: \
|
2018-11-07 15:40:02 -06:00
|
|
|
return nop == nir_op_##op##16 || \
|
|
|
|
|
nop == nir_op_##op##32 || \
|
|
|
|
|
nop == nir_op_##op##64;
|
|
|
|
|
|
2023-08-08 12:00:35 -05:00
|
|
|
#define MATCH_ICONV_CASE(op) \
|
|
|
|
|
case nir_search_op_##op: \
|
|
|
|
|
return nop == nir_op_##op##8 || \
|
2018-11-07 15:40:02 -06:00
|
|
|
nop == nir_op_##op##16 || \
|
|
|
|
|
nop == nir_op_##op##32 || \
|
|
|
|
|
nop == nir_op_##op##64;
|
|
|
|
|
|
|
|
|
|
switch (sop) {
|
2023-08-08 12:00:35 -05:00
|
|
|
MATCH_FCONV_CASE(i2f)
|
|
|
|
|
MATCH_FCONV_CASE(u2f)
|
|
|
|
|
MATCH_FCONV_CASE(f2f)
|
|
|
|
|
MATCH_ICONV_CASE(f2u)
|
|
|
|
|
MATCH_ICONV_CASE(f2i)
|
|
|
|
|
MATCH_ICONV_CASE(u2u)
|
|
|
|
|
MATCH_ICONV_CASE(i2i)
|
|
|
|
|
MATCH_FCONV_CASE(b2f)
|
|
|
|
|
MATCH_ICONV_CASE(b2i)
|
2018-11-07 15:40:02 -06:00
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid nir_search_op");
|
2018-11-07 15:40:02 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#undef MATCH_FCONV_CASE
|
|
|
|
|
#undef MATCH_ICONV_CASE
|
nir/search: Add automaton-based pre-searching
nir_opt_algebraic is currently one of the most expensive NIR passes,
because of the many different patterns we've added over the years. Even
though patterns are already sorted by opcode, there are still way too
many patterns for common opcodes like bcsel and fadd, which means that
many patterns are tried but only a few actually match. One way to fix
this is to add a pre-pass over the code that scans it using an automaton
constructed beforehand, similar to the automatons produced by lex and
yacc for parsing source code. This automaton has to walk the SSA graph
and recognize possible pattern matches.
It turns out that the theory to do this is quite mature already, having
been developed for instruction selection as well as other non-compiler
things. I followed the presentation in the dissertation cited in the
code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
the naming similar. To create the automaton, we have to perform
something like the classical NFA to DFA subset construction used by lex,
but it turns out that actually computing the transition table for all
possible states would be way too expensive, with the dissertation
reporting times of almost half an hour for an example of size similar to
nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
explained in the dissertation, which trade much faster table generation
and table size for a few more table lookups per instruction at runtime.
I chose the filter which resulted the fastest table generation time,
with medium table size. Right now, the table generation takes around .5
seconds, despite being implemented in pure Python, which I think is good
enough. Based on the numbers in the dissertation, the other choice might
make table compilation time 25x slower to get 4x smaller table size, but
I don't think that's worth it. As of now, we get the following binary
size before and after this patch:
text data bss dec hex filename
11979455 464720 730864 13175039 c908ff before i965_dri.so
text data bss dec hex filename
12037835 616244 791792 13445871 cd2aef after i965_dri.so
There are a number of places where I've simplified the automaton by
getting rid of details in the LHS patterns rather than complicate things
to deal with them. For example, right now the automaton doesn't
distinguish between constants with different values. This means that it
isn't as precise as it could be, but the decrease in compile time is
still worth it -- these are the compilation time numbers for a shader-db
run with my (admittedly old) database on Intel skylake:
Difference at 95.0% confidence
-42.3485 +/- 1.375
-7.20383% +/- 0.229926%
(Student's t, pooled s = 1.69843)
We can always experiment with making it more precise later.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-02-18 14:20:34 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
uint16_t
|
|
|
|
|
nir_search_op_for_nir_op(nir_op nop)
|
|
|
|
|
{
|
|
|
|
|
#define MATCH_FCONV_CASE(op) \
|
2023-08-08 12:00:35 -05:00
|
|
|
case nir_op_##op##16: \
|
|
|
|
|
case nir_op_##op##32: \
|
|
|
|
|
case nir_op_##op##64: \
|
nir/search: Add automaton-based pre-searching
nir_opt_algebraic is currently one of the most expensive NIR passes,
because of the many different patterns we've added over the years. Even
though patterns are already sorted by opcode, there are still way too
many patterns for common opcodes like bcsel and fadd, which means that
many patterns are tried but only a few actually match. One way to fix
this is to add a pre-pass over the code that scans it using an automaton
constructed beforehand, similar to the automatons produced by lex and
yacc for parsing source code. This automaton has to walk the SSA graph
and recognize possible pattern matches.
It turns out that the theory to do this is quite mature already, having
been developed for instruction selection as well as other non-compiler
things. I followed the presentation in the dissertation cited in the
code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
the naming similar. To create the automaton, we have to perform
something like the classical NFA to DFA subset construction used by lex,
but it turns out that actually computing the transition table for all
possible states would be way too expensive, with the dissertation
reporting times of almost half an hour for an example of size similar to
nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
explained in the dissertation, which trade much faster table generation
and table size for a few more table lookups per instruction at runtime.
I chose the filter which resulted the fastest table generation time,
with medium table size. Right now, the table generation takes around .5
seconds, despite being implemented in pure Python, which I think is good
enough. Based on the numbers in the dissertation, the other choice might
make table compilation time 25x slower to get 4x smaller table size, but
I don't think that's worth it. As of now, we get the following binary
size before and after this patch:
text data bss dec hex filename
11979455 464720 730864 13175039 c908ff before i965_dri.so
text data bss dec hex filename
12037835 616244 791792 13445871 cd2aef after i965_dri.so
There are a number of places where I've simplified the automaton by
getting rid of details in the LHS patterns rather than complicate things
to deal with them. For example, right now the automaton doesn't
distinguish between constants with different values. This means that it
isn't as precise as it could be, but the decrease in compile time is
still worth it -- these are the compilation time numbers for a shader-db
run with my (admittedly old) database on Intel skylake:
Difference at 95.0% confidence
-42.3485 +/- 1.375
-7.20383% +/- 0.229926%
(Student's t, pooled s = 1.69843)
We can always experiment with making it more precise later.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-02-18 14:20:34 +01:00
|
|
|
return nir_search_op_##op;
|
|
|
|
|
|
|
|
|
|
#define MATCH_ICONV_CASE(op) \
|
2023-08-08 12:00:35 -05:00
|
|
|
case nir_op_##op##8: \
|
|
|
|
|
case nir_op_##op##16: \
|
|
|
|
|
case nir_op_##op##32: \
|
|
|
|
|
case nir_op_##op##64: \
|
nir/search: Add automaton-based pre-searching
nir_opt_algebraic is currently one of the most expensive NIR passes,
because of the many different patterns we've added over the years. Even
though patterns are already sorted by opcode, there are still way too
many patterns for common opcodes like bcsel and fadd, which means that
many patterns are tried but only a few actually match. One way to fix
this is to add a pre-pass over the code that scans it using an automaton
constructed beforehand, similar to the automatons produced by lex and
yacc for parsing source code. This automaton has to walk the SSA graph
and recognize possible pattern matches.
It turns out that the theory to do this is quite mature already, having
been developed for instruction selection as well as other non-compiler
things. I followed the presentation in the dissertation cited in the
code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
the naming similar. To create the automaton, we have to perform
something like the classical NFA to DFA subset construction used by lex,
but it turns out that actually computing the transition table for all
possible states would be way too expensive, with the dissertation
reporting times of almost half an hour for an example of size similar to
nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
explained in the dissertation, which trade much faster table generation
and table size for a few more table lookups per instruction at runtime.
I chose the filter which resulted the fastest table generation time,
with medium table size. Right now, the table generation takes around .5
seconds, despite being implemented in pure Python, which I think is good
enough. Based on the numbers in the dissertation, the other choice might
make table compilation time 25x slower to get 4x smaller table size, but
I don't think that's worth it. As of now, we get the following binary
size before and after this patch:
text data bss dec hex filename
11979455 464720 730864 13175039 c908ff before i965_dri.so
text data bss dec hex filename
12037835 616244 791792 13445871 cd2aef after i965_dri.so
There are a number of places where I've simplified the automaton by
getting rid of details in the LHS patterns rather than complicate things
to deal with them. For example, right now the automaton doesn't
distinguish between constants with different values. This means that it
isn't as precise as it could be, but the decrease in compile time is
still worth it -- these are the compilation time numbers for a shader-db
run with my (admittedly old) database on Intel skylake:
Difference at 95.0% confidence
-42.3485 +/- 1.375
-7.20383% +/- 0.229926%
(Student's t, pooled s = 1.69843)
We can always experiment with making it more precise later.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-02-18 14:20:34 +01:00
|
|
|
return nir_search_op_##op;
|
|
|
|
|
|
|
|
|
|
switch (nop) {
|
2023-08-08 12:00:35 -05:00
|
|
|
MATCH_FCONV_CASE(i2f)
|
|
|
|
|
MATCH_FCONV_CASE(u2f)
|
|
|
|
|
MATCH_FCONV_CASE(f2f)
|
|
|
|
|
MATCH_ICONV_CASE(f2u)
|
|
|
|
|
MATCH_ICONV_CASE(f2i)
|
|
|
|
|
MATCH_ICONV_CASE(u2u)
|
|
|
|
|
MATCH_ICONV_CASE(i2i)
|
|
|
|
|
MATCH_FCONV_CASE(b2f)
|
|
|
|
|
MATCH_ICONV_CASE(b2i)
|
nir/search: Add automaton-based pre-searching
nir_opt_algebraic is currently one of the most expensive NIR passes,
because of the many different patterns we've added over the years. Even
though patterns are already sorted by opcode, there are still way too
many patterns for common opcodes like bcsel and fadd, which means that
many patterns are tried but only a few actually match. One way to fix
this is to add a pre-pass over the code that scans it using an automaton
constructed beforehand, similar to the automatons produced by lex and
yacc for parsing source code. This automaton has to walk the SSA graph
and recognize possible pattern matches.
It turns out that the theory to do this is quite mature already, having
been developed for instruction selection as well as other non-compiler
things. I followed the presentation in the dissertation cited in the
code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
the naming similar. To create the automaton, we have to perform
something like the classical NFA to DFA subset construction used by lex,
but it turns out that actually computing the transition table for all
possible states would be way too expensive, with the dissertation
reporting times of almost half an hour for an example of size similar to
nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
explained in the dissertation, which trade much faster table generation
and table size for a few more table lookups per instruction at runtime.
I chose the filter which resulted the fastest table generation time,
with medium table size. Right now, the table generation takes around .5
seconds, despite being implemented in pure Python, which I think is good
enough. Based on the numbers in the dissertation, the other choice might
make table compilation time 25x slower to get 4x smaller table size, but
I don't think that's worth it. As of now, we get the following binary
size before and after this patch:
text data bss dec hex filename
11979455 464720 730864 13175039 c908ff before i965_dri.so
text data bss dec hex filename
12037835 616244 791792 13445871 cd2aef after i965_dri.so
There are a number of places where I've simplified the automaton by
getting rid of details in the LHS patterns rather than complicate things
to deal with them. For example, right now the automaton doesn't
distinguish between constants with different values. This means that it
isn't as precise as it could be, but the decrease in compile time is
still worth it -- these are the compilation time numbers for a shader-db
run with my (admittedly old) database on Intel skylake:
Difference at 95.0% confidence
-42.3485 +/- 1.375
-7.20383% +/- 0.229926%
(Student's t, pooled s = 1.69843)
We can always experiment with making it more precise later.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-02-18 14:20:34 +01:00
|
|
|
default:
|
|
|
|
|
return nop;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#undef MATCH_FCONV_CASE
|
|
|
|
|
#undef MATCH_ICONV_CASE
|
2018-11-07 15:40:02 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static nir_op
|
|
|
|
|
nir_op_for_search_op(uint16_t sop, unsigned bit_size)
|
|
|
|
|
{
|
|
|
|
|
if (sop <= nir_last_opcode)
|
|
|
|
|
return sop;
|
|
|
|
|
|
2023-08-08 12:00:35 -05:00
|
|
|
#define RET_FCONV_CASE(op) \
|
|
|
|
|
case nir_search_op_##op: \
|
|
|
|
|
switch (bit_size) { \
|
|
|
|
|
case 16: \
|
|
|
|
|
return nir_op_##op##16; \
|
|
|
|
|
case 32: \
|
|
|
|
|
return nir_op_##op##32; \
|
|
|
|
|
case 64: \
|
|
|
|
|
return nir_op_##op##64; \
|
|
|
|
|
default: \
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid bit size"); \
|
2018-11-07 15:40:02 -06:00
|
|
|
}
|
|
|
|
|
|
2023-08-08 12:00:35 -05:00
|
|
|
#define RET_ICONV_CASE(op) \
|
|
|
|
|
case nir_search_op_##op: \
|
|
|
|
|
switch (bit_size) { \
|
|
|
|
|
case 8: \
|
|
|
|
|
return nir_op_##op##8; \
|
|
|
|
|
case 16: \
|
|
|
|
|
return nir_op_##op##16; \
|
|
|
|
|
case 32: \
|
|
|
|
|
return nir_op_##op##32; \
|
|
|
|
|
case 64: \
|
|
|
|
|
return nir_op_##op##64; \
|
|
|
|
|
default: \
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid bit size"); \
|
2018-11-07 15:40:02 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
switch (sop) {
|
2023-08-08 12:00:35 -05:00
|
|
|
RET_FCONV_CASE(i2f)
|
|
|
|
|
RET_FCONV_CASE(u2f)
|
|
|
|
|
RET_FCONV_CASE(f2f)
|
|
|
|
|
RET_ICONV_CASE(f2u)
|
|
|
|
|
RET_ICONV_CASE(f2i)
|
|
|
|
|
RET_ICONV_CASE(u2u)
|
|
|
|
|
RET_ICONV_CASE(i2i)
|
|
|
|
|
RET_FCONV_CASE(b2f)
|
|
|
|
|
RET_ICONV_CASE(b2i)
|
2018-11-07 15:40:02 -06:00
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid nir_search_op");
|
2018-11-07 15:40:02 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#undef RET_FCONV_CASE
|
|
|
|
|
#undef RET_ICONV_CASE
|
|
|
|
|
}
|
|
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
static bool
|
2021-11-30 14:23:39 -08:00
|
|
|
match_value(const nir_algebraic_table *table,
|
|
|
|
|
const nir_search_value *value, nir_alu_instr *instr, unsigned src,
|
2014-11-13 21:19:28 -08:00
|
|
|
unsigned num_components, const uint8_t *swizzle,
|
|
|
|
|
struct match_state *state)
|
|
|
|
|
{
|
2018-07-12 03:40:23 +02:00
|
|
|
uint8_t new_swizzle[NIR_MAX_VEC_COMPONENTS];
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2015-05-08 08:33:01 -07:00
|
|
|
/* If the source is an explicitly sized source, then we need to reset
|
|
|
|
|
* both the number of components and the swizzle.
|
|
|
|
|
*/
|
|
|
|
|
if (nir_op_infos[instr->op].input_sizes[src] != 0) {
|
|
|
|
|
num_components = nir_op_infos[instr->op].input_sizes[src];
|
|
|
|
|
swizzle = identity_swizzle;
|
|
|
|
|
}
|
|
|
|
|
|
2015-09-08 23:52:48 +08:00
|
|
|
for (unsigned i = 0; i < num_components; ++i)
|
2014-11-13 21:19:28 -08:00
|
|
|
new_swizzle[i] = instr->src[src].swizzle[swizzle[i]];
|
|
|
|
|
|
2016-04-25 12:41:44 -07:00
|
|
|
/* If the value has a specific bit size and it doesn't match, bail */
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
if (value->bit_size > 0 &&
|
2016-04-25 12:41:44 -07:00
|
|
|
nir_src_bit_size(instr->src[src].src) != value->bit_size)
|
|
|
|
|
return false;
|
|
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
switch (value->type) {
|
|
|
|
|
case nir_search_value_expression:
|
2025-11-07 21:38:36 +08:00
|
|
|
if (!nir_src_is_alu(instr->src[src].src))
|
2014-11-13 21:19:28 -08:00
|
|
|
return false;
|
|
|
|
|
|
2021-11-30 14:23:39 -08:00
|
|
|
return match_expression(table, nir_search_value_as_expression(value),
|
2025-07-31 09:49:36 -04:00
|
|
|
nir_def_as_alu(instr->src[src].src.ssa),
|
2014-11-13 21:19:28 -08:00
|
|
|
num_components, new_swizzle, state);
|
|
|
|
|
|
|
|
|
|
case nir_search_value_variable: {
|
|
|
|
|
nir_search_variable *var = nir_search_value_as_variable(value);
|
2015-05-08 09:42:05 -07:00
|
|
|
assert(var->variable < NIR_SEARCH_MAX_VARIABLES);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2025-01-02 15:00:15 +00:00
|
|
|
if (var->is_constant &&
|
2025-11-07 21:38:36 +08:00
|
|
|
!nir_src_is_const(instr->src[src].src))
|
2025-01-02 15:00:15 +00:00
|
|
|
return false;
|
|
|
|
|
|
2025-08-15 10:59:12 +01:00
|
|
|
if (var->cond_index != -1 && !table->variable_cond[var->cond_index](state->state, instr,
|
2025-01-02 15:00:15 +00:00
|
|
|
src, num_components, new_swizzle))
|
|
|
|
|
return false;
|
|
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
if (state->variables_seen & (1 << var->variable)) {
|
2017-01-10 10:24:55 -08:00
|
|
|
if (state->variables[var->variable].src.ssa != instr->src[src].src.ssa)
|
2014-11-13 21:19:28 -08:00
|
|
|
return false;
|
|
|
|
|
|
2026-02-10 14:12:08 +00:00
|
|
|
return !memcmp(state->variables[var->variable].swizzle, new_swizzle, num_components);
|
2014-11-13 21:19:28 -08:00
|
|
|
} else {
|
|
|
|
|
state->variables_seen |= (1 << var->variable);
|
2026-02-10 14:12:08 +00:00
|
|
|
nir_alu_src *dst = &state->variables[var->variable];
|
|
|
|
|
dst->src = instr->src[src].src;
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2026-02-10 14:12:08 +00:00
|
|
|
memcpy(dst->swizzle, new_swizzle, num_components);
|
|
|
|
|
memset(dst->swizzle + num_components, 0, NIR_MAX_VEC_COMPONENTS - num_components);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
|
|
|
|
return true;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
case nir_search_value_constant: {
|
|
|
|
|
nir_search_constant *const_val = nir_search_value_as_constant(value);
|
|
|
|
|
|
2018-10-20 12:17:30 -05:00
|
|
|
if (!nir_src_is_const(instr->src[src].src))
|
2014-11-13 21:19:28 -08:00
|
|
|
return false;
|
|
|
|
|
|
2015-08-14 11:45:30 -07:00
|
|
|
switch (const_val->type) {
|
nir/search: Don't compare 8-bit or 1-bit constants with floats
Without this, adding an algebraic rule like
(('bcsel', ('flt', a, 0.0), 0.0, ...), ...),
will cause assertion failures inside nir_src_comp_as_float in
GTF-GL46.gtf21.GL.lessThan.lessThan_vec3_frag (and related tests) from
the OpenGL CTS and shaders/closed/steam/witcher-2/511.shader_test from
shader-db.
All of these cases have some code that ends up like
('bcsel', ('flt', a, 0.0), 'b@1', ...)
When the 'b@1' is tested, nir_src_comp_as_float fails because there's
no such thing as a 1-bit float.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-20 13:42:46 -07:00
|
|
|
case nir_type_float: {
|
|
|
|
|
nir_load_const_instr *const load =
|
2025-07-31 09:49:36 -04:00
|
|
|
nir_def_as_load_const(instr->src[src].src.ssa);
|
nir/search: Don't compare 8-bit or 1-bit constants with floats
Without this, adding an algebraic rule like
(('bcsel', ('flt', a, 0.0), 0.0, ...), ...),
will cause assertion failures inside nir_src_comp_as_float in
GTF-GL46.gtf21.GL.lessThan.lessThan_vec3_frag (and related tests) from
the OpenGL CTS and shaders/closed/steam/witcher-2/511.shader_test from
shader-db.
All of these cases have some code that ends up like
('bcsel', ('flt', a, 0.0), 'b@1', ...)
When the 'b@1' is tested, nir_src_comp_as_float fails because there's
no such thing as a 1-bit float.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-20 13:42:46 -07:00
|
|
|
|
|
|
|
|
/* There are 8-bit and 1-bit integer types, but there are no 8-bit or
|
|
|
|
|
* 1-bit float types. This prevents potential assertion failures in
|
|
|
|
|
* nir_src_comp_as_float.
|
|
|
|
|
*/
|
|
|
|
|
if (load->def.bit_size < 16)
|
|
|
|
|
return false;
|
|
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
for (unsigned i = 0; i < num_components; ++i) {
|
2018-10-20 12:17:30 -05:00
|
|
|
double val = nir_src_comp_as_float(instr->src[src].src,
|
|
|
|
|
new_swizzle[i]);
|
2015-08-14 11:45:30 -07:00
|
|
|
if (val != const_val->data.d)
|
2014-11-13 21:19:28 -08:00
|
|
|
return false;
|
nir/search: respect sign of zero when comparing floats
Floating point comparison treats -0.0 and 0.0 as equal,
but do this in nir_search makes patterns signed zero incorrect.
Foz-DB Navi21:
Totals from 1460 (1.16% of 125360) affected shaders:
MaxWaves: 33704 -> 33710 (+0.02%)
Instrs: 2559362 -> 2558823 (-0.02%); split: -0.02%, +0.00%
CodeSize: 14502684 -> 14496352 (-0.04%); split: -0.05%, +0.00%
VGPRs: 71800 -> 71776 (-0.03%)
Latency: 19274782 -> 19274267 (-0.00%); split: -0.01%, +0.00%
InvThroughput: 3307870 -> 3299091 (-0.27%); split: -0.27%, +0.00%
SClause: 158698 -> 158703 (+0.00%); split: -0.00%, +0.00%
Copies: 240291 -> 241003 (+0.30%); split: -0.03%, +0.32%
PreSGPRs: 73203 -> 73206 (+0.00%); split: -0.00%, +0.01%
PreVGPRs: 62515 -> 62508 (-0.01%)
VALU: 1564970 -> 1564331 (-0.04%); split: -0.04%, +0.00%
SALU: 378546 -> 378654 (+0.03%); split: -0.00%, +0.03%
This difference is suprisingly positive, the only patterns affected
did previously signed zero incorrect bcsel -> b2f.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39123>
2026-01-02 09:58:34 +01:00
|
|
|
|
|
|
|
|
/* The comparison above does not check the sign bit for 0.0,
|
|
|
|
|
* so do it manually.
|
|
|
|
|
*/
|
|
|
|
|
if ((dui(val) == 0) != (dui(const_val->data.d) == 0))
|
|
|
|
|
return false;
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
return true;
|
nir/search: Don't compare 8-bit or 1-bit constants with floats
Without this, adding an algebraic rule like
(('bcsel', ('flt', a, 0.0), 0.0, ...), ...),
will cause assertion failures inside nir_src_comp_as_float in
GTF-GL46.gtf21.GL.lessThan.lessThan_vec3_frag (and related tests) from
the OpenGL CTS and shaders/closed/steam/witcher-2/511.shader_test from
shader-db.
All of these cases have some code that ends up like
('bcsel', ('flt', a, 0.0), 'b@1', ...)
When the 'b@1' is tested, nir_src_comp_as_float fails because there's
no such thing as a 1-bit float.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-20 13:42:46 -07:00
|
|
|
}
|
2015-08-14 11:45:30 -07:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
case nir_type_int:
|
2015-05-15 09:14:47 -07:00
|
|
|
case nir_type_uint:
|
2018-10-18 22:31:08 -05:00
|
|
|
case nir_type_bool: {
|
2018-10-20 12:17:30 -05:00
|
|
|
unsigned bit_size = nir_src_bit_size(instr->src[src].src);
|
2021-08-02 16:43:52 -07:00
|
|
|
uint64_t mask = u_uintN_max(bit_size);
|
2018-10-20 12:17:30 -05:00
|
|
|
for (unsigned i = 0; i < num_components; ++i) {
|
|
|
|
|
uint64_t val = nir_src_comp_as_uint(instr->src[src].src,
|
|
|
|
|
new_swizzle[i]);
|
|
|
|
|
if ((val & mask) != (const_val->data.u & mask))
|
|
|
|
|
return false;
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
2018-10-20 12:17:30 -05:00
|
|
|
return true;
|
|
|
|
|
}
|
2015-08-14 11:45:30 -07:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid alu source type");
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid search value type");
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static bool
|
2021-11-30 14:23:39 -08:00
|
|
|
match_expression(const nir_algebraic_table *table, const nir_search_expression *expr, nir_alu_instr *instr,
|
2014-11-13 21:19:28 -08:00
|
|
|
unsigned num_components, const uint8_t *swizzle,
|
|
|
|
|
struct match_state *state)
|
|
|
|
|
{
|
2021-11-30 14:23:39 -08:00
|
|
|
if (expr->cond_index != -1 && !table->expression_cond[expr->cond_index](instr))
|
2017-01-10 15:47:31 +11:00
|
|
|
return false;
|
|
|
|
|
|
2018-11-07 15:40:02 -06:00
|
|
|
if (!nir_op_matches_search_op(instr->op, expr->opcode))
|
2014-11-13 21:19:28 -08:00
|
|
|
return false;
|
|
|
|
|
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
if (expr->value.bit_size > 0 &&
|
2023-08-14 11:43:35 -05:00
|
|
|
instr->def.bit_size != expr->value.bit_size)
|
2016-04-25 12:41:44 -07:00
|
|
|
return false;
|
|
|
|
|
|
2026-02-10 14:46:42 +00:00
|
|
|
if (expr->fp_math_ctrl_exclude & instr->fp_math_ctrl)
|
2026-01-28 15:35:49 +01:00
|
|
|
return false;
|
|
|
|
|
|
2026-02-10 14:46:42 +00:00
|
|
|
state->fp_math_ctrl |= instr->fp_math_ctrl;
|
2016-03-17 11:04:49 -07:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
assert(nir_op_infos[instr->op].num_inputs > 0);
|
|
|
|
|
|
|
|
|
|
/* If we have an explicitly sized destination, we can only handle the
|
|
|
|
|
* identity swizzle. While dot(vec3(a, b, c).zxy) is a valid
|
|
|
|
|
* expression, we don't have the information right now to propagate that
|
|
|
|
|
* swizzle through. We can only properly propagate swizzles if the
|
|
|
|
|
* instruction is vectorized.
|
2024-11-17 08:45:48 -05:00
|
|
|
*
|
2025-03-13 19:10:45 +00:00
|
|
|
* The only exception is swizzle, for which we have a special condition,
|
2024-11-17 08:45:48 -05:00
|
|
|
* so that we can do pack64_2x32_split(unpack(a).x, unpack(a).y) --> a.
|
2014-11-13 21:19:28 -08:00
|
|
|
*/
|
2025-03-13 19:10:45 +00:00
|
|
|
if (expr->swizzle >= 0) {
|
|
|
|
|
if (num_components != 1 || swizzle[0] != expr->swizzle)
|
2024-11-17 08:45:48 -05:00
|
|
|
return false;
|
|
|
|
|
} else {
|
|
|
|
|
if (nir_op_infos[instr->op].output_size != 0) {
|
2026-02-10 14:12:08 +00:00
|
|
|
if (memcmp(swizzle, identity_swizzle, num_components))
|
|
|
|
|
return false;
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
/* If this is a commutative expression and it's one of the first few, look
|
|
|
|
|
* up its direction for the current search operation. We'll use that value
|
|
|
|
|
* to possibly flip the sources for the match.
|
2015-05-08 09:48:33 -07:00
|
|
|
*/
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
unsigned comm_op_flip =
|
|
|
|
|
(expr->comm_expr_idx >= 0 &&
|
2023-08-08 12:00:35 -05:00
|
|
|
expr->comm_expr_idx < NIR_SEARCH_MAX_COMM_OPS)
|
|
|
|
|
? ((state->comm_op_direction >> expr->comm_expr_idx) & 1)
|
|
|
|
|
: 0;
|
2015-05-08 09:48:33 -07:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
bool matched = true;
|
|
|
|
|
for (unsigned i = 0; i < nir_op_infos[instr->op].num_inputs; i++) {
|
2025-08-13 10:04:33 +01:00
|
|
|
/* If src1 of the search expression is a constant, check that first since it's faster. */
|
|
|
|
|
unsigned src_idx = i < 2 ? i ^ expr->src1_is_const : i;
|
|
|
|
|
|
2019-05-09 15:33:11 -07:00
|
|
|
/* 2src_commutative instructions that have 3 sources are only commutative
|
|
|
|
|
* in the first two sources. Source 2 is always source 2.
|
|
|
|
|
*/
|
2025-08-13 10:04:33 +01:00
|
|
|
if (!match_value(table, &state->table->values[expr->srcs[src_idx]].value, instr,
|
|
|
|
|
i < 2 ? src_idx ^ comm_op_flip : src_idx,
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
num_components, swizzle, state)) {
|
2014-11-13 21:19:28 -08:00
|
|
|
matched = false;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
return matched;
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
|
2015-08-14 11:45:30 -07:00
|
|
|
static unsigned
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
replace_bitsize(const nir_search_value *value, unsigned search_bitsize,
|
|
|
|
|
struct match_state *state)
|
2015-08-14 11:45:30 -07:00
|
|
|
{
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
if (value->bit_size > 0)
|
|
|
|
|
return value->bit_size;
|
|
|
|
|
if (value->bit_size < 0)
|
|
|
|
|
return nir_src_bit_size(state->variables[-value->bit_size - 1].src);
|
|
|
|
|
return search_bitsize;
|
2015-08-14 11:45:30 -07:00
|
|
|
}
|
|
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
static nir_alu_src
|
2018-10-22 14:08:13 -05:00
|
|
|
construct_value(nir_builder *build,
|
|
|
|
|
const nir_search_value *value,
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
unsigned num_components, unsigned search_bitsize,
|
2015-08-14 11:45:30 -07:00
|
|
|
struct match_state *state,
|
2018-10-22 14:08:13 -05:00
|
|
|
nir_instr *instr)
|
2014-11-13 21:19:28 -08:00
|
|
|
{
|
|
|
|
|
switch (value->type) {
|
|
|
|
|
case nir_search_value_expression: {
|
|
|
|
|
const nir_search_expression *expr = nir_search_value_as_expression(value);
|
2018-11-07 15:40:02 -06:00
|
|
|
unsigned dst_bit_size = replace_bitsize(value, search_bitsize, state);
|
|
|
|
|
nir_op op = nir_op_for_search_op(expr->opcode, dst_bit_size);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2018-11-07 15:40:02 -06:00
|
|
|
if (nir_op_infos[op].output_size != 0)
|
|
|
|
|
num_components = nir_op_infos[op].output_size;
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2018-11-07 15:40:02 -06:00
|
|
|
nir_alu_instr *alu = nir_alu_instr_create(build->shader, op);
|
2023-08-14 11:43:35 -05:00
|
|
|
nir_def_init(&alu->instr, &alu->def, num_components,
|
nir: Drop most instances of nir_ssa_dest_init()
Generated using the following two semantic patches:
@@
expression I, J, NC, BS;
@@
-nir_ssa_dest_init(I, &J->dest, NC, BS);
+nir_def_init(I, &J->dest.ssa, NC, BS);
@@
expression I, J, NC, BS;
@@
-nir_ssa_dest_init(I, &J->dest.dest, NC, BS);
+nir_def_init(I, &J->dest.dest.ssa, NC, BS);
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24658>
2023-08-12 18:31:52 -05:00
|
|
|
dst_bit_size);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2016-03-28 11:12:33 -07:00
|
|
|
/* We have no way of knowing what values in a given search expression
|
|
|
|
|
* map to a particular replacement value. Therefore, if the
|
|
|
|
|
* expression we are replacing has any exact values, the entire
|
|
|
|
|
* replacement should be exact.
|
|
|
|
|
*/
|
2026-02-12 12:01:09 +00:00
|
|
|
alu->fp_math_ctrl = nir_op_valid_fp_math_ctrl(op, state->fp_math_ctrl | expr->fp_math_ctrl_add);
|
2016-03-28 11:12:33 -07:00
|
|
|
|
2018-11-07 15:40:02 -06:00
|
|
|
for (unsigned i = 0; i < nir_op_infos[op].num_inputs; i++) {
|
2014-11-13 21:19:28 -08:00
|
|
|
/* If the source is an explicitly sized source, then we need to reset
|
|
|
|
|
* the number of components to match.
|
|
|
|
|
*/
|
|
|
|
|
if (nir_op_infos[alu->op].input_sizes[i] != 0)
|
|
|
|
|
num_components = nir_op_infos[alu->op].input_sizes[i];
|
|
|
|
|
|
2021-11-29 15:24:47 -08:00
|
|
|
alu->src[i] = construct_value(build, &state->table->values[expr->srcs[i]].value,
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
num_components, search_bitsize,
|
2018-10-22 14:08:13 -05:00
|
|
|
state, instr);
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
|
nir/algebraic: ad-hoc constant-fold ALU instructions
Slight differences due to different optimization order.
Totals from 135 (0.17% of 79839) affected shaders: (Navi48)
Instrs: 287852 -> 287527 (-0.11%); split: -0.15%, +0.03%
CodeSize: 1522972 -> 1521764 (-0.08%); split: -0.12%, +0.04%
Latency: 1806803 -> 1825754 (+1.05%); split: -0.08%, +1.12%
InvThroughput: 242693 -> 244703 (+0.83%); split: -0.02%, +0.84%
VClause: 4092 -> 4084 (-0.20%)
SClause: 7462 -> 7478 (+0.21%)
Copies: 20509 -> 20401 (-0.53%); split: -0.74%, +0.21%
Branches: 6395 -> 6386 (-0.14%)
PreSGPRs: 7334 -> 7337 (+0.04%); split: -0.03%, +0.07%
PreVGPRs: 6375 -> 6382 (+0.11%)
VALU: 151787 -> 151595 (-0.13%); split: -0.15%, +0.02%
SALU: 52967 -> 52910 (-0.11%); split: -0.23%, +0.12%
VMEM: 6704 -> 6696 (-0.12%)
SMEM: 12099 -> 12129 (+0.25%)
Tested on a small collection of 2518 shaders from Dredge with callgrind using RADV:
baseline:
nir_opt_algebraic was called 12917 times from radv_optimize_nir()
nir_opt_cse was called 15204 times from radv_optimize_nir()
relative time spent in radv_optimize_nir(): 31.48%
total instruction fetch cost: 28,642,638,021
with nir/algebraic: ad-hoc constant-fold ALU instructions
nir_opt_algebraic was called 12797 times from radv_optimize_nir()
nir_opt_cse was called 12963 times from radv_optimize_nir()
relative time spent in radv_optimize_nir(): 30.63%
total instruction fetch cost: 28,284,386,123
=> ~1.27% improvement in total compile times
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37195>
2025-09-05 10:57:45 +02:00
|
|
|
/* Immediately try to constant-fold the expression, in order to allow
|
|
|
|
|
* for more expressions to be matched within a single pass.
|
|
|
|
|
*/
|
|
|
|
|
nir_def *def = &alu->def;
|
|
|
|
|
nir_def *const_expr = nir_try_constant_fold_alu(build, alu);
|
|
|
|
|
if (const_expr) {
|
|
|
|
|
nir_instr_free(&alu->instr);
|
|
|
|
|
def = const_expr;
|
|
|
|
|
} else {
|
|
|
|
|
nir_builder_instr_insert(build, &alu->instr);
|
|
|
|
|
}
|
|
|
|
|
assert(def->index ==
|
2019-05-10 16:57:45 +02:00
|
|
|
util_dynarray_num_elements(state->states, uint16_t));
|
util/dynarray: infer type in append
Most of the time, we can infer the type to append in
util_dynarray_append using __typeof__, which is standardized in C23 and
support in Jesse's MSMSVCV. This patch drops the type argument most of
the time, making util_dynarray a little more ergonomic to use.
This is done in four steps.
First, rename util_dynarray_append -> util_dynarray_append_typed
bash -c "find . -type f -exec sed -i -e 's/util_dynarray_append(/util_dynarray_append_typed(/g' \{} \;"
Then, add a new append that infers the type. This is much more ergonomic
for what you want most of the time.
Next, use type-inferred append as much as possible, via Coccinelle
patch (plus manual fixup):
@@
expression dynarray, element;
type type;
@@
-util_dynarray_append_typed(dynarray, type, element);
+util_dynarray_append(dynarray, element);
Finally, hand fixup cases that Coccinelle missed or incorrectly
translated, of which there were several because we can't used the
untyped append with a literal (since the sizeof won't do what you want).
All four steps are squashed to produce a single patch changing every
util_dynarray_append call site in tree to either drop a type parameter
(if possible) or insert a _typed suffix (if we can't infer). As such,
the final patch is best reviewed by hand even though it was
tool-assisted.
No Long Linguine Meals were involved in the making of this patch.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38038>
2025-10-23 15:36:13 -04:00
|
|
|
util_dynarray_append_typed(state->states, uint16_t, 0);
|
2025-11-07 21:38:36 +08:00
|
|
|
nir_algebraic_automaton(nir_def_instr(def), state->states, state->pass_op_table);
|
2019-05-10 16:57:45 +02:00
|
|
|
|
2015-01-21 11:11:03 -08:00
|
|
|
nir_alu_src val;
|
nir/algebraic: ad-hoc constant-fold ALU instructions
Slight differences due to different optimization order.
Totals from 135 (0.17% of 79839) affected shaders: (Navi48)
Instrs: 287852 -> 287527 (-0.11%); split: -0.15%, +0.03%
CodeSize: 1522972 -> 1521764 (-0.08%); split: -0.12%, +0.04%
Latency: 1806803 -> 1825754 (+1.05%); split: -0.08%, +1.12%
InvThroughput: 242693 -> 244703 (+0.83%); split: -0.02%, +0.84%
VClause: 4092 -> 4084 (-0.20%)
SClause: 7462 -> 7478 (+0.21%)
Copies: 20509 -> 20401 (-0.53%); split: -0.74%, +0.21%
Branches: 6395 -> 6386 (-0.14%)
PreSGPRs: 7334 -> 7337 (+0.04%); split: -0.03%, +0.07%
PreVGPRs: 6375 -> 6382 (+0.11%)
VALU: 151787 -> 151595 (-0.13%); split: -0.15%, +0.02%
SALU: 52967 -> 52910 (-0.11%); split: -0.23%, +0.12%
VMEM: 6704 -> 6696 (-0.12%)
SMEM: 12099 -> 12129 (+0.25%)
Tested on a small collection of 2518 shaders from Dredge with callgrind using RADV:
baseline:
nir_opt_algebraic was called 12917 times from radv_optimize_nir()
nir_opt_cse was called 15204 times from radv_optimize_nir()
relative time spent in radv_optimize_nir(): 31.48%
total instruction fetch cost: 28,642,638,021
with nir/algebraic: ad-hoc constant-fold ALU instructions
nir_opt_algebraic was called 12797 times from radv_optimize_nir()
nir_opt_cse was called 12963 times from radv_optimize_nir()
relative time spent in radv_optimize_nir(): 30.63%
total instruction fetch cost: 28,284,386,123
=> ~1.27% improvement in total compile times
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37195>
2025-09-05 10:57:45 +02:00
|
|
|
val.src = nir_src_for_ssa(def);
|
2025-07-14 17:10:07 +02:00
|
|
|
if (expr->swizzle < 0)
|
|
|
|
|
memcpy(val.swizzle, identity_swizzle, sizeof(val.swizzle));
|
|
|
|
|
else
|
|
|
|
|
memset(val.swizzle, expr->swizzle, sizeof(val.swizzle));
|
2014-11-13 21:19:28 -08:00
|
|
|
|
|
|
|
|
return val;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
case nir_search_value_variable: {
|
|
|
|
|
const nir_search_variable *var = nir_search_value_as_variable(value);
|
|
|
|
|
assert(state->variables_seen & (1 << var->variable));
|
|
|
|
|
|
2015-04-21 18:00:21 -07:00
|
|
|
nir_alu_src val = { NIR_SRC_INIT };
|
2023-09-06 14:04:39 +10:00
|
|
|
nir_alu_src_copy(&val, &state->variables[var->variable]);
|
2015-01-22 14:15:27 -08:00
|
|
|
assert(!var->is_constant);
|
|
|
|
|
|
2019-06-20 21:23:53 -04:00
|
|
|
for (unsigned i = 0; i < NIR_MAX_VEC_COMPONENTS; i++)
|
|
|
|
|
val.swizzle[i] = state->variables[var->variable].swizzle[var->swizzle[i]];
|
|
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
return val;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
case nir_search_value_constant: {
|
|
|
|
|
const nir_search_constant *c = nir_search_value_as_constant(value);
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
unsigned bit_size = replace_bitsize(value, search_bitsize, state);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2023-08-12 16:17:15 -04:00
|
|
|
nir_def *cval;
|
2015-08-14 11:45:30 -07:00
|
|
|
switch (c->type) {
|
2014-11-13 21:19:28 -08:00
|
|
|
case nir_type_float:
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
cval = nir_imm_floatN_t(build, c->data.d, bit_size);
|
2014-11-13 21:19:28 -08:00
|
|
|
break;
|
2015-08-14 11:45:30 -07:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
case nir_type_int:
|
2015-05-15 09:14:47 -07:00
|
|
|
case nir_type_uint:
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
cval = nir_imm_intN_t(build, c->data.i, bit_size);
|
2016-04-22 09:45:10 +03:00
|
|
|
break;
|
2015-08-14 11:45:30 -07:00
|
|
|
|
2018-10-18 22:31:08 -05:00
|
|
|
case nir_type_bool:
|
2018-10-18 11:59:40 -05:00
|
|
|
cval = nir_imm_boolN_t(build, c->data.u, bit_size);
|
2014-11-13 21:19:28 -08:00
|
|
|
break;
|
2018-10-18 11:59:40 -05:00
|
|
|
|
2014-11-13 21:19:28 -08:00
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid alu source type");
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
|
2019-05-10 16:57:45 +02:00
|
|
|
assert(cval->index ==
|
|
|
|
|
util_dynarray_num_elements(state->states, uint16_t));
|
util/dynarray: infer type in append
Most of the time, we can infer the type to append in
util_dynarray_append using __typeof__, which is standardized in C23 and
support in Jesse's MSMSVCV. This patch drops the type argument most of
the time, making util_dynarray a little more ergonomic to use.
This is done in four steps.
First, rename util_dynarray_append -> util_dynarray_append_typed
bash -c "find . -type f -exec sed -i -e 's/util_dynarray_append(/util_dynarray_append_typed(/g' \{} \;"
Then, add a new append that infers the type. This is much more ergonomic
for what you want most of the time.
Next, use type-inferred append as much as possible, via Coccinelle
patch (plus manual fixup):
@@
expression dynarray, element;
type type;
@@
-util_dynarray_append_typed(dynarray, type, element);
+util_dynarray_append(dynarray, element);
Finally, hand fixup cases that Coccinelle missed or incorrectly
translated, of which there were several because we can't used the
untyped append with a literal (since the sizeof won't do what you want).
All four steps are squashed to produce a single patch changing every
util_dynarray_append call site in tree to either drop a type parameter
(if possible) or insert a _typed suffix (if we can't infer). As such,
the final patch is best reviewed by hand even though it was
tool-assisted.
No Long Linguine Meals were involved in the making of this patch.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38038>
2025-10-23 15:36:13 -04:00
|
|
|
util_dynarray_append_typed(state->states, uint16_t, 0);
|
2025-11-07 21:38:36 +08:00
|
|
|
nir_algebraic_automaton(nir_def_instr(cval), state->states,
|
2019-05-10 16:57:45 +02:00
|
|
|
state->pass_op_table);
|
|
|
|
|
|
2015-01-21 11:11:03 -08:00
|
|
|
nir_alu_src val;
|
2018-10-22 14:08:13 -05:00
|
|
|
val.src = nir_src_for_ssa(cval);
|
2015-01-21 11:11:03 -08:00
|
|
|
memset(val.swizzle, 0, sizeof val.swizzle);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
|
|
|
|
return val;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("Invalid search value type");
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2023-08-08 12:00:35 -05:00
|
|
|
UNUSED static void
|
|
|
|
|
dump_value(const nir_algebraic_table *table, const nir_search_value *val)
|
2019-02-18 17:28:32 +01:00
|
|
|
{
|
|
|
|
|
switch (val->type) {
|
|
|
|
|
case nir_search_value_constant: {
|
|
|
|
|
const nir_search_constant *sconst = nir_search_value_as_constant(val);
|
|
|
|
|
switch (sconst->type) {
|
|
|
|
|
case nir_type_float:
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "%f", sconst->data.d);
|
2019-02-18 17:28:32 +01:00
|
|
|
break;
|
|
|
|
|
case nir_type_int:
|
2023-08-08 12:00:35 -05:00
|
|
|
fprintf(stderr, "%" PRId64, sconst->data.i);
|
2019-02-18 17:28:32 +01:00
|
|
|
break;
|
|
|
|
|
case nir_type_uint:
|
2023-08-08 12:00:35 -05:00
|
|
|
fprintf(stderr, "0x%" PRIx64, sconst->data.u);
|
2019-02-18 17:28:32 +01:00
|
|
|
break;
|
2019-06-24 14:49:17 -07:00
|
|
|
case nir_type_bool:
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "%s", sconst->data.u != 0 ? "True" : "False");
|
2019-06-24 14:49:17 -07:00
|
|
|
break;
|
2019-02-18 17:28:32 +01:00
|
|
|
default:
|
2025-07-23 09:17:35 +02:00
|
|
|
UNREACHABLE("bad const type");
|
2019-02-18 17:28:32 +01:00
|
|
|
}
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
case nir_search_value_variable: {
|
|
|
|
|
const nir_search_variable *var = nir_search_value_as_variable(val);
|
|
|
|
|
if (var->is_constant)
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "#");
|
|
|
|
|
fprintf(stderr, "%c", var->variable + 'a');
|
2019-02-18 17:28:32 +01:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
case nir_search_value_expression: {
|
|
|
|
|
const nir_search_expression *expr = nir_search_value_as_expression(val);
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "(");
|
2026-01-28 15:35:49 +01:00
|
|
|
if (expr->fp_math_ctrl_exclude & nir_fp_exact)
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "~");
|
2019-02-18 17:28:32 +01:00
|
|
|
switch (expr->opcode) {
|
2023-08-08 12:00:35 -05:00
|
|
|
#define CASE(n) \
|
|
|
|
|
case nir_search_op_##n: \
|
|
|
|
|
fprintf(stderr, #n); \
|
|
|
|
|
break;
|
|
|
|
|
CASE(b2f)
|
|
|
|
|
CASE(b2i)
|
|
|
|
|
CASE(i2i)
|
2025-06-26 17:14:23 +01:00
|
|
|
CASE(u2u)
|
|
|
|
|
CASE(f2f)
|
2023-08-08 12:00:35 -05:00
|
|
|
CASE(f2i)
|
2025-06-26 17:14:23 +01:00
|
|
|
CASE(f2u)
|
2023-08-08 12:00:35 -05:00
|
|
|
CASE(i2f)
|
2025-06-26 17:14:23 +01:00
|
|
|
CASE(u2f)
|
2019-02-18 17:28:32 +01:00
|
|
|
#undef CASE
|
|
|
|
|
default:
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "%s", nir_op_infos[expr->opcode].name);
|
2019-02-18 17:28:32 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
unsigned num_srcs = 1;
|
|
|
|
|
if (expr->opcode <= nir_last_opcode)
|
|
|
|
|
num_srcs = nir_op_infos[expr->opcode].num_inputs;
|
|
|
|
|
|
|
|
|
|
for (unsigned i = 0; i < num_srcs; i++) {
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, " ");
|
2021-11-29 15:24:47 -08:00
|
|
|
dump_value(table, &table->values[expr->srcs[i]].value);
|
2019-02-18 17:28:32 +01:00
|
|
|
}
|
|
|
|
|
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, ")");
|
2019-02-18 17:28:32 +01:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (val->bit_size > 0)
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "@%d", val->bit_size);
|
2019-02-18 17:28:32 +01:00
|
|
|
}
|
|
|
|
|
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
static void
|
2020-11-04 13:12:47 +00:00
|
|
|
add_uses_to_worklist(nir_instr *instr,
|
|
|
|
|
nir_instr_worklist *worklist,
|
|
|
|
|
struct util_dynarray *states,
|
|
|
|
|
const struct per_op_table *pass_op_table)
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
{
|
2023-08-15 12:05:54 -05:00
|
|
|
nir_def *def = nir_instr_def(instr);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
|
|
|
|
|
nir_foreach_use_safe(use_src, def) {
|
2023-08-14 09:58:47 -04:00
|
|
|
if (nir_algebraic_automaton(nir_src_parent_instr(use_src), states, pass_op_table))
|
|
|
|
|
nir_instr_worklist_push_tail(worklist, nir_src_parent_instr(use_src));
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void
|
|
|
|
|
nir_algebraic_update_automaton(nir_instr *new_instr,
|
|
|
|
|
nir_instr_worklist *algebraic_worklist,
|
|
|
|
|
struct util_dynarray *states,
|
|
|
|
|
const struct per_op_table *pass_op_table)
|
|
|
|
|
{
|
|
|
|
|
|
2025-08-11 12:22:37 -04:00
|
|
|
nir_instr_worklist automaton_worklist;
|
|
|
|
|
nir_instr_worklist_init(&automaton_worklist);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
|
|
|
|
|
/* Walk through the tree of uses of our new instruction's SSA value,
|
|
|
|
|
* recursively updating the automaton state until it stabilizes.
|
|
|
|
|
*/
|
2025-08-11 12:22:37 -04:00
|
|
|
add_uses_to_worklist(new_instr, &automaton_worklist, states, pass_op_table);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
|
|
|
|
|
nir_instr *instr;
|
2025-08-11 12:22:37 -04:00
|
|
|
while ((instr = nir_instr_worklist_pop_head(&automaton_worklist))) {
|
2020-11-04 13:12:47 +00:00
|
|
|
nir_instr_worklist_push_tail(algebraic_worklist, instr);
|
2025-08-11 12:22:37 -04:00
|
|
|
add_uses_to_worklist(instr, &automaton_worklist, states, pass_op_table);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
}
|
|
|
|
|
|
2025-08-11 12:22:37 -04:00
|
|
|
nir_instr_worklist_fini(&automaton_worklist);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
}
|
|
|
|
|
|
2026-03-14 08:18:23 +01:00
|
|
|
static bool
|
2018-10-22 14:08:13 -05:00
|
|
|
nir_replace_instr(nir_builder *build, nir_alu_instr *instr,
|
2025-08-15 10:59:12 +01:00
|
|
|
const nir_search_state *search_state,
|
2019-05-10 16:57:45 +02:00
|
|
|
struct util_dynarray *states,
|
2021-11-29 15:24:47 -08:00
|
|
|
const nir_algebraic_table *table,
|
2018-10-22 14:08:13 -05:00
|
|
|
const nir_search_expression *search,
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
const nir_search_value *replace,
|
2019-03-19 14:16:41 +01:00
|
|
|
nir_instr_worklist *algebraic_worklist,
|
|
|
|
|
struct exec_list *dead_instrs)
|
2014-11-13 21:19:28 -08:00
|
|
|
{
|
|
|
|
|
struct match_state state;
|
2026-01-28 15:35:49 +01:00
|
|
|
state.fp_math_ctrl = nir_fp_fast_math;
|
2025-08-15 10:59:12 +01:00
|
|
|
state.state = search_state;
|
2021-11-29 15:24:47 -08:00
|
|
|
state.pass_op_table = table->pass_op_table;
|
|
|
|
|
state.table = table;
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2019-06-24 15:30:35 -07:00
|
|
|
STATIC_ASSERT(sizeof(state.comm_op_direction) * 8 >= NIR_SEARCH_MAX_COMM_OPS);
|
|
|
|
|
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
unsigned comm_expr_combinations =
|
|
|
|
|
1 << MIN2(search->comm_exprs, NIR_SEARCH_MAX_COMM_OPS);
|
|
|
|
|
|
|
|
|
|
bool found = false;
|
|
|
|
|
for (unsigned comb = 0; comb < comm_expr_combinations; comb++) {
|
|
|
|
|
/* The bitfield of directions is just the current iteration. Hooray for
|
|
|
|
|
* binary.
|
|
|
|
|
*/
|
|
|
|
|
state.comm_op_direction = comb;
|
|
|
|
|
state.variables_seen = 0;
|
|
|
|
|
|
2021-11-30 14:23:39 -08:00
|
|
|
if (match_expression(table, search, instr,
|
2023-08-14 11:43:35 -05:00
|
|
|
instr->def.num_components,
|
2026-02-10 14:05:53 +00:00
|
|
|
identity_swizzle, &state)) {
|
nir/search: Search for all combinations of commutative ops
Consider the following search expression and NIR sequence:
('iadd', ('imul', a, b), b)
ssa_2 = imul ssa_0, ssa_1
ssa_3 = iadd ssa_2, ssa_0
The current algorithm is greedy and, the moment the imul finds a match,
it commits those variable names and returns success. In the above
example, it maps a -> ssa_0 and b -> ssa_1. When we then try to match
the iadd, it sees that ssa_0 is not b and fails to match. The iadd
match will attempt to flip itself and try again (which won't work) but
it cannot ask the imul to try a flipped match.
This commit instead counts the number of commutative ops in each
expression and assigns an index to each. It then does a loop and loops
over the full combinatorial matrix of commutative operations. In order
to keep things sane, we limit it to at most 4 commutative operations (16
combinations). There is only one optimization in opt_algebraic that
goes over this limit and it's the bitfieldReverse detection for some UE4
demo.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15310125 -> 15302469 (-0.05%)
instructions in affected programs: 1797123 -> 1789467 (-0.43%)
helped: 6751
HURT: 2264
total cycles in shared programs: 357346617 -> 357202526 (-0.04%)
cycles in affected programs: 15931005 -> 15786914 (-0.90%)
helped: 6024
HURT: 3436
total loops in shared programs: 4360 -> 4360 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 23675 -> 23666 (-0.04%)
spills in affected programs: 235 -> 226 (-3.83%)
helped: 5
HURT: 1
total fills in shared programs: 32040 -> 32032 (-0.02%)
fills in affected programs: 190 -> 182 (-4.21%)
helped: 6
HURT: 2
LOST: 18
GAINED: 5
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
2019-03-22 17:45:29 -05:00
|
|
|
found = true;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (!found)
|
2026-03-14 08:18:23 +01:00
|
|
|
return false;
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2019-02-18 17:28:32 +01:00
|
|
|
#if 0
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, "matched: ");
|
2024-01-26 19:53:13 +00:00
|
|
|
dump_value(table, &search->value);
|
2019-09-16 14:27:24 -07:00
|
|
|
fprintf(stderr, " -> ");
|
2024-01-26 19:53:13 +00:00
|
|
|
dump_value(table, replace);
|
2023-08-14 11:43:35 -05:00
|
|
|
fprintf(stderr, " ssa_%d\n", instr->def.index);
|
2019-02-18 17:28:32 +01:00
|
|
|
#endif
|
|
|
|
|
|
nir/algebraic: Change the default cursor location when replacing a unary op
If the expression tree that is being replaced has a unary operation at
its root, set the cursor (location where new instructions are inserted)
at the source instruction instead.
This doesn't do much now because there are very few patterns that have a
unary operation as the root. Almost all of the patterns that do have a
unary operation as the root have inot. All of the shaders that are
affected by this commit have expression trees with an inot at the root.
This change prevents some significant, spurious caused by the next
commit. There is further explanation in the large comment added in
the code.
I also considered a couple other options that may still be worth exploring.
1. Add some mark-up to the search pattern to denote where new
instructions should be added. I considered using "@" to denote the
cursor location. For example,
(('fneg', ('fadd@', a, b)), ...)
2. To prevent other kinds of unintended code motion, add the ability to
name expressions in the search pattern so that they can be reused in
the replacement. For example,
(('bcsel', ('ige', ('find_lsb=b', a), 0), ('find_lsb', a), -1), b),
An alternative would be to add some kind of CSE at the time of
inserting the replacements. Create a new instruction, then check to
see if it already exists. That option might be better overall.
Over the years I know Matt has heard me complain, "I added a pattern
that just deleted an instruction, but it added a bunch of spills!" This
was always in large, complex shaders that are very hard to analyze. I
always blamed these cases on the scheduler being dumb. I am now very
suspicious that unintended code motion was the real problem.
All Gen4+ Intel platforms had similar results. (Tiger Lake shown)
total instructions in shared programs: 17611405 -> 17611333 (<.01%)
instructions in affected programs: 18613 -> 18541 (-0.39%)
helped: 41
HURT: 13
helped stats (abs) min: 1 max: 18 x̄: 4.46 x̃: 4
helped stats (rel) min: 0.27% max: 5.68% x̄: 1.29% x̃: 1.34%
HURT stats (abs) min: 1 max: 20 x̄: 8.54 x̃: 7
HURT stats (rel) min: 0.30% max: 4.20% x̄: 2.15% x̃: 2.38%
95% mean confidence interval for instructions value: -3.29 0.63
95% mean confidence interval for instructions %-change: -0.95% 0.02%
Inconclusive result (value mean confidence interval includes 0).
total cycles in shared programs: 338366118 -> 338365223 (<.01%)
cycles in affected programs: 257889 -> 256994 (-0.35%)
helped: 42
HURT: 15
helped stats (abs) min: 2 max: 120 x̄: 39.38 x̃: 34
helped stats (rel) min: 0.04% max: 2.55% x̄: 0.86% x̃: 0.76%
HURT stats (abs) min: 6 max: 204 x̄: 50.60 x̃: 34
HURT stats (rel) min: 0.11% max: 4.75% x̄: 1.12% x̃: 0.56%
95% mean confidence interval for cycles value: -30.39 -1.02
95% mean confidence interval for cycles %-change: -0.66% -0.02%
Cycles are helped.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/1359>
2020-03-23 14:25:24 -07:00
|
|
|
/* If the instruction at the root of the expression tree being replaced is
|
|
|
|
|
* a unary operation, insert the replacement instructions at the location
|
|
|
|
|
* of the source of the unary operation. Otherwise, insert the replacement
|
|
|
|
|
* instructions at the location of the expression tree root.
|
|
|
|
|
*
|
|
|
|
|
* For the unary operation case, this is done to prevent some spurious code
|
|
|
|
|
* motion that can dramatically extend live ranges. Imagine an expression
|
|
|
|
|
* like -(A+B) where the addtion and the negation are separated by flow
|
|
|
|
|
* control and thousands of instructions. If this expression is replaced
|
|
|
|
|
* with -A+-B, inserting the new instructions at the site of the negation
|
|
|
|
|
* could extend the live range of A and B dramtically. This could increase
|
|
|
|
|
* register pressure and cause spilling.
|
|
|
|
|
*
|
|
|
|
|
* It may well be that moving instructions around is a good thing, but
|
|
|
|
|
* keeping algebraic optimizations and code motion optimizations separate
|
|
|
|
|
* seems safest.
|
|
|
|
|
*/
|
2025-11-07 21:38:36 +08:00
|
|
|
nir_alu_instr *const src_instr = nir_src_as_alu(instr->src[0].src);
|
nir/algebraic: Change the default cursor location when replacing a unary op
If the expression tree that is being replaced has a unary operation at
its root, set the cursor (location where new instructions are inserted)
at the source instruction instead.
This doesn't do much now because there are very few patterns that have a
unary operation as the root. Almost all of the patterns that do have a
unary operation as the root have inot. All of the shaders that are
affected by this commit have expression trees with an inot at the root.
This change prevents some significant, spurious caused by the next
commit. There is further explanation in the large comment added in
the code.
I also considered a couple other options that may still be worth exploring.
1. Add some mark-up to the search pattern to denote where new
instructions should be added. I considered using "@" to denote the
cursor location. For example,
(('fneg', ('fadd@', a, b)), ...)
2. To prevent other kinds of unintended code motion, add the ability to
name expressions in the search pattern so that they can be reused in
the replacement. For example,
(('bcsel', ('ige', ('find_lsb=b', a), 0), ('find_lsb', a), -1), b),
An alternative would be to add some kind of CSE at the time of
inserting the replacements. Create a new instruction, then check to
see if it already exists. That option might be better overall.
Over the years I know Matt has heard me complain, "I added a pattern
that just deleted an instruction, but it added a bunch of spills!" This
was always in large, complex shaders that are very hard to analyze. I
always blamed these cases on the scheduler being dumb. I am now very
suspicious that unintended code motion was the real problem.
All Gen4+ Intel platforms had similar results. (Tiger Lake shown)
total instructions in shared programs: 17611405 -> 17611333 (<.01%)
instructions in affected programs: 18613 -> 18541 (-0.39%)
helped: 41
HURT: 13
helped stats (abs) min: 1 max: 18 x̄: 4.46 x̃: 4
helped stats (rel) min: 0.27% max: 5.68% x̄: 1.29% x̃: 1.34%
HURT stats (abs) min: 1 max: 20 x̄: 8.54 x̃: 7
HURT stats (rel) min: 0.30% max: 4.20% x̄: 2.15% x̃: 2.38%
95% mean confidence interval for instructions value: -3.29 0.63
95% mean confidence interval for instructions %-change: -0.95% 0.02%
Inconclusive result (value mean confidence interval includes 0).
total cycles in shared programs: 338366118 -> 338365223 (<.01%)
cycles in affected programs: 257889 -> 256994 (-0.35%)
helped: 42
HURT: 15
helped stats (abs) min: 2 max: 120 x̄: 39.38 x̃: 34
helped stats (rel) min: 0.04% max: 2.55% x̄: 0.86% x̃: 0.76%
HURT stats (abs) min: 6 max: 204 x̄: 50.60 x̃: 34
HURT stats (rel) min: 0.11% max: 4.75% x̄: 1.12% x̃: 0.56%
95% mean confidence interval for cycles value: -30.39 -1.02
95% mean confidence interval for cycles %-change: -0.66% -0.02%
Cycles are helped.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/1359>
2020-03-23 14:25:24 -07:00
|
|
|
if (src_instr != NULL &&
|
|
|
|
|
(instr->op == nir_op_fneg || instr->op == nir_op_fabs ||
|
|
|
|
|
instr->op == nir_op_ineg || instr->op == nir_op_iabs ||
|
|
|
|
|
instr->op == nir_op_inot)) {
|
|
|
|
|
/* Insert new instructions *after*. Otherwise a hypothetical
|
|
|
|
|
* replacement fneg(X) -> fabs(X) would insert the fabs() instruction
|
|
|
|
|
* before X! This can also occur for things like fneg(X.wzyx) -> X.wzyx
|
|
|
|
|
* in vector mode. A move instruction to handle the swizzle will get
|
|
|
|
|
* inserted before X.
|
|
|
|
|
*
|
|
|
|
|
* This manifested in a single OpenGL ES 2.0 CTS vertex shader test on
|
|
|
|
|
* older Intel GPU that use vector-mode vertex processing.
|
|
|
|
|
*/
|
|
|
|
|
build->cursor = nir_after_instr(&src_instr->instr);
|
|
|
|
|
} else {
|
|
|
|
|
build->cursor = nir_before_instr(&instr->instr);
|
|
|
|
|
}
|
2018-10-22 14:08:13 -05:00
|
|
|
|
2019-05-10 16:57:45 +02:00
|
|
|
state.states = states;
|
|
|
|
|
|
2018-10-22 14:08:13 -05:00
|
|
|
nir_alu_src val = construct_value(build, replace,
|
2023-08-14 11:43:35 -05:00
|
|
|
instr->def.num_components,
|
|
|
|
|
instr->def.bit_size,
|
nir/algebraic: Rewrite bit-size inference
Before this commit, there were two copies of the algorithm: one in C,
that we would use to figure out what bit-size to give the replacement
expression, and one in Python, that emulated the C one and tried to
prove that the C algorithm would never fail to correctly assign
bit-sizes. That seemed pretty fragile, and likely to fall over if we
make any changes. Furthermore, the C code was really just recomputing
more-or-less the same thing as the Python code every time. Instead, we
can just store the results of the Python algorithm in the C
datastructure, and consult it to compute the bitsize of each value,
moving the "brains" entirely into Python. Since the Python algorithm no
longer has to match C, it's also a lot easier to change it to something
more closely approximating an actual type-inference algorithm. The
algorithm used is based on Hindley-Milner, although deliberately
weakened a little. It's a few more lines than the old one, judging by
the diffstat, but I think it's easier to verify that it's correct while
being as general as possible.
We could split this up into two changes, first making the C code use the
results of the Python code and then rewriting the Python algorithm, but
since the old algorithm never tracked which variable each equivalence
class, it would mean we'd have to add some non-trivial code which would
then get thrown away. I think it's better to see the final state all at
once, although I could also try splitting it up.
v2:
- Replace instances of "== None" and "!= None" with "is None" and
"is not None".
- Rename first_src to first_unsized_src
- Only merge the destination with the first unsized source, since the
sources have already been merged.
- Add a comment explaining what nir_search_value::bit_size now means.
v3:
- Fix one last instance to use "is not" instead of !=
- Don't try to be so clever when choosing which error message to print
based on whether we're in the search or replace expression.
- Fix trailing whitespace.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2018-11-23 17:34:19 +01:00
|
|
|
&state, &instr->instr);
|
2018-10-22 14:08:13 -05:00
|
|
|
|
2019-09-16 14:34:20 -07:00
|
|
|
/* Note that NIR builder will elide the MOV if it's a no-op, which may
|
|
|
|
|
* allow more work to be done in a single pass through algebraic.
|
2014-11-13 21:19:28 -08:00
|
|
|
*/
|
2026-03-14 08:18:23 +01:00
|
|
|
nir_def *mov = nir_def_rewrite_uses_with_alu_src(build, &instr->def, val,
|
|
|
|
|
instr->def.num_components);
|
|
|
|
|
|
|
|
|
|
if (mov) {
|
util/dynarray: infer type in append
Most of the time, we can infer the type to append in
util_dynarray_append using __typeof__, which is standardized in C23 and
support in Jesse's MSMSVCV. This patch drops the type argument most of
the time, making util_dynarray a little more ergonomic to use.
This is done in four steps.
First, rename util_dynarray_append -> util_dynarray_append_typed
bash -c "find . -type f -exec sed -i -e 's/util_dynarray_append(/util_dynarray_append_typed(/g' \{} \;"
Then, add a new append that infers the type. This is much more ergonomic
for what you want most of the time.
Next, use type-inferred append as much as possible, via Coccinelle
patch (plus manual fixup):
@@
expression dynarray, element;
type type;
@@
-util_dynarray_append_typed(dynarray, type, element);
+util_dynarray_append(dynarray, element);
Finally, hand fixup cases that Coccinelle missed or incorrectly
translated, of which there were several because we can't used the
untyped append with a literal (since the sizeof won't do what you want).
All four steps are squashed to produce a single patch changing every
util_dynarray_append call site in tree to either drop a type parameter
(if possible) or insert a _typed suffix (if we can't infer). As such,
the final patch is best reviewed by hand even though it was
tool-assisted.
No Long Linguine Meals were involved in the making of this patch.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38038>
2025-10-23 15:36:13 -04:00
|
|
|
util_dynarray_append_typed(states, uint16_t, 0);
|
2026-03-14 08:18:23 +01:00
|
|
|
nir_algebraic_automaton(nir_def_instr(mov), states, table->pass_op_table);
|
2019-05-10 16:57:45 +02:00
|
|
|
}
|
|
|
|
|
|
2026-03-14 08:18:23 +01:00
|
|
|
/* Recurse through the uses updating the automaton's state. */
|
|
|
|
|
nir_algebraic_update_automaton(nir_def_instr(val.src.ssa), algebraic_worklist,
|
2021-11-29 15:24:47 -08:00
|
|
|
states, table->pass_op_table);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
/* Nothing uses the instr any more, so drop it out of the program. Note
|
|
|
|
|
* that the instr may be in the worklist still, so we can't free it
|
|
|
|
|
* directly.
|
2014-11-13 21:19:28 -08:00
|
|
|
*/
|
2019-03-19 14:16:41 +01:00
|
|
|
assert(instr->instr.pass_flags == 0);
|
|
|
|
|
instr->instr.pass_flags = 1;
|
2014-11-13 21:19:28 -08:00
|
|
|
nir_instr_remove(&instr->instr);
|
2019-03-19 14:16:41 +01:00
|
|
|
exec_list_push_tail(dead_instrs, &instr->instr.node);
|
2014-11-13 21:19:28 -08:00
|
|
|
|
2026-03-14 08:18:23 +01:00
|
|
|
return true;
|
2014-11-13 21:19:28 -08:00
|
|
|
}
|
2019-09-23 14:36:32 -07:00
|
|
|
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
static bool
|
2019-05-10 16:57:45 +02:00
|
|
|
nir_algebraic_automaton(nir_instr *instr, struct util_dynarray *states,
|
2019-09-23 14:36:32 -07:00
|
|
|
const struct per_op_table *pass_op_table)
|
|
|
|
|
{
|
2019-05-10 16:57:45 +02:00
|
|
|
switch (instr->type) {
|
|
|
|
|
case nir_instr_type_alu: {
|
|
|
|
|
nir_alu_instr *alu = nir_instr_as_alu(instr);
|
|
|
|
|
nir_op op = alu->op;
|
|
|
|
|
uint16_t search_op = nir_search_op_for_nir_op(op);
|
|
|
|
|
const struct per_op_table *tbl = &pass_op_table[search_op];
|
|
|
|
|
if (tbl->num_filtered_states == 0)
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
return false;
|
2019-05-10 16:57:45 +02:00
|
|
|
|
|
|
|
|
/* Calculate the index into the transition table. Note the index
|
|
|
|
|
* calculated must match the iteration order of Python's
|
|
|
|
|
* itertools.product(), which was used to emit the transition
|
|
|
|
|
* table.
|
|
|
|
|
*/
|
2020-02-18 15:31:37 -08:00
|
|
|
unsigned index = 0;
|
2019-05-10 16:57:45 +02:00
|
|
|
for (unsigned i = 0; i < nir_op_infos[op].num_inputs; i++) {
|
|
|
|
|
index *= tbl->num_filtered_states;
|
2021-12-01 14:20:55 -08:00
|
|
|
if (tbl->filter)
|
|
|
|
|
index += tbl->filter[*util_dynarray_element(states, uint16_t,
|
|
|
|
|
alu->src[i].src.ssa->index)];
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
|
|
|
|
|
uint16_t *state = util_dynarray_element(states, uint16_t,
|
2023-08-14 11:43:35 -05:00
|
|
|
alu->def.index);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
if (*state != tbl->table[index]) {
|
|
|
|
|
*state = tbl->table[index];
|
|
|
|
|
return true;
|
|
|
|
|
}
|
|
|
|
|
return false;
|
2019-05-10 16:57:45 +02:00
|
|
|
}
|
2019-09-23 14:36:32 -07:00
|
|
|
|
2019-05-10 16:57:45 +02:00
|
|
|
case nir_instr_type_load_const: {
|
|
|
|
|
nir_load_const_instr *load_const = nir_instr_as_load_const(instr);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
uint16_t *state = util_dynarray_element(states, uint16_t,
|
|
|
|
|
load_const->def.index);
|
|
|
|
|
if (*state != CONST_STATE) {
|
|
|
|
|
*state = CONST_STATE;
|
|
|
|
|
return true;
|
|
|
|
|
}
|
|
|
|
|
return false;
|
2019-05-10 16:57:45 +02:00
|
|
|
}
|
2019-09-23 14:36:32 -07:00
|
|
|
|
2019-05-10 16:57:45 +02:00
|
|
|
default:
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
return false;
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static bool
|
2019-11-13 12:15:08 -08:00
|
|
|
nir_algebraic_instr(nir_builder *build, nir_instr *instr,
|
2025-08-15 10:59:12 +01:00
|
|
|
const nir_search_state *state,
|
2019-09-23 14:36:32 -07:00
|
|
|
const bool *condition_flags,
|
2021-11-29 13:57:19 -08:00
|
|
|
const nir_algebraic_table *table,
|
2019-05-10 16:57:45 +02:00
|
|
|
struct util_dynarray *states,
|
2019-03-19 14:16:41 +01:00
|
|
|
nir_instr_worklist *worklist,
|
|
|
|
|
struct exec_list *dead_instrs)
|
2019-09-23 14:36:32 -07:00
|
|
|
{
|
|
|
|
|
|
2019-11-13 12:15:08 -08:00
|
|
|
if (instr->type != nir_instr_type_alu)
|
|
|
|
|
return false;
|
2019-09-23 14:36:32 -07:00
|
|
|
|
2019-11-13 12:15:08 -08:00
|
|
|
nir_alu_instr *alu = nir_instr_as_alu(instr);
|
|
|
|
|
|
|
|
|
|
int xform_idx = *util_dynarray_element(states, uint16_t,
|
2023-08-14 11:43:35 -05:00
|
|
|
alu->def.index);
|
2021-12-01 15:45:54 -08:00
|
|
|
for (const struct transform *xform = &table->transforms[table->transform_offsets[xform_idx]];
|
|
|
|
|
xform->condition_offset != ~0;
|
|
|
|
|
xform++) {
|
2019-11-13 12:15:08 -08:00
|
|
|
if (condition_flags[xform->condition_offset] &&
|
2025-08-15 10:59:12 +01:00
|
|
|
nir_replace_instr(build, alu, state, states, table,
|
2021-11-29 15:24:47 -08:00
|
|
|
&table->values[xform->search].expression,
|
2019-03-19 14:16:41 +01:00
|
|
|
&table->values[xform->replace].value, worklist, dead_instrs)) {
|
2026-03-02 15:32:41 +00:00
|
|
|
nir_invalidate_fp_analysis_state(state->range_ht);
|
nir/algebraic: improve is_unsigned_multiple_of_4 and use it more
fossil-db (gfx1201):
Totals from 160 (0.20% of 79839) affected shaders:
MaxWaves: 4008 -> 3952 (-1.40%)
Instrs: 390073 -> 379834 (-2.62%); split: -2.63%, +0.00%
CodeSize: 2126020 -> 2053740 (-3.40%); split: -3.40%, +0.00%
VGPRs: 9492 -> 9612 (+1.26%)
Latency: 6746019 -> 6723893 (-0.33%); split: -0.33%, +0.00%
InvThroughput: 849571 -> 848942 (-0.07%); split: -0.42%, +0.35%
VClause: 11977 -> 11983 (+0.05%); split: -0.20%, +0.25%
SClause: 11828 -> 11824 (-0.03%); split: -0.14%, +0.11%
Copies: 30003 -> 30938 (+3.12%); split: -0.09%, +3.20%
PreSGPRs: 8914 -> 8938 (+0.27%)
PreVGPRs: 7352 -> 7514 (+2.20%); split: -0.04%, +2.24%
VALU: 171829 -> 168829 (-1.75%); split: -1.76%, +0.01%
SALU: 66503 -> 66543 (+0.06%); split: -0.01%, +0.07%
VMEM: 29365 -> 25327 (-13.75%)
VOPD: 864 -> 1013 (+17.25%)
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36760>
2025-08-12 15:08:31 +01:00
|
|
|
if (state->numlsb_ht->entries)
|
|
|
|
|
_mesa_hash_table_clear(state->numlsb_ht, NULL);
|
2019-11-13 12:15:08 -08:00
|
|
|
return true;
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2019-11-13 12:15:08 -08:00
|
|
|
return false;
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
bool
|
|
|
|
|
nir_algebraic_impl(nir_function_impl *impl,
|
|
|
|
|
const bool *condition_flags,
|
2021-11-29 13:57:19 -08:00
|
|
|
const nir_algebraic_table *table)
|
2019-09-23 14:36:32 -07:00
|
|
|
{
|
|
|
|
|
bool progress = false;
|
|
|
|
|
|
2023-06-26 10:42:29 -04:00
|
|
|
nir_builder build = nir_builder_create(impl);
|
2019-09-23 14:36:32 -07:00
|
|
|
|
|
|
|
|
/* Note: it's important here that we're allocating a zeroed array, since
|
|
|
|
|
* state 0 is the default state, which means we don't have to visit
|
|
|
|
|
* anything other than constants and ALU instructions.
|
|
|
|
|
*/
|
2023-08-08 12:00:35 -05:00
|
|
|
struct util_dynarray states = { 0 };
|
2020-05-21 22:34:37 -05:00
|
|
|
if (!util_dynarray_resize(&states, uint16_t, impl->ssa_alloc)) {
|
treewide: Switch to nir_progress
Via the Coccinelle patch at the end of the commit message, followed by
sed -ie 's/progress = progress | /progress |=/g' $(git grep -l 'progress = prog')
ninja -C ~/mesa/build clang-format
cd ~/mesa/src/compiler/nir && clang-format -i *.c
agxfmt
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
-return prog;
+return nir_progress(prog, impl, metadata);
@@
expression prog_expr, impl, metadata;
@@
-if (prog_expr) {
-nir_metadata_preserve(impl, metadata);
-return true;
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-return false;
-}
+bool progress = prog_expr;
+return nir_progress(progress, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-nir_metadata_preserve(impl, prog ? (metadata) : nir_metadata_all);
-return prog;
+return nir_progress(prog, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-nir_metadata_preserve(impl, prog ? (metadata) : nir_metadata_all);
+nir_progress(prog, impl, metadata);
@@
expression impl, metadata;
@@
-nir_metadata_preserve(impl, metadata);
-return true;
+return nir_progress(true, impl, metadata);
@@
expression impl;
@@
-nir_metadata_preserve(impl, nir_metadata_all);
-return false;
+return nir_no_progress(impl);
@@
identifier other_prog, prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
-other_prog |= prog;
+other_prog = other_prog | nir_progress(prog, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+nir_progress(prog, impl, metadata);
@@
identifier other_prog, prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-other_prog = true;
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+other_prog = other_prog | nir_progress(prog, impl, metadata);
@@
expression prog_expr, impl, metadata;
identifier prog;
@@
-if (prog_expr) {
-nir_metadata_preserve(impl, metadata);
-prog = true;
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+bool impl_progress = prog_expr;
+prog = prog | nir_progress(impl_progress, impl, metadata);
@@
identifier other_prog, prog;
expression impl, metadata;
@@
-if (prog) {
-other_prog = true;
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+other_prog = other_prog | nir_progress(prog, impl, metadata);
@@
expression prog_expr, impl, metadata;
identifier prog;
@@
-if (prog_expr) {
-prog = true;
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+bool impl_progress = prog_expr;
+prog = prog | nir_progress(impl_progress, impl, metadata);
@@
expression prog_expr, impl, metadata;
@@
-if (prog_expr) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+bool impl_progress = prog_expr;
+nir_progress(impl_progress, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-nir_metadata_preserve(impl, metadata);
-prog = true;
+prog = nir_progress(true, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-}
-return prog;
+return nir_progress(prog, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-}
+nir_progress(prog, impl, metadata);
@@
expression impl;
@@
-nir_metadata_preserve(impl, nir_metadata_all);
+nir_no_progress(impl);
@@
expression impl, metadata;
@@
-nir_metadata_preserve(impl, metadata);
+nir_progress(true, impl, metadata);
squashme! sed -ie 's/progress = progress | /progress |=/g' $(git grep -l 'progress = prog')
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33722>
2025-02-24 15:10:33 -05:00
|
|
|
return nir_no_progress(impl);
|
2020-05-21 22:34:37 -05:00
|
|
|
}
|
2019-05-10 16:57:45 +02:00
|
|
|
memset(states.data, 0, states.size);
|
2019-09-23 14:36:32 -07:00
|
|
|
|
2026-03-02 15:32:41 +00:00
|
|
|
nir_fp_analysis_state range_ht = nir_create_fp_analysis_state(impl);
|
2019-09-23 14:36:32 -07:00
|
|
|
|
nir/algebraic: improve is_unsigned_multiple_of_4 and use it more
fossil-db (gfx1201):
Totals from 160 (0.20% of 79839) affected shaders:
MaxWaves: 4008 -> 3952 (-1.40%)
Instrs: 390073 -> 379834 (-2.62%); split: -2.63%, +0.00%
CodeSize: 2126020 -> 2053740 (-3.40%); split: -3.40%, +0.00%
VGPRs: 9492 -> 9612 (+1.26%)
Latency: 6746019 -> 6723893 (-0.33%); split: -0.33%, +0.00%
InvThroughput: 849571 -> 848942 (-0.07%); split: -0.42%, +0.35%
VClause: 11977 -> 11983 (+0.05%); split: -0.20%, +0.25%
SClause: 11828 -> 11824 (-0.03%); split: -0.14%, +0.11%
Copies: 30003 -> 30938 (+3.12%); split: -0.09%, +3.20%
PreSGPRs: 8914 -> 8938 (+0.27%)
PreVGPRs: 7352 -> 7514 (+2.20%); split: -0.04%, +2.24%
VALU: 171829 -> 168829 (-1.75%); split: -1.76%, +0.01%
SALU: 66503 -> 66543 (+0.06%); split: -0.01%, +0.07%
VMEM: 29365 -> 25327 (-13.75%)
VOPD: 864 -> 1013 (+17.25%)
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36760>
2025-08-12 15:08:31 +01:00
|
|
|
struct hash_table numlsb_ht;
|
|
|
|
|
_mesa_pointer_hash_table_init(&numlsb_ht, NULL);
|
|
|
|
|
|
2025-08-15 10:59:12 +01:00
|
|
|
nir_search_state state;
|
|
|
|
|
state.range_ht = &range_ht;
|
nir/algebraic: improve is_unsigned_multiple_of_4 and use it more
fossil-db (gfx1201):
Totals from 160 (0.20% of 79839) affected shaders:
MaxWaves: 4008 -> 3952 (-1.40%)
Instrs: 390073 -> 379834 (-2.62%); split: -2.63%, +0.00%
CodeSize: 2126020 -> 2053740 (-3.40%); split: -3.40%, +0.00%
VGPRs: 9492 -> 9612 (+1.26%)
Latency: 6746019 -> 6723893 (-0.33%); split: -0.33%, +0.00%
InvThroughput: 849571 -> 848942 (-0.07%); split: -0.42%, +0.35%
VClause: 11977 -> 11983 (+0.05%); split: -0.20%, +0.25%
SClause: 11828 -> 11824 (-0.03%); split: -0.14%, +0.11%
Copies: 30003 -> 30938 (+3.12%); split: -0.09%, +3.20%
PreSGPRs: 8914 -> 8938 (+0.27%)
PreVGPRs: 7352 -> 7514 (+2.20%); split: -0.04%, +2.24%
VALU: 171829 -> 168829 (-1.75%); split: -1.76%, +0.01%
SALU: 66503 -> 66543 (+0.06%); split: -0.01%, +0.07%
VMEM: 29365 -> 25327 (-13.75%)
VOPD: 864 -> 1013 (+17.25%)
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36760>
2025-08-12 15:08:31 +01:00
|
|
|
state.numlsb_ht = &numlsb_ht;
|
2025-08-15 10:59:12 +01:00
|
|
|
|
2025-08-11 12:22:37 -04:00
|
|
|
nir_instr_worklist worklist;
|
|
|
|
|
nir_instr_worklist_init(&worklist);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
|
|
|
|
|
/* Walk top-to-bottom setting up the automaton state. */
|
2019-09-23 14:36:32 -07:00
|
|
|
nir_foreach_block(block, impl) {
|
2019-05-10 16:57:45 +02:00
|
|
|
nir_foreach_instr(instr, block) {
|
2021-11-29 13:57:19 -08:00
|
|
|
nir_algebraic_automaton(instr, &states, table->pass_op_table);
|
2019-05-10 16:57:45 +02:00
|
|
|
}
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|
|
|
|
|
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
/* Put our instrs in the worklist such that we're popping the last instr
|
|
|
|
|
* first. This will encourage us to match the biggest source patterns when
|
|
|
|
|
* possible.
|
|
|
|
|
*/
|
2019-09-23 14:36:32 -07:00
|
|
|
nir_foreach_block_reverse(block, impl) {
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
nir_foreach_instr_reverse(instr, block) {
|
2019-03-19 14:16:41 +01:00
|
|
|
instr->pass_flags = 0;
|
2020-11-04 13:10:25 +00:00
|
|
|
if (instr->type == nir_instr_type_alu)
|
2025-08-11 12:22:37 -04:00
|
|
|
nir_instr_worklist_push_tail(&worklist, instr);
|
2019-11-13 12:15:08 -08:00
|
|
|
}
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|
|
|
|
|
|
2019-03-19 14:16:41 +01:00
|
|
|
struct exec_list dead_instrs;
|
|
|
|
|
exec_list_make_empty(&dead_instrs);
|
|
|
|
|
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
nir_instr *instr;
|
2025-08-11 12:22:37 -04:00
|
|
|
while ((instr = nir_instr_worklist_pop_head(&worklist))) {
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
/* The worklist can have an instr pushed to it multiple times if it was
|
|
|
|
|
* the src of multiple instrs that also got optimized, so make sure that
|
|
|
|
|
* we don't try to re-optimize an instr we already handled.
|
|
|
|
|
*/
|
2019-03-19 14:16:41 +01:00
|
|
|
if (instr->pass_flags)
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
progress |= nir_algebraic_instr(&build, instr,
|
2025-08-15 10:59:12 +01:00
|
|
|
&state, condition_flags,
|
2025-08-11 12:22:37 -04:00
|
|
|
table, &states, &worklist, &dead_instrs);
|
nir: Make algebraic backtrack and reprocess after a replacement.
The algebraic pass was exhibiting O(n^2) behavior in
dEQP-GLES2.functional.uniform_api.random.3 and
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 (along with
other code-generated tests, and likely real-world loop-unroll cases).
In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to
transform:
result = b2f(a == b);
result *= b2f(c == d);
...
result *= b2f(z == w);
->
temp = (a == b)
temp = temp && (c == d)
...
temp = temp && (z == w)
result = b2f(temp);
nir_opt_algebraic, proceeding bottom-to-top, would match and convert
the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to
be matched by the next fmul down on the next time algebraic got run by
the optimization loop.
Back in 2016 in 7be8d0773229 ("nir: Do opt_algebraic in reverse
order."), Matt changed algebraic to go bottom-to-top so that we would
match the biggest patterns first. This helped his cases, but I
believe introduced this failure mode. Instead of reverting that, now
that we've got the automaton, we can update the automaton's state
recursively and just re-process any instructions whose state has
changed (indicating that they might match new things). There's a
small chance that the state will hash to the same value and miss out
on this round of algebraic, but this seems to be good enough to fix
dEQP.
Effects with NIR_VALIDATE=0 (improvement is better with validation enabled):
Intel shader-db runtime -0.954712% +/- 0.333844% (n=44/46, obvious throttling
outliers removed)
dEQP-GLES2.functional.uniform_api.random.3 runtime
-65.3512% +/- 4.22369% (n=21, was 1.4s)
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.13 runtime
-68.8066% +/- 6.49523% (was 4.8s)
v2: Use two worklists, suggested by @cwabbott, to cut out a bunch of
tricky code. Runtime of uniform_api.random.3 down -0.790299% +/-
0.244213% compred to v1.
v3: Re-add the nir_instr_remove() that I accidentally dropped in v2,
fixing infinite loops.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2019-10-02 10:59:13 -07:00
|
|
|
}
|
|
|
|
|
|
2019-03-19 14:16:41 +01:00
|
|
|
nir_instr_free_list(&dead_instrs);
|
|
|
|
|
|
2025-08-11 12:22:37 -04:00
|
|
|
nir_instr_worklist_fini(&worklist);
|
nir/algebraic: improve is_unsigned_multiple_of_4 and use it more
fossil-db (gfx1201):
Totals from 160 (0.20% of 79839) affected shaders:
MaxWaves: 4008 -> 3952 (-1.40%)
Instrs: 390073 -> 379834 (-2.62%); split: -2.63%, +0.00%
CodeSize: 2126020 -> 2053740 (-3.40%); split: -3.40%, +0.00%
VGPRs: 9492 -> 9612 (+1.26%)
Latency: 6746019 -> 6723893 (-0.33%); split: -0.33%, +0.00%
InvThroughput: 849571 -> 848942 (-0.07%); split: -0.42%, +0.35%
VClause: 11977 -> 11983 (+0.05%); split: -0.20%, +0.25%
SClause: 11828 -> 11824 (-0.03%); split: -0.14%, +0.11%
Copies: 30003 -> 30938 (+3.12%); split: -0.09%, +3.20%
PreSGPRs: 8914 -> 8938 (+0.27%)
PreVGPRs: 7352 -> 7514 (+2.20%); split: -0.04%, +2.24%
VALU: 171829 -> 168829 (-1.75%); split: -1.76%, +0.01%
SALU: 66503 -> 66543 (+0.06%); split: -0.01%, +0.07%
VMEM: 29365 -> 25327 (-13.75%)
VOPD: 864 -> 1013 (+17.25%)
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36760>
2025-08-12 15:08:31 +01:00
|
|
|
_mesa_hash_table_fini(&numlsb_ht, NULL);
|
2026-03-02 15:32:41 +00:00
|
|
|
nir_free_fp_analysis_state(&range_ht);
|
2019-05-10 16:57:45 +02:00
|
|
|
util_dynarray_fini(&states);
|
2019-09-23 14:36:32 -07:00
|
|
|
|
treewide: Switch to nir_progress
Via the Coccinelle patch at the end of the commit message, followed by
sed -ie 's/progress = progress | /progress |=/g' $(git grep -l 'progress = prog')
ninja -C ~/mesa/build clang-format
cd ~/mesa/src/compiler/nir && clang-format -i *.c
agxfmt
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
-return prog;
+return nir_progress(prog, impl, metadata);
@@
expression prog_expr, impl, metadata;
@@
-if (prog_expr) {
-nir_metadata_preserve(impl, metadata);
-return true;
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-return false;
-}
+bool progress = prog_expr;
+return nir_progress(progress, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-nir_metadata_preserve(impl, prog ? (metadata) : nir_metadata_all);
-return prog;
+return nir_progress(prog, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-nir_metadata_preserve(impl, prog ? (metadata) : nir_metadata_all);
+nir_progress(prog, impl, metadata);
@@
expression impl, metadata;
@@
-nir_metadata_preserve(impl, metadata);
-return true;
+return nir_progress(true, impl, metadata);
@@
expression impl;
@@
-nir_metadata_preserve(impl, nir_metadata_all);
-return false;
+return nir_no_progress(impl);
@@
identifier other_prog, prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
-other_prog |= prog;
+other_prog = other_prog | nir_progress(prog, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+nir_progress(prog, impl, metadata);
@@
identifier other_prog, prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-other_prog = true;
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+other_prog = other_prog | nir_progress(prog, impl, metadata);
@@
expression prog_expr, impl, metadata;
identifier prog;
@@
-if (prog_expr) {
-nir_metadata_preserve(impl, metadata);
-prog = true;
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+bool impl_progress = prog_expr;
+prog = prog | nir_progress(impl_progress, impl, metadata);
@@
identifier other_prog, prog;
expression impl, metadata;
@@
-if (prog) {
-other_prog = true;
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+other_prog = other_prog | nir_progress(prog, impl, metadata);
@@
expression prog_expr, impl, metadata;
identifier prog;
@@
-if (prog_expr) {
-prog = true;
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+bool impl_progress = prog_expr;
+prog = prog | nir_progress(impl_progress, impl, metadata);
@@
expression prog_expr, impl, metadata;
@@
-if (prog_expr) {
-nir_metadata_preserve(impl, metadata);
-} else {
-nir_metadata_preserve(impl, nir_metadata_all);
-}
+bool impl_progress = prog_expr;
+nir_progress(impl_progress, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-nir_metadata_preserve(impl, metadata);
-prog = true;
+prog = nir_progress(true, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-}
-return prog;
+return nir_progress(prog, impl, metadata);
@@
identifier prog;
expression impl, metadata;
@@
-if (prog) {
-nir_metadata_preserve(impl, metadata);
-}
+nir_progress(prog, impl, metadata);
@@
expression impl;
@@
-nir_metadata_preserve(impl, nir_metadata_all);
+nir_no_progress(impl);
@@
expression impl, metadata;
@@
-nir_metadata_preserve(impl, metadata);
+nir_progress(true, impl, metadata);
squashme! sed -ie 's/progress = progress | /progress |=/g' $(git grep -l 'progress = prog')
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33722>
2025-02-24 15:10:33 -05:00
|
|
|
return nir_progress(progress, impl, nir_metadata_control_flow);
|
2019-09-23 14:36:32 -07:00
|
|
|
}
|