Found with Jasons new metadata rework (https://gitlab.freedesktop.org/mesa/mesa/merge_requests/950).
Fixes: af355aaa07 "nir: add nir_opt_move_load_ubo() optimization pass"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Because the core principle of the vars_to_ssa pass is that it globally
(within a function) looks at all of the uses of a never-indirected path
and does a full into-SSA on that path, it can't handle a path which has
any chance of having aliasing. If a function_temp variable has a cast
or anything else which may cause aliasing, we have to assume that all
paths to that variable may alias and ignore the entire variable.
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We're about to change the meaning of get_deref_node returning NULL so we
need a non-NULL value to mean properly undefined.
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This lets passes easily detect derefs which have uses that fall outside
the standard load/store/copy pattern so they can bail appropriately.
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
When we inlined cf_node_has_side_effects into node_is_dead, all the
conditions flipped and we forgot to flip one. Fortunately, it doesn't
matter right now because no one uses this pass on shaders with more than
one function.
Fixes: b50465d197 "nir/dead_cf: Inline cf_node_has_side_effects"
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This mode is used by PhysicalStorageBufferEXT storage class.
Fixes: 8bdf5a008b "nir: Allow derefs to be used as phi sources"
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Silence two unused var warnings. And init elem_size, elem_align to
zero to silence "maybe uninitialized" warnings.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
It is possible and valid for a pointer to be selected based on a
conditional before used, and depending on the mode, those cases will
result in a phi with derefs as sources.
To achieve this, we don't rematerialize derefs that are used by phis.
As a consequence, when converting from SSA to regs, we may have phis
that come from different blocks and are used by phis. We now convert
those to regs too.
Validation was added to ensure only derefs of certain modes can be
used as phi sources. No extra validation is needed for the presence
of cast, any instruction that uses derefs will validate the
deref-chain is complete (ending in a cast or a var).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We scalarize IO to enable further optimizations, such as propagating
constant components across shaders, eliminating dead components, and
so on. This patch attempts to re-vectorize those operations after
the varying optimizations are done.
Intel GPUs are a scalar architecture, but IO operations work on whole
vec4's at a time, so we'd prefer to have a single IO load per vector
rather than 4 scalar IO loads. This re-vectorization can help a lot.
Broadcom GPUs, however, really do want scalar IO. radeonsi may want
this, or may want to leave it to LLVM. So, we make a new flag in the
NIR compiler options struct, and key it off of that, allowing drivers
to pick. (It's a bit awkward because we have per-stage settings, but
this is about IO between two stages...but I expect drivers to globally
prefer one way or the other. We can adjust later if needed.)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The difference between imov and fmov has been a constant source of
confusion in NIR for years. No one really knows why we have two or when
to use one vs. the other. The real reason is that they do different
things in the presence of source and destination modifiers. However,
without modifiers (which many back-ends don't have), they are identical.
Now that we've reworked nir_lower_to_source_mods to leave one abs/neg
instruction in place rather than replacing them with imov or fmov
instructions, we don't need two different instructions at all anymore.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Acked-by: Rob Clark <robdclark@chromium.org>
Unless source modifiers are present, fmov and imov are the same.
There's no good reason for having two helpers.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
It's potentially a tiny bit less efficient but the helpers make it much
easier to sort out the rules for updating source modifiers.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This flag has caused more confusion than good in most cases. You can
validly use imov for floats or fmov for integers because, without source
modifiers, neither modify their input in any way. Using imov for floats
is more reliable so we go that direction.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This just adds some split var splitting tests, it verifies
by counting derefs and local vars.
a basic load from inputs, store to array,
same as before but with a shader temp
struct { float } [4] don't split test
a basic load from inputs, with some out of band loads.
a load/store of only half the array
two level array, load from inputs store to all levels
a basic load from inputs with an indirect store to array.
two level array, indirect store to lvl 0
two level array, indirect store to lvl 1
load from inputs, store to array twice
load from input, store to array, load from array, store to another array.
load and from input and copy deref to array
create wildcard derefs, and do a copy
v2: use array_imm helpers, move derefs out of loops,
rename toplevel/secondlevel, use ints, fix lvl1 don't split test,
rename globabls to shader_temp, add comment, check the derefs type
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
When num_state_slots is 0, don't create the array. This was
triggering the following assert when running vkcube with
NIR_TEST_CLONE=1
vkcube: ../src/compiler/nir/nir_split_per_member_structs.c:66:
split_variable: Assertion `var->state_slots == NULL' failed.
Fixes: 9fbd390dd4 "nir: Add support for cloning shaders"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Returns the nir_const_value * with the representation of the NULL
pointer for each address format.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This is a simple 32-bit address which is not a global address. Gives
us a format that don't use 0 as its null pointer value. We will need
this in anv to represent nir_var_mem_shared addresses.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
An address format representing a purely logical addressing model. In
this model, all deref chains must be complete from the dereference
operation to the variable. Cast derefs are not allowed. These
addresses will be 32-bit scalars but the format is immaterial because
you can always chase the chain. E.g. push constants in anv.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
we validate assert entry just before this, but since that doesn't
stop execution, we need to check entry before the next validation
assert.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
If the SSA def produced by this instruction is only in the block in
which it is defined and is not used by ifs or phis, then we don't have
a reason to convert it to a register in
nir_lower_ssa_defs_to_regs_block().
The special case for derefs is covered by the general case, so can be
removed: at this point all derefs in the block are
materialized (i.e. the whole deref chain is in the block) and derefs
are not used in phis.
v2: Fix wrong check for if_uses. If there's such an use, the def is
not "local_to_block". (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
First, allow the case for negative powers of two. Then ensure that we
use the absolute value of the non-constant value to calculate the
quotient -- this was hinted in the code by the name 'uq'.
This fixes an issue when 'd' is positive and 'n' is negative. The
ishr will propagate the negative sign and we'll use nir_ineg() again,
incorrectly.
v2: First version used only ishr, but that isn't sufficient, since it
never can produce a zero as a result. (Jason)
Allow negative powers of two. (Caio)
Fixes: 74492ebad9 "nir: Add a pass for lowering integer division by constants"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This pass moves instructions around and adds control-flow in the
middle of blocks. We need to use nir_foreach_instr_safe to ensure that
we iterate over instructions correctly anyway.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 3bd5457641 ("nir: Add a lowering pass for non-uniform resource access")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This line is no longer relevant now that booleans are 1-bit, and in fact
causes issues (infinite progress loop between algebraic optimizations
and copy prop) with constant vector masks.
No shader-db changes on Intel platforms (Jason).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Jason Ekstrand <jason.ekstrand@intel.com>
Obviously missing the instruction insertion into the SSA list.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 3bd5457641 ("nir: Add a lowering pass for non-uniform resource access")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The goal is to avoid having an extra MOV instruction to perform the
saturate. Doing the subtraction first allows the saturate to be applied
to the ADD instruction making the MOV unnecessary. Values generated in
different block and values from non-ALU instructions (e.g., texture
instructions) almost always need the extra MOV.
Multiply instructions are restricted because doing this rearrangement
can interfere with the generation of flrp and ffma instructions.
v2: Now that the final method has been selected, squash three commits
into one.
All Intel platforms has similar results. (Ice Lake shown)
total instructions in shared programs: 17223214 -> 17219386 (-0.02%)
instructions in affected programs: 1524376 -> 1520548 (-0.25%)
helped: 2686
HURT: 26
helped stats (abs) min: 1 max: 32 x̄: 1.44 x̃: 1
helped stats (rel) min: 0.03% max: 16.67% x̄: 0.54% x̃: 0.37%
HURT stats (abs) min: 1 max: 2 x̄: 1.69 x̃: 2
HURT stats (rel) min: 0.33% max: 1.67% x̄: 0.54% x̃: 0.35%
95% mean confidence interval for instructions value: -1.46 -1.36
95% mean confidence interval for instructions %-change: -0.56% -0.50%
Instructions are helped.
total cycles in shared programs: 360811571 -> 360791896 (<.01%)
cycles in affected programs: 103650214 -> 103630539 (-0.02%)
helped: 1557
HURT: 675
helped stats (abs) min: 1 max: 1773 x̄: 41.44 x̃: 16
helped stats (rel) min: <.01% max: 26.77% x̄: 1.37% x̃: 0.64%
HURT stats (abs) min: 1 max: 1513 x̄: 66.44 x̃: 14
HURT stats (rel) min: <.01% max: 46.16% x̄: 2.00% x̃: 0.49%
95% mean confidence interval for cycles value: -14.82 -2.81
95% mean confidence interval for cycles %-change: -0.50% -0.20%
Cycles are helped.
LOST: 2
GAINED: 0
Reviewed-by: Matt Turner <mattst88@gmail.com> [v1]
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
The value-range tracking pass that is coming is not clever enough to
know that the result of the ffma must be non-negative. Making it that
smart will require quite a bit of work. It might be possible to add a
special case that detects that a whole tree of fadd(fmul(fsat(a),
fneg(fsat(a))), 1.0) cannot be negative.
For cases when the comparison is used in the domain guard for a
square-root (see nir/algebraic: Simplify fsqrt domain guard), the
compare may be converted to a fmax. This patch also handles that case.
All of the affected cases are in DiRT: Showdown.
All Gen7+ platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17225365 -> 17225303 (<.01%)
instructions in affected programs: 40051 -> 39989 (-0.15%)
helped: 62
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 0.07% max: 0.66% x̄: 0.27% x̃: 0.26%
95% mean confidence interval for instructions value: -1.00 -1.00
95% mean confidence interval for instructions %-change: -0.31% -0.22%
Instructions are helped.
total cycles in shared programs: 360842788 -> 360842595 (<.01%)
cycles in affected programs: 1818081 -> 1817888 (-0.01%)
helped: 29
HURT: 22
helped stats (abs) min: 1 max: 206 x̄: 20.66 x̃: 14
helped stats (rel) min: <.01% max: 9.55% x̄: 0.87% x̃: 0.42%
HURT stats (abs) min: 1 max: 108 x̄: 18.45 x̃: 7
HURT stats (rel) min: <.01% max: 4.48% x̄: 0.56% x̃: 0.19%
95% mean confidence interval for cycles value: -14.48 6.91
95% mean confidence interval for cycles %-change: -0.71% 0.21%
Inconclusive result (value mean confidence interval includes 0).
No changes on any other Intel platform.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
Without this, adding an algebraic rule like
(('bcsel', ('flt', a, 0.0), 0.0, ...), ...),
will cause assertion failures inside nir_src_comp_as_float in
GTF-GL46.gtf21.GL.lessThan.lessThan_vec3_frag (and related tests) from
the OpenGL CTS and shaders/closed/steam/witcher-2/511.shader_test from
shader-db.
All of these cases have some code that ends up like
('bcsel', ('flt', a, 0.0), 'b@1', ...)
When the 'b@1' is tested, nir_src_comp_as_float fails because there's
no such thing as a 1-bit float.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
This change also enables a later change (nir/algebraic: Replace
1-fsat(a) with fsat(1-a)) to affect more shaders.
Almost all of the affected shaders are in Bioshock Infinite, and all of
those shaders all require GLSL 4.10.
All Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17228584 -> 17228376 (<.01%)
instructions in affected programs: 31438 -> 31230 (-0.66%)
helped: 105
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 1.98 x̃: 1
helped stats (rel) min: 0.08% max: 1.53% x̄: 0.73% x̃: 0.70%
95% mean confidence interval for instructions value: -2.20 -1.76
95% mean confidence interval for instructions %-change: -0.80% -0.67%
Instructions are helped.
total cycles in shared programs: 360936431 -> 360935690 (<.01%)
cycles in affected programs: 420100 -> 419359 (-0.18%)
helped: 71
HURT: 21
helped stats (abs) min: 1 max: 160 x̄: 19.28 x̃: 10
helped stats (rel) min: <.01% max: 9.78% x̄: 0.95% x̃: 0.48%
HURT stats (abs) min: 1 max: 198 x̄: 29.90 x̃: 10
HURT stats (rel) min: 0.05% max: 8.36% x̄: 1.24% x̃: 0.90%
95% mean confidence interval for cycles value: -16.77 0.66
95% mean confidence interval for cycles %-change: -0.85% -0.06%
Inconclusive result (value mean confidence interval includes 0).
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
This doesn't make any real difference now, but future work (not in this
series) will add a LOT of ffma patterns. Having to duplicate all of
them for ffma(a, b, c) and ffma(b, a, c) is just terrible.
No shader-db changes on any Intel platform.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
v2: Instead of handling 3 sources as a special case, generalize with
loops to N sources. Suggested by Jason.
v3: Further generalize by only checking that number of sources is >= 2.
Suggested by Jason.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The meaning of the new name is that the first two sources are
commutative. Since this is only currently applied to two-source
operations, there is no change.
A future change will mark ffma as 2src_commutative.
It is also possible that future work will add 3src_commutative for
opcodes like fmin3.
v2: s/commutative_2src/2src_commutative/g. I had originally considered
this, but I discarded it because I did't want to deal with identifiers
that (should) start with 2. Jason suggested it in review, so we decided
that _2src_commutative would be used in nir_opcodes.py. Also add some
comments documenting what 2src_commutative means. Also suggested by
Jason.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The current SSA def validation we do in nir_validate validates three
things:
1. That each SSA def is only ever used in the function in which it is
defined.
2. That an nir_src exists in an SSA def's use list if and only if it
points to that SSA def.
3. That each nir_src is in the correct use list (uses or if_uses) based
on whether it's an if condition or not.
The way we were doing this before was that we had a hash table which
provided a map from SSA def to a small ssa_def_validate_state data
structure which contained a pointer to the nir_function_impl and two
hash sets, one for each use list. This meant piles of allocation and
creating of little hash sets. It also meant one hash lookup for each
SSA def plus one per use as well as two per src (because we have to look
up the ssa_def_validate_state and then look up the use.) It also
involved a second walk over the instructions as a post-validate step.
This commit changes us to use a single low-collision hash set of SSA
sources for all of this by being a bit more clever. We accomplish the
objectives above as follows:
1. The list is clear when we start validating a function. If the
nir_src references an SSA def which is defined in a different
function, it simply won't be in the set.
2. When validating the SSA defs, we walk the uses and verify that they
have is_ssa set and that the SSA def points to the SSA def we're
validating. This catches the case of a nir_src being in the wrong
list. We then put the nir_src in the set and, when we validate the
nir_src, we assert that it's in the set. This takes care of any
cases where a nir_src isn't in the use list. After checking that
the nir_src is in the set, we remove it from the set and, at the end
of nir_function_impl validation, we assert that the set is empty.
This takes care of any cases where a nir_src is in a use list but
the instruction is no longer in the shader.
3. When we put a nir_src in the set, we set the bottom bit of the
pointer to 1 if it's the condition of an if. This lets us detect
whether or not a nir_src is in the right list.
When running shader-db with an optimized debug build of mesa on my
laptop, I get the following shader-db CPU times:
With NIR_VALIDATE=0 3033.34 seconds
Before this commit 20224.83 seconds
After this commit 6255.50 seconds
Assuming shader-db is a representative sampling of GLSL shaders, this
means that making this change yields an 81% reduction in the time spent
in nir_validate. It still isn't cheap but enabling validation now only
increases compile times by 2x instead of 6.6x.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
All of our hash tables and sets are already using ralloc. There's
really no good reason why we don't just make a ralloc context rather
than try to remember to clean everything up manually.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
The nested fma calls were supposed to implement
x_new = x + x * (1 - x*src),
but instead current code is equivalent to
x_new = x - x * (1 - x*src).
The result is that Newton-Raphson steps don't improve precision at all.
This patch fixes this problem.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110435
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This represents a float vec4 constant color, as passed to glBlendColor.
While the existing 4 shader sysvals are retained to minimize code churn,
a single vectorized intrinsic is required for efficient blending on
vector architectures. (This may also apply to archictectures like
Bifrost where ALU is scalar but load/store is vector; it largely depends
on how blending is implemented per-driver.)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
This can be used by both etnaviv and freedreno/a2xx as they are both vec4
architectures with some instructions being scalar-only.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>