It was double-counting cases where multiple variables were assigned to
the same slot, and not handling the case where the last variable is a
compact variable.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
These are used in Vulkan for clip/cull distances, instead of the GLSL
lowering when the clip/cull arrays are shared.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
It isn't really doing anything Gallium-specific, and it's needed for
handling component packing, overlapping, etc.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
load_fragcoord is already handled in common code for radeonsi, so we
don't need to do anything to handle it. However, there were some passes
creating NIR with the varying, so we switch them over to the sysval. In
the case of nir_lower_input_attachments which is used by both radv and
anv, we add handling for both until intel switches to using a sysval.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Equivalent to the already existing ir_variable is_in_buffer_block and
is_in_shader_storage_block, adding the uniform buffer object one. I'm
using the short forms (ssbo, ubo) to avoid having method names too
long.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
No shader-db change on any Intel platform. No shader-db run-time
difference on a certain 36-core / 72-thread system at 95% confidence
(n=20).
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
There is no reason to mark the fmul in the expression
('fmul', ('fadd', a, b), ('fadd', a, b))
as commutative. If a source of an instruction doesn't match one of the
('fadd', a, b) patterns, it won't match the other either.
This change is enough to make this pattern work:
('~fadd@32', ('fmul', ('fadd', 1.0, ('fneg', a)),
('fadd', 1.0, ('fneg', a))),
('fmul', ('flrp', a, 1.0, a), b))
This pattern has 5 commutative expressions (versus a limit of 4), but
the first fmul does not need to be commutative.
No shader-db change on any Intel platform. No shader-db run-time
difference on a certain 36-core / 72-thread system at 95% confidence
(n=20).
There are more subpatterns that could be marked as non-commutative, but
detecting these is more challenging. For example, this fadd:
('fadd', ('fmul', a, b), ('fmul', a, c))
The first fadd:
('fmul', ('fadd', a, b), ('fadd', a, b))
And this fadd:
('flt', ('fadd', a, b), 0.0)
This last case may be easier to detect. If all sources are variables
and they are the only instances of those variables, then the pattern can
be marked as non-commutative. It's probably not worth the effort now,
but if we end up with some patterns that bump up on the limit again, it
may be worth revisiting.
v2: Update the comment about the explicit "len(self.sources)" check to
be more clear about why it is necessary. Requested by Connor. Many
Python fixes style / idom fixes suggested by Dylan. Add missing (!!!)
opcode check in Expression::__eq__ method. This bug is the reason the
expected number of commutative expressions in the bitfield_reverse
pattern changed from 61 to 45 in the first version of this patch.
v3: Use all() in Expression::__eq__ method. Suggested by Connor.
Revert away from using __eq__ overloads. The "equality" implementation
of Constant and Variable needed for commutativity pruning is weaker than
the one needed for propagating and validating bit sizes. Using actual
equality caused the pruning to fail for my ('fmul', ('fadd', 1, a),
('fadd', 1, a)) case. I changed the name to "equivalent" rather than
the previous "same_as" to further differentiate it from __eq__.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
Search patterns that are expected to have too many (e.g., the giant
bitfield_reverse pattern) can be added to a white list.
This would have saved me a few hours debugging. :(
v2: Implement the expected-failure annotation as a property of the
search-replace pattern instead of as a property of the whole list of
patterns. Suggested by Connor.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
The bfi/bfm behavior change replaced the bfi/bfm usage in
lower_bitfield_insert_to_shifts with actual shifts like the name says,
but it failed to handle the offset=0, bits==32 case in the new
lowering.
v2: Use 31 < bits instead of bits == 32, to get the 31 < (iand bits,
31) -> false optimization.
Fixes regressions in dEQP-GLES31.*bitfield_insert* on freedreno.
Fixes: 165b7f3a44 ("nir: define behavior of nir_op_bfm and nir_op_u/ibfe according to SM5 spec.")
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
The helpers are needed so we can use the syntax `instr(cond)` in the
algebraic rules. Add simple rule for dropping a pair of mul-div of
the same value when wrapping is guaranteed to not happen.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
They indicate the operation does not cause overflow or underflow.
This is motivated by SPIR-V decorations NoSignedWrap and
NoUnsignedWrap.
Change the storage of `exact` to be a single bit, so they pack
together.
v2: Handle no_wrap in nir_instr_set. (Karol)
v3: Use two separate flags, since the NIR SSA values and certain
instructions are typeless, so just no_wrap would be insufficient
to know which one was referred to. (Connor)
v4: Don't use nir_instr_set to propagate the flags, unlike `exact`,
consider the instructions different if the flags have different
values. Fix hashing/comparing. (Jason)
Reviewed-by: Karol Herbst <kherbst@redhat.com> [v1]
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
There doesn't seem to be any reason to keep these opcodes around:
* fnot/fxor are not used at all.
* fand/for are only used in lower_alu_to_scalar, but easily replaced
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
We can vectorize instructions with different constant sources by creating
a new load_const and using that.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
This lets us use the optimization pattern
(('ult', 31, ('iand', b, 31)), False) to remove the
bcsel instruction for code originating in D3D shaders.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
The [iu]bfe and bfm instructions are defined to only use the five
least significant bits.
This optimizes a common pattern from D3D -> SPIR-V translation.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
That is: the five least significant bits provide the values of
'bits' and 'offset' which is the case for all hardware currently
supported by NIR and using the bfm/bfe instructions.
This patch also changes the lowering of bitfield_insert/extract
using shifts to not use bfm and removes the flag 'lower_bfm'.
Tested-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
These optimizations are based on the fact that
'and(a,b) <= umin(a,b)'.
For AMD, this series moves the optimization from LLVM to NIR,
so currently no vkpipeline-db changes here.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
We don't expect the output of a TXS instruction to be wider than a
vec3. Add an assert() to make sure this never happens.
Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Most places in NIR, we treat matrices like arrays. The one annoying
exception to this has been nir_constant where a matrix is a first-class
thing. This commit changes that so a matrix nir_constant is the same as
an array nir_constant. This makes matrix nir_constants a tiny bit more
expensive but shrinks all others by 96B.
Reviewed-by: Karol Herbst <kherbst@redhat.com>
The spec explicitly says that volatile writes can't be removed and
volatile reads do not guarantee that the same value will still be around
after the read, as if there were a barrier after each read/write. Just
ignore them.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
We were completely ignoring these before, except for putting them on
variables. While we're here, don't set access qualifiers when converting
to bindless since glsl_to_nir will already have set a more accurate
qualifier that includes any qualifiers on struct members that are
dereferenced.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
In the next commit, we'll properly handle access qualifiers on struct
members by propagating them to load/store instructions, but these
instructions had no way to specify the qualifier.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
This effectively does the opposite of nir_lower_alus_to_scalar, trying
to combine per-component ALU operations with the same sources but
different swizzles into one larger ALU operation. It uses a similar
model as CSE, where we do a depth-first approach and keep around a hash
set of instructions to be combined, but there are a few major
differences:
1. For now, we only support entirely per-component ALU operations.
2. Since it's not always guaranteed that we'll be able to combine
equivalent instructions, we keep a stack of equivalent instructions
around, trying to combine new instructions with instructions on the
stack.
The pass isn't comprehensive by far; it can't handle operations where
some of the sources are per-component and others aren't, and it can't
handle phi nodes. But it should handle the more common cases, and it
should be reasonably efficient.
[Alyssa: Rebase on latest master, updating with respect to typeless
moves]
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
The V3D driver has an open-coded solution for this, and we need the
same thing for Panfrost, so let's add a generic way to lower TXS(LOD)
into max(TXS(0) >> LOD, 1).
Changes in v2:
* Use == 0 instead of !
* Rework the minification logic as suggested by Jason
* Assign cursor pos at the beginning of the function
* Patch the LOD just after retrieving the old value
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
get_texture_size() will create a txs instruction with ->sampler_dim set
to the original tex->sampler_dim. The condition to call lower_rect()
only checks the value of ->sampler_dim and whether lower_rect is
requested or not. This leads to an infinite loop when calling
nir_lower_tex() with the same options until it returns false.
In order to avoid that, let's move the tex->sampler_dim patching before
get_texture_size() is called. This way the txs instruction will have
->sampler_dim set to GLSL_SAMPLER_DIM_2D and nir_lower_tex() won't try
to lower it on the subsequent passes.
Changes in v2:
* Add Jason R-b
* Add a comment explaining why we patch ->sampler_dim at the beginning
of the lower_rect() func
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
The code considers that projector lowering was done even if it's not
really the case. Change the project_src() prototype to return a bool
encoding whether projector lowering happened or not and update the
progress var accordingly in nir_lower_tex_block().
---
Changes in v2:
* Add Jason R-b
* Drop the part suggesting that nir_lower_rect() could be called in
a do-while(progress) loop.
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
This fixes a rebase fail in ea51275e07, and prevents it from happening
again. There's no reason to do this manually.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The rules added in patch 3addd7c are inverted:
It should be:
(al * bh) << 16 + c
instead of:
(ah * bl) << 16 + c
Fixes a number of regressions under
dEQP-GLES31.functional.draw_indirect.compute_interop.large.*
on Freedreno.
Reviewed-by: Rob Clark <robdclark@gmail.com>
Fixes: cd73b6174b "nir/lower_to_source_mods: Stop turning add, sat, and neg into mov"
Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
For umul_low (al * bl), zero is returned if the low 16-bits word of either
source is zero.
for imadsh_mix16 (ah * bl << 16 + c), c is returned if either 'ah' or 'bl'
is zero.
A couple of nir_search_helpers are added:
is_upper_half_zero() returns true if the highest word of all components of
an integer NIR alu src are zero.
is_lower_half_zero() returns true if the lowest word of all components of
an integer nir alu src are zero.
Reviewed-by: Eric Anholt <eric@anholt.net>
'umul_low' is the low 32-bits of unsigned integer multiply. It maps
directly to ir3's MULL_U.
'imadsh_mix16' is multiply add with shift and mix, an ir3 specific
instruction that maps directly to ir3's IMADSH_M16.
Both are necessary for the lowering of integer multiplication on
Freedreno, which will be introduced later in this series.
Reviewed-by: Eric Anholt <eric@anholt.net>
We originally had a single lower_fmod option. In commit 2ab2d2e5, Sam
split 32 and 64-bit lowering into separate flags, with the rationale
that some drivers might want different options there. This left 16-bit
unhandled, so Iago added a lower_fmod16 option in commit ca31df6f.
Now that lower_fmod64 is gone (in favor of nir_lower_doubles and
nir_lower_dmod), we re-combine lower_fmod16 and lower_fmod32 into a
single lower_fmod flag again. I'm not aware of any hardware which
need lowering for one bitsize and not the other.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
nir_lower_doubles offers a wide variety of fp64 lowering, including
lowering fmod@64. The version there also better handles imprecisions
due to lowered frcp@64. Let's consolidate on one version.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Instead, we add a new helper which stomps one nir_shader and replaces it
with another. The new helper effectively just changes which pointer
gets used for the base nir_shader. It should be 99% as good at testing
cloning but without requiring that everything handle having the shader
swapped out from under it constantly.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108957
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Rob Clark <robdclark@chromium.org>
This pattern was noticed in glmark's jellyfish scene.
v2: Add inexact qualifier due to NaN behaviour.
Minimal shader-db changes (slightly helped).
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Elie Tournier <tournier.elie@gmail.com>