This will allow the options to be visible under nir_shader->options,
which will allow the gallium state_tracker to use the driver preferred
settings during glsl_to_nir.
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
I noticed this while looking at a shader that was affected by Tim's
"more loop unrolling" series.
In review, Tim Arceri asked:
> Why the hurt on Gen6+ is this something that should be in the late
> optimisations pass?
As far as I can tell, it's just because our scheduler is terrible. In
all the fragment shaders that I looked at (some hurt shaders were from
other stages), only one of the SIMD8 or SIMD16 version would be hurt.
In many of those case, the other SIMD width is improved (e.g.,
shaders/closed/steam/brutal-legend/3990.shader_test).
Often it looks like the scheduler decides to differently schedule a SEND
the occurs somewhere early in the shader. Once that happens, everything
is different.
I looked at one vertex shader that was hurt (from Goat Simulator). In
that case, both the floor and fract are used. The optimization
eliminates the add, and it should allow better scheduling. In the area
of the FRC and RNDD instructions, the scheduler does the right thing.
However, later in the shader a MAD and and ADD get scheduled
differently, and that makes it slightly worse.
In light of this, I tried adding some "is_used_once" mark-up, and that
did not fix all the cycles regressions. It also did a lot more harm
than good on SKL (helped 82 vs. hurt 241).
All Gen6+ platforms had similar results. (Skylake shown)
total instructions in shared programs: 15437001 -> 15435259 (-0.01%)
instructions in affected programs: 213651 -> 211909 (-0.82%)
helped: 988
HURT: 0
helped stats (abs) min: 1 max: 27 x̄: 1.76 x̃: 1
helped stats (rel) min: 0.15% max: 11.54% x̄: 1.14% x̃: 0.59%
95% mean confidence interval for instructions value: -1.89 -1.63
95% mean confidence interval for instructions %-change: -1.23% -1.05%
Instructions are helped.
total cycles in shared programs: 383007378 -> 382997063 (<.01%)
cycles in affected programs: 1650825 -> 1640510 (-0.62%)
helped: 679
HURT: 302
helped stats (abs) min: 1 max: 348 x̄: 23.39 x̃: 14
helped stats (rel) min: 0.04% max: 28.77% x̄: 1.61% x̃: 0.98%
HURT stats (abs) min: 1 max: 250 x̄: 18.43 x̃: 7
HURT stats (rel) min: 0.04% max: 25.86% x̄: 1.41% x̃: 0.53%
95% mean confidence interval for cycles value: -13.05 -7.98
95% mean confidence interval for cycles %-change: -0.86% -0.50%
Cycles are helped.
Iron Lake and GM45 had similar results. (GM45 shown)
total instructions in shared programs: 5043616 -> 5043010 (-0.01%)
instructions in affected programs: 119691 -> 119085 (-0.51%)
helped: 432
HURT: 0
helped stats (abs) min: 1 max: 27 x̄: 1.40 x̃: 1
helped stats (rel) min: 0.10% max: 8.11% x̄: 0.66% x̃: 0.39%
95% mean confidence interval for instructions value: -1.58 -1.23
95% mean confidence interval for instructions %-change: -0.72% -0.59%
Instructions are helped.
total cycles in shared programs: 128139812 -> 128135762 (<.01%)
cycles in affected programs: 3829724 -> 3825674 (-0.11%)
helped: 602
HURT: 0
helped stats (abs) min: 2 max: 486 x̄: 6.73 x̃: 6
helped stats (rel) min: 0.02% max: 4.85% x̄: 0.19% x̃: 0.10%
95% mean confidence interval for cycles value: -8.40 -5.05
95% mean confidence interval for cycles %-change: -0.22% -0.16%
Cycles are helped.
Reviewed-by: Elie Tournier <tournier.elie@gmail.com>
I have not investigated the result of doing this during code
generation. That should be possible, but it would be a bit more
effort.
All Gen6+ platforms had nearly identical results. (Skylake shown)
total cycles in shared programs: 370961508 -> 370961367 (<.01%)
cycles in affected programs: 5174 -> 5033 (-2.73%)
helped: 2
HURT: 0
Iron Lake and GM45 had similar results. (Iron Lake shown)
total instructions in shared programs: 8206587 -> 8206589 (<.01%)
instructions in affected programs: 1325 -> 1327 (0.15%)
helped: 0
HURT: 2
total cycles in shared programs: 187657422 -> 187657428 (<.01%)
cycles in affected programs: 11566 -> 11572 (0.05%)
helped: 0
HURT: 2
This change has almost no effect right now. However, removing this
patch (but leaving the patch "intel/fs: Generate if instructions with
inverted conditions") after adding a patch that removes !(a < b) -> (a
>= b) optimizations (like
https://patchwork.freedesktop.org/patch/264787/) has the following
results on Skylake:
Skylake
total instructions in shared programs: 15071804 -> 15071806 (<.01%)
instructions in affected programs: 640 -> 642 (0.31%)
helped: 0
HURT: 2
total cycles in shared programs: 369914348 -> 369916569 (<.01%)
cycles in affected programs: 27900 -> 30121 (7.96%)
helped: 4
HURT: 15
helped stats (abs) min: 2 max: 112 x̄: 30.00 x̃: 3
helped stats (rel) min: 0.28% max: 12.28% x̄: 3.34% x̃: 0.40%
HURT stats (abs) min: 2 max: 758 x̄: 156.07 x̃: 81
HURT stats (rel) min: 0.20% max: 74.30% x̄: 16.29% x̃: 16.91%
95% mean confidence interval for cycles value: 12.68 221.11
95% mean confidence interval for cycles %-change: 3.09% 21.23%
Cycles are HURT.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Differently than the direct case, the indirect array derefs of vector
are handled like regular derefs, with the exception that we ignore any
vector entry that has SSA values when performing a load. Such SSA
values don't help loading of the indirect unless we emit an if-ladder.
Copy_derefs are supported for indirects.
Also enable two tests that now pass.
v2: Remove unnecessary temporaries. Be clearer when identifying the
case where copy_entry doesn't help when we are dealing with an
indirect array_deref (of a vector). (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When looking up an entry to use, always prefer an equal match, as it
more likely to contain reusable SSA or derefs to propagate.
This will be necessary when adding entries with array derefs of
vectors, because we don't want the vector if the equal entry (an array
deref of that vector) is present.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Both on an actual array and on a vector, and an extra test on a vector
mixing direct and indirect access. The vector tests are disabled and
will be enabled by a later commit.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When direct array deref is used on a vector type (for loads and
stores), copy_prop_vars is now smart to propagate values it knows
about.
Given a 'vec4 v', storing to v[3] will update the copy entry for v and
it is equivalent to a write to v.w. Loading from v[1] will try first
to see if there's a known value for v.y -- and drop the load in that
case.
The copy entries still always refer to the entire vectors, so the
operations happen on the parent deref (the 'vector') and the values
are fixed accordingly.
It might be the case now that certain entries have not only different
SSA defs in each element but also those come from different components
than they are set to, because stores to individual elements always
come from a SSA definition with a single component.
Tests related to these cases are now enabled.
v2: Instead of asserting on invalid indices, "load" an undef and
remove the store. (Jason)
v3: Merge code path for the cases of is_array_deref_of_vector into the
regular code path. Add a base_index parameter to
value_set_from_value. (code changes by Jason)
v4: Removed the get_entry_for_deref helper, now being used only once.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Also replace uses of 0xf with the appropriate full mask created from
the number of components.
Note that an increase of MAX might make us change how the data is
stored later on, but for now at least we make sure the pass is not
hardcoded.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The name reflected this function role back when the pass also did dead
write elimination. So rename it to what it does now, which is setting
a value using another value; and narrow the argument list.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When emitting a branch in a block, it does not make sense to continue
processing further instructions, as they will not be reachable.
This fixes a nasty case with a loop with a branch that both then-part
and else-part exits the loop:
%1 = OpLabel
OpLoopMerge %2 %3 None
OpBranchConditional %false %2 %2
%3 = OpLabel
OpBranch %1
%2 = OpLabel
[...]
We know that block %1 will branch always to block %2, which is the merge
block for the loop. And thus a break is emitted. If we keep continuing
processing further instructions, we will be processing the branch
conditional and thus emitting the proper NIR conditional, which leads to
instructions after the break.
This fixes dEQP-VK.graphicsfuzz.continue-and-merge.
CC: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Some types of params such as some builtins are always padded. We
need to keep track of this so we can restore the list correctly.
Here we also remove a couple of cache entries that are not actually
required as they get rebuilt by the _mesa_add_parameter() calls.
This patch fixes a bunch of arb_texture_multisample and
arb_sample_shading piglit tests for the radeonsi NIR backend.
Fixes: edded12376 ("mesa: rework ParameterList to allow packing")
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This is useful for r600 since there the abs source modifier is not supported
for ops with three sources
v2: Use correct logic to enable lowering to abs source mod (Eric Anhold)
Signed-off-by: Gert Wollny <gw.fossdev@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
The memory layout associated with this format would be:
Byte: 0 1 2 3
Component: V U Y X
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Fixes following valgrind warning:
==27561== Conditional jump or move depends on uninitialised value(s)
==27561== at 0x667856B: value_set_ssa_components (nir_opt_copy_prop_vars.c:78)
==27561== by 0x667A1C4: copy_prop_vars_block (nir_opt_copy_prop_vars.c:797)
Fixes: 62332d139c "nir: Add a local variable-based copy propagation pass"
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The nir_builder swizzling improvement to not emit extra MOVs resulted in
nir_lower_tex() trying to rewrite an SSA def to itself, triggering the
assert on all texturing in v3d. There's no work to be done in this case,
so just stop asserting.
Fixes: 743700be1f ("nir/builder: Don't emit no-op swizzles")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This is a common pattern from HLSL->SPIRV translation
and supported in HW by all current NIR backends.
vkpipeline-db results anv (SKL):
total instructions in shared programs: 6403130 -> 6402380 (-0.01%)
instructions in affected programs: 204084 -> 203334 (-0.37%)
helped: 208
HURT: 0
total cycles in shared programs: 1915629582 -> 1918198408 (0.13%)
cycles in affected programs: 1158892682 -> 1161461508 (0.22%)
helped: 107
HURT: 86
shader-db results on i965 (KBL):
total instructions in shared programs: 15284592 -> 15284568 (<.01%)
instructions in affected programs: 81683 -> 81659 (-0.03%)
helped: 24
HURT: 0
total cycles in shared programs: 375013622 -> 375013932 (<.01%)
cycles in affected programs: 40169618 -> 40169928 (<.01%)
helped: 13
HURT: 9
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
SPIR-V shifts are undefined for values >= bitsize, but SM5 shifts
are defined to only use the least significant bits.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
apply_implicit_conversion only converts and check base types but we
need actual type equality for function returns, otherwise you can
return a vec2 from a function declared as returning a float.
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
The nir_swizzle helper is used some on it's own but it's also called by
nir_channel and nir_channels which are used everywhere. It's pretty
quick to check while we're walking the swizzle anyway whether or not
it's an identity swizzle. If it is, we now don't bother emitting the
instruction. Sure, copy-prop will clean it up for us but there's no
sense making more work for the optimizer than we have to.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Test using array deref on vectors in loads and stores. These are
marked DISABLED_ as this optimization is currently not done.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Replace find_next_intrinsic(intrinsic, after) with
get_intrinsic(intrinsic, index). This makes slightly more convenient
to check the resulting loads/stores/copies, since in most tests we
know which one we care about. The cost is to perform more traversals,
but for such tests this is not a problem.
Added the ASSERT_EQ() on count to some tests missing it, so the
indices queried are always expected to find something.
Also, drop two nir_print_shader leftover calls in a test.
v2: Remove redundant assertions. nir_src_comp_as_uint already
assert what we need. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When a copy_entry is SSA, store not only the nir_ssa_def* for each
component, but also the source component they come from. At the
moment this is always a match (i.e. 'component[i] == i'), because all
the operations for a copy_entry happen using definitions with the same
size. This prepares the code for array_derefs of vectors, in which
'component[i] != i'.
Also, extract setting all SSA components into a function of its own.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Disabled by default, to be used during development. Adding those
so I don't rewrite some ad-hoc version of them everytime I'm working
with this pass.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
For now these derefs are not handled, so don't let these get into the
copies list -- which would cause wrong propagations. For load_derefs,
do nothing. For store_derefs, invalidate whatever the store is
writing to. For copy_derefs, invalidate whatever the copy is writing
to.
These cases will happen once derefs to SSBOs/UBOs are kept around long
enough to get optimized by copy_prop_vars.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Rather than only lowering if all srcs are scalarizable we instead
check that at least one src is scalarizable.
We change undef type to return false otherwise it will cause
regressions when it is the only scalarizable src.
total instructions in shared programs: 13219105 -> 13024547 (-1.47%)
instructions in affected programs: 1153797 -> 959239 (-16.86%)
helped: 581
HURT: 74
total cycles in shared programs: 333968972 -> 324807922 (-2.74%)
cycles in affected programs: 129809402 -> 120648352 (-7.06%)
helped: 571
HURT: 131
total spills in shared programs: 57947 -> 29130 (-49.73%)
spills in affected programs: 53364 -> 24547 (-54.00%)
helped: 351
HURT: 0
total fills in shared programs: 51310 -> 25468 (-50.36%)
fills in affected programs: 44882 -> 19040 (-57.58%)
helped: 351
HURT: 0
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This reduces the time spent in nir_opt_cse() by almost a half.
The massif tool from callgrind reported no change in peak
memory use with the large doliphin uber shaders I used for
testing.
Reviewed-by: Thomas Helland<thomashelland90@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Copy+paste error. It was supposed to test cull and not clip.
Fixes: 4e69fba534 "nir: Rewrite lower_clip_cull_distance_arrays..."
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109717
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
On GLSL that info is set as a layout qualifier when redeclaring
gl_FragCoord, so somehow tied to a specific variable. But in practice,
they behave as a global of the shader. On ARB programs they are set
using a global OPTION (defined at ARB_fragment_coord_conventions), and
on SPIR-V using ExecutionModes, that are also not tied specifically to
the builtin.
This patch moves that info from nir variable and ir variable to nir
shader and gl_program shader_info respectively, so the map is more
similar to SPIR-V, and ARB programs, instead of more similar to GLSL.
FWIW, shader_info.fs already had pixel_center_integer, so this change
also removes some redundancy. Also, as struct gl_program also includes
a shader_info, we removed gl_program::OriginUpperLeft and
PixelCenterInteger, as it would be superfluous.
This change was needed because recently spirv_to_nir changed the order
in which execution modes and variables are handled, so the variables
didn't get the correct values. Now the info is set on the shader
itself, and we don't need to go back to the builtin variable to set
it.
Fixes: e68871f6a ("spirv: Handle constants and types before execution
modes")
v2: (Jason)
* glsl_to_nir: get the info before glsl_to_nir, while all the rest
of the info gathering is happening
* prog_to_nir: gather the info on a general info-gathering pass,
not on variable setup.
v3: (Jason)
* Squash with the patch that removes that info from ir variable
* anv: assert that OriginUpperLeft is true. It should be already
set by spirv_to_nir.
* blorp: set origin_upper_left on its core "compile fragment
shader", not just on some specific places (for this we added an
helper on a previous patch).
* prog_to_nir: no need to gather specifically this fragcoord modes
as the full gl_program shader_info is copied.
* spirv_to_nir: assert that we are a fragment shader when handling
this execution modes.
v4: (reported by failing gitlab pipeline #18750)
* state_tracker: update too due changes on ir.h/gl_program
v5:
* blorp: minor change after change on previous patch
* radeonsi: update due this change.
v6: (Timothy Arceri)
* prog_to_nir: remove extra whitespace
* shader_info: don't use :1 on origin_upper_left
* glsl: program.fs.origin_upper_left/pixel_center_integer can be
move out of the shader list loop
This makes us properly handle gl_ClipDistance and gl_CullDistance.
Fixes: 19064b8c "nir: Add a pass for gathering transform feedback info"
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
We needed to better handle cases where a chunk of a variable starts at
some non-zero location_frac and rolls over into the next slot but may
not be more than 4 dwords. For example, if gl_CullDistance is an array
of 3 things and has location_frac = 2, it will span across two vec4s but
is not, itself, bigger than a vec4. If you ignore the clip/cull special
case, it's not allowed to happen for anything else because the only
things that can span more than one slot is dvec3 and dvec4 and they're
both bigger than a vec4. The current code uses this attrib_slot thing
where we count attribute slots and iterate over them. However, that
doesn't work in the case above because gl_CullDistance will have an
attrib_slot count of 1 even though it does span two slots. We could fix
this by adjusting attrib_slot but we already have comp_mask and it's
easier to just handle it that way.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Instead of going to all the work of to combine them into one array, just
make two arrays and use location_frac to colocate them within CLIP0.
Then the back-end can sort things out and stack them on top of each
other. Thanks to ef99f4c8, we also don't need to set compact anymore.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Even in a very basic shader this reduces the time spent in
nir_copy_prop() by ~17%.
No shader-db changes for radeonsi NIR or i965.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Commit 08bfd710a2. (nir/dead_cf: Stop
relying on liveness analysis) introduced a new check that iterated
through a SSA def's uses, to see if it's used. But it only checked
normal uses, and not uses which are part of an 'if' condition. This
led to it thinking more nodes were dead than possible.
Fixes Piglit's variable-indexing/tcs-output-array-float-index-wr test
(and related tests) with the out-of-tree Iris driver.
Fixes: 08bfd710a2 nir/dead_cf: Stop relying on liveness analysis
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
I'd like to use this in the prog_parameter.c code, so I need to move it
into C, make it non-static, and so on. This probably isn't the ideal
place for it, but I couldn't think of a better one.
Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
The idea here is to reassociate a * (b * c) into (a * c) * b, when
b is a non-constant value, but a and c are constants, allowing them
to be combined.
But nothing was enforcing that 'b' must be non-constant, which meant
that running opt_algebraic in a loop would never terminate if the IR
contained non-folded constant expressions like 256 * 0.5 * 2. Normally,
we call constant folding in such a loop too, but IMO it's better for
nir_opt_algebraic to be robust and not rely on that.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109581
Fixes: 32e266a9a5 i965: Compile fp64 funcs only if we do not have 64-bit hardware support
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
This was probably useful when it was first written, however it
looks to be no longer necessary.
As far as I can tell these days dce is smart enough to remove useless
instructions from if branches. Once this is done
nir_opt_peephole_select() will end up removing the empty if.
Removing this support reduces the dolphin uber shader compilation
time spent in nir_opt_dead_cf() by a little over 7x.
No shader-db changes on i965 or radeonsi.
Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
In opt_peel_initial_if optimization, when moving the continue list to
end of the continue block, before the jump, could happen that the
continue list itself also ends with a jump.
This would mean that we would have two jump instructions in a row: the
first one from the continue list and the second one from the contine
block.
As inserting an instruction after a jump is not allowed (and it does not
make sense, as it will not be executed), remove the jump from the
continue block and keep the one from continue list, as it will be
executed first.
CC: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
opt_split_alu_of_phi moves ALU instruction to the end of continue block.
But if the continue block ends with a jump instruction (an explicit
"continue" instruction) then the ALU must be inserted before the jump,
as it is illegal to add instructions after the jump.
CC: Ian Romanick <ian.d.romanick@intel.com>
Fixes: 0881e90c09 ("nir: Split ALU instructions in loops that read phis")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
The liveness analysis pass is fairly expensive because it has to build
large bit-sets and run a fix-point algorithm on them. Instead of
requiring liveness for detecting if values escape a CF node, just take
advantage of the structured nature of NIR and use block indices instead.
This only requires the block index metadata which is the fastest we have
metadata to generate.
No shader-db changes on Kaby Lake
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>