Most of the vars tests already had a local helper, so just drop it in
favor of the one in nir_builder. Remaining two tests changed to use
the helper.
The load_store_vectorizer tests were using the specific memory
barriers, but since scoped barriers are also handled, prefer that.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3913>
Three groups of tests, effectively defining what cases the
optimization is allowed or prevented
- Redudant loads (a load generated the value)
- Propagate SSA values (a store generated the value)
- Propagate a var (a copy generated the value)
Change the shader type of the tests to be COMPUTE so
nir_var_mem_shared can also be used. Doesn't affect the semantic of
the copy propagation.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This just adds some split var splitting tests, it verifies
by counting derefs and local vars.
a basic load from inputs, store to array,
same as before but with a shader temp
struct { float } [4] don't split test
a basic load from inputs, with some out of band loads.
a load/store of only half the array
two level array, load from inputs store to all levels
a basic load from inputs with an indirect store to array.
two level array, indirect store to lvl 0
two level array, indirect store to lvl 1
load from inputs, store to array twice
load from input, store to array, load from array, store to another array.
load and from input and copy deref to array
create wildcard derefs, and do a copy
v2: use array_imm helpers, move derefs out of loops,
rename toplevel/secondlevel, use ints, fix lvl1 don't split test,
rename globabls to shader_temp, add comment, check the derefs type
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Other nir_src_as_* functions just take a nir_src. It's not that much
more memory copying and the constness preserving really isn't worth the
cognitive dissonance.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
v2: (all from Jason)
Reuse existing function for the end of the block combinations.
Check the SSA values are coming from the right place in tests.
Document the case when the store to array_deref is reused.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Unlike most of the cases in which we do this by hand, the new helper
properly handles non-32-bit pointers.
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Differently than the direct case, the indirect array derefs of vector
are handled like regular derefs, with the exception that we ignore any
vector entry that has SSA values when performing a load. Such SSA
values don't help loading of the indirect unless we emit an if-ladder.
Copy_derefs are supported for indirects.
Also enable two tests that now pass.
v2: Remove unnecessary temporaries. Be clearer when identifying the
case where copy_entry doesn't help when we are dealing with an
indirect array_deref (of a vector). (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Both on an actual array and on a vector, and an extra test on a vector
mixing direct and indirect access. The vector tests are disabled and
will be enabled by a later commit.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When direct array deref is used on a vector type (for loads and
stores), copy_prop_vars is now smart to propagate values it knows
about.
Given a 'vec4 v', storing to v[3] will update the copy entry for v and
it is equivalent to a write to v.w. Loading from v[1] will try first
to see if there's a known value for v.y -- and drop the load in that
case.
The copy entries still always refer to the entire vectors, so the
operations happen on the parent deref (the 'vector') and the values
are fixed accordingly.
It might be the case now that certain entries have not only different
SSA defs in each element but also those come from different components
than they are set to, because stores to individual elements always
come from a SSA definition with a single component.
Tests related to these cases are now enabled.
v2: Instead of asserting on invalid indices, "load" an undef and
remove the store. (Jason)
v3: Merge code path for the cases of is_array_deref_of_vector into the
regular code path. Add a base_index parameter to
value_set_from_value. (code changes by Jason)
v4: Removed the get_entry_for_deref helper, now being used only once.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Test using array deref on vectors in loads and stores. These are
marked DISABLED_ as this optimization is currently not done.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Replace find_next_intrinsic(intrinsic, after) with
get_intrinsic(intrinsic, index). This makes slightly more convenient
to check the resulting loads/stores/copies, since in most tests we
know which one we care about. The cost is to perform more traversals,
but for such tests this is not a problem.
Added the ASSERT_EQ() on count to some tests missing it, so the
indices queried are always expected to find something.
Also, drop two nir_print_shader leftover calls in a test.
v2: Remove redundant assertions. nir_src_comp_as_uint already
assert what we need. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
the naming is a bit confusing no matter how you look at it. Within SPIR-V
"global" memory is memory accessible from all threads. glsl "global" memory
normally refers to shader thread private memory declared at global scope. As
we already use "shared" for memory shared across all thrads of a work group
the solution where everybody could be happy with is to rename "global" to
"private" and use "global" later for memory usually stored within system
accessible memory (be it VRAM or system RAM if keeping SVM in mind).
glsl "local" memory is memory only accessible within a function, while SPIR-V
"local" memory is memory accessible within the same workgroup.
v2: rename local to function as well
v3: rename vtn_variable_mode_local as well
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Previously, NIR had a single nir_var_uniform mode used for atomic
counters, UBOs, samplers, images, and normal uniforms. This commit
splits this into nir_var_uniform and nir_var_ubo where nir_var_uniform
is still a bit of a catch-all but the nir_var_ubo is specific to UBOs.
While we're at it, we also rename shader_storage to ssbo to follow the
convention.
We need this so that we can distinguish between normal uniforms and UBO
access at the deref level without going all the way back variable and
seeing if it has an interface type.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Use nir_src_comp_as_uint() to read the proper second component, as
nir_src_as_uint() returns the first one.
v2: Use nir_src_comp_as_uint() [Jason]
Fixes: 16870de8a0 ("nir: Use nir_src_is_const and nir_src_as_* in core
code")
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108532
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Vinson Lee <vlee@freedesktop.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Extend the pass to propagate the copies information along the control
flow graph. It performs two walks, first it collects the vars
that were written inside each node. Then it walks applying the copy
propagation using a list of copies previously available. At each node
the list is invalidated according to results from the first walk.
This approach is simpler than a full data-flow analysis, but covers
various cases. If derefs are used for operating on more memory
resources (e.g. SSBOs), the difference from a regular pass is expected
to be more visible -- as the SSA copy propagation pass won't apply to
those.
A full data-flow analysis would handle more scenarios: conditional
breaks in the control flow and merge equivalent effects from multiple
branches (e.g. using a phi node to merge the source for writes to the
same deref). However, as previous commentary in the code stated, its
complexity 'rapidly get out of hand'. The current patch is a good
intermediate step towards more complex analysis.
The 'copies' linked list was modified to use util_dynarray to make it
more convenient to clone it (to handle ifs/loops).
Annotated shader-db results for Skylake:
total instructions in shared programs: 15105796 -> 15105451 (<.01%)
instructions in affected programs: 152293 -> 151948 (-0.23%)
helped: 96
HURT: 17
All the HURTs and many HELPs are one instruction. Looking
at pass by pass outputs, the copy prop kicks in removing a
bunch of loads correctly, which ends up altering what other
other optimizations kick. In those cases the copies would be
propagated after lowering to SSA.
In few HELPs we are actually helping doing more than was
possible previously, e.g. consolidating load_uniforms from
different blocks. Most of those are from
shaders/dolphin/ubershaders/.
total cycles in shared programs: 566048861 -> 565954876 (-0.02%)
cycles in affected programs: 151461830 -> 151367845 (-0.06%)
helped: 2933
HURT: 2950
A lot of noise on both sides.
total loops in shared programs: 4603 -> 4603 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 11085 -> 11073 (-0.11%)
spills in affected programs: 23 -> 11 (-52.17%)
helped: 1
HURT: 0
The shaders/dolphin/ubershaders/12.shader_test was able to
pull a couple of loads from inside if statements and reuse
them.
total fills in shared programs: 23143 -> 23089 (-0.23%)
fills in affected programs: 2718 -> 2664 (-1.99%)
helped: 27
HURT: 0
All from shaders/dolphin/ubershaders/.
LOST: 0
GAINED: 0
The other generations follow the same overall shape. The spills and
fills HURTs are all from the same game.
shader-db results for Broadwell.
total instructions in shared programs: 15402037 -> 15401841 (<.01%)
instructions in affected programs: 144386 -> 144190 (-0.14%)
helped: 86
HURT: 9
total cycles in shared programs: 600912755 -> 600902486 (<.01%)
cycles in affected programs: 185662820 -> 185652551 (<.01%)
helped: 2598
HURT: 3053
total loops in shared programs: 4579 -> 4579 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 80929 -> 80924 (<.01%)
spills in affected programs: 720 -> 715 (-0.69%)
helped: 1
HURT: 5
total fills in shared programs: 93057 -> 93013 (-0.05%)
fills in affected programs: 3398 -> 3354 (-1.29%)
helped: 27
HURT: 5
LOST: 0
GAINED: 2
shader-db results for Haswell:
total instructions in shared programs: 9231975 -> 9230357 (-0.02%)
instructions in affected programs: 44992 -> 43374 (-3.60%)
helped: 27
HURT: 69
total cycles in shared programs: 87760587 -> 87727502 (-0.04%)
cycles in affected programs: 7720673 -> 7687588 (-0.43%)
helped: 1609
HURT: 1416
total loops in shared programs: 1830 -> 1830 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
total spills in shared programs: 1988 -> 1692 (-14.89%)
spills in affected programs: 296 -> 0
helped: 1
HURT: 0
total fills in shared programs: 2103 -> 1668 (-20.68%)
fills in affected programs: 438 -> 3 (-99.32%)
helped: 4
HURT: 0
LOST: 0
GAINED: 1
v2: Remove the DISABLE prefix from tests we now pass.
v3: Add comments about missing write_mask handling. (Caio)
Add unreachable when switching on cf_node type. (Jason)
Properly merge the component information in written map
instead of replacing. (Jason)
Explain how removal from written arrays works. (Jason)
Use mode directly from deref instead of getting the var. (Jason)
v4: Register the local written mode for calls. (Jason)
Prefer cf_node instead of node. (Jason)
Clarify that remove inside iteration only works in backward
iterations. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Also tests for removal of redundant loads, that we currently handle as
part of the copy propagation.
Note some tests involve multiple blocks and are currently DISABLED
because they (expectedly) fail.
v2: Add missing DISABLED prefix to "multi block" tests. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Instead of doing this as part of the existing copy_prop_vars pass.
Separation makes easier to expand the scope of both passes to be more
than per-block. For copy propagation, the information about valid
copies comes from previous instructions; while the dead write removal
depends on information from later instructions ("have any instruction
used this deref before overwrite it?").
Also change the tests to use this pass (instead of copy prop vars).
Note that the disabled tests continue to fail, since the standalone
pass is still per-block.
v2: Remove entries from dynarray instead of marking items as
deleted. Use foreach_reverse. (Caio)
(all from Jason)
Do not cache nir_deref_path. Not worthy for this patch.
Clear unused writes when hitting a call instruction.
Clean up enumeration of modes for barriers.
Move metadata calls to the inner function.
v3: For copies, use the vector length to calculate the mask.
(all from Jason)
Use nir_component_mask_t when applicable.
Rename functions for clarity.
Consider local vars used by a call to be conservative (SPIR-V has
such cases).
Comment and assert the assumption that stores and copies are
always to a deref that ends with a vector or scalar.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Note at the moment the pass called is nir_opt_copy_prop_vars, because
dead write elimination is implemented there.
Also added tests that involve identifying dead writes in multiple
blocks (e.g. the overwrite happens in another block). Those currently
fail as expected, so are marked to be skipped.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Add basic helpers for doing tests on the vars related optimization
passes. The main goal is to lower the barrier to create tests during
development and debugging of the passes. Full coverage is not a
requirement.
v2: Make find_next_intrinsic() skip blocks before 'after'. (Jason)
Move nir_imm_ivec2() to nir_builder.h. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>