Previously when considering whether to rematerialize or spill/fill
ssa_1954, we would go for a spill/fill :
vec4 32 ssa_388 = (float32)txf ssa_387 (texture_handle), ssa_86 (coord), ssa_23 (lod), 0 (texture), 0 (sampler)
...
vec1 32 ssa_1953 = load_const (0xbd23d70a = -0.040000)
vec1 32 ssa_1954 = fadd ssa_388.x, ssa_1953
vec1 32 ssa_1955 = fneg ssa_1954
This is because when looking at ssa_1955 the first time, we would
consider ssa_388 unrematerialiable, and therefore all values built on
top of it would be considered unrematerialiable as well.
The missing piece when considering whether to rematerialize ssa_1954
is that we should look at filled values. Now that ssa_388 has been
spilled/filled, we can rebuild ssa_1955 on top of the filled value and
avoid spilling/filling ssa_1955 at all.
This requires a bit more work though. We can't just look at an
instruction in isolation, we need to go through the ssa chains until
we find values we can rematerialize or not.
In this change we build a list of all ssa values involved in building
a given value, up to the point there we find a filled or a
rematerializable value.
In this particular case, looking at ssa_1955 :
* We can rematerialize ssa_388 from its filled value
* We can rematerialize ssa_1953 trivially
* We can rematerialize ssa_1954 because its 2 inputs are rematerializable
* We can rematerialize ssa_1955 because ssa_1954 is rematerializable
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16556>
Currently we do something like this :
ssa_0 = ...
ssa_1 = ...
* spill ssa_0, ssa_1
call1()
* fill ssa_0, ssa_1
ssa_2 = ...
ssa_3 = ...
* spill ssa_0, ssa_1, ssa_2, ssa_3
call2()
* fill ssa_0, ssa_1, ssa_2, ssa_3
If we assign the same possition to ssa_0 & ssa_1 in the spilling
stack, then on call2(), we know that those values are already present
in memory at the right location and we can avoid respilling them.
The result would be something like this :
ssa_0 = ...
ssa_1 = ...
* spill ssa_0, ssa_1
call1()
* fill ssa_0, ssa_1
ssa_2 = ...
ssa_3 = ...
* spill ssa_2, ssa_3
call2()
* fill ssa_0, ssa_1, ssa_2, ssa_3
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16556>
For a follow up optimization, we would like to track scratch loads.
This isn't possible with global load/store intrinsics. So use a couple
of special intrinsic in the pass and only lower it to global
intrinsics at the end.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16556>
For used by different counter.
Vulkan:
1. VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_PRIMITIVES_BIT,
sum generated primitives of all 4 streams when GS.
2. VK_QUERY_TYPE_PRIMITIVES_GENERATED_EXT, count generated
primitives for all 4 streams when VS/TES/GS.
3. VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT, count generated
and streamout primitives for all 4 streams when VS/TES/GS.
OpenGL:
1. GL_GEOMETRY_SHADER_PRIMITIVES_EMITTED_ARB, sum generated
primitives for all 4 streams when GS.
2. GL_PRIMITIVES_GENERATED, count generated primitives for all 4
streams when VS/TES/GS.
3. GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, count streamout
primitives for all 4 streams when VS/TES/GS.
pipeline_stat_query_enabled_amd is for Vulkan 1 and OpenGL 1.
xfb_query_enabled_amd is for Vulkan 2/3 and OpenGL 2/3.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19015>
Based on si_create_passthrough_tcs() as that seemed the most generic of
the various different backend driver implementations. Uses the
load_tess_level_outer_default and load_tess_level_inner_default
intrinsics to load the gl_TessLevelOuter and gl_TessLevelInner values,
so driver will somehow need to implement those to load the values set
by pipe_context::set_tess_state() or similar.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19259>
nir_opt_preamble will be crucial to optimize out the lowering for array
textures on AGX, which involves this AGX-specific sysval.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Acked-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18813>
For example if src0 is 0x80000000 we should return 1, not 0.
Signed-off-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Fixes: a5747f8ab3 ("nir: add opcodes for *find_msb_rev and lowering")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18951>
Also modify all existing uses to pass a zero to this new src.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com> (nir)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17551>
Previously, we always treated these as coherent, but now let's make
this configurable. Also set all current users to ACCESS_COHERENT.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com> (nir)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17551>
These intrinsics will be used to lower NGG attributes to memory on
GFX11.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19173>
For radeonsi which load this from arg.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19166>
...even for non-pixel interpolation, for consistency. Otherwise backends get
funny intrinsics with interpolateAt:
vec4 32 ssa_4 = intrinsic load_interpolated_input (ssa_3, ssa_2) (base=1, component=0, dest_type=invalid /*0*/, io location=33 slots=1 /*161*/)
We know it'll be a float, but backends shouldn't need to special case this. (Or
maybe interpolated_input shouldn't have a dest_type index. I'd be ok with that
resolution too. But having one and not setting it consistently is wrong.)
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19085>
I stumbled upon this old NIR pass (still in use by intel and broadcom)
and noticed how most of the code was NIR boilerplate that we have
helpers for. Rewrite the pass to use all the helpers.
v2: Fix cube map arrays.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18754>
This failed to take fabs of the first component, implementing an unintended
formula that would return the right results in some common cases but is wrong in
general:
max { x, |y|, |z| }
instead of the intended
max { |x|, |y|, |z| }
Reexpress the implementation to make correctness obvious.
Fixes: 272e927d0e ("nir/spirv: initial handling of OpenCL.std extension opcodes")
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18754>
If there is a single use of fmul, and that single use is fadd, it makes
sense to fuse ffma, as we already do. However, if there are multiple
uses, fusing may impede code gen. Consider the source fragment:
a = fmul(x, y)
b = fadd(a, z)
c = fmin(a, t)
d = fmax(b, c)
The fmul has two uses. The current ffma fusing is greedy and will
produce the following "optimized" code.
a = fmul(x, y)
b = ffma(x, y, z)
c = fmin(a, t)
d = fmax(b, c)
Actually, this code is worse! Instead of 1 fmul + 1 fadd, we now have 1
fmul + 1 ffma. In effect, two multiplies (and a fused add) instead of
one multiply and an add. Depending on the ISA, that could impede
scheduling or increase code size. It can also increase register
pressure, extending the live range.
It's tempting to gate on is_used_once, but that would hurt in cases
where we really do fuse everything, e.g.:
a = fmul(x, y)
b = fadd(a, z)
c = fadd(a, t)
For ISAs that fuse ffma, we expect that 2 ffma is faster than 1 fmul + 2
fadd. So what we really want is to fuse ffma iff the fmul will get
deleted. That occurs iff all uses of the fmul are fadd and will
themselves get fused to ffma, leaving fmul to get dead code eliminated.
That's easy to implement with a new NIR search helper, checking that all
uses are fadd.
shader-db results on Mali-G57 [open shader-db + subset of closed]:
total instructions in shared programs: 179491 -> 178991 (-0.28%)
instructions in affected programs: 36862 -> 36362 (-1.36%)
helped: 190
HURT: 27
total cycles in shared programs: 10573.20 -> 10571.75 (-0.01%)
cycles in affected programs: 72.02 -> 70.56 (-2.02%)
helped: 28
HURT: 1
total fma in shared programs: 1590.47 -> 1582.61 (-0.49%)
fma in affected programs: 319.95 -> 312.09 (-2.46%)
helped: 194
HURT: 1
total cvt in shared programs: 812.98 -> 813.03 (<.01%)
cvt in affected programs: 118.53 -> 118.58 (0.04%)
helped: 65
HURT: 81
total quadwords in shared programs: 98968 -> 98840 (-0.13%)
quadwords in affected programs: 2960 -> 2832 (-4.32%)
helped: 20
HURT: 4
total threads in shared programs: 4693 -> 4697 (0.09%)
threads in affected programs: 4 -> 8 (100.00%)
helped: 4
HURT: 0
v2: Update trace checksums for virgl due to numerical differences.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18814>
validate_ssa_def_dominance() asserts :
validate_assert(state, !BITSET_TEST(state->ssa_defs_found, def->index));
Because the previous validation lefts bits set when it processed the
IR.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18966>
In that case we need to use the sysval. That sysval can be optimized anyway in
the nonvariable case. Fixes test_basic.get_linear_ids on panfrost.
Fixes: 998d84fca5 ("nir/lower_system_values: Support lowering more intrinsics")
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18662>
The new enum is called nir_selection_control_divergent_always_taken,
and it's almost the same as nir_selection_control_flatten.
The main difference between the two is that "flatten" represents
a choice made by the application but "divergent_always_taken" may
be applied by the compiler stack when it thinks this is beneficial.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Reviewed-By: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17921>
If the abs source modifiers is not supported for the current
instruction because it is an instruction with three sources we
may still see a parent mov that has the `abs` modifier. In this
case we must not propagate that abs modifier from that parent
instructions.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7350
Fixes: cd73b6174b
nir/lower_to_source_mods: Stop turning add, sat, and neg into mov
Signed-off-by: Gert Wollny <gert.wollny@collabora.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18902>
We elminated OOB loads while renaming vars to SSA. However, if the OOB
load only appeared after some other passes had constant folded, there may
be no renaming work to do, at which point we'd leave the OOB load deref
around without renaming it or deleting it. For vc4, this was quite a
surprise and caused a regression when we stop eliminating some OOB
accesses at the GLSL level.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18466>
It was swept at the end, but it meant that in shaders with lots of copies
available at the start of lots of if statements, you'd blow up memory
usage.
turnip memory consumption on dEQP-VK.ssbo.layout.random.scalar.75 drops
from 1.4GB to 110MB, and runtime from 19s to 17s.
Fixes: #7361
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18891>
Lowered fp64 ops can blow up the loop bodies while still being
suitable for unrolling.
Allow for using different parameters to unroll loops with
soft fp64.
Reviewed-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18863>
This lets us vectorize gl_TessLevel{Inner,Outer} writes, using a pass
developed for RADV. Not all backends are prepared to handle this, so
we make it optional.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17944>
somehow 64bit lowering creates patterns like
vec1 64 ssa_1 = undefined
ssa_2 = unpack_64_2x32_split_x ssa_1
and then the 64bit value is never otherwise used. for this case, rewriting
the unpack to just be a 32bit undef allows the 64bit undef to be optimized
out, avoiding spec violations
fixes#6945
SoroushIMG <soroush.kashani@imgtec.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18728>
If we know that a select will be eliminated once the loop is
unrolled than we don't need to count the instruction towards the
cost of the loop.
This change helps 2 loops unroll in an xcom enemy unknown shader
that is loaded full of these redundant selects.
Acked-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18587>
Here we move the calculation of the instruction cost of the loop
after we have processed other information such as finding the
induction variables. This is useful because we can use this further
information to find instructions that will be eliminated if the
loop was to unroll and therefore give them a cost of 0.
Acked-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18587>
Changes:
- nir_metadata_preserve(..., nir_metadata_block_index | nir_metadata_dominance)
is called only when pass makes progress
- nir_metadata_preserve(..., nir_metadata_all) is called when pass doesn't
make progress
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12282>
Changes:
- nir_metadata_preserve(..., nir_metadata_block_index | nir_metadata_dominance)
is called only when pass makes progress
- nir_metadata_preserve(..., nir_metadata_all) is called when pass doesn't
make progress
Reviewed-by: Alyssa Rosenzweig <alyssa@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12282>