This doesn't help on Intel GPUs now because we always take the
"always_precise" path first. It may help on other GPUs, and it does
prevent a bunch of regressions in "intel/compiler: Don't always require
precise lowering of flrp".
Reviewed-by: Matt Turner <mattst88@gmail.com>
There is little effect on Intel GPUs now because we almost always take
the "always_precise" path first. It may help on other GPUs, and it does
prevent a bunch of regressions in "intel/compiler: Don't always require
precise lowering of flrp".
No changes on any other Intel platforms.
GM45 and Iron Lake had similar results. (Iron Lake shown)
total cycles in shared programs: 188852500 -> 188852484 (<.01%)
cycles in affected programs: 14612 -> 14596 (-0.11%)
helped: 4
HURT: 0
helped stats (abs) min: 4 max: 4 x̄: 4.00 x̃: 4
helped stats (rel) min: 0.09% max: 0.13% x̄: 0.11% x̃: 0.11%
95% mean confidence interval for cycles value: -4.00 -4.00
95% mean confidence interval for cycles %-change: -0.13% -0.09%
Cycles are helped.
Reviewed-by: Matt Turner <mattst88@gmail.com>
This doesn't help on Intel GPUs now because we always take the
"always_precise" path first. It may help on other GPUs, and it does
prevent a bunch of regressions in "intel/compiler: Don't always require
precise lowering of flrp".
No changes on any Intel platform. Before a number of large rebases this
helped cycles in a couple shaders on Iron Lake and GM45.
Reviewed-by: Matt Turner <mattst88@gmail.com>
If the magnitudes of #a and #b are such that (b-a) won't lose too much
precision, lower as a+c(b-a).
No changes on any other Intel platforms.
v2: Rebase on 424372e5dd5 ("nir: Use the flrp lowering pass instead of
nir_opt_algebraic")
Iron Lake and GM45 had similar results. (Iron Lake shown)
total instructions in shared programs: 8192503 -> 8192383 (<.01%)
instructions in affected programs: 18417 -> 18297 (-0.65%)
helped: 68
HURT: 0
helped stats (abs) min: 1 max: 18 x̄: 1.76 x̃: 1
helped stats (rel) min: 0.19% max: 7.89% x̄: 1.10% x̃: 0.43%
95% mean confidence interval for instructions value: -2.48 -1.05
95% mean confidence interval for instructions %-change: -1.56% -0.63%
Instructions are helped.
total cycles in shared programs: 188662536 -> 188661956 (<.01%)
cycles in affected programs: 744476 -> 743896 (-0.08%)
helped: 62
HURT: 0
helped stats (abs) min: 4 max: 60 x̄: 9.35 x̃: 6
helped stats (rel) min: 0.02% max: 4.84% x̄: 0.27% x̃: 0.06%
95% mean confidence interval for cycles value: -12.37 -6.34
95% mean confidence interval for cycles %-change: -0.48% -0.06%
Cycles are helped.
Reviewed-by: Matt Turner <mattst88@gmail.com>
I tried to be very careful while updating all the various drivers, but I
don't have any of that hardware for testing. :(
i965 is the only platform that sets always_precise = true, and it is
only set true for fragment shaders. Gen4 and Gen5 both set lower_flrp32
only for vertex shaders. For fragment shaders, nir_op_flrp is lowered
during code generation as a(1-c)+bc. On all other platforms 64-bit
nir_op_flrp and on Gen11 32-bit nir_op_flrp are lowered using the old
nir_opt_algebraic method.
No changes on any other Intel platforms.
v2: Add panfrost changes.
Iron Lake and GM45 had similar results. (Iron Lake shown)
total cycles in shared programs: 188647754 -> 188647748 (<.01%)
cycles in affected programs: 5096 -> 5090 (-0.12%)
helped: 3
HURT: 0
helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2
helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12%
Reviewed-by: Matt Turner <mattst88@gmail.com>
This pass will soon grow to include some optimizations that are
difficult or impossible to implement correctly within nir_opt_algebraic.
It also include the ability to generate strictly correct code which the
current nir_opt_algebraic lowering lacks (though that could be changed).
v2: Document the parameters to nir_lower_flrp. Rebase on top of
3766334923 ("compiler/nir: add lowering for 16-bit flrp")
Reviewed-by: Matt Turner <mattst88@gmail.com>
At initial nir level all drivers are supporting ints.
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Driver which do not support native integers should use a lowering
pass to go from integers to floats.
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This new pass lowers ints and bools to floats. It allows hardware
that doesn't have native integers (e.g. Mali4x0) use the same
code paths as modern hardware.
It uses newly introduced pass to gather SSA types and should be
used as late as possible.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
In commit a99c360a46 (nir: add pass to lower fb reads), a new
file was added that needs to also be added to the
Makefile.sources list used by the Android and SCons build system.
Cc: Rob Clark <robdclark@chromium.org>
Cc: Emil Velikov <emil.l.velikov@gmail.com>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Alistair Strachan <astrachan@google.com>
Cc: Greg Hartman <ghartman@google.com>
Cc: Tapani Pälli <tapani.palli@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Fixes: a99c360a46 ("nir: add pass to lower fb reads")
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The current AOSP master build system breaks building mesa due to the
following error:
external/mesa3d/src/compiler/Android.glsl.gen.mk:94: error:
writing to readonly directory: "external/mesa3d/src/compiler/glsl/ir.h"
This error is bogus -- nothing "writes" to ir.h -- but the rule is
unnecessary because the generated header that is a dependency of the
non-generated header should be added to LOCAL_GENERATED_SOURCES and this
will track if the dependency needs to be regenerated.
(This change fixes a similar problem affecting nir.h too.)
Cc: Rob Clark <robdclark@chromium.org>
Cc: Emil Velikov <emil.l.velikov@gmail.com>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Alistair Strachan <astrachan@google.com>
Cc: Greg Hartman <ghartman@google.com>
Cc: Tapani Pälli <tapani.palli@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Signed-off-by: Alistair Strachan <astrachan@google.com>
[jstultz: Forward ported and tweaked commit subject]
Signed-off-by: John Stultz <john.stultz@linaro.org>
with that we can simplify code where nir vectors are created
v2: merge both lines in nir_vec
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This new pass (which isn't even compile-tested) attempts to determine
the ALU type of all the SSA values in a function impl. It takes a
greedy approach and assigns intness or floatness to everything it thinks
can possibly contain an int or a float. Some values will be labled as
both int and float and some will be labled as neither and it is up to
the caller to decide what to do with this information. However, for a
"nice" shader where the original source contained no bit-casts and no
implicit bit-casts were introduced by optimizations, there shouldn't be
any overlap in the two sets save for the odd CSEd zero constant.
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Just don't emit the transform array at all if there are no transforms
v2:
- Don't use len(array) > 0 (Dylan)
- Keep using ARRAY_SIZE to make the generated C code easier to read
(Jason).
Somewhere down in the depths of the mingw headers 'interface' is
defined, change it to iface like a similar patch did.
Signed-off-by: Dylan Baker <dylan.c.baker@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
This has a couple of hardcoded vec4 limits in it, change them
to the proper sizing to avoid future issues.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
- make sure compute shader derivatives are exposed for all extensions
- unify duplicated code
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Apparently we never hit this path. Or at least haven't for a rather
long time. But in either case (load_deref or load_frag_coord), we can
just directly use the intrinsic's ssa dest. So stop passing the
nir_variable (which would be NULL in the load_frag_coord case) around
and instead just use &intr->dest.ssa.
(This ofc means we need to setup the cursor to insert *after* the
instruction, which seems to be another bug of the original
implementation.)
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
The extra comma at the end was annoying me.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
nir_opt_algebraic is currently one of the most expensive NIR passes,
because of the many different patterns we've added over the years. Even
though patterns are already sorted by opcode, there are still way too
many patterns for common opcodes like bcsel and fadd, which means that
many patterns are tried but only a few actually match. One way to fix
this is to add a pre-pass over the code that scans it using an automaton
constructed beforehand, similar to the automatons produced by lex and
yacc for parsing source code. This automaton has to walk the SSA graph
and recognize possible pattern matches.
It turns out that the theory to do this is quite mature already, having
been developed for instruction selection as well as other non-compiler
things. I followed the presentation in the dissertation cited in the
code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
the naming similar. To create the automaton, we have to perform
something like the classical NFA to DFA subset construction used by lex,
but it turns out that actually computing the transition table for all
possible states would be way too expensive, with the dissertation
reporting times of almost half an hour for an example of size similar to
nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
explained in the dissertation, which trade much faster table generation
and table size for a few more table lookups per instruction at runtime.
I chose the filter which resulted the fastest table generation time,
with medium table size. Right now, the table generation takes around .5
seconds, despite being implemented in pure Python, which I think is good
enough. Based on the numbers in the dissertation, the other choice might
make table compilation time 25x slower to get 4x smaller table size, but
I don't think that's worth it. As of now, we get the following binary
size before and after this patch:
text data bss dec hex filename
11979455 464720 730864 13175039 c908ff before i965_dri.so
text data bss dec hex filename
12037835 616244 791792 13445871 cd2aef after i965_dri.so
There are a number of places where I've simplified the automaton by
getting rid of details in the LHS patterns rather than complicate things
to deal with them. For example, right now the automaton doesn't
distinguish between constants with different values. This means that it
isn't as precise as it could be, but the decrease in compile time is
still worth it -- these are the compilation time numbers for a shader-db
run with my (admittedly old) database on Intel skylake:
Difference at 95.0% confidence
-42.3485 +/- 1.375
-7.20383% +/- 0.229926%
(Student's t, pooled s = 1.69843)
We can always experiment with making it more precise later.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
I'm not sure what triggered this, but building with
scons platform=windows toolchain=crossmingw machine=x86 build=profile
with MinGW g++ 7.3 or 7.4 causes an internal compiler error.
We can work around it by forcing -O1 optimization.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Neha Bhende <bhenden@vmware.com>
Use a different arrangement of constants to allow more ffma.
A vec4 backend will now use 3 fma for yuv_to_rgb. On freedreno/ir3, it is
down from 10 to 7 alu (4 fma, 3 mul, 3 add to 7 fma). Other backends
shouldn't be hurt.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
The code was handling the Weak variant in some cases, but missing
others, e.g. the get_deref_nir_atomic_op. Add all the missing cases
with the same behavior of the non-Weak SpvOpAtomicCompareExchange.
Note that the Weak variant is basically an alias, as SPIR-V 1.3,
Revision 7 says
"OpAtomicCompareExchangeWeak
Deprecated (use OpAtomicCompareExchange).
Has the same semantics as OpAtomicCompareExchange."
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
One special case, `src/util/xmlpool/.gitignore` is not entirely deleted,
as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`).
Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
From page 76 (page 80 of the PDF) of the GLSL 4.60 v.5 spec:
" No aliasing in output buffers is allowed: It is a compile-time or
link-time error to specify variables with overlapping transform
feedback offsets."
Currently, this is expected to fail, but it succeeds:
"
...
layout (xfb_offset = 0) out vec2 a;
layout (xfb_offset = 0) out vec4 b;
...
"
Fixes the following piglit test:
tests/spec/arb_enhanced_layouts/compiler/transform-feedback-layout-qualifiers/xfb_offset/invalid-overlap.vert
Fixes the following test:
KHR-GL44.enhanced_layouts.xfb_output_overlapping
v2:
- Use a data structure to track the used components instead of a
nested loop (Ilia).
v3:
- Take the BITSET_WORD array out from the
gl_transform_feedback_buffer struct and make it local to the
validation process (Timothy).
- Do not use a nested scope for the validation (Timothy).
v4:
- Add reference to the fixed piglit test in the commit log.
- Add reference to the fixed VK-GL-CTS test in the commit
log (Tapani).
- Empty initialize the BITSET_WORD pointers array (Tapani).
Cc: Timothy Arceri <tarceri@itsqueeze.com>
Cc: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
On some hardware (e.g. Mali400) the shader needs to apply some
transformations for correct gl_FragCoord handling. The lowering
actions look like the following in pseudocode:
gl_FragCoord.xyz = gl_FragCoord_orig.xyz
gl_FragCoord.w = 1.0 / gl_FragCoord_orig.w
Add this lowering as a nir pass in preparation for using it in the driver.
Signed-off-by: Andreas Baierl <ichgeh@imkreisrum.de>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
fixes following warning with clang:
warning: suggest braces around initialization of subobject
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Used same syntax as elsewhere with Mesa sources, verified result
against MSVC with godbolt.org.
fixes following warning with clang:
warning: suggest braces around initialization of subobject
v2: empty braces -> braces around subobject (Caio, Kristian)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
These have been popping up more and more with the OpenCL work and other
bits causing extra conversions to/from 64-bit.
Reviewed-by: Karol Herbst <kherbst@redhat.com>
This fixes a case where we are expecting 64-bit but generate
32-bit consts and validate gets angry.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
This fixes KHR-GL45.compute_shader.resources-max on radeonsi.
Fixes: 4e1e8f684b "glsl: remember which SSBOs are not read-only and pass it to gallium"
v2: use is_interface_array, protect again assertion failures in u_bit_consecutive
Reviewed-by: Dave Airlie <airlied@redhat.com>
Calculates i,j at specified offset within a pixel. A new load_size_ir3
intrinsic is used in conjunction with fddx/fddy to translate the offset
into primitive space and adjust the i,j from load_barycentric_pixel
accordingly.
Signed-off-by: Rob Clark <robdclark@chromium.org>