It was always fneu but naming it fne causes confusion from time to time. So
lets rename it. Later we also want to add other unordered and fne, this is
a smaller preparation for that.
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6377>
The OpenCL image_width/height/depth functions have variants which can
take an LOD parameter. More importantly, LLVM-SPIRV-Translator always
generates OpImageQuerySizeLod even if the LOD is guaranteed to be zero.
Given that over half the hardware out there has an LOD field for image
size queries (based on a rudimentary scan through their NIR -> whatever
code), we may as well just add the source to the NIR intrinsic. If this
is ever a problem for anyone, the lowering is pretty trivial.
I've also added asserts to everyone's drivers that should alert them if
they ever see an LOD other than zero. This will never happen with GL or
Vulkan so there's no need for panic.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6396>
Just to clarify the missing break is intentional.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5365>
SPIRV OpControlBarrier can have both a memory and a control barrier
which some hardware can handle with a single instruction. Let's
turn the scoped_memory_barrier into a scoped barrier which can embed
both barrier types. Note that control-only or memory-only barriers can
be supported through this new intrinsic by passing NIR_SCOPE_NONE to the
unused barrier type.
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Suggested-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4900>
Using HALT to immediately jump to the end of the shader is required to
implement GL_EXT_gpu_shader4 and OpenGL 3.0. However, vanilla OpenGL
1.2 doesn't forbid it and it likely makes something somewhere faster.
We should be consistent and implement the same discard behavior on all
hardware if we can.
The rules for HALT on Gen4-5 are a bit different from Gen6+. On the
older hardware, there is no stack for HALT; instead it's up to software
to save and restore mask registers. However, there's no real saving
needed since we only use HALT to jump to the end of the program where
we're about about to do our FB writes. All we need to do is reset AMask
to DMask, the value it was initialized to at the start of the thread.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5244>
The nir_intrinsic_load_simd_width_intel is always lowered by the
brw_nir_lower_simd() pass before the emission happens. This is likely
a "leftover" from patch rewriting/squashing that happened when this
intrinsic was added.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
Intrinsic to get the SIMD width, which not always the same as subgroup
size. Starting with a small scope (Intel), but we can rename it later
to generalize if this turns out useful for other drivers.
Change brw_nir_lower_cs_intrinsics() to use this intrinsic instead of
a width will be passed as argument. The pass also used to optimized
load_subgroup_id for the case that the workgroup fitted into a single
thread (it will be constant zero). This optimization moved together
with lowering of the SIMD.
This is a preparation for letting the drivers call it before the
brw_compile_cs() step.
No shader-db changes in BDW, SKL, ICL and TGL.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
Adding this since Iris will handle variable group size parameters by
itself.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
Render Target Array Index has moved from R0.0[26:16] to
R1.1[26:16] on gen12.
Fixes dEQP-VK.multiview.input_attachments.*
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4836>
In Gen11+, when emitting a fence for both L3 and SLM, the generated
code would look like
SEND, MOV (for stall), SEND, MOV (for stall)
This commit change that so two SENDs are emitted before the MOVs for
stall. This is similar to the approach used in Ivy Bridge for the
render fence.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.
For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).
Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).
This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.
Shader-db results for IVB (i965):
total instructions in shared programs: 11971190 -> 11971200 (<.01%)
instructions in affected programs: 11482 -> 11492 (0.09%)
helped: 0
HURT: 8
HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1
HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
95% mean confidence interval for instructions value: 0.66 1.84
95% mean confidence interval for instructions %-change: 0.01% 0.27%
Instructions are HURT.
Unlike the previous code, that used the `mov g1 g2` trick to force
both `g1` and `g2` to stall, the scheduling fence will generate `mov
null g1` and `mov null g2`. During review it was decided it was not
worth keeping the special codepath for the small effect will have.
Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below
total cycles in shared programs: 341738444 -> 341710570 (<.01%)
cycles in affected programs: 7240002 -> 7212128 (-0.38%)
helped: 46
HURT: 5
helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362
HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
95% mean confidence interval for cycles value: -777.71 -315.38
95% mean confidence interval for cycles %-change: -1.42% -0.83%
Cycles are helped.
This seems to be the effect of allocating two registers separatedly
instead of a single one with size 2, which causes different register
allocation, affecting the cycle estimates.
while ICL also has not change on instruction count but report changes
negative changes in cycles
total cycles in shared programs: 352665369 -> 352707484 (0.01%)
cycles in affected programs: 9608288 -> 9650403 (0.44%)
helped: 4
HURT: 104
helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48
HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
95% mean confidence interval for cycles value: 256.67 523.24
95% mean confidence interval for cycles %-change: 0.63% 1.03%
Cycles are HURT.
AFAICT this is the result of the case above.
Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions
total instructions in shared programs: 17690586 -> 17690597 (<.01%)
instructions in affected programs: 64617 -> 64628 (0.02%)
helped: 55
HURT: 32
helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2
HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
95% mean confidence interval for instructions value: -2.03 2.28
95% mean confidence interval for instructions %-change: -0.41% 0.15%
Inconclusive result (value mean confidence interval includes 0).
Now that more is done in the IR, more dependencies are visible and
more SWSB annotations are emitted. Mixed with different register
allocation decisions like above, some shaders will see more `sync
nops` while others able to avoid them.
Most of the new `sync nops` are also redundant and could be dropped,
which will be fixed in a separate change.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
This should have gone away when removing source modifiers. They won't
be set any longer, so this is simply dead code.
Fixes: b7c47c4f7c ("intel/compiler: Drop nir_lower_to_source_mods() and related handling.")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4691>
I think we're unanimous in wanting to drop nir_lower_to_source_mods.
It's a bit of complexity to handle in the backend, but perhaps more
importantly, would be even more complexity to handle in nir_search.
And, it turns out that since we made other compiler improvements in the
last few years, they no longer appear to buy us anything of value.
Summarizing the results from shader-db from this patch:
- Icelake (scalar mode)
Instruction counts:
- 411 helped, 598 hurt (out of 139,470 shaders)
- 99.2% of shaders remain unaffected. The average increase in
instruction count in hurt programs is 1.78 instructions.
- total instructions in shared programs: 17214951 -> 17215206 (<.01%)
- instructions in affected programs: 1143879 -> 1144134 (0.02%)
Cycles:
- 1042 helped, 1357 hurt
- total cycles in shared programs: 365613294 -> 365882263 (0.07%)
- cycles in affected programs: 138155497 -> 138424466 (0.19%)
- Haswell (both scalar and vector modes)
Instruction counts:
- 73 helped, 1680 hurt (out of 139,470 shaders)
- 98.7% of shaders remain unaffected. The average increase in
instruction count in hurt programs is 1.9 instructions.
- total instructions in shared programs: 14199527 -> 14202262 (0.02%)
- instructions in affected programs: 446499 -> 449234 (0.61%)
Cycles:
- 5253 helped, 5559 hurt
- total cycles in shared programs: 359996545 -> 360038731 (0.01%)
- cycles in affected programs: 155897127 -> 155939313 (0.03%)
Given that ~99% of shader-db remains unaffected, and the affected
programs are hurt by about 1-2 instructions - which are all cheap
ALU instructions - this is unlikely to be measurable in terms of
any real performance impact that would affect users.
So, drop them and simplify the backend, and hopefully enable other
future simplifications in NIR.
Reviewed-by: Eric Anholt <eric@anholt.net> [v1]
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4616>
As of the previous commit, they are never used.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4624>
Add new builtin parameters that are used to keep track of the group
size. This will be used to implement ARB_compute_variable_group_size.
The compiler will use the maximum group size supported to pick a
suitable SIMD variant. A later improvement will be to keep all SIMD
variants (like FS) so the driver can select the best one at dispatch
time.
When variable workgroup size is used, the small workgroup optimization
is disabled as it we can't prove at compile time that the barriers
won't be needed.
Extracted from original i965 patch with additional changes by
Caio Marcelo de Oliveira Filho.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
Thanks to the NIR vectorizing pass, we're about to see alignments that
are higher than the bit size. Previously, we could use either and we
just happened to choose alignment (probably the wrong choice) so it's
harmless to switch to detecting based on bit size. This commit changes
things to take both into account which is more accurate to what the
messages we're using do. We also beef up the asserts and make them more
consistent, more accurate, and more complete.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4367>
This is a query that both Intel and freedreno need to do. We can simplify
it a lot with the new glsl_get_sampler_dim_coordinate_components()
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3728>
The other source of the multiply will be interpreted as a uint32_t in an
XOR instruction. Any source modifiers with either not be interpreted at
all or will be misinterpreted due to the differing types.
If the other operand of the multiplication has a source modifier, just
emit an extra move to resolve the source modifiers.
The negation source modifier problem is difficult to reproduce due to an
algebraic optimization that changes (-a*b) to -(a*b). However, changes
in MR !1359 push the negations back down.
On Gen7+ it might be possible to do slightly better for an abs() source
modifier by using BFI2 as a glorified copysign().
On Gen8+ it might be possible to do slightly better for a neg() source
modifier by emitting (~a ^ b).
There were no shader-db changes on any Intel platform, so I think we can
deal with that problem when it arises.
See also piglit!224.
Fixes: 06d2c11641 ("intel/fs: Add a scale factor to emit_fsign")
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3780>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3780>
At this point this simply involves fixing the initialization of the
sample mask flag register to take the right dispatch mask from the
thread payload, and fixing sample_mask_reg() to return f1.1 for the
second half of a SIMD32 thread. This improves Manhattan 3.1
performance by 2.4%±0.31% (N>40) on my ICL with SIMD32 enabled
relative to falling back to SIMD16 for the shaders that use discard.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
FIND_LIVE_CHANNEL was using f1.0-f1.1 as temporary flag register on
Gen7, instead use f0.0-f0.1. In order to avoid collision with the
discard sample mask, move the latter to f1.0-f1.1. This makes room
for keeping track of the sample mask of the second half of SIMD32
programs that use discard.
Note that some MOVs of the sample mask into f1.0 become redundant now
in lower_surface_logical_send() and lower_a64_logical_send().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>x
Use it instead of hard-coding f0.1 for the sample mask of programs
that use discard. This will make the task easier when we replace f0.1
with another flag register location in order to support discard with
SIMD32 shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
v2: Remove smashing type to D for nir_op_irhadd. Caio noticed it was
odd, and removing it fixes an assertion failure in the crucible
func.shader.averageRounded.int64_t test (because the source should be
W).
v3: Emit BRW_OPCODE_MUL directly for nir_op_umul_32x16 and
nir_op_imul_32x16. Suggested by Curro.
v4: Smash types of MUL instruction generated for nir_op_umul_32x16 and
nir_op_imul_32x16. With this change, I get the same assembly now as I
did with v2.
v5: Remove support for pre-Gen7. The integer multiply path was
incorrect, and, since the extension isn't enabled pre-Gen7, there's no
way to test it.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
Gen4/5's rounding instructions operate differently than later Gens'.
They all return the floor of the input and the "Round-increment"
conditional modifier answers whether the result should be incremented by
1.0 to get the appropriate result for the operation (and thus its
behavior is determined by the round opcode; e.g., RNDZ vs RNDE).
Since this requires a second instruciton (a predicated ADD) that
consumes the result of the round instruction, the round instruction
cannot write its result directly to the (write-only) message registers.
By emitting the ADD in the generator, the backend thinks it's safe to
store the round's result directly to the message register file.
To avoid this, we move the emission of the ADD instruction to the NIR
translator so that the backend has the information it needs.
I suspect this also fixes code generated for RNDZ.SAT but since
Gen4/5 don't support GLSL 1.30 which adds the trunc() function, I
couldn't write a piglit test to confirm. My thinking is that if x=-0.5:
sat(trunc(-0.5)) = 0.0
But on Gen4/5 where sat(trunc(x)) is implemented as
rndz.r.f0 result, x // result = floor(x)
// set f0 if increment needed
(+f0) add result, result, 1.0 // fixup so result = trunc(x)
then putting saturate on both instructions will give the wrong result.
floor(-0.5) = -1.0
sat(floor(-0.5)) = 0.0
// +1 increment would be needed since floor(-0.5) != trunc(-0.5)
sat(sat(floor(-0.5)) + 1.0) = 1.0
Fixes: 6f394343b1 ("nir/algebraic: i2f(f2i()) -> trunc()")
Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2355
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3459>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3459>
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to use a barrier to
ensure all the invocations reach the same point in the shader, because
they are already running lock-step.
Results for SKL running Iris for shader-db tests with compute shaders
total sends in shared programs: 18361 -> 18339 (-0.12%)
sends in affected programs: 904 -> 882 (-2.43%)
helped: 9
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 2.44 x̃: 2
helped stats (rel) min: 0.84% max: 21.43% x̄: 7.82% x̃: 2.67%
95% mean confidence interval for sends value: -3.31 -1.58
95% mean confidence interval for sends %-change: -14.67% -0.97%
Sends are helped.
Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.
Results for ICL and TGL are similar to SKL.
Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to synchronize shared
memory access (SLM) since all the requests from a single thread are
already synchronized. In such case, we just add a scheduling fence.
To be able to identify that case for all platforms, move the handling
of platforms prior to Gen11 (which don't have a separate SLM fence)
after the optimization.
Results for SKL running Iris for shader-db tests with compute shaders
total sends in shared programs: 18395 -> 18361 (-0.18%)
sends in affected programs: 938 -> 904 (-3.62%)
helped: 9
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 3.78 x̃: 4
helped stats (rel) min: 1.56% max: 26.32% x̄: 10.33% x̃: 2.60%
95% mean confidence interval for sends value: -4.85 -2.71
95% mean confidence interval for sends %-change: -19.12% -1.54%
Sends are helped.
Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.
Results for ICL and TGL are similar to SKL.
Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
This involves permuting the registers of barycentric vectors to have
the standard X[0-n] Y[0-n] layout at NIR translation time.
Barycentrics are converted to the format expected by the PLN
instruction in the lower_barycentrics() pass run after the
optimization loop.
Main reason is correctness of SIMD32 fragment shaders. The
shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
during NIR translation are busted for SIMD32. This leads to serious
corruption at present with INTEL_DEBUG=do32, especially on Gen11+
where these helpers are hit more frequently due to the lack of a
hardware PLN instruction.
Of course one could have chosen to fix those helpers instead, but
there is another far more subtle issue that was reported during review
of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
currently handles SIMD32 barycentric vectors as if they had the
standard X[0-n] Y[0-n] layout, even though they are interleaved for
the PLN instruction, which causes incorrect execution masks to be
applied to the MOVs unzipping barycentric vectors in cases where a
LINTERP instruction occurs under non-uniform control flow.
I'm not aware of any conformance regressions due to the latter issue
at present, but for our peace of mind let's move the conversion to the
PLN layout into the lower_barycentrics() pass run after
lower_simd_width().
This leads to the following shader-db improvements (including SIMD32
shaders) in combination with the previous back-end preparation changes
-- Without them (especially the copy propagation changes) this would
lead to a massive number of regressions. On ICL:
total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
instructions in affected programs: 10538474 -> 10343061 (-1.85%)
helped: 68775
HURT: 6
total spills in shared programs: 8938 -> 8748 (-2.13%)
spills in affected programs: 376 -> 186 (-50.53%)
helped: 9
HURT: 5
total fills in shared programs: 8965 -> 8663 (-3.37%)
fills in affected programs: 965 -> 663 (-31.30%)
helped: 9
HURT: 6
LOST: 146
GAINED: 43
On SKL:
total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
instructions in affected programs: 3876590 -> 3765635 (-2.86%)
helped: 27492
HURT: 2
LOST: 191
GAINED: 417
On SNB:
total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
instructions in affected programs: 5199074 -> 4606107 (-11.41%)
helped: 29998
HURT: 0
LOST: 21
GAINED: 30
Results are somewhat less impressive but still significant without
SIMD32 fragment shaders enabled. On ICL:
total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
instructions in affected programs: 6114788 -> 6027719 (-1.42%)
helped: 42046
HURT: 6
total spills in shared programs: 8218 -> 8028 (-2.31%)
spills in affected programs: 376 -> 186 (-50.53%)
helped: 9
HURT: 5
total fills in shared programs: 8953 -> 8651 (-3.37%)
fills in affected programs: 965 -> 663 (-31.30%)
helped: 9
HURT: 6
LOST: 0
GAINED: 3
On SKL:
total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
instructions in affected programs: 168850 -> 167594 (-0.74%)
helped: 711
HURT: 2
On SNB:
total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
instructions in affected programs: 2702172 -> 2666037 (-1.34%)
helped: 17818
HURT: 0
All of the hurt shaders are either spilling slightly more or emitting
additional NOP instructions due to the SIMD16 POW workaround for
Gen8-9 combined with differences in scheduling.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The goal is to represent barycentrics with the standard vector layout
during optimization and particularly SIMD lowering. Instead of
emitting the barycentric layout conversions at NIR translation time,
do it later as a lowering pass. For the moment this is only applied
to PI messages, but we'll give the same treatment to LINTERP
instructions too.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
get_nir_image_intrinsic_image() was incorrectly mutating the value held
by the register which holds the intrinsic's first source (image index).
If this happened to be the register for an SSA def which is also used
elsewhere in the program, this meant that we would clobber that value
in subsequent uses.
Note that this only affects i965, because neither anv nor iris use the
binding table start sections, so nothing is ever added here.
Fixes KHR-GL46.compute_shader.resources-max on i965 with Eric Anholt's
MR !3240 applied. That MR reorders SSBOs and ABOs, so that test uses
image 0 and SSBO 0, causing this code to brilliantly add binding table
index 45 to both the image (correct) and the SSBO (bzzt, wrong!).
Fixes: 09f1de97a7 ("anv,i965: Lower away image derefs in the driver")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3404>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3404>
Fixes: b390ff3517 ("intel/fs: Add support for SLM fence in Gen11")
Fixes: e142061399 ("intel/fs: Implement scoped_memory_barrier")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This is a more explicit name now that we don't want it to be doing any
memory barrier stuff for us.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
Right now, it's implemented as a no-op for everyone. For most drivers,
it's a switch case in the NIR -> whatever which just breaks. For ir3,
they already have code to delete tessellation barriers so we just add a
case to also delete memory_barrier_tcs_patch.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
For uniform sample ID, only the first channel of msg_data will be
initialized. We need to pass that component only to the SEND message
for SIMD lowering to unzip the descriptor source correctly.
Fixes several dozens of conformance test failures with SIMD32 fragment
shaders enabled, including:
dEQP-GLES31.functional.shaders.multisample_interpolation.interpolate_at_sample.dynamic_sample_number.*
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
On ICL we have the src1 restriction which is applied through
fix_byte_src() and potentially changes the type of the operands from 8
to 32 bits. When this change happens, we fall into the "else if
(bit_size < 32)" case and miscompute src_type because it takes into
consideration bit_size (8) instead of the adjusted size of temp_op
(32). This results in the shader reading unused memory, giving us
mostly failures, but occasional passes due to whatever was already in
the registers we were reading.
This commit fixes a lot of dEQP subgroup i8vec2 tests on ICL, such as:
dEQP-VK.subgroups.arithmetic.compute.subgroupadd_i8vec2
This can also be verified by simply changing fix_byte_src() to apply
on all platforms.
Fixes: 5847de6e9a ("intel/compiler: don't use byte operands for src1 on ICL")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
This commit fills in a number of different pieces:
1. We add support to brw_nir_lower_mem_access_bit_sizes to handle the
new intrinsics. This involves simple plumbing work as well as a
tiny bit of extra logic to always scalarize scratch intrinsics
2. Add code to brw_fs_nir.cpp to turn nir_load/store_scratch intrinsics
into byte/dword scattered read/write messages which use the A32
stateless model.
3. Add code to lower_surface_logical_send to handle dword scattered
messages and the A32 stateless model.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The bit moved on gen12 in order to prepare for dual-SIMD8 dispatch.
This implementation isn't an entirely complete as it only works on SIMD8
and SIMD16 and not dual-SIMD8.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>