Adding this since Iris will handle variable group size parameters by
itself.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
Render Target Array Index has moved from R0.0[26:16] to
R1.1[26:16] on gen12.
Fixes dEQP-VK.multiview.input_attachments.*
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4836>
In Gen11+, when emitting a fence for both L3 and SLM, the generated
code would look like
SEND, MOV (for stall), SEND, MOV (for stall)
This commit change that so two SENDs are emitted before the MOVs for
stall. This is similar to the approach used in Ivy Bridge for the
render fence.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
Instead of emitting the stall MOV "inside" the
SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when
creating the IR.
For IvyBridge, every (data cache) fence is accompained by a render
cache fence, that now is explicit in the IR, two
SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs).
Because Begin and End interlock intrinsics are effectively memory
barriers, move its handling alongside the other memory barrier
intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish
if we are going to use a SENDC (for Begin) or regular SEND (for End).
This change is a preparation to allow emitting both SENDs in Gen11+
before we can stall on them.
Shader-db results for IVB (i965):
total instructions in shared programs: 11971190 -> 11971200 (<.01%)
instructions in affected programs: 11482 -> 11492 (0.09%)
helped: 0
HURT: 8
HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1
HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10%
95% mean confidence interval for instructions value: 0.66 1.84
95% mean confidence interval for instructions %-change: 0.01% 0.27%
Instructions are HURT.
Unlike the previous code, that used the `mov g1 g2` trick to force
both `g1` and `g2` to stall, the scheduling fence will generate `mov
null g1` and `mov null g2`. During review it was decided it was not
worth keeping the special codepath for the small effect will have.
Shader-db results for HSW (i965), BDW and SKL don't have a change
on instruction count, but do report changes in cycles count, showing
SKL results below
total cycles in shared programs: 341738444 -> 341710570 (<.01%)
cycles in affected programs: 7240002 -> 7212128 (-0.38%)
helped: 46
HURT: 5
helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154
helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95%
HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362
HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08%
95% mean confidence interval for cycles value: -777.71 -315.38
95% mean confidence interval for cycles %-change: -1.42% -0.83%
Cycles are helped.
This seems to be the effect of allocating two registers separatedly
instead of a single one with size 2, which causes different register
allocation, affecting the cycle estimates.
while ICL also has not change on instruction count but report changes
negative changes in cycles
total cycles in shared programs: 352665369 -> 352707484 (0.01%)
cycles in affected programs: 9608288 -> 9650403 (0.44%)
helped: 4
HURT: 104
helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101
helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49%
HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48
HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45%
95% mean confidence interval for cycles value: 256.67 523.24
95% mean confidence interval for cycles %-change: 0.63% 1.03%
Cycles are HURT.
AFAICT this is the result of the case above.
Shader-db results for TGL have similar cycles result as ICL, but also
affect instructions
total instructions in shared programs: 17690586 -> 17690597 (<.01%)
instructions in affected programs: 64617 -> 64628 (0.02%)
helped: 55
HURT: 32
helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3
helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74%
HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2
HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69%
95% mean confidence interval for instructions value: -2.03 2.28
95% mean confidence interval for instructions %-change: -0.41% 0.15%
Inconclusive result (value mean confidence interval includes 0).
Now that more is done in the IR, more dependencies are visible and
more SWSB annotations are emitted. Mixed with different register
allocation decisions like above, some shaders will see more `sync
nops` while others able to avoid them.
Most of the new `sync nops` are also redundant and could be dropped,
which will be fixed in a separate change.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
This should have gone away when removing source modifiers. They won't
be set any longer, so this is simply dead code.
Fixes: b7c47c4f7c ("intel/compiler: Drop nir_lower_to_source_mods() and related handling.")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4691>
I think we're unanimous in wanting to drop nir_lower_to_source_mods.
It's a bit of complexity to handle in the backend, but perhaps more
importantly, would be even more complexity to handle in nir_search.
And, it turns out that since we made other compiler improvements in the
last few years, they no longer appear to buy us anything of value.
Summarizing the results from shader-db from this patch:
- Icelake (scalar mode)
Instruction counts:
- 411 helped, 598 hurt (out of 139,470 shaders)
- 99.2% of shaders remain unaffected. The average increase in
instruction count in hurt programs is 1.78 instructions.
- total instructions in shared programs: 17214951 -> 17215206 (<.01%)
- instructions in affected programs: 1143879 -> 1144134 (0.02%)
Cycles:
- 1042 helped, 1357 hurt
- total cycles in shared programs: 365613294 -> 365882263 (0.07%)
- cycles in affected programs: 138155497 -> 138424466 (0.19%)
- Haswell (both scalar and vector modes)
Instruction counts:
- 73 helped, 1680 hurt (out of 139,470 shaders)
- 98.7% of shaders remain unaffected. The average increase in
instruction count in hurt programs is 1.9 instructions.
- total instructions in shared programs: 14199527 -> 14202262 (0.02%)
- instructions in affected programs: 446499 -> 449234 (0.61%)
Cycles:
- 5253 helped, 5559 hurt
- total cycles in shared programs: 359996545 -> 360038731 (0.01%)
- cycles in affected programs: 155897127 -> 155939313 (0.03%)
Given that ~99% of shader-db remains unaffected, and the affected
programs are hurt by about 1-2 instructions - which are all cheap
ALU instructions - this is unlikely to be measurable in terms of
any real performance impact that would affect users.
So, drop them and simplify the backend, and hopefully enable other
future simplifications in NIR.
Reviewed-by: Eric Anholt <eric@anholt.net> [v1]
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4616>
As of the previous commit, they are never used.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4624>
Add new builtin parameters that are used to keep track of the group
size. This will be used to implement ARB_compute_variable_group_size.
The compiler will use the maximum group size supported to pick a
suitable SIMD variant. A later improvement will be to keep all SIMD
variants (like FS) so the driver can select the best one at dispatch
time.
When variable workgroup size is used, the small workgroup optimization
is disabled as it we can't prove at compile time that the barriers
won't be needed.
Extracted from original i965 patch with additional changes by
Caio Marcelo de Oliveira Filho.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>
Thanks to the NIR vectorizing pass, we're about to see alignments that
are higher than the bit size. Previously, we could use either and we
just happened to choose alignment (probably the wrong choice) so it's
harmless to switch to detecting based on bit size. This commit changes
things to take both into account which is more accurate to what the
messages we're using do. We also beef up the asserts and make them more
consistent, more accurate, and more complete.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4367>
This is a query that both Intel and freedreno need to do. We can simplify
it a lot with the new glsl_get_sampler_dim_coordinate_components()
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3728>
The other source of the multiply will be interpreted as a uint32_t in an
XOR instruction. Any source modifiers with either not be interpreted at
all or will be misinterpreted due to the differing types.
If the other operand of the multiplication has a source modifier, just
emit an extra move to resolve the source modifiers.
The negation source modifier problem is difficult to reproduce due to an
algebraic optimization that changes (-a*b) to -(a*b). However, changes
in MR !1359 push the negations back down.
On Gen7+ it might be possible to do slightly better for an abs() source
modifier by using BFI2 as a glorified copysign().
On Gen8+ it might be possible to do slightly better for a neg() source
modifier by emitting (~a ^ b).
There were no shader-db changes on any Intel platform, so I think we can
deal with that problem when it arises.
See also piglit!224.
Fixes: 06d2c11641 ("intel/fs: Add a scale factor to emit_fsign")
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3780>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3780>
At this point this simply involves fixing the initialization of the
sample mask flag register to take the right dispatch mask from the
thread payload, and fixing sample_mask_reg() to return f1.1 for the
second half of a SIMD32 thread. This improves Manhattan 3.1
performance by 2.4%±0.31% (N>40) on my ICL with SIMD32 enabled
relative to falling back to SIMD16 for the shaders that use discard.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
FIND_LIVE_CHANNEL was using f1.0-f1.1 as temporary flag register on
Gen7, instead use f0.0-f0.1. In order to avoid collision with the
discard sample mask, move the latter to f1.0-f1.1. This makes room
for keeping track of the sample mask of the second half of SIMD32
programs that use discard.
Note that some MOVs of the sample mask into f1.0 become redundant now
in lower_surface_logical_send() and lower_a64_logical_send().
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>x
Use it instead of hard-coding f0.1 for the sample mask of programs
that use discard. This will make the task easier when we replace f0.1
with another flag register location in order to support discard with
SIMD32 shaders.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
v2: Remove smashing type to D for nir_op_irhadd. Caio noticed it was
odd, and removing it fixes an assertion failure in the crucible
func.shader.averageRounded.int64_t test (because the source should be
W).
v3: Emit BRW_OPCODE_MUL directly for nir_op_umul_32x16 and
nir_op_imul_32x16. Suggested by Curro.
v4: Smash types of MUL instruction generated for nir_op_umul_32x16 and
nir_op_imul_32x16. With this change, I get the same assembly now as I
did with v2.
v5: Remove support for pre-Gen7. The integer multiply path was
incorrect, and, since the extension isn't enabled pre-Gen7, there's no
way to test it.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>
Gen4/5's rounding instructions operate differently than later Gens'.
They all return the floor of the input and the "Round-increment"
conditional modifier answers whether the result should be incremented by
1.0 to get the appropriate result for the operation (and thus its
behavior is determined by the round opcode; e.g., RNDZ vs RNDE).
Since this requires a second instruciton (a predicated ADD) that
consumes the result of the round instruction, the round instruction
cannot write its result directly to the (write-only) message registers.
By emitting the ADD in the generator, the backend thinks it's safe to
store the round's result directly to the message register file.
To avoid this, we move the emission of the ADD instruction to the NIR
translator so that the backend has the information it needs.
I suspect this also fixes code generated for RNDZ.SAT but since
Gen4/5 don't support GLSL 1.30 which adds the trunc() function, I
couldn't write a piglit test to confirm. My thinking is that if x=-0.5:
sat(trunc(-0.5)) = 0.0
But on Gen4/5 where sat(trunc(x)) is implemented as
rndz.r.f0 result, x // result = floor(x)
// set f0 if increment needed
(+f0) add result, result, 1.0 // fixup so result = trunc(x)
then putting saturate on both instructions will give the wrong result.
floor(-0.5) = -1.0
sat(floor(-0.5)) = 0.0
// +1 increment would be needed since floor(-0.5) != trunc(-0.5)
sat(sat(floor(-0.5)) + 1.0) = 1.0
Fixes: 6f394343b1 ("nir/algebraic: i2f(f2i()) -> trunc()")
Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2355
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3459>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3459>
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to use a barrier to
ensure all the invocations reach the same point in the shader, because
they are already running lock-step.
Results for SKL running Iris for shader-db tests with compute shaders
total sends in shared programs: 18361 -> 18339 (-0.12%)
sends in affected programs: 904 -> 882 (-2.43%)
helped: 9
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 2.44 x̃: 2
helped stats (rel) min: 0.84% max: 21.43% x̄: 7.82% x̃: 2.67%
95% mean confidence interval for sends value: -3.31 -1.58
95% mean confidence interval for sends %-change: -14.67% -0.97%
Sends are helped.
Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.
Results for ICL and TGL are similar to SKL.
Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to synchronize shared
memory access (SLM) since all the requests from a single thread are
already synchronized. In such case, we just add a scheduling fence.
To be able to identify that case for all platforms, move the handling
of platforms prior to Gen11 (which don't have a separate SLM fence)
after the optimization.
Results for SKL running Iris for shader-db tests with compute shaders
total sends in shared programs: 18395 -> 18361 (-0.18%)
sends in affected programs: 938 -> 904 (-3.62%)
helped: 9
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 3.78 x̃: 4
helped stats (rel) min: 1.56% max: 26.32% x̄: 10.33% x̃: 2.60%
95% mean confidence interval for sends value: -4.85 -2.71
95% mean confidence interval for sends %-change: -19.12% -1.54%
Sends are helped.
Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.
Results for ICL and TGL are similar to SKL.
Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
This involves permuting the registers of barycentric vectors to have
the standard X[0-n] Y[0-n] layout at NIR translation time.
Barycentrics are converted to the format expected by the PLN
instruction in the lower_barycentrics() pass run after the
optimization loop.
Main reason is correctness of SIMD32 fragment shaders. The
shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
during NIR translation are busted for SIMD32. This leads to serious
corruption at present with INTEL_DEBUG=do32, especially on Gen11+
where these helpers are hit more frequently due to the lack of a
hardware PLN instruction.
Of course one could have chosen to fix those helpers instead, but
there is another far more subtle issue that was reported during review
of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
currently handles SIMD32 barycentric vectors as if they had the
standard X[0-n] Y[0-n] layout, even though they are interleaved for
the PLN instruction, which causes incorrect execution masks to be
applied to the MOVs unzipping barycentric vectors in cases where a
LINTERP instruction occurs under non-uniform control flow.
I'm not aware of any conformance regressions due to the latter issue
at present, but for our peace of mind let's move the conversion to the
PLN layout into the lower_barycentrics() pass run after
lower_simd_width().
This leads to the following shader-db improvements (including SIMD32
shaders) in combination with the previous back-end preparation changes
-- Without them (especially the copy propagation changes) this would
lead to a massive number of regressions. On ICL:
total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
instructions in affected programs: 10538474 -> 10343061 (-1.85%)
helped: 68775
HURT: 6
total spills in shared programs: 8938 -> 8748 (-2.13%)
spills in affected programs: 376 -> 186 (-50.53%)
helped: 9
HURT: 5
total fills in shared programs: 8965 -> 8663 (-3.37%)
fills in affected programs: 965 -> 663 (-31.30%)
helped: 9
HURT: 6
LOST: 146
GAINED: 43
On SKL:
total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
instructions in affected programs: 3876590 -> 3765635 (-2.86%)
helped: 27492
HURT: 2
LOST: 191
GAINED: 417
On SNB:
total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
instructions in affected programs: 5199074 -> 4606107 (-11.41%)
helped: 29998
HURT: 0
LOST: 21
GAINED: 30
Results are somewhat less impressive but still significant without
SIMD32 fragment shaders enabled. On ICL:
total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
instructions in affected programs: 6114788 -> 6027719 (-1.42%)
helped: 42046
HURT: 6
total spills in shared programs: 8218 -> 8028 (-2.31%)
spills in affected programs: 376 -> 186 (-50.53%)
helped: 9
HURT: 5
total fills in shared programs: 8953 -> 8651 (-3.37%)
fills in affected programs: 965 -> 663 (-31.30%)
helped: 9
HURT: 6
LOST: 0
GAINED: 3
On SKL:
total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
instructions in affected programs: 168850 -> 167594 (-0.74%)
helped: 711
HURT: 2
On SNB:
total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
instructions in affected programs: 2702172 -> 2666037 (-1.34%)
helped: 17818
HURT: 0
All of the hurt shaders are either spilling slightly more or emitting
additional NOP instructions due to the SIMD16 POW workaround for
Gen8-9 combined with differences in scheduling.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The goal is to represent barycentrics with the standard vector layout
during optimization and particularly SIMD lowering. Instead of
emitting the barycentric layout conversions at NIR translation time,
do it later as a lowering pass. For the moment this is only applied
to PI messages, but we'll give the same treatment to LINTERP
instructions too.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
get_nir_image_intrinsic_image() was incorrectly mutating the value held
by the register which holds the intrinsic's first source (image index).
If this happened to be the register for an SSA def which is also used
elsewhere in the program, this meant that we would clobber that value
in subsequent uses.
Note that this only affects i965, because neither anv nor iris use the
binding table start sections, so nothing is ever added here.
Fixes KHR-GL46.compute_shader.resources-max on i965 with Eric Anholt's
MR !3240 applied. That MR reorders SSBOs and ABOs, so that test uses
image 0 and SSBO 0, causing this code to brilliantly add binding table
index 45 to both the image (correct) and the SSBO (bzzt, wrong!).
Fixes: 09f1de97a7 ("anv,i965: Lower away image derefs in the driver")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3404>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3404>
Fixes: b390ff3517 ("intel/fs: Add support for SLM fence in Gen11")
Fixes: e142061399 ("intel/fs: Implement scoped_memory_barrier")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This is a more explicit name now that we don't want it to be doing any
memory barrier stuff for us.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
Right now, it's implemented as a no-op for everyone. For most drivers,
it's a switch case in the NIR -> whatever which just breaks. For ir3,
they already have code to delete tessellation barriers so we just add a
case to also delete memory_barrier_tcs_patch.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3307>
For uniform sample ID, only the first channel of msg_data will be
initialized. We need to pass that component only to the SEND message
for SIMD lowering to unzip the descriptor source correctly.
Fixes several dozens of conformance test failures with SIMD32 fragment
shaders enabled, including:
dEQP-GLES31.functional.shaders.multisample_interpolation.interpolate_at_sample.dynamic_sample_number.*
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
On ICL we have the src1 restriction which is applied through
fix_byte_src() and potentially changes the type of the operands from 8
to 32 bits. When this change happens, we fall into the "else if
(bit_size < 32)" case and miscompute src_type because it takes into
consideration bit_size (8) instead of the adjusted size of temp_op
(32). This results in the shader reading unused memory, giving us
mostly failures, but occasional passes due to whatever was already in
the registers we were reading.
This commit fixes a lot of dEQP subgroup i8vec2 tests on ICL, such as:
dEQP-VK.subgroups.arithmetic.compute.subgroupadd_i8vec2
This can also be verified by simply changing fix_byte_src() to apply
on all platforms.
Fixes: 5847de6e9a ("intel/compiler: don't use byte operands for src1 on ICL")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
This commit fills in a number of different pieces:
1. We add support to brw_nir_lower_mem_access_bit_sizes to handle the
new intrinsics. This involves simple plumbing work as well as a
tiny bit of extra logic to always scalarize scratch intrinsics
2. Add code to brw_fs_nir.cpp to turn nir_load/store_scratch intrinsics
into byte/dword scattered read/write messages which use the A32
stateless model.
3. Add code to lower_surface_logical_send to handle dword scattered
messages and the A32 stateless model.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The bit moved on gen12 in order to prepare for dual-SIMD8 dispatch.
This implementation isn't an entirely complete as it only works on SIMD8
and SIMD16 and not dual-SIMD8.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
flrp was forgotten when already adding the rounding mode for other
instructions.
Fixes: ba1e25e1aa ("i965/fs: set rounding mode when emitting fadd, fmul and ffma instructions")
Suggested-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
After
1711bf6cf2 ("intel/fs: Generate better code for fsign multiplied by a value"),
the conflicts resolution for setting the rounding mode after the
fused fmul and fsign optimization is non obvious.
Basically, the optimization doesn't really result in a MUL, or any
other operation which would need to have the rounding mode set. Hence,
we set it just before the actual MUL in the treatment of fmul.
Fixes: ba1e25e1aa ("i965/fs: set rounding mode when emitting fadd, fmul and ffma instructions")
Suggested-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Because byte immediates aren't a thing on GEN hardware, we return a
signed or unsigned word immediate in the byte case.
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
I would like for iris to be able to avoid setting up SURFACE_STATE
for UBOs in the common case where all constants are pushed.
Unfortunately, we don't know up front whether everything will be
pushed: the backend is allowed to demote pushed UBOs to pull loads
fairly late in the process. This is probably desirable though, as
we'd like the backend to be able to re-pull pushed data to break up
long live ranges in response to register pressure.
Here we simply add a "are there any pull loads at all" boolean to
prog_data, which is a bit crude but at least allows us to skip work
in the common "everything pushed" case. We could skip more work by
tracking exactly which UBO surfaces are pulled in a bitmask, but I
wanted to avoid bringing back the old mark_surface_used() mechanism.
Finer-grained tracking could allow us to skip a bit more work when
multiple UBOs are in use and /some/ are 100% pushed, but others are
accessed via pulls. However, I'm not sure how common this is and
it would save at most 4 pull descriptors, so we defer that for now.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
v2:
- Consider nir_op_f2f16 case too (Caio).
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
v2:
- Updated to renamed shader info member (Andres).
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We need this function to emit code that setups the control register
later with the defined execution mode for the shader. Therefore, we
emit it as the first instruction.
v2:
- Fix bug in setting the default mode mask in brw_rnd_mode_from_nir().
- Fix support for rounding modes in brw_rnd_mode_from_nir().
v3:
- Updated to renamed shader info member and enum values (Andres).
v4:
- Add actual emission as first instruction of emit_nir_code (Caio).
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The denorm mode is set in the control register, no need to do
something else.
v2:
- Add an assert to make sure that we realize if this assumption is
broken in the future (Caio).
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This reverts commit c0504569ea. Now that
we're doing interpolation lowering in NIR, we can continue to stride the
FS input registers directly in the brw_fs_nir code like we did before.
This fixes SIMD32 fragment shaders which broke because lower_simd_width
depended on the 0 stride to split PLN instructions correctly.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Leftover from 021fa28163 ("xintel/nir: Add a helper for getting
BRW_AOP from an intrinsic").
Acked-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This better matches all the other atomic intrinsics such as those for
SSBOs and shared variables where the sign is part of the intrinsic
opcode. Both generators (GLSL and SPIR-V) know the sign from the type
of the image variable or handle. In SPIR-V, signed min/max are separate
opcodes from unsigned.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>