This will effectively enable the optimization in anv.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This allows us to drop legacy userclip plane handling in both the vec4
and FS backends, and simplifies a few interfaces.
v2 (Jason Ekstrand):
- Move brw_nir_lower_legacy_clipping to brw_nir_uniforms.cpp because
it's i965-specific.
- Handle adding the params in brw_nir_lower_legacy_clipping
- Call brw_nir_lower_legacy_clipping from brw_codegen_vs_prog
Co-authored-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The rules for gl_SubgroupSize in Vulkan require that it be a constant
that can be queried through the API. However, all GL requires is that
it's a uniform. Instead of always claiming that the subgroup size in
the shader is 32 in GL like we have to do for Vulkan, claim 8 for
geometry stages, the maximum for fragment shaders, and the actual size
for compute.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Instead of lowering the subgroup size so early, wait until we have more
information. In particular, we're going to want different subgroup
sizes from different stages depending on the API. We also defer
lowering of subgroup masks because the ge/gt masks require the subgroup
size to generate a subgroup mask.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
warning: struct 'fs_reg' was previously declared as a class
Fixes: e64be391 ("intel/compiler: generalize the combine constants pass")
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Andrii Simiklit <andrii.simiklit@globallogic.com>
Normally, we haven't worried too much about stack sizes as Linux tends
to be fairly friendly towards large stacks. However, when running DXVK
apps under wine, we're suddenly subject to Windows' more stringent stack
limitations and can run out of space more easily. In particular, some
of the shaders in Elite Dangerous: Horizons have quite a few registers
and the arrays in split_virtual_grfs are large enough to blow a 1 MiB
stack leading to crashes during shader compilation.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108662
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
In many cases, the compiler can just copy-prop the strided MOV whereas
the conversion is a bit trickier. This cuts 5% of the instructions off
of one particular Vulkan CTS test which does lots of load_ssbo.
Reviewed-by: Matt Turner <mattst88@gmail.com>
This fixes some validation errors generated by certain D->W conversions
but is likely not a full solution. Calculating an actual register
stride is a far more complex problem in general and should probably be
handled by the brw_fs_generator.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Now that the 64-bit lowering passes do a complete lowering in one go, we
don't need to loop anymore. We do, however, have to ensure that int64
lowering happens after double lowering because double lowering can
produce int64 ops.
Reviewed-by: Eric Anholt <eric@anholt.net>
We need this when doing full software 64-bit emulation.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110309
Fixes: cbad201c2b "nir/algebraic: Add missing 64-bit extract_[iu]8..."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
For bindless SSBO access, we have to do 64-bit address calculations. On
ICL and above, we don't have 64-bit integer support so we have to lower
the address calculations to 32-bit arithmetic. If we don't run the
optimization loop before lowering, we won't fold any of the address
chain calculations before lowering 64-bit arithmetic and they aren't
really foldable afterwards. This cuts the size of the generated code in
the compute shader in dEQP-VK.ssbo.phys.layout.random.16bit.scalar.13 by
around 30%.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
c8665005: ("intel/compiler: Don't always require precise lowering of flrp")
forgot to remove some comments that didn't apply any more after the
change.
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrnd.net>
A couple patches later in this series use the flag to avoid a few
thousand shader-db regresions on all vec4 platforms.
I'm not particularly enamored with the name of this flag. However, I
suspect the Intel vec4 backend is the only backend that will benefit
from it. Specifically, the cases where this helps are all cases where
we want to prevent nir_opt_algebraic from rearranging instructions to
create 3-source instructions, such as ffma and flrp, with additional
immediate value or uniform sources.
The earlier commit "intel/vec4: Try to emit a single load for multiple
3-src instruction operands" solves most of the problems caused by
additional immediate values, but the restrictions on register strides
that cause problems for uniforms and shader inputs persist.
Reviewed-by: Matt Turner <mattst88@gmail.com>
If a 3-source instruction uses immediate values 1.0 and -1.0, just load
1.0 into a register. Use the negation source modifier to get -1.0.
This has trivial impact now, but it prevents a few thousand regressions
on vec4 platforms with "nir/algebraic: Recognize open-coded flrp(-1, 1,
a) and flrp(1, -1, a)"
All Gen6 and Gen7 platforms had similar results. (Haswell shown)
total instructions in shared programs: 13487412 -> 13487406 (<.01%)
instructions in affected programs: 541 -> 535 (-1.11%)
helped: 6
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 0.36% max: 2.08% x̄: 1.65% x̃: 1.80%
95% mean confidence interval for instructions value: -1.00 -1.00
95% mean confidence interval for instructions %-change: -2.33% -0.97%
Instructions are helped.
total cycles in shared programs: 376402564 -> 376402500 (<.01%)
cycles in affected programs: 10348 -> 10284 (-0.62%)
helped: 10
HURT: 1
helped stats (abs) min: 2 max: 26 x̄: 7.00 x̃: 2
helped stats (rel) min: 0.13% max: 2.05% x̄: 0.89% x̃: 0.79%
HURT stats (abs) min: 6 max: 6 x̄: 6.00 x̃: 6
HURT stats (rel) min: 0.29% max: 0.29% x̄: 0.29% x̃: 0.29%
95% mean confidence interval for cycles value: -11.72 0.08
95% mean confidence interval for cycles %-change: -1.20% -0.36%
Inconclusive result (value mean confidence interval includes 0).
No shader-db changes on any other Intel platform.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Gen11 SLM is not on L3 anymore, so now the hardware has two separate
fences. Add a way to control which fence types to use.
At this time, we don't have enough information in NIR to control the
visibility of the memory being fenced, so for now be conservative and
assume that fences will need a stall. With more information later
we'll be able to reduce those.
Fixes Vulkan CTS tests in ICL:
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.workgroup.guard_local.buffer.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_local.image.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.buffer.guard_nonlocal.workgroup.comp
dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.workgroup.payload_local.image.guard_nonlocal.workgroup.comp
The whole set of supported tests in dEQP-VK.memory_model.* group
should be passing in ICL now.
v2: Pass BTI around instead of having an enum. (Jason)
Emit two SHADER_OPCODE_MEMORY_FENCE instead of one that gets
transformed into two. (Jason)
List tests fixed. (Lionel)
v3: For clarity, split the decision of which fences to emit from the
emission code. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Right now, all keys have two things in common: a program string ID and a
sampler_prog_key_data. I'd like to add another thing or two and need a
place to put it. This commit adds a new brw_base_prog_key struct which
contains those two common bits.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Effectivley unused since dd7135d55d ("intel/compiler: Use the flrp
lowering pass for all stages on Gen4 and Gen5"). I had intended to
remove this code as part of that series, but I forgot.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Previously, an instruction like
mul(8) vgrf29.xy:F, vgrf25.yxxx:F, [-1F, 1F, 0F, 0F]
would get rewritten as
mul(8) vgrf0.yz:F, vgrf25.yyxx:F, [-1F, 1F, 0F, 0F]
The latter does not produce the correct result. The VF immediate in the
second should be either [-1F, -1F, 1F, 1F] or [0F, -1F, 1F, 0F]. This
commit produces the former.
Fixes: 1ee1d8ab46 ("i965/vec4: Reswizzle sources when necessary.")
Reviewed-by: Matt Turner <mattst88@gmail.com>
The "demote" intrinsic works like "discard" but don't change the
control flow, allowing derivative operations to work. This is the
semantics of D3D discard.
The "is_helper_invocation" intrinsic will return true for helper
invocations -- both the ones that started as helpers and the ones that
where demoted. This is needed to avoid changing the behavior of
gl_HelperInvocation which is an input (so not expected to change
during shader execution).
v2: Emit the discard jump and comment why it is safe. (Jason)
Rework the is_helper_invocation() that was stomping f0.1. (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Pretty much every driver using nir_lower_io_to_temporaries followed by
nir_lower_io is going to want this. In particular, radv and radeonsi in
the next commits.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
If they never get used, dead code should clean them up. Also, we rework
the at_offset and at_sample intrinsics so they return a proper vec2
instead of returning things in PLN layout. Fortunately, copy-prop is
pretty good at cleaning this up and it doesn't result in any actual
extra MOVs.
Reviewed-by: Matt Turner <mattst88@gmail.com>
v2: 1) Drop changes for vec4 backend as on Gen11+ we don't support
align16 mode (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
The simulator complains about using byte operands, we also have
documentation telling us.
Note that add operations on bytes seems to work fine on HW (like ADD).
Using dwords operands with CMP & SEL fixes the following tests :
dEQP-VK.spirv_assembly.type.vec*.i8.*
v2: Drop the GLK changes (Matt)
Add validator tests (Matt)
v3: Drop GLK ref (Matt)
Don't mix float/integer in MAD (Matt)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com> (v1)
Reviewed-by: Matt Turner <mattst88@gmail.com>
BSpec: 3017
Cc: <mesa-stable@lists.freedesktop.org>
This rewrites the ddy in EXECUTE_4 mode with a loop to make it more
obvious what is going on and also sets the group each of the 4 threads
in the groups are supposed to execute.
Fixes the following CTS tests :
dEQP-VK.glsl.derivate.dfdyfine.dynamic_*
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Co-Authored-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 2134ea3800 ("intel/compiler/fs: Implement ddy without using align16 for Gen11+")
The main motivation for this change is API ergonomics: most operations
on dynarrays are really on elements, not on bytes, so it's weird to have
grow and resize as the odd operations out.
The secondary motivation is memory safety. Users of the old byte-oriented
functions would often multiply a number of elements with the element size,
which could overflow, and checking for overflow is tedious.
With this change, we only need to implement the overflow checks once.
The checks are cheap: since eltsize is a compile-time constant and the
functions should be inlined, they only add a single comparison and an
unlikely branch.
v2:
- ensure operations are no-op when allocation fails
- in util_dynarray_clone, call resize_bytes with a compile-time constant element size
v3:
- fix iris, lima, panfrost
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The other sources of the bcsel behave like the sources of an and or
other logical operation. However, source zero behaves differently.
It is evaluated as a Boolean, so it needs to be resolved.
No shader-db changes, but the tests mentioned in the bug get a couple
instructions added back.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110857
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is the same as the need_dest parameter to
prepare_alu_destination_and_sources. This allows us to not change the
register that is expected to hold an result if an instruction is
re-emitted. This is particularly a problem if the re-emitted
instruction is a partial write. A later patch will use this feature.
No shader-db changes on any Intel platform.
v2: Don't do the Boolean resolve when there is no destination. If the
ALU instruction didn't write a register, there's nothing to resolve.
This replaces an earlier patch "intel/fs: Allocate dummy destination
register when need_dest is false".
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
There were two errors. First, the pass could propagate conditional
modifiers from an instruction that writes on flag register to an
instruction that writes a different flag register. For example,
cmp.nz.f0.0(16) null:F, vgrf6:F, vgrf5:F
cmp.nz.f0.1(16) null:F, vgrf6:F, vgrf5:F
could be come
cmp.nz.f0.0(16) null:F, vgrf6:F, vgrf5:F
Second, if an instruction writes f0.1 has it's condition propagated, the
modified instruction will incorrectly write flag f0.0. For example,
linterp(16) vgrf6:F, g2:F, attr0:F
cmp.z.f0.1(16) null:F, vgrf6:F, vgrf5:F
(-f0.1) discard_jump(16) (null):UD
could become
linterp.z.f0.0(16) vgrf6:F, g2:F, attr0:F
(-f0.1) discard_jump(16) (null):UD
None of these cases will occur currently. The only time we use f0.1 is
for generating discard intrinsics. In all those cases, we generate a
squence like:
cmp.nz.f0.0(16) vgrf7:F, vgrf6:F, vgrf5:F
(+f0.1) cmp.z(16) null:D, vgrf7:D, 0d
(-f0.1) discard_jump(16) (null):UD
Due to the mixed types and incompatible conditions, this sequence would
never see any cmod propagation. The next patch will change this.
No shader-db changes on any Intel platform.
v2: Fix typo in comment in test case subtract_delete_compare_other_flag.
Noticed by Caio.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tests like this should have been added in 4467040cb6 ("i965/fs:
Propagate conditional modifiers from not instructions").
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
We originally had a single lower_fmod option. In commit 2ab2d2e5, Sam
split 32 and 64-bit lowering into separate flags, with the rationale
that some drivers might want different options there. This left 16-bit
unhandled, so Iago added a lower_fmod16 option in commit ca31df6f.
Now that lower_fmod64 is gone (in favor of nir_lower_doubles and
nir_lower_dmod), we re-combine lower_fmod16 and lower_fmod32 into a
single lower_fmod flag again. I'm not aware of any hardware which
need lowering for one bitsize and not the other.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
nir_lower_doubles offers a wide variety of fp64 lowering, including
lowering fmod@64. The version there also better handles imprecisions
due to lowered frcp@64. Let's consolidate on one version.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Since NIR_PASS no longer swaps out the NIR pointer when NIR_TEST_* is
enabled, we can just take a single pointer and not a pointer to pointer.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Now that NIR_TEST_* doesn't swap the shader out from under us, it's
sufficient to just modify the shader rather than having to return in
case we're testing serialization or cloning.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>