Allowing longer writes reduces the number of send messages needed
to support unaligned 4-component writes.
Note: nothing currently generates 8-component writes, so this change
makes "second_mask" code path in emit_urb_direct_writes and
emit_urb_indirect_writes_mod dead.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20858>
Our hardware requires that we write to URB using full vec4s at aligned
addresses. It gives us an ability to mask-off dwords within vec4 we don't
want to write, but we have to know their positions at compile time.
Let's assume that:
- V represents one dword we want to write
- ? is an unitinitialized value
- "|" is a vec4 boundary.
When we want to write 2-dword value at offset 0 we generate 1 write message:
| V1 V2 ? ? |
with mask:
| 1 1 0 0 |
When we want to write 4-dword value at offset 2 we generate 2 write messages:
| ? ? V1 V2 | V3 V4 ? ? |
with mask:
| 0 0 1 1 | 1 1 0 0 |
However if we don't know the offset within vec4 at *compile time* we
currently generate 4 write messages:
| V1 V1 V1 V1 |
| 0 0 1 0 |
| V2 V2 V2 V2 |
| 0 0 0 1 |
| V3 V3 V3 V3 |
| 1 0 0 0 |
| V4 V4 V4 V4 |
| 0 1 0 0 |
where masks are determined at *run time*.
This is quite wasteful and slow.
However, if we could determine the offset modulo 4 statically at compile time,
we could generate only 1 or 2 write messages (1 if modulo is 0) instead of 4.
This is what this patch does: it analyzes the addressing expression for
modulo 4 value and if it can determine it at compile time, we generate
1 or 2 writes, and if it can't we fallback to the old 4 writes method.
In mesh shader, the value of offset modulo 4 should be known for all outputs,
with an exception of primitive indices.
The modulo value should be known because of MUE layout restrictions, which
require that user per-primitive and per-vertex data start at address aligned
to 8 dwords and we should statically always know the offset from this base.
There can be some cases where the offset from the base is more dynamic
(e.g. indirect array access inside a per-vertex value), so we always do
the analysis.
Primitive indices are an exception, because they form vec3s (for triangles),
which means that the offset will not be easy to analyse.
When U888X index format lands, primitive indices will use only one dword
per triangle, which means that we'll always write them using one message.
Task shaders don't have any predetermined structure of output memory, so
always do the analysis.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20050>
We can lower FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD into other more
generic sends and drop this internal opcode.
The idea behind this change is to allow bindless surfaces to be used
for UBO pulls and why it's interesting to be able to reuse
setup_surface_descriptors(). But that will come in a later change.
No shader-db changes on TGL & DG2.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20416>
The ACP entries created by copy propagation to track the implied
copies of LOAD_PAYLOAD instructions don't model the behavior of
LOAD_PAYLOAD correctly, since (as of 41868bb682) header
moves are implicitly retyped to UD and the destination of non-header
copies implicitly uses the same type as the corresponding source, even
though the ACP entries created for such copies could incorrectly
represent a type conversion, which can lead to mis-optimization of the
program.
According to Marcin, this fixes the func.mesh.ext.workgroup_id.task.q0
crucible test.
Fixes: 41868bb682 ("i965/fs: Rework the fs_visitor LOAD_PAYLOAD instruction")
Reported-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Tested-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18980>
Started showing up when nir_opt_large_constants call was moved in 88756cee8d.
Fixes dEQP-VK.mesh_shader.ext.smoke.monolithic.fullscreen_gradient*
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Fixes: 88756cee8d ("intel/compiler: Run nir_opt_large_constants before scalarizing consts")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20876>
I stumbled on this when I inserted some suboptimal lowering code after all
optimizations. Adding certain subset of optimizations after my lowering code
actually avoided this bug, so I think it's not possible to hit this on upstream.
Let's fix this for the next person generating suboptimal code...
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20857>
I've been running into failures with tests like :
dEQP-VK.robustness.robustness2.bind.notemplate.rgba32i.unroll.nonvolatile.uniform_buffer_dynamic.no_fmt_qual.len_4.samples_1.1d.frag
With the load_global_const_block_intel NIR intrinsic, you can load a
vec8/vec16 with a predicate. The predicate is correctly uniformized to
feed into the SEND instruction's flag register.
The problem is that a series of optimization first remove the
find_live_channel and then changes the broadcast into a simple MOV
instruction, on the assumption that the first channel is always active
if there is not control flow. This is correct.
But after that the cmod optimzation will remove this instruction :
mov.nz.f0.0(16) null:D, vgrf16+0.0<0>:D NoMask
because it seems to be equivalent to :
cmp.g.f0.0(16) vgrf16:D, vgrf12:D, 63d
In this case vgrf16 is the predicate to the load block SEND
instruction. Since the execution mask is different between both, some
of the channels of the SEND instruction end up not being loaded or
loaded with the wrong predication and we end up with incorrect UBO
data.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20852>
NIR atomic operation intrinsics all have destinations. This is just
copy and pasted from other generic intrinsic handling where that may
or may not be the case.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
These are handled identically in almost all cases. There is one place
in the legacy surface lowering that was obtaining the bitsize from the
opcode, but the LSC-based lowering uses (type_sz(inst->dst.type) * 8)
for that and works just fine. If we just do that in the legacy lowering
too, then we don't need this plethora of opcodes.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
These are basically identical save for:
- shared has surface hardcoded to SLM rather than an SSBO index
- shared has to handle adding the 'base' const_index (SSBO have none)
- the NIR source index for data is shifted by one
It's not worth copy and pasting the entire function for this.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
These are now basically identical to their non-float counterparts. The
only thing that differed was the opcode checking to determine which
operands existed. Now that we have a unified opcode enum and a helper
for the number of data operands, we can just use that.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
We already expanded data to 32-bit a few lines earlier, so this is just
redundantly doing it a second time.
Fixes: 43169dbbe5 ("intel/compiler: Support 16 bit float ops")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
The only reason for the separate opcode was because of the overlapping
BRW_AOP_* enums, making it impossible to tell whether a particular AOP
was the integer or float operation. Now that we use the lsc_opcode
enums, we can just have the legacy lowering inspect the opcode and
select the right descriptor. No need for a separate opcode.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
There are a number of places that need to know how many operands an LSC
atomic takes (0 for inc/dec, 1 for most things, 2 for cmpxchg). We can
add a helper for that and eliminate some code (with more to come).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
This gets our logical atomic messages using the lsc_opcode enum rather
than the legacy BRW_AOP_* defines. We have to translate one way or
another, and using the modern set makes sense going forward.
One advantage is that the lsc_opcode encoding has opcodes for both
integer and floating point atomics in the same enum, whereas the legacy
encoding used overlapping values (BRW_AOP_AND == 1 == BRW_AOP_FMAX),
which made it impossible to handle both sensibly in common code.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
There's no need to pass both the intrinsic and an opcode computed from
that same intrinsic. Just do it in the functions themselves.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20604>
Commit c80c0ed943 introduced handling of
SHADER_OPCODE_BROADCAST into inferred_exec_pipe(), but it was already
being handled, drop the redundant handling. Shouldn't lead to any
functional changes.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20543>
The conditional mode field doesn't exist for instructions with a
64-bit immediate, so this would currently print garbage.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20543>
This avoids a violation of the Vulkan memory model that was leading to
intermittent failures of at least 8k test-cases of the Vulkan CTS
(within the group dEQP-VK.memory_model.*) on TGL and DG2 platforms.
In theory the issue may be reproducible on earlier platforms like IVB
and ICL, but the SYNC.ALLWR instruction is not available on those
platforms so a different (likely costlier) fix will be needed.
The issue occurs within the sequence we emit for a NIR memory barrier
with acquire semantics requiring the synchronization of multiple
caches, e.g. in pseudocode for a barrier involving the TGM and UGM
caches on DG2:
x <- load.ugm // Atomic read sequenced-before the barrier
y <- fence.ugm
z <- fence.tgm
wait(y, z)
w <- load.tgm // Read sequenced-after the barrier
In the example we must provide the guarantee that the memory load for
x is completed before the one for w, however this ordering can be
reversed with the intervention of a concurrent thread, since the UGM
fence will block on the prior UGM load and potentially take a long
time, while the TGM fence may complete and invalidate the TGM cache
immediately, so a concurrent thread could pollute the TGM cache with
stale contents for the w location *before* the UGM load has completed,
leading to an inversion of the expected memory ordering.
v2: Apply the workaround regardless of whether the NIR barrier
intrinsic specifies multiple storage classes or a single one,
since an acquire barrier is required to order subsequent requests
relative to previous atomic requests of unknown storage class not
necessarily specified by the memory scope information of the
intrinsic.
Cc: mesa-stable
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20690>
Otherwise build fails:
"../src/intel/compiler/brw_private.h:40:4: note:
‘std::variant’ is only available from C++17 onwards"
Fixes: 6c194ddd18 ("intel/compiler: Prepare SIMD selection helpers to handle different prog_datas")
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20725>
This can't be checked in EU validation because the bits to describe the
base type of the individual sources no longer exist.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20527>
This enabled SIMD32 in blorp shaders and seems to be give a small FPS
bump when using a DG2 GPU as secondary (requires copies to linear
buffers to exchange with main GPU).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19341>
GLSL IR opcodes generated for bitfieldExtract and bitfieldInsert are
lowered by lower_instructions. 4dff3ff005 ("nir/opt_algebraic:
Optimize open coded bfm.") adds an optimization that can rematerialize
nir_op_bfm that was prevented by the GLSL IR lowering.
It appears that every piece of hardware, except older Intel GPUS, that
has real integers (i.e., lower_bitops is not set) also sets
lower_bitfield_extract_to_shifts and lower_bitfield_insert_to_shifts.
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 4dff3ff005 ("nir/opt_algebraic: Optimize open coded bfm.")
Closes: #7874
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20323>
This replaces brw_fs_get_dispatch_enables(), which was added in
b9403b1c47 ("intel: factor out dispatch PS enabling logic"), but this
function will not work well for future changes to 3DSTATE_PS.
So, instead, this moves the related code into a "genX" file which can
directly update 3DSTATE_PS for the given platform.
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20329>
There are a lot of optimizations in opt_algebraic that match ('ine', a,
0), but there are almost none that match i2b. Instead of adding a huge
pile of additional patterns (including variations that include both ine
and i2b), always lower i2b to a != 0.
At this point in the series, it should be impossible for anything to
generate i2b, so there /should not/ be any changes.
The failing test on d3d12 is a pre-existing bug that is triggered by
this change. I talked to Jesse about it, and, after some analysis, he
suggested just adding it to the list of known failures.
v2: Don't rematerialize i2b instructions in dxil_nir_lower_x2b.
v3: Don't rematerialize i2b instructions in zink_nir_algebraic.py.
v4: Fix zink-on-TGL CI failures by calling nir_opt_algebraic after
nir_lower_doubles makes progress. The latter can generate b2i
instructions, but nir_lower_int64 can't handle them (anymore).
v5: Add back most of the hunk at line 2125 of nir_opt_algebraic.py. I
had accidentally removed the f2b(bf2(x)) optimization.
v6: Just eliminate the i2b instruction.
v7: Remove missed i2b32 in midgard_compile.c. Remove (now unused)
emit_alu_i2orf2_b1 function from sfn_instr_alu.cpp. Previously this
function was still used. 🤷
No shader-db changes on any Intel platform.
All Intel platforms had similar results. (Ice Lake shown)
Instructions in all programs: 141165875 -> 141165873 (-0.0%)
Instructions helped: 2
Cycles in all programs: 9098956382 -> 9098956350 (-0.0%)
Cycles helped: 2
The two Vulkan shaders are helped because of the "new" (('b2i32',
('ine', ('ubfe', a, b, 1), 0)), ('ubfe', a, b, 1)) algebraic pattern.
Acked-by: Jesse Natalie <jenatali@microsoft.com> [earlier version]
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Daniel Schürmann <daniel@schuermann.dev> [earlier version]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
In a future commit, nit_type_conversion_op won't be able to handle i2b
(and in a much later commit f2b), so switch many users to the fully
featured function.
No shader-db or fossil-db changes on any Intel platform.
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15121>
Unusual hardware features that require special hanlding usually get a
devinfo field, so do this for MTL's unordered DF types. This will
guarantee that any platform based on MTL (thus inheriting from
MTL_FEATURES) will automatically be handled in these special cases.
v2: s/has_unordered_64bit_float/has_64bit_float_via_math_pipe/ (Curro).
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20072>
In the previous patch we adjusted the scoreboard pass to take into
consideration a new case of unordered operations for TGL. Fix the
decoding as well.
v2: use intel_device_info_is_mtl() (Curro, Jordan)
v3: the part where we export num_sources_from_inst() is now a separate patch
(Curro).
v4: Work around false positive maybe-unitialized warning since Marge
uses -Werror=maybe-uninitialized (Marge).
Reviewed-by: Francisco Jerez <currojerez@riseup.net> (v3)
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20072>
We want to call this from brw_disasm.c, so move it out to brw_eu.c
since it's about to become more of a shared utility function than
something specific to the EU validator.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20072>