It is currently a bitset on top of a uint64_t but there are already
more than 64 values. Change to use BITSET to cover all the
SYSTEM_VALUE_MAX bits.
Cc: mesa-stable
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Acked-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8585>
There are two problems this commit solves: First, is that the 64x64 MUL
lowering generates a Q MOV which, because of how late it runs in the
compile pipeline, it never gets removed. Second, it generates 32x32
MULs and we have to run it a second time to lower those.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7329>
Instead of making it a fragment-specific thing based on uses_kill, track
whether or not we need one in fs_visitor and emit HALT_TARGET at the end
of emit_nir_code() if needed.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5071>
This means the pass has to walk all the instructions but it was doing
that in a bunch of cases anyway when it didn't have a HALT_TARGET.
However, removing HALT_TARGET frees up the scheduler a bit because
HALT_TARGET is considered a scheduling barrier. The shader-db results
are kind-of a wash but we're about to add HALT_TARGET unconditionally so
we want to be able to get rid of it.
Shader-db results on Ice Lake:
total instructions in shared programs: 19935623 -> 19935623 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 976758472 -> 976766135 (<.01%)
cycles in affected programs: 11097707 -> 11105370 (0.07%)
helped: 1750
HURT: 875
helped stats (abs) min: 1 max: 866 x̄: 26.39 x̃: 4
helped stats (rel) min: <.01% max: 39.24% x̄: 1.25% x̃: 0.46%
HURT stats (abs) min: 1 max: 1678 x̄: 61.54 x̃: 10
HURT stats (rel) min: <.01% max: 65.69% x̄: 1.86% x̃: 0.42%
95% mean confidence interval for cycles value: -2.48 8.32
95% mean confidence interval for cycles %-change: -0.40% -0.03%
Inconclusive result (value mean confidence interval includes 0).
LOST: 62
GAINED: 46
All of the lost/gained programs are SIMD32 fragment shaders.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5071>
We're about to start using it to implement nir_jump_halt which has
nothing inherently to do with fragment shaders or discards. May as well
name it for the HW instruction it generates.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5071>
The Intel bindless thread dispatch model is very simple. When a compute
shader is to be used for bindless dispatch, it can request a set of
stack IDs. These are allocated per-dual-subslice by the hardware and
recycled automatically when the stack ID is returned. Passed to the
bindless dispatch are a global argument address, a stack ID, and an
address of the BINDLESS_SHADER_RECORD to invoke. When the bindless
shader is dispatched, it is passed its stack ID as well as the global
and local argument pointers. The local argument pointer is the address
of the BINDLESS_SHADER_RECORD plus some offset which is specified as
part of the BINDLESS_SHADER_RECORD.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7356>
Icelake's sampler message header introduces a field in m0.3 bit 0
which controls whether the sampler state pointer should be relative
to bindless sampler state base address or dynamic state base address.
g0.3 bit 0 is part of the per-thread scratch space field. On older
hardware, we were able to copy that along because the sampler ignored
bits 4:0. Now, however, we need to mask them out.
Fixes various textureGatherOffsets piglit tests when forcing the FS
to run with 2048 bytes of per-thread scratch space (which is a
per-thread scratch space encoding of 1, meaning bit 0 will be set).
Cc: mesa-stable
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6735>
Motivation is to detect earlier certain bugs that can occur when
missing a check for the stage before using the downcast.
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7540>
For terminate operation, jump the invocation without predicating on
the rest of the quad being disabled -- which is what is done for
demote and discard.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7150>
cs_prog_data->slm_size is basically redundant with
prog_data->total_shared, which is the field that we actually use for
controlling the shared local memory size in all drivers. We were
still using it in one place for VK_EXT_pipeline_executable_properties,
but we should just fix that and delete the field.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7152>
This opcode is responsible for setting up the buffer base address and
per-thread scratch space fields of a scratch message header. For the
most part, it's a copy of g0 but some messages need us to zero out g0.2
and the bottom bits of g0.5.
This may actually fix a bug when nir_load/store_scratch is used. The
docs say that the DWORD scattered messages respect the per-thread
scratch size specified in gN.3[3:0] in the message header but we've been
leaving it zero. This may mean that we've been ignoring any scratch
reads/writes from a load/store_scratch intrinsic above the 1KB mark.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
In theory, this fixes a bug where we were dropping the PTSS bound on the
floor. The hardware docs claim that the A32 DWORD and BYTE scattered
read/write messages do a PTSS bounds check. However, in practice, it
seems that the hardware ignores the bounds check so this doesn't
actually matter. I verified this with the following couple of piglit
tests:
https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/399
In practice, this prevents the next commit from making a subtle
behavioral change.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7084>
It's included in declaration of INTEL_DEBUG.
Signed-off-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6732>
Scratch stores are being lowered to the instructions with side-effects,
however they should be enabled in fs helper invocations, since they
are produced from operations which don't imply side-effects.
To fix this - we move the decision of whether the sample mask predication
is enable to the point where logical brw instructions are created.
GLSL example of the issue:
int tmp[1024];
...
do {
// changes to tmp
} while (some_condition(tmp))
If `tmp` is lowered to scrach memory, `some_condition` would be
undefined if scratch write is predicated on sample mask, making
possible for the while loop to become infinite and hang the GPU.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3256
Fixes: 53bfcdeecf
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6056>
GEN:BUG:1604601757 prevents source modifiers for multiplication of
lower precision integers.
lower_mul_dword_inst() splits 32x32 multiplication into 32x16, and
needs to eliminate source modifiers in this case.
Closes: #3329
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
If INTEL_DEBUG=no16 is used, then simd16 will not be attempted. This,
in turn prevents simd32 from running, because we attempt to skip
simd32 when simd16 fails to compile.
This change more accurately recognizes when we attempted simd16, but
simd16 failed.
One easy way to cause an issue is to set both no8 and no16. Before
this change, we would be left with no FS program, even though simd32
could still be generated in some cases.
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5269>
If no16 was specified, and the shader can't run in simd8 due to the
local_size, then we need to generate a simd32 program.
If both no8 and no16 are specified, then we need to generate a simd32
program.
Rework:
* Drop update of `if` that would have changed `do32` to try simd32
even if simd16 spilled registers. (Caio)
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5269>
2 source instruction only support immediate for src1 operand, so no
point in adding optimization condition for src0 oprand.
v2:
- Update commit message and don't remove ADD optimization (Matt Turner)
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5341>
Gen9 and Cherryview have the ability to mark texture instructions with
the End-of-thread bit under some conditions, which allows the texture
result to be written to the render target directly, rather than
returning to the EU.
In order to handle overlapping primitives correctly, we have to use the
'sendc' instruction which stalls until other threads potentially writing
to the same locations in the render target are retired. Unfortunately,
this stall happens before the texture is sampled (rather than in
parallel with stall), so for some literal edge cases (like the diagonal
edge between two triangles forming a rectangle) there can be a
performance penalty. As a result, it's probably not a good idea to use
this optimization in general.
I had planned to leave it enabled only for BLORP, where we use rectangle
primitives and are typically clearing/blitting an entire render target
without any overlapping primitives, but I noticed that the optimization
wasn't applied in some normal cases anyway. For example, in the piglit
test tests/shaders/glsl-fs-texture2d-bias.shader_test it is applied to
one BLORP-blit shader but not another due to some kind of mishandling of
register types (the destination register type of the texture operation
is UD while the color source of the render target write is F).
Additionally the instruction scheduler assumed that the combined texture
and render target write operation took 0 cycles, leading to cycle
estimates that are wildly inaccurate. Since the optimization was not
implemented for SIMD32 and our decision whether to use the SIMD32
program is made by comparing the estimated performance with that of the
SIMD16 shader, we wrongly threw out a bunch of SIMD32 programs that are
likely profitable.
total cycles in shared programs: 472807891 -> 473784245 (0.21%)
cycles in affected programs: 108277 -> 1084631 (901.72%)
helped: 0
HURT: 1290
total sends in shared programs: 998955 -> 1000245 (0.13%)
sends in affected programs: 1400 -> 2690 (92.14%)
helped: 0
HURT: 1290
LOST: 0
GAINED: 33
This patch shows no performance changes in Intel's Mesa performance CI.
Given the problems, the lack of evidence that the pass improves
performance, and the fact that the hardware feature was removed from
subsequent GPU generations, I think that the pass is not valuable and
should be removed.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5412>
This will make the GL drivers pick the right SIMD variant for a given
group size set during dispatch. The heuristic implemented in
brw_cs_simd_size_for_group_size() is the same as in brw_compile_cs().
The cs_prog_data::simd_size field was removed. The generated SIMD
sizes are marked in a bitmask, which is already used via
brw_cs_simd_size_for_group_size() by the drivers.
When in variable group size, it is OK if larger SIMD shader spill,
since we'd need it for the cases where the smaller one can't hold all
the invocations.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>