load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
surface format defined. So when reading 16-bit components with the
sampler we need to unshuffle two 16-bit components from each 32-bit
component.
Using the sampler avoids the use of the byte_scattered_read message
that needs one message for each component and is supposed to be
slower.
v2: (Jason Ekstrand)
- Simplify component selection and unshuffling for different bitsizes
- Remove SKL optimization of reading only two 32-bit components when
reading 16-bits types.
Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
v2: Fix alignment style (Topi Pohjolainen)
(Jason Ekstrand)
- Enable bit_size parameter to scattered messages to enable different
bitsizes byte/word/dword.
- Remove use of brw_send_indirect_scattered_message in favor of
brw_send_indirect_surface_message.
- Move scattered messages to surface messages namespace.
- Assert align1 for scattered messages and assume Gen8+.
- Inline brw_set_dp_byte_scattered_read.
v3: (Jason Ekstrand)
- Use renamed brw_byte_scattered_data_element_from_bit_size method
- Assert scattered read for Gen8+ and Haswell.
- Use conditional expresion at components_read.
- Include comment about params for scattered opcodes.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
v2: (Jason Ekstrand)
- Enable bit_size parameter to scattered messages to enable different
bitsizes byte/word/dword.
- Remove use of brw_send_indirect_scattered_message in favor of
brw_send_indirect_surface_message.
- Move scattered messages to surface messages namespace.
- Assert align1 for scattered messages and assume Gen8+.
- Inline brw_set_dp_byte_scattered_write.
v3: - Remove leftover newline (Topi Pohjolainen)
- Rename brw_data_size to brw_scattered_data_element and use
defines instead of an enum (Jason Ekstrand)
- Assert scattered write for Gen8+ and Haswell (Jason Ekstrand)
Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Although from SPIR-V point of view, rounding modes are attached to the
operation/destination, on i965 it is a status, so we don't need to
explicitly set the rounding mode if the one we want is already set.
Taking into account that the default mode is RTE, one possible
optimization would be optimize out the first RTE set for each
block. For in order to work, we would need to take into account block
interrelationships. At this point, it is not worth to complicate the
optimization for such small gain.
v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
with the rounding mode (Curro)
v3: Reset optimization for every block. (Jason Ekstrand)
Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We're going to want subgroup ID for SPIR-V subgroups eventually anyway.
We really only want to push one and calculate the other from it. It
makes a bit more sense to push the subgroup ID because it's simpler to
calculate and because it's a real API thing. The only advantage to
pushing the base thread ID is to avoid a single SHL in the shader.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
With the advent of SPIR-V subgroup operations, compute shaders will have
to be slightly different depending on the SIMD size at which they
execute. In order to allow us to do dispatch-width specific things in
NIR, we re-run the final NIR stages for each sIMD width.
One side-effect of this change is that we start rallocing fs_visitors
which means we need DECLARE_RALLOC_CXX_OPERATORS.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Previously, brw_nir_lower_intrinsics added the param and then emitted a
load_uniform intrinsic to load it directly. This commit switches things
over to use a specific NIR intrinsic for the thread id. The one thing I
don't like about this approach is that we have to copy thread_local_id
over to the new visitor in import_uniforms.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
This isn't often a problem , when we're in a compute shader, we must
push the thread local ID so we decrement the amount of available push
space by 1 and it's no longer even and 64-bit data can, in theory, span
it. By marking those uniforms contiguous, we ensure that they never get
split in half between push and pull constants.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
The only things that adjust fs_visitor::max_dispatch_width are render
target writes which don't happen in compute shaders so they're
pointless.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
It's 8 for everything except compute shaders. For compute shaders,
there's no need to duplicate the computation and it's just a possible
source of error.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Before, we bailing in assign_constant_locations based on the minimum
dispatch size. The more direct thing to do is simply to check for
whether or not we have constant locations and bail if we do. For
nir_setup_uniforms, it's completely safe to do it multiple times because
we just copy a value from the NIR shader.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Some hardware (CHV, BXT) have special restrictions on register regions
when doing integer multiplication. We want to respect those when we
lower to DxW multiplication.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
Before, we were careful to place the zip after the last of the split
instructions but did unzip on-demand. This changes things so that the
unzips go before all of the split instructions and the unzip comes
explicitly after all the split instructions. As a side-effect of this
change, we now emit the split instruction from highest SIMD group to
lowest instead of low to high. We could have kept the old behavior, but
it shouldn't matter and this made the code easier.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
This makes it far more explicit where we're inserting the instructions
rather than the magic "before and after" stuff that the emit_[un]zip
helpers did based on block and inst.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
Thanks to the ralloc invariant of "any pointer returned from ralloc can
be used as a context", calling ralloc_size with a size of zero will
cause it to allocate at least a header. If we don't have any push
constants, then NULL is perfectly acceptable (and even preferred).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
It doesn't actually matter since the only user of push constants, i965,
ralloc_steals it back to NULL but it's more consistent and probably
fixes memory leaks in some error cases.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Cc: mesa-stable@lists.freedesktop.org
The caller can now use brw_stage_prog_data::program_size which is set
by the brw_compile_* functions.
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This will be used by the on disk shader cache.
v2:
* Set in brw_compile_* rather than brw_codegen_*. (Jason)
Signed-off-by: Timothy Arceri <timothy.arceri@collabora.com>
[jordan.l.justen@intel.com: Only add to brw_stage_prog_data]
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Patch uses mem_ctx for allocation to ensure param array gets freed
later.
==6164== 48 bytes in 1 blocks are definitely lost in loss record 61 of 193
==6164== at 0x4C2EB6B: malloc (vg_replace_malloc.c:299)
==6164== by 0x12E31C6C: ralloc_size (ralloc.c:121)
==6164== by 0x130189F1: fs_visitor::assign_constant_locations() (brw_fs.cpp:2095)
==6164== by 0x13022D32: fs_visitor::optimize() (brw_fs.cpp:5715)
==6164== by 0x13024D5A: fs_visitor::run_fs(bool, bool) (brw_fs.cpp:6229)
==6164== by 0x1302549A: brw_compile_fs (brw_fs.cpp:6570)
==6164== by 0x130C4B07: blorp_compile_fs (blorp.c:194)
==6164== by 0x130D384B: blorp_params_get_clear_kernel (blorp_clear.c:79)
==6164== by 0x130D3C56: blorp_fast_clear (blorp_clear.c:332)
==6164== by 0x12EFA439: do_single_blorp_clear (brw_blorp.c:1261)
==6164== by 0x12EFC4AF: brw_blorp_clear_color (brw_blorp.c:1326)
==6164== by 0x12EFF72B: brw_clear (brw_clear.c:297)
Fixes: 8d90e28839 ("intel/compiler: Allocate pull_param in assign_constant_locations")
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable@lists.freedesktop.org
Fixes intermittent GPU hangs on Broxton with an Intel internal
test case.
There are plenty of similar fragment shaders in piglit that do
not use any varyings and any uniforms. According to the
documentation special timing is needed between pipeline stages.
Apparently we just don't hit that with piglit. Even with the
failing test case one doesn't always get the hang.
Moreover, according to the error states the hang happens
significantly later than the execution of the problematic shader.
There are multiple render cycles (primitive submissions) in between.
I've also seen error states where the ACTHD points outside the
batch. Almost as if the hardware writes somewhere that gets used
later on. That would also explain why piglit doesn't suffer from
this - most tests kick off one render cycle and any corruption
is left unseen.
v2 (Ken): Instead of enabling push constants, enable one of the
inputs (PSIZ).
v3 (Ken, Jason): Use LAYER instead making vulkan emit_3dstate_sbe()
happy.
Cc: "17.3 17.2" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
In order to implement the ballot intrinsic, we do a MOV from flag
register to some GRF. If that GRF is used in a SEL, cmod propagation
helpfully changes it into a MOV from the flag register with a cmod.
This is perfectly valid but when lower_simd_width comes along, it simply
splits into two instructions which both have conditional modifiers.
This is a problem since we're reading the flag register. This commit
makes us check whether or not flags_written() overlaps with the flag
values that we are reading via the instruction source and, if we have
any interference, will force us to emit a copy of the source.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
This eliminates a layer of wrapping, and makes a backend_instruction
sufficient. The downside is that it exposes 'eot' to the vec4 backend,
which it doesn't need, but can basically happily ignore.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Pallavi G <pallavi.g@intel.com>
This is a lot more natural than special casing it all over the place.
We still have to do a bit of special-casing in assign_constant_locations
but it's not special-cased quite as bad as it was before.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Now that everything is nicely ralloc'd, we can allocate the pull_param
array in assign_constant_locations instead of higher up. We can also
re-allocate the param array so that it's exactly the needed size. This
should save us some memory because we're not allocating the total needed
param space for both push and pull.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of making the caller of brw_compile_cs add something to the
param array for thread_local_id_index, just add it on-demand in
brw_nir_intrinsics and grow the array. This is now safe to do because
everyone is now using ralloc for prog_data::param.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
It's already only ever called from brw_compile_cs and only handles
compute intrinsics. Let's just make it CS-specific. We can always
make it handle other stages again later if we want.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The Vulkan driver does not support pull constants. It simply limits
things such that we can always push everything. Previously, we were
determining whether or not to push things based on whether or not the
prog_data::pull_param array is non-null. This is rather hackish and
about to stop working.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This moves us away to the array of pointers model and onto a model where
each param is represented by a generic uint32_t handle. We reserve 2^16
of these handles for builtins that get generated by somewhere inside the
compiler and have well-defined meanings. Generic params have handles
whose meanings are defined by the driver.
The primary downside to this new approach is that it moves a little bit
of the work that we would normally do at compile time to draw time. On
my laptop this hurts OglBatch6 by no more than 1% and doesn't seem to
have any measurable affect on OglBatch7. So, while this may come back
to bite us, it doesn't look too bad.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We can start reading the URB at the first offset that contains varyings
that are actually read in the URB. We still need to make sure that we
read at least one varying to honor hardware requirements.
This helps alleviate a problem introduced with 99df02ca26 for
separate shader objects: without separate shader objects we assign
locations sequentially, however, since that commit we have changed the
method for SSO so that the VUE slot assigned depends on the number of
builtin slots plus the location assigned to the varying. This fixed
layout is intended to help SSO programs by avoiding on-the-fly recompiles
when swapping out shaders, however, it also means that if a varying uses
a large location number close to the maximum allowed by the SF/FS units
(31), then the offset introduced by the number of builtin slots can push
the location outside the range and trigger an assertion.
This problem is affecting at least the following CTS tests for
enhanced layouts:
KHR-GL45.enhanced_layouts.varying_array_components
KHR-GL45.enhanced_layouts.varying_array_locations
KHR-GL45.enhanced_layouts.varying_components
KHR-GL45.enhanced_layouts.varying_locations
which use SSO and the the location layout qualifier to select such
location numbers explicitly.
This change helps these tests because for SSO we always have to include
things such as VARYING_SLOT_CLIP_DIST{0,1} even if the fragment shader is
very unlikely to read them, so by doing this we free builtin slots from
the fixed VUE layout and we avoid the tests to crash in this scenario.
Of course, this is not a proper fix, we'd still run into problems if someone
tries to use an explicit max location and read gl_ViewportIndex, gl_LayerID or
gl_CullDistancein in the FS, but that would be a much less common bug and we
can probably wait to see if anyone actually runs into that situation in a real
world scenario before making the decision that more aggresive changes are
required to support this without reverting 99df02ca26.
v2:
- Add a debug message when we skip clip distances (Ilia)
- we also need to account for this when we compute the urb setup
for the fragment shader stage, so add a compiler util to compute
the first slot that we need to read from the URB instead of
replicating the logic in both places.
v3:
- Make the util more generic so it can account for all unused slots
at the beginning of the URB, that will make it more useful (Ken).
- Drop the debug message, it was not what Ilia was asking for.
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Triggering the push model when 64-bit inputs are involved is not easy due to
the constrains on the maximum number of registers that we allow for this mode,
however, for GS with 'points' primitive type and just a couple of double
varyings we can trigger this and it just doesn't work because the
implementation is not 64-bit aware at all. For now, let's make sure that we
don't attempt this model whith 64-bit inputs and we always fall back to pull
model for them.
Also, don't enable the VUE handles in the thread payload on the fly when we
find an input for which we need the pull model, this is not safe: if we need
to resort to the pull model we need to account for that when we setup the
thread payload so we compute the first non-payload register properly. If we
didn't do that correctly and we enable it on-the-fly here then we will end up
VUE handles on the first non-payload register which will probably lead to
GPU hangs. Instead, always enable the VUE handles for the pull model so we
can safely use them when needed. The GS is going to resort to pull model
almost in every situation anyway, so this shouldn't make a significant
difference and it makes things easier and safer.
v2: Always enable the VUE handles for pull model, this is easier and safer
and the GS is going to fallback to pull model almost always anyway (Ken)
v3: Only clamp the URB read length if we are over the maximum reserved for
push inputs as we were doing in the original code (Ken).
v4: No need to clamp the urb read length if invocations > 1
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Right now, OpenGL uses the GLSL lowering for shared variables and anv
uses NIR to lower them. For a long time, we've done this weird thing
where we do the NIR lowering unconditionally and then add the SLM sizes
from the two together. This works because one of them will always be 0
but it's a bit sketchy. Let's just move the NIR-based lowering into
anv_pipeline and get rid of the sketch.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
These vaguely corresponded to the hardware encodings, but that is purely
historical at this point. Reorder them so we stop making things "almost
work" when mixing enums.
The ordering has been closen so that no enum value is the same as a
compatible hardware encoding.
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
The implementations of the ARB_shader_ballot intrinsics will explicitly
read the flag as a source register.
Reviewed-by: Matt Turner <mattst88@gmail.com>
The implementations of the ARB_shader_group_vote intrinsics will
explicitly write the flag as the destination register.
Reviewed-by: Matt Turner <mattst88@gmail.com>
This actually takes advantage of the newly pushed UBO data, avoiding
pull loads.
Improves performance in GLBenchmark Manhattan 3.1 by:
HSW: ~1%, BDW/SKL/KBL GT2: 3-4%, SKL GT4: 7-8%, APL: 4-5%.
(thanks to Eero Tamminen for these numbers)
shader-db results on Skylake, ignoring programs with spill/fill changes:
total instructions in shared programs: 13963994 -> 13651893 (-2.24%)
instructions in affected programs: 4250328 -> 3938227 (-7.34%)
helped: 28527
HURT: 0
total cycles in shared programs: 179808608 -> 172535170 (-4.05%)
cycles in affected programs: 79720410 -> 72446972 (-9.12%)
helped: 26951
HURT: 1248
LOST: 46
GAINED: 21
Many "Deus Ex: Mankind Divided" shaders which already spilled end up
spill a lot more (about 240 programs hurt, 9 helped). The cycle
estimator suggests this is still overall a win (-0.23% in cycle counts)
presumably because we trade pull loads for fills.
v2: Drop "PULL" environment variable left in for initial debugging
(caught by Matt).
Reviewed-by: Matt Turner <mattst88@gmail.com>
With UBOs, the answer of "have we decided to push this uniform" gets
a bit more complicated - for one, we have multiple surfaces. This
patch refactors things so we can add the new code in a single place.
Reviewed-by: Matt Turner <mattst88@gmail.com>
This patch starts uploading UBO data via 3DSTATE_CONSTANT_* packets,
and updates the compiler to know that there's extra payload data, so
things continue working. However, it still issues pull loads for all
data. I wanted to separate the two aspects for greater bisectability.
v2: Update for new intel_bufferobj_buffer parameter.
Reviewed-by: Matt Turner <mattst88@gmail.com>
This optimization has been removed on gen10+.
Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
In 5f2fe9302c is_geminilake was introduced for the differenciate
broxton from geminilake. Unfortunately I failed as verifying that
is_broxton is throughout the code base to mean Gen9lp.
Fixes: 5f2fe9302c ("intel: common: add flag to identify platforms by name")
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>