When we have a bindless sampler, we need an instruction header. Even in
SIMD8, this pushes the instruction over the sampler message size maximum
of 11 registers. Instead, we have to lower TXD to TXL.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This commit adds a new way for ANV to do SSBO bindings by just passing a
GPU address in through the descriptor buffer and using the A64 messages
to access the GPU address directly. This means that our variable
pointers are now "real" pointers instead of a vec2(BTI, offset) pair.
This carries a few of advantages:
1. It lets us support a virtually unbounded number of SSBO bindings.
2. It lets us implement VK_KHR_shader_atomic_int64 which we couldn't
implement before because those atomic messages are only available
in the bindless A64 form.
3. It's way better than messing around with bindless handles for SSBOs
which is the only other option for VK_EXT_descriptor_indexing.
4. It's more future looking, maybe? At the least, this is what NVIDIA
does (they don't have binding based SSBOs at all). This doesn't a
priori mean it's better, it just means it's probably not terrible.
The big disadvantage, of course, is that we have to start doing our own
bounds checking for robustBufferAccess again have to push in dynamic
offsets.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
In order to avoid the potential overhead of A64 operations on all SSBO
ops, we look for those SSBO ops where we can get to the descriptor set
from the SSBO access operation and lower those to a binding-table
approach. When robustBufferAccess is enabled, this lets the hardware do
the bounds checking for us. It also avoids some potentially expensive
64-bit integer calculations.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is more descriptive and a bit nicer than checking for gen >= 8 &&
use_softpin everywhere.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We're about to start doing 64-bit pointer calculations in ANV. They
will get applied after brw_preprocess_nir which is where we currently do
64-bit integer arithmetic lowering. Because we're adding 64-bit integer
arithmetic after the initial lowering has happened, we need to lower
again.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
If the number of surfaces or samplers exceeds what we can put in a
table, we will want to spill out to bindless. There is no bindless
support yet but this gets us the basic framework that will be used by
later commits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This commit just sorts the bindings by how often they're used vs the
array size of the binding. This will let us make more nuanced decisions
about what goes in the binding table vs. what to make bindless.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This also fixes a bug where we mis-calculate maximum binding table sizes
and may return true in vkGetDescriptorSetLayoutSupport even for sets too
large to fit in a binding table.
Fixes: ddc4069122 "anv: Implement VK_KHR_maintenance3"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This is really where they belong; not push constants. The one downside
here is that we can't push them anymore for compute shaders. However,
that's a general problem and we should figure out how to push descriptor
sets for compute shaders. This lets us bump MAX_IMAGES to 64 on BDW and
earlier platforms because we no longer have to worry about push constant
overhead limits.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We spend a lot of time in the driver adding things to hash sets to track
residency. The reality is that a properly built Vulkan app uses large
memory objects and sub-allocates from them. In a typical frame, most of
if not all of those allocations are going to be resident for the entire
frame so we're really not saving ourselves much by tracking fine-grained
residency. Just throwing everything in the validation list does make it
a little bit more expensive inside the kernel to walk the list and
ensure that all our VA is in order. However, without relocations, the
overhead of that is pretty small.
If we ever do run into a memory pressure situation where the fine-
grained residency could even potentially help, we would likely be
swapping one page out to make room for another within the draw call and
performance is totally lost at that point. We're better off swapping
out other apps and just letting ours run a whole frame.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
If the last graphics pipeline bound to the command buffer has enough
space in its VS URB entries for Blorp then avoid reconfiguring the URB
partitions.
v2: s/0/MESA_SHADER_VERTEX/ (Caio)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
There was an assumption that num_thread_per_eu would be set in the
Gen8 features. Since this is mostly the same of all gen8->11 (except
GEN9_LP that overwrites it) let's just factor it out.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable@lists.freedesktop.org
Acked-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Anuj Phogat anuj.phogat@gmail.com
The current register allocator has a concept of "spill benefit" which is
based on the number of nodes with which a given node interferes. The
idea is that you want to spill stuff with high interference because
those are the most likely registers to help when spilling. However,
this fails to take into account the length of the live range so the
allocator frequently picks "cheap" (not many uses) registers which are
actually very short lived and so spilling them doesn't help with the
pressure situation.
This commit takes into account the length of the live range to make
long-lived registers more likely to get spilled than short-lived ones.
This encourages the spill chooser to choose slightly larger registers
which will affect a larger area of the program and hopefully we have to
spill fewer of them to get the same reduction in over-all register
pressure.
Shader-db results on Kaby Lake:
total spills in shared programs: 23664 -> 12050 (-49.08%)
spills in affected programs: 19243 -> 7629 (-60.35%)
helped: 296
HURT: 8
total fills in shared programs: 32028 -> 25139 (-21.51%)
fills in affected programs: 20378 -> 13489 (-33.81%)
helped: 295
HURT: 16
Of course, most of that is in Deus Ex...
Shader-db results on Kaby Lake (without Deus Ex):
total spills in shared programs: 6479 -> 5834 (-9.96%)
spills in affected programs: 3231 -> 2586 (-19.96%)
helped: 40
HURT: 4
total fills in shared programs: 17165 -> 17099 (-0.38%)
fills in affected programs: 6951 -> 6885 (-0.95%)
helped: 40
HURT: 7
Even without Deus Ex, the spill help is pretty respectable. The worst
hurt shaders were one compute shader in Aztec Ruins and one fragment
shader in KSP that were each hurt by around 13% fill 9% spill.
VkPipeline-db results on Kaby Lake:
total spills in shared programs: 9149 -> 8069 (-11.80%)
spills in affected programs: 5197 -> 4117 (-20.78%)
helped: 27
HURT: 16
total fills in shared programs: 26390 -> 25477 (-3.46%)
fills in affected programs: 12662 -> 11749 (-7.21%)
helped: 24
HURT: 22
The Vulkan results were decidedly more mixed but we don't have nearly as
many apps in that database yet.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Normally fsign generates -1, 0, or +1. The new scale factor, S, causes
fsign to generate -S, 0, or +S.
v2: Rebase on v2 changes in previous commit.
v3: Rebase on 85c35885b3 ("nir: Rework nir_src_as_alu_instr to not take
a pointer").
Reviewed-by: Matt Turner <mattst88@gmail.com> [v2]
Other nir_src_as_* functions just take a nir_src. It's not that much
more memory copying and the constness preserving really isn't worth the
cognitive dissonance.
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This 3d performance workaround was initially put in the kernel but the
media driver requires different settings so the register has been
whitelisted in i915 [1] and userspace drivers are left initializing it as
they wish.
[1] : https://patchwork.freedesktop.org/series/59494/
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
This 3d performance workaround was initially put in the kernel but the
media driver requires different settings so the register has been
whitelisted in i915 [1] and userspace drivers are left initializing it as
they wish.
[1] : https://patchwork.freedesktop.org/series/59494/
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
v2 (Jason):
- Merge shaderFloat16 and shaderInt8 enablement into a single patch.
- Merge extension enable.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
v2:
- Merge Float16 and Int8 capabilities into a single patch (Jason)
- Merged patch that enabled SPIR-V front-end checks for these caps
(except for Int8, which was already merged)
v3:
- Keep capabilities sorted (Jason)
v4:
- SpvCapabilityFloat16 support already added in master (Juan)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
v2:
- Adapted unit tests to make them consistent with the changes done
to the validation of half-float conversions.
v3 (Curro):
- Check all the accummulators
- Constify declarations
- Do not check src1 type in single-source instructions.
- Check for all instructions that read accumulator (either implicitly or
explicitly)
- Check restrictions in src1 too.
- Merge conditional block
- Add invalid test case.
v4 (Curro):
- Assert on 3-src instructions, as they are not validated.
- Get rid of types_are_mixed_float(), as we know instruction is mixed
float at that point.
- Remove conditions from not verified case.
- Fix brackets on conditional.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
v2:
- Add some tests with UB type too (Jason)
v3:
- consider implicit conversions from 2src instructions too (Curro).
v4:
- Do not check src1 type in single-source instructions (Curro).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v2)
v2:
- Consider implicit conversions in 2-src instructions too (Curro)
- For restrictions that involve destination stride requirements
only validate them for Align1, since Align16 always requires
packed data.
- Skip general rule for the dst/execution type size ratio for
mixed float instructions on CHV and SKL+, these have their own
set of rules that we'll be validated separately.
v3 (Curro):
- Do not check src1 type in single-source instructions.
- Check restriction on src1.
- Remove invalid test.
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
The section 'Execution Data Types' of 3D Media GPGPU volume, which
describes execution types, is exactly the same in BDW and SKL+.
Also, this section states that there is a single execution type, so it
makes sense that this is the wider of the two floating point types
involved in mixed float mode, which is what we do for SKL+ and CHV.
v2:
- Make sure we also account for the destination type in mixed mode (Curro).
Acked-by: Francisco Jerez <currojerez@riseup.net>
It is very likely that this optimzation is never useful and we'll probably
just end up removing it, so let's not bother adding more cases to it for
now.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
NIR already has these and correctly considers exact/inexact qualification,
whereas the backend doesn't and can apply the optimizations where it
shouldn't. This happened to be the case in a handful of Tomb Raider shaders,
where NIR would skip the optimizations because of a precise qualification
but the backend would then (incorrectly) apply them anyway.
Besides this, considering that we are not emitting much math in the backend
these days it is unlikely that these optimizations are useful in general. A
shader-db run confirms that MAD and LRP optimizations, for example, were only
being triggered in cases where NIR would skip them due to precise
requirements, so in the near future we might want to remove more of these,
but for now we just remove the ones that are not completely correct.
Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
There are no 8-bit immediates, so assert in that case.
16-bit immediates are replicated in each word of a 32-bit immediate, so
we only need to check the lower 16-bits.
v2:
- Fix is_zero with half-float to consider -0 as well (Jason).
- Fix is_negative_one for word type.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
At the very least we need it to handle HF too, since we are doing
constant propagation for MAD and LRP, which relies on this pass
to promote the immediates to GRF in the end, but ideally
we want it to support even more types so we can take advantage
of it to improve register pressure in some scenarios.
v2 (Jason):
- Support 64-bit types too.
- Check if we need to set the half-float flag if the immediate already
existed.
- Multiply the size of the immediate by the width of the copy
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The hardware only allows a stride of 1 on a Byte destination for raw
byte MOV instructions. This is required even when the destination
is the NULL register.
Rather than making sure that we emit a proper NULL:B destination
every time we need one, just fix it at emission time.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Now that we have the regioning lowering pass we can just put all of these
opcodes together in a single block and we can just assert on the few cases
of conversion instructions that are not supported in hardware and that should
be lowered in brw_nir_lower_conversions.
The only cases what we still handle separately are the conversions from float
to half-float since the rounding variants would need to fallthrough and we
are already doing this for boolean opcodes (since they need to negate), plus
there is also a large comment about these opcodes that we probably want to
keep so it is just easier to keep these separate.
Suggested-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This function is used in two different scenarios that for 32-bit
instructions are the same, but for 16-bit instructions are not.
One scenario is that in which we are working at a SIMD8 register
level and we need to know if a register is fully defined or written.
This is useful, for example, in the context of liveness analysis or
register allocation, where we work with units of registers.
The other scenario is that in which we want to know if an instruction
is writing a full scalar component or just some subset of it. This is
useful, for example, in the context of some optimization passes
like copy propagation.
For 32-bit instructions (or larger), a SIMD8 dispatch will always write
at least a full SIMD8 register (32B) if the write is not partial. The
function is_partial_write() checks this to determine if we have a partial
write. However, when we deal with 16-bit instructions, that logic disables
some optimizations that should be safe. For example, a SIMD8 16-bit MOV will
only update half of a SIMD register, but it is still a complete write of the
variable for a SIMD8 dispatch, so we should not prevent copy propagation in
this scenario because we don't write all 32 bytes in the SIMD register
or because the write starts at offset 16B (wehere we pack components Y or
W of 16-bit vectors).
This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-bit
instructions, which lose a number of optimizations because of this, most
important of which is copy-propagation.
This patch splits is_partial_write() into is_partial_reg_write(), which
represents the current is_partial_write(), useful for things like
liveness analysis, and is_partial_var_write(), which considers
the dispatch size to check if we are writing a full variable (rather
than a full register) to decide if the write is partial or not, which
is what we really want in many optimization passes.
Then the patch goes on and rewrites all uses of is_partial_write() to use
one or the other version. Specifically, we use is_partial_var_write()
in the following places: copy propagation, cmod propagation, common
subexpression elimination, saturate propagation and sel peephole.
Notice that the semantics of is_partial_var_write() exactly match the
current implementation of is_partial_write() for anything that is
32-bit or larger, so no changes are expected for 32-bit instructions.
Tested against ~5000 tests involving 16-bit instructions in CTS produced
the following changes in instruction counts:
Patched | Master | % |
================================================
SIMD8 | 621,900 | 706,721 | -12.00% |
================================================
SIMD16 | 93,252 | 93,252 | 0.00% |
================================================
As expected, the change only affects SIMD8 dispatches.
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
Empirical testing shows that gen8 has a bug where MAD instructions with
a half-float source starting at a non-zero offset fail to execute
properly.
This scenario usually happened in SIMD8 executions, where we used to
pack vector components Y and W in the second half of SIMD registers
(therefore, with a 16B offset). It looks like we are not currently doing
this any more but this would handle the situation properly if we ever
happen to produce code like this again.
v2 (Jason):
- Move this workaround to the lower_regioning pass as an additional case
to has_invalid_src_region()
- Do not apply the workaround if the stride of the source operand is 0,
testing suggests the problem doesn't exist in that case.
v3 (Jason):
- We want offset % REG_SIZE > 0, not just offset > 0
- Use a helper to compute the offset
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com> (v1)
Broadwell has restrictions that apply to Align16 half-float that
make the Align16 implementation of this invalid for this platform.
Use the gen11 path for this instead, which uses Align1 mode.
The restriction is not present in cherryview, gen9 or gen10, where
the Align16 implementation seems to work just fine.
v2:
- Rework the comment in the code, move the PRM citation from the
commit message to the comment in the code (Matt)
- Cherryview isn't affected, only Broadwell (Matt)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
Reviewed-by: Matt Turner <mattst88@gmail.com>
We were assuming 32-bit elements. Also, In SIMD8 we pack 2 vector components
in a single SIMD register, so for example, component Y of a 16-bit vec2
starts is at byte offset 16B. This means that when we compute the offset of
the elements to be differentiated we should not stomp whatever base offset we
have, but instead add to it.
v2
- Use byte_offset() helper (Jason)
- Merge the fix for SIMD8: using byte_offset() fixes that too.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v1)
Reviewed-by: Matt Turner <mattst88@gmail.com>
Source0 and Destination extract the floating-point precision automatically
from the SrcType and DstType instruction fields respectively when they are
set to types :F or :HF. For Source1 and Source2 operands, we use the new
1-bit fields Src1Type and Src2Type, where 0 means normal precision and 1
means half-precision. Since we always use the type of the destination for
all operands when we emit 3-source instructions, we only need set Src1Type
and Src2Type to 1 when we are emitting a half-precision instruction.
v2:
- Set the bit separately for each source based on its type so we can
do mixed floating-point mode in the future (Topi).
v3:
- Use regular citation style for the comment referencing the PRM (Matt).
- Decided not to add asserts in the emission code to check that only
mixed HF/F types are used since such checks would break negative tests
for brw_eu_validate.c (Matt)
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
We are now using these bits, so don't assert that they are not set. In gen8,
if these bits are set compaction is not possible. On gen9 and CHV platforms
set_3src_control_index() checks these bits (and others) against a table to
validate if the particular bit combination is eligible for compaction or not.
v2
- Add more detail in the commit message explaining the situation for SKL+
and CHV (Jason)
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>