spirv_to_nir() returned the nir_function corresponding to the
entrypoint, as a way to identify it. There's now a bool is_entrypoint
in nir_function and also a helper function to get the entry_point from
a nir_shader.
The return type reflects better what the function name suggests. It
also helps drivers avoid the mistake of reusing internal shader
references after running NIR_PASS on it. When using NIR_TEST_CLONE or
NIR_TEST_SERIALIZE, those would be invalidated right in the first pass
executed.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This makes the following packets use actual driver provided sizes rather
than guessing an arbitrary number:
- CC_VIEWPORT
- SF_CLIP_VIEWPORT
- BLEND_STATE
- COLOR_CALC_STATE
- SCISSOR_RECT
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
(Technically this is common code, but it doesn't affect i965 or anv.)
Improves performance of GFXBench5/gl_tess_off on Skylake GT4e at 1080p
by 9.3933% +/- 0.0305157% by eliminating all spilling in the GS.
Improves performance of GFXBench5/gl_4_off (Car Chase) on Skylake GT4e
at 1080p by 0.325208% +/- 0.0842233% (n=18).
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
When using the binding tables to access arrays of YCbCr descriptors we
did not consider the offset of the accessed element. We can't do a
simple multiple because the binding table entries are tightly packed.
For example element 0 of the array could use 2 entries/planes and
element 1 could use 2 entries/planes.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 3bb8768b9d ("anv: toggle on support for VK_EXT_ycbcr_image_arrays")
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
The libmesa_anv_gen* modules require anv_extensions.h, patch makes sure
it gets generated as a dependency before building them.
Signed-off-by: Chenglei Ren <chenglei.ren@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Cc: <mesa-stable@lists.freedesktop.org>
The difference between imov and fmov has been a constant source of
confusion in NIR for years. No one really knows why we have two or when
to use one vs. the other. The real reason is that they do different
things in the presence of source and destination modifiers. However,
without modifiers (which many back-ends don't have), they are identical.
Now that we've reworked nir_lower_to_source_mods to leave one abs/neg
instruction in place rather than replacing them with imov or fmov
instructions, we don't need two different instructions at all anymore.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Acked-by: Rob Clark <robdclark@chromium.org>
A few of our very late passes can end up generating vectors accidentally
so we need to get rid of them. The only known case of this is the ffma
peephole which generates fneg and fabs as vectors. Currently, they're
not a problem because they get turned into fmov which the back-end
compiler knows how to handle as a vector. That's about to change.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This flag has caused more confusion than good in most cases. You can
validly use imov for floats or fmov for integers because, without source
modifiers, neither modify their input in any way. Using imov for floats
is more reliable so we go that direction.
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_IMAGE
or VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT, the imageView member of each
element of pImageInfo must have been created with the identity swizzle.
Fixes: d2aa65eb
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
On machines with many cores, you can run into that issue :
../mesa-9999/src/vulkan/overlay-layer/overlay.cpp:42:10: fatal error: vk_enum_to_str.h: No such file or directory
v2: Move declare_dependency around (Eric)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reported-by: Jan Ziak
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
When running with NIR_TEST_CLONE=1, the pointer will not be valid, as
the whole shader is going to be recreated every pass. Prefer using
is_entrypoint (to query when looping) and nir_shader_get_entrypoint()
instead.
Fixes the Vulkan Piglit tests
- vulkan/glsl450/frexp-double
- vulkan/glsl450/isinf-double
- vulkan/shaders/fs-multiple-large-local-array
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108957
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Instead of setting the glsl types of the pointers for each resource,
set the nir_address_format, from which we can derive the glsl_type,
and in the future the bit pattern representing a NULL pointer.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
From the Vulkan 1.1.107 spec:
Sample shading is enabled for a graphics pipeline:
- If the interface of the fragment shader entry point of the
graphics pipeline includes an input variable decorated with
SampleId or SamplePosition. In this case minSampleShadingFactor
takes the value 1.0.
- Else if the sampleShadingEnable member of the
VkPipelineMultisampleStateCreateInfo structure specified when
creating the graphics pipeline is set to VK_TRUE. In this case
minSampleShadingFactor takes the value of
VkPipelineMultisampleStateCreateInfo::minSampleShading.
Otherwise, sample shading is considered disabled.
In other words, if sampleShadingEnable is set to VK_FALSE, we should
ignore minSampleShading.
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This was an unintended artifact of my testing of bindless images. We
should be choosing bindless or not dynamically.
Fixes: c0d9926df7 "anv: Use bindless handles for images"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Now that we have the descriptor buffer mechanism, emulated texture
swizzle can be implemented in a very non-invasive way. Previous
attempts all tried to extend the push constant based image param
mechanism which was gross. This could, in theory, be done much faster
with a magic back-end instruction which does indirect MOVs but Vulkan on
IVB is already so slow this isn't going to matter much.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104355
Cc: "19.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Don't attempt sampling with HiZ if the sampler lacks support for it. On
ICL, the HW docs state that sampling with HiZ is not supported and that
instances of AUX_HIZ in the RENDER_SURFACE_STATE object will be
interpreted as AUX_NONE.
Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
For a block with a contiguous chunk of 32 vars that don't need updating,
this lets us skip 32 vars at a time. Also, by using bitscan, we only
iterate for each set bit rather than testing them all one at a time.
Looking at perf (with -O0 which is unfortunately necessary to get
reasonable back-traces), this seems to cuts about 50-60% of the time
spent in compute_start_end() which is, itself about 4-6% of the
run-time. In the real world, with a release driver build, this cuts
1.34% off a full shader-db run. (I ran shader-db 5 times in each
configuration).
Reviewed-by: Matt Turner <mattst88@gmail.com>
Otherwise, we get an effectively random spill reg because we no longer
have the information from RA to guide us. Also, a completely clean
graph has undefined data in in_stack which is used for choosing the
spill reg so it really is non-deterministic.
Fixes: e99081e76d "intel/fs/ra: Spill without destroying the..."
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In the future I want to expand this to 128-bits, for vec16 support, so
lets just put the code in place to use bitset ranges now.
v2: just declare the bitset to be the max of what we should ever see
and change assert to reflect it.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Our tessellation control shaders can be dispatched in several modes.
- SINGLE_PATCH (Gen7+) processes a single patch per thread, with each
channel corresponding to a different patch vertex. PATCHLIST_N will
launch (N / 8) threads. If N is less than 8, some channels will be
disabled, leaving some untapped hardware capabilities. Conditionals
based on gl_InvocationID are non-uniform, which means that they'll
often have to execute both paths. However, if there are fewer than
8 vertices, all invocations will happen within a single thread, so
barriers can become no-ops, which is nice. We also burn a maximum
of 4 registers for ICP handles, so we can compile without regard for
the value of N. It also works in all cases.
- DUAL_PATCH mode processes up to two patches at a time, where the first
four channels come from patch 1, and the second group of four come
from patch 2. This tries to provide better EU utilization for small
patches (N <= 4). It cannot be used in all cases.
- 8_PATCH mode processes 8 patches at a time, with a thread launched per
vertex in the patch. Each channel corresponds to the same vertex, but
in each of the 8 patches. This utilizes all channels even for small
patches. It also makes conditions on gl_InvocationID uniform, leading
to proper jumps. Barriers, unfortunately, become real. Worse, for
PATCHLIST_N, the thread payload burns N registers for ICP handles.
This can burn up to 32 registers, or 1/4 of our register file, for
URB handles. For Vulkan (and DX), we know the number of vertices at
compile time, so we can limit the amount of waste. In GL, the patch
dimension is dynamic state, so we either would have to waste all 32
(not reasonable) or guess (badly) and recompile. This is unfortunate.
Because we can only spawn 16 thread instances, we can only use this
mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH.
This patch implements the new 8_PATCH TCS mode, but leaves us using
SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to
using 8_PATCH mode for testing and benchmarking purposes. We may
want to consider using 8_PATCH mode in Vulkan in some cases.
The data I've seen shows that 8_PATCH mode can be more efficient in
some cases, but SINGLE_PATCH mode (the one we use today) is faster
in other cases. Ultimately, the TES matters much more than the TCS
for performance, so the decision may not matter much.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The payload field is actually "instance" (thread number), which is used
to calculate the invocation ID.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When we add 8_PATCH mode, this will get a bit more complex, so we may
as well start by putting it in a helper function.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Instead of re-building the interference graph every time we spill, we
modify it in place so we can avoid recalculating liveness and the whole
O(n^2) interference graph building process. We make a simplifying
assumption in order to do so which is that all spill/fill temporary
registers live for the entire duration of the instruction around which
we're spilling. This isn't quite true because a spill into the source
of an instruction doesn't need to interfere with its destination, for
instance. Not re-calculating liveness also means that we aren't
adjusting spill costs based on the new liveness. The combination of
these things results in a bit of churn in spilling. It takes a large
cut out of the run-time of shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311360 (<.01%)
instructions in affected programs: 77027 -> 77163 (0.18%)
helped: 11
HURT: 18
total cycles in shared programs: 355544739 -> 355830749 (0.08%)
cycles in affected programs: 203273745 -> 203559755 (0.14%)
helped: 234
HURT: 190
total spills in shared programs: 12049 -> 12042 (-0.06%)
spills in affected programs: 2465 -> 2458 (-0.28%)
helped: 9
HURT: 16
total fills in shared programs: 25112 -> 25165 (0.21%)
fills in affected programs: 6819 -> 6872 (0.78%)
helped: 11
HURT: 16
Total CPU time (seconds): 2469.68 -> 2360.22 (-4.43%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This is slightly less convenient in some places but it will make it much
easier when we want to start adding nodes dynamically.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The old code was arranged by the type of interference being added. It
would set up payload registers and then add payload interference for all
VGRFs. It would set up MRFs and add MRF interference for all VGRFs.
This commit re-arranges things to be organized differently. It first
creates and sets up all RA nodes and then groups interference into two
new categories: live range and instruction interference. Once all the
RA nodes have been set up, it walks the list of VGRFs and sets up their
live range interference and then walks the list of instructions and sets
up instruction interference. This new arrangement will be advantageous
for a future patch but, at the moment, it cuts 2% off the run-time of
shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311224 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355544739 -> 355544739 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2523.45 -> 2469.68 (-2.13%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The only use of the MRF hack these days is for spilling and there we
don't need the precise MRF usage information. If we're spilling then we
know pretty well how many MRFs are going to be used. It is possible if
the only things that are spilled have fewer SIMD channels than the
dispatch width of the shader that this may be more MRFs than needed.
That's a risk we're willing to takd.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311224 (<.01%)
instructions in affected programs: 16664 -> 16788 (0.74%)
helped: 1
HURT: 5
total cycles in shared programs: 355543197 -> 355544739 (<.01%)
cycles in affected programs: 731864 -> 733406 (0.21%)
helped: 3
HURT: 6
The hurt shaders are all SIMD32 compute shaders where we reserve enough
space for a 32-wide spill/fill but don't need it.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This accomplishes two things. First, it makes interfaces which are
really private to RA private to RA. Second, it gives us a place to
store some common stuff as we go through the algorithm.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
There's no reason why we need to use the calculated payload_node_count
value which is just first_non_payload_grf aligned up. The grf_used
value will be aligned up to 16 anyway (which is a much bigger alignment)
before being handed off to hardware.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We only have one node per VGRF so this was adding way too much
interference. No idea how we didn't catch this before.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355543197 (0.02%)
cycles in affected programs: 2472492 -> 2547639 (3.04%)
helped: 17
HURT: 20
Fixes: 014edff0d2 "intel/fs: Add interference between SENDS sources"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In the last phase of the schedule and RA loop, the RA call is redundant
if we spill. Immediately afterwards, we're going to see that we
couldn't allocate without spilling and call back into RA and tell it to
go ahead and spill. We've known about it for a while but we've always
brushed over it on the theory that, if you're going to spill, you'll be
calling RA a bunch anyway and what does one extra RA hurt? As it turns
out, it hurts more than you'd expect. Because the RA interference graph
gets sparser with each spill and the RA algorithm is more efficient on
sparser graphs, the RA call that we're duplicating is actually the most
expensive call in the RA-and-spill loop.
There's another extra RA call we do that's a bit harder to see which
this also removes. If we try to compile a shader that isn't the minimum
dispatch width and it fails to allocate without spilling we call fail()
to set an error but then go ahead and do the first spilling RA pass and
only after that's complete do we detect the fail and bail out. By
making minimum dispatch widths part of the spill condition, we side-step
this problem.
Getting rid of these extra spills takes the compile time of a nasty
Aztec Ruins shader from about 28 seconds to about 26 seconds on my
laptop. It also makes shader-db 1.5% faster
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355468050 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2524.31 -> 2486.63 (-1.49%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Import some restrictions from intel_tiling_supports_hiz() and
intel_miptree_supports_hiz().
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
Import some restrictions from intel_miptree_supports_mcs() and don't
assume that the caller knows which device generations are supported.
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
There's no real work to do here since we already support scalar block
layout which is a direct superset of what this extension allows.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
glibc < 2.27 defines OVERFLOW in /usr/include/math.h.
This patch fixes this build error.
In file included from ../include/c99_math.h:37:0,
from ../src/util/u_math.h:44,
from ../src/mesa/main/macros.h:35,
from ../src/intel/compiler/brw_reg.h:47,
from ../src/intel/tools/i965_asm.h:32,
from ../src/intel/tools/i965_gram.y:29:
src/intel/tools/i965_gram.tab.c:562:5: error: expected identifier before numeric constant
OVERFLOW = 412,
^
Fixes: 70308a5a8a ("intel/tools: New i965 instruction assembler tool")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110656
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Acked-by: Eric Engestrom <eric@engestrom.ch>
Update various limits in
VkPhysicalDeviceDescriptorIndexingPropertiesEXT that were previously
zero to their values from VkPhysicalDeviceLimits. When using
VK_EXT_descriptor_indexing, the former limits will apply to all the
descriptor layout sets -- not only those using the new feature bits.
For the reference, VK_EXT_descriptor_indexing says
"There are new descriptor set layout and descriptor pool creation
flags that are required to opt in to the update-after-bind
functionality, and there are separate maxPerStage* and
maxDescriptorSet* limits that apply to these descriptor set
layouts which may be much higher than the pre-existing limits. The
old limits only count descriptors in non-updateAfterBind
descriptor set layouts, and the new limits count descriptors in
all descriptor set layouts in the pipeline layout."
Fixes: 6e230d7607 "anv: Implement VK_EXT_descriptor_indexing"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>