We would like to reuse performance query metrics in other APIs. Let's
make the query code dealing with the processing of raw counters into
human readable values API agnostic.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Mark Janes <mark.a.janes@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The i965 driver has a bunch of code to compare two sets of program keys
and print out the differences. This can be useful for debugging why a
shader needed to be recompiled on the fly due to non-orthogonal state
dependencies. anv doesn't do recompiles, so we didn't need to share
this in the past - but I'd like to use it in iris.
This moves the bulk of the code to the compiler where it can be reused.
To make that possible, we need to decouple it from i965 - we can't get
at the brw program cache directly, nor use brw_context to print things.
Instead, we use compiler->shader_perf_log(), and simply pass in keys.
We put all of this debugging code in brw_debug_recompile.c, and only
export a single function, for simplicity. I also tidied the code a
bit while moving it, now that it all lives in one file.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Both Vulkan and OpenGL might be using glsl_types simultaneously or we
can also have multiple concurrent Vulkan instances using glsl_types.
Patch adds a one time init to track number of users and will release
types only when last user calls _glsl_type_singleton_decref().
This change fixes glsl_type memory leaks we have with anv driver.
v2: reuse hash_mutex, cleanup, apply fix also to radv driver and
rename helper functions (Jason)
v3: move init, destroy to happen on GL context init and destroy
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
If we write to the flag register changing the swizzle would change
what channels are written to the flag register.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110201
Fixes: 4cd1a0be
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: <ian.d.romanick@intel.com>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Acked-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Matt Turner <mattst88@gmail.com>
These were updated in version 1.1.106 of vulkan.h to make more sense
with the extension names. We may as well keep with the times.
Acked-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Pipeline statistics queries should not count BLORP's rectangles.
(23) How do operations like Clear, TexSubImage, etc. affect the
results of the newly introduced queries?
DISCUSSION: Implementations might require "helper" rendering
commands be issued to implement certain operations like Clear,
TexSubImage, etc.
RESOLVED: They don't. Only application submitted rendering
commands should have an effect on the results of the queries.
Piglit's arb_pipeline_statistics_query-vert_adj exposes this bug when
the driver is hacked to always perform glBufferData via a GPU staging
copy (for debugging purposes).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
v2: remove & operator in a couple of memsets
add some memsets
v3: fixup lima
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v2)
While we're here, fix a typo which caused it to actually return a vec4
with the third and fourth components zero.
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
In 628c9ca908 I forgot to apply the same -4Gb of the high address
of the high heap VMA. This was previously computed in the
HIGH_HEAP_MAX_ADDRESS.
Many thanks to James for pointing this out.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reported-by: Xiong, James <james.xiong@intel.com>
Fixes: 628c9ca908 ("anv: store heap address bounds when initializing physical device")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We will never hit a condition where we have src1 and src2 as immediate
operands.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We were always programming it with the Broadwell convention which is too
large by a factor of two on Haswell and just plain wrong on IVB and BYT.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable@lists.freedesktop.org
v2: handle atomics as well
make use of nir_rewrite_image_intrinsic
v3: remove call to nir_remove_dead_derefs
v4: (Timothy Arceri) dont actually call lowering yet
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v3)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
We can then reuse those bounds to initialize the VMA heaps at logical
device creation.
This fixes an issue on EHL which has only 36bits of VMA. We were
incorrectly using the fixed 48bits upper bound to initialize the
logical device heap, resulting in addresses beyong the device's
limits.
v2: Don't confuse heap size (limited by system memory) and VMA size
(limited by number of addressing bits the platform has)
v3: Fix low heap vma_size :( (Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reported-by: James Xiong <james.xiong@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com> (v1)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v2)
libintel_common depends on libintel_compiler, but it contains debug
functionality that is needed by libintel_compiler. Break the circular
dependency by moving gen_debug files to libintel_dev.
Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
If the user didn't provide a pipeline cache and we're using the
default internal pipeline cache, then we shouldn't consider a cache
hit for VK_EXT_pipeline_creation_feedback as the application did not
provide a cache.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 6601e5d6fc ("anv: implement VK_EXT_pipeline_creation_feedback")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
So far ANV was advertising 4 bits for both subTexelPrecisionBits and
mipmapPrecisionBits. But these values were not actually verified.
But it seems the right value is actually 8 bits for both cases.
Unfortunately Intel PRM does not clarify how many bits the hardware use.
For the mipmap case, there is the following reference in PRM Volume 6
(3D Media GPGPU), specifically in LOD Computation Pseudocode:
```
Bias: S4.8
MinLod: U4.8
MaxLod: U4.8
Base: U4.1
MIPCnt: U4
SurfMinLod: U4.8
ResMinLod: U4.8
``
We have other clues, though:
- On one side, dEQP-VK.texture.explicit_lod.* tests fail when using 4
bits, but work when using 8 bits. These tests try to mimic the expected
behaviour as much real as possible, and they use the reported
subTexelPrecisionBits and mipmapPrecisionBits reported to get this.
- On the other side, the equivalent driver for Windows is reporting 8
bits for both elements. Not sure if they got to verify it from the PRM
or from a diffent source.
CC: Jason Ekstrand <jason@jlekstrand.net>
CC: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This will make that step visible in NIR_PRINT=1.
v2: Also use the macro for the cleanup passes.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This was needed when certain intrinsics were lowered to other ones
that were defined by the same pass. After 060817b2 "intel,nir: Move
gl_LocalInvocationID lowering to nir_lower_system_values" we don't
need the loop anymore.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When using quads, instead of mapping the elements to the next 4 local
invocation indices, we map the two next in the "current" row and two
next in the "next row". A side effect is that a thread will execute
the indices in a different order.
We now perform the lowering of both local invocation ID and index
together -- and don't rely anymore on lowering done by
nir_lower_system_values. That is convenient when doing the math for
quads, because we need X and Y to get the right invocation index.
When the pass progresses, fold the constants and clean up to reduce
the noise from the indexing math.
This implements the derivative_group_quadsNV semantics from
NV_compute_shader_derivatives.
v2: Take subgroup_id into account, otherwise only values in the first
subgroup would be used. (Jason)
v3: Calculate invocation index and ID together, to avoid duplicating
some math in the quads case when both index and ID are used. (Jason)
v4: Don't call cleanup passes as part of the lowering, let that to the
call site. (Jason)
Change calculation to use less instructions. (Jason)
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> (v3)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When I implemented opt_if_loop_last_continue() I had restricted
this pass from moving other if-statements inside the branch opposite
the continue. At the time it was causing a bunch of spilling in
shader-db for i965.
However Samuel Pitoiset noticed that making this pass more aggressive
significantly improved the performance of Doom on RADV. Below are
the statistics he gathered.
28717 shaders in 14931 tests
Totals:
SGPRS: 1267317 -> 1267549 (0.02 %)
VGPRS: 896876 -> 895920 (-0.11 %)
Spilled SGPRs: 24701 -> 26367 (6.74 %)
Code Size: 48379452 -> 48507880 (0.27 %) bytes
Max Waves: 241159 -> 241190 (0.01 %)
Totals from affected shaders:
SGPRS: 23584 -> 23816 (0.98 %)
VGPRS: 25908 -> 24952 (-3.69 %)
Spilled SGPRs: 503 -> 2169 (331.21 %)
Code Size: 2471392 -> 2599820 (5.20 %) bytes
Max Waves: 586 -> 617 (5.29 %)
The codesize increases is related to Wolfenstein II it seems largely
due to an increase in phis rather than the existing jumps.
This gives +10% FPS with Doom on my Vega56.
Rhys Perry also benchmarked Doom on his VEGA64:
Before: 72.53 FPS
After: 80.77 FPS
v2: disable pass on non-AMD drivers
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> (v1)
Acked-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Drivers using genxml will start compilation before generated files are
created, so add a dependency to it.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
Cc: mesa-stable@lists.freedesktop.org
This revision allows for images to be :
- created by reusing image parameters from swapchain
- bound to memory from a swapchain
v2: Add color attachment flag
Use same implicit WSI parameters (tiling, samples, usage)
v3: Fix missing break in vk_foreach_struct_const() switch (Lionel)
v4: Fix accessing image aspects before android resolve (Tapani)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
If we increase vector sizing later it would be nice to avoid
tripped over this again.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>