If we have a d24x8 format, there is no stencil. Therefore, we can always
clear these bits too, which means this will be some kind of memset rather
than read-modify-write.
This is good for some 7% increase or so in gears with huge window size -
seems to have a bigger effect if things aren't in caches. Of course, any
real app won't spend nearly as much time comparatively in clearing
depth buffer in the first place, so the speedup will be much lower.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This is untested and probably broken.
We already passed the atan2 CTS tests before implementing this opcode.
Presumably, glslang or something was giving us a plain Atan opcode
instead of Atan2. I don't know why.
Fixes a number of GLES31 CTS failures and hangs on various hardware:
ES31-CTS.texture_gather.plain-gather-depth-2d
ES31-CTS.texture_gather.plain-gather-depth-2darray
ES31-CTS.texture_gather.plain-gather-depth-cube
ES31-CTS.texture_gather.offset-gather-depth-2d
ES31-CTS.texture_gather.offset-gather-depth-2darray
ES31-CTS.layout_binding.sampler2D_layout_binding_texture_ComputeShader
ES31-CTS.layout_binding.sampler2DArray_layout_binding_texture_ComputeShader
ES31-CTS.explicit_uniform_location.uniform-loc-types-samplers
ES31-CTS.compute_shader.resources-texture
Some of them were actually passing by luck on some generations even
though we weren't uploading sampler state tables explicitly for the
compute stage, most likely because they relied on the cached sampler
state left from previous rendering to be close enough.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92589
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93312
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93325
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93407
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93725
Reported-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reviewed-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
This reuses the NEW_SAMPLER_STATE_TABLE state bit (currently only used
on pre-Gen7 hardware) to signal that the sampler state tables have
changed in order to make sure that the GPGPU interface descriptor is
updated.
Reviewed-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
I have a patch that writes shaders as .shader_test files, and it uses
this function to create the headers (i.e. [vertex shader]).
[tess ctrl shader] isn't a valid shader_runner header - it's spelled
out as [tessellation control shader].
There's no real reason to abbreviate it, so spell it out.
v2: Rebase on Rob's patches to move the code.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Even if re-linking fails rendering shouldn't fail as the previous
succesfully linked program will still be available. It also shouldn't
be possible to have an unlinked program as part of the current rendering
state.
This fixes a subtest in:
ES31-CTS.sepshaderobjs.StateInteraction
This change should improve performance on CPU limited benchmarks as noted
in commit d6c6b186cf.
>From Section 7.3 (Program Objects) of the OpenGL 4.5 spec:
"If a program object that is active for any shader stage is re-linked
unsuccessfully, the link status will be set to FALSE, but any existing
executables and associated state will remain part of the current rendering
state until a subsequent call to UseProgram, UseProgramStages, or
BindProgramPipeline removes them from use. If such a program is attached to
any program pipeline object, the existing executables and associated state
will remain part of the program pipeline object until a subsequent call to
UseProgramStages removes them from use. An unsuccessfully linked program may
not be made part of the current rendering state by UseProgram or added to
program pipeline objects by UseProgramStages until it is successfully
re-linked."
"void UseProgram(uint program);
...
An INVALID_OPERATION error is generated if program has not been linked, or
was last linked unsuccessfully. The current rendering state is not modified."
V2: apply the rule to both core and compat.
Cc: Tapani Pälli <tapani.palli@intel.com>
Cc: Brian Paul <brianp@vmware.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
From the ARB_shading_language_420pack spec:
"More than one layout qualifier may appear in a single
declaration. If the same layout-qualifier-name occurs in
multiple layout qualifiers for the same declaration, the
last one overrides the former ones."
The parser was already failing correctly when the extension is
not available but testing for duplicates within a single layout
qualifier was still causing this to fail when available as both
cases share the same function for merging.
Here we add a parameter to differentiate between the two uses
and apply it to the duplicate test.
Acked-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Chris Forbes <chrisf@ijw.co.nz>
In order to only create a single node for each default declaration
we add a new boolean parameter to the in/out merge function to
only create one once we reach the rightmost layout qualifier.
From the ARB_shading_language_420pack spec:
"More than one layout qualifier may appear in a single
declaration. If the same layout-qualifier-name occurs in
multiple layout qualifiers for the same declaration, the
last one overrides the former ones."
Acked-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Chris Forbes <chrisf@ijw.co.nz>
This will allow merging of duplicate layout qualifiers as allowed
by ARB_shading_language_420pack
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Chris Forbes <chrisf@ijw.co.nz>
This is added by ARB_enhanced_layouts although it doesn't fit
into any of the six main changes so we enable this independently.
From the ARB_enhanced_layouts spec:
"More than one layout qualifier may appear in a single
declaration. Additionally, the same layout-qualifier-name
can occur multiple times within a layout qualifier or across
multiple layout qualifiers in the same declaration"
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Chris Forbes <chrisf@ijw.co.nz>
We were diligently setting Wayland buffers as non-busy, but nowhere in
the code did we set them to busy when submitted to the server. This
meant that acquire_next_image would only ever find the same buffer in
a loop, over and over.
Signed-off-by: Daniel Stone <daniels@collabora.com>
In acquire_next_image, we are waiting for a wl_buffer::release to arrive
and release one of the buffers in our swapchain. Most compositors don't
explicitly flush release events, so we may need to perform a roundtrip
instead, to ensure the event arrives.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Apparently, nobody has combined stippling with a fragment shader
containing immediates in almost five years...
Fixes a bug in Kodi with radeonsi reported by Christian König.
Cc: "11.0 11.1" <mesa-stable@lists.freedesktop.org>
Tested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Sanity check of BindVertexBuffer for OpenGL ES in
_mesa_handle_bind_buffer_gen breaks OpenGL ES 2 conformance.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93426
Signed-off-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Print the stream value not the pointer to the expression,
also use the unsigned format specifier.
Cc: 11.1 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
SPIR-V only has one ImageQuerySize opcode that has to work for both
textures and storage images. Therefore, we have to special-case that one a
bit and look at the type of the incoming image handle.
Image intrinsics always take a vec4 coordinate and always return a vec4.
This simplifies the intrinsics a but but also means that they don't
actually match the incomming SPIR-V. In order to compensate for this, we
add swizzling movs for both source and destination to get the right number
of components.
The whole point of inlining sources is to reduce loads. We can end up in
a situation where one value is used a lot of times, and one value is
used only once per instruction. The once-per-instruction one is the one
that should get inlined, but with the previous algorithm, it was given
no preference.
This flips things around to preferring putting less-referenced values
into src1 which increases the likelihood of them being inlined.
While we're at it, adjust the heuristic to not treat 0 as an immediate,
as well as (effectively) check for situations where LIMMs can't be
loaded. All this yields improvements on nvc0:
total instructions in shared programs : 6261157 -> 6255985 (-0.08%)
total gprs used in shared programs : 945082 -> 943417 (-0.18%)
total local used in shared programs : 30372 -> 30288 (-0.28%)
total bytes used in shared programs : 50089256 -> 50047880 (-0.08%)
local gpr inst bytes
helped 21 822 3332 3332
hurt 0 278 565 565
And more importantly avoids generating really bad code with SSBOs, where
we end up checking a lot of different values (usually immediates) against
the length.
On nv50 we get comparable results, and even improve packing (bytes went
down more than instructions):
total instructions in shared programs : 6346564 -> 6341277 (-0.08%)
total gprs used in shared programs : 728719 -> 725131 (-0.49%)
total local used in shared programs : 3552 -> 3552 (0.00%)
total bytes used in shared programs : 43995688 -> 43932928 (-0.14%)
local gpr inst bytes
helped 0 1380 3252 3774
hurt 0 287 1710 1365
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
An issue could still occur if the base level is set, but fixing that
would require a lot more logic.
This fixes the recently-failing texelFetch 3D tests because the mipmaps
were no longer being generated, which in turn caused the copying logic
to be hit, which in turn didn't work because of the broken
width/height/depth.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The Vulkan spec requires all allocations that happen for device creation to
happen with SCOPE_DEVICE. Since meta calls into other things that allocate
memory, the easiest way to do this is with an allocator.
Once we go past half of the "GPR" register file, it seems like we need
to run frag shader with smaller threadsize. (The vertex shader already
runs at TWO_QUADS, which is the minimum.)
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Some a4xx firmware doesn't implement the "PFD" (prefetch-disabled)
version of the CP_INDIRECT_BUFFER packet. So allow for PFD vs PFE per
generation. Switch a3xx and a4xx over to using prefetch-enabled version
(which is also what blob does.. it seems only on a2xx we cannot use
PFE).
Signed-off-by: Rob Clark <robclark@freedesktop.org>
The border color packet is specified as a 64-byte aligned address relative
to dynamic state base address. The way the packing functions are currently
set up, we need to provide it with (offset >> 6) because it just shoves the
bits in where the PRM says they go and isn't really aware that it's an
address.
The __gen_fixed helper properly clamps the value and also handles negative
values correctly. Eventually, we need to make the scripts generate this
and use it for more things.
Instead of emitting the continue before the loop body we emit it
afterwards. Then, once we've finished with the entire function, we run
nir_repair_ssa to add whatever phi nodes are needed.
The efficiency should be approximately the same. We do a little more work
per phi node because we have to sort the predecessors. However, we no
longer have to walk the blocks a second time to pop things off the stack.
The bigger advantage, however, is that we can now re-use the phi placement
and per-block SSA value tracking in other passes.
Right now, we have phi placement code in two places and there are other
places where it would be nice to be able to do this analysis. Instead of
repeating it all over the place, this commit adds a helper for placing all
of the needed phi nodes for a value.