Don't have a separate mechanism for NIR constants to be removed from
the table. If unused, we will compact it away. The use_null_surface
is needed when INTEL_DISABLE_COMPACT_BINDING_TABLE is set.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Change the iris_binding_table to keep track of what surfaces are
actually going to be used, then assign binding table indices just for
those. Reducing unused bytes on those are valuable because we use a
reduced space for those tables in Iris.
The rest of the driver can go from "group indices" (i.e. UBO #2) to
BTI and vice-versa using helper functions. The value
IRIS_SURFACE_NOT_USED is returned to indicate a certain group index is
not used or a certain BTI is not valid.
The environment variable INTEL_DISABLE_COMPACT_BINDING_TABLE can be
set to skip compacting binding table.
v2: (all from Ken)
Use BITFIELD64_MASK helper. Improve comments.
Assert all group is marked as used when we have indirects.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Stop using brw_compiler to lower the final binding table indices for
surface access. This is done by simply not setting the
'prog_data->binding_table.*_start' fields. Then make the driver
perform this lowering.
This is a better place to perfom the binding table assignments, since
the driver has more information and will also later consume those
assignments to upload resources.
This also prepares us for two changes: use ibc without having to
implement binding table logic there; and remove unused entries from
the binding table.
Since the `block` field in brw_ubo_range now refers to the final
binding table index, we need to adjust it before using to index
shs->constbuf.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Shader-db results on Kaby Lake:
total instructions in shared programs: 15306230 -> 15304726 (<.01%)
instructions in affected programs: 4570 -> 3066 (-32.91%)
helped: 16
HURT: 0
total cycles in shared programs: 361703436 -> 361680041 (<.01%)
cycles in affected programs: 129388 -> 105993 (-18.08%)
helped: 16
HURT: 0
LOST: 0
GAINED: 2
The helped programs were in XCom 2, Deus Ex: Mankind Divided, and Kerbal
Space Program
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Felix noticed a crash when using INTEL_DEBUG=bat decoding. It turned
out that we were sometimes placing variable length data near the end
of a buffer, and with the decoder guessing random lengths rather than
having an actual count, it was walking off the end and crashing. So
this does more than improve the decoder output.
Unfortunately, this is a bit more complicated than i965's handling,
because we don't have a single state buffer. Various places upload
data via u_upload_mgr, and so there isn't a central place to record
the size. We don't need to catch every single place, however, since
it's only important to record variable length packets (like viewports
and binding tables).
State data also lives arbitrarily long, rather than being discarded on
every batch like i965, so we don't know when to clear out old entries
either. (We also don't have a callback when an upload buffer is
released.) So, this tracking may space leak over time. That's probably
okay though, as this is only a debugging feature and it's a slow leak.
We may also get lucky and overwrite existing entries as we reuse BOs,
though I find this unlikely to happen.
The fact that the decoder works in terms of offsets from a state base
address is also not ideal, as dynamic state base address and surface
state base address differ for iris. However, because dynamic state
addresses start from the top of a 4GB region, and binding tables start
from addresses [0, 64K), it's highly unlikely that we'll get overlap.
We can always improve this, but for now it's better than what we had.
Our tessellation control shaders can be dispatched in several modes.
- SINGLE_PATCH (Gen7+) processes a single patch per thread, with each
channel corresponding to a different patch vertex. PATCHLIST_N will
launch (N / 8) threads. If N is less than 8, some channels will be
disabled, leaving some untapped hardware capabilities. Conditionals
based on gl_InvocationID are non-uniform, which means that they'll
often have to execute both paths. However, if there are fewer than
8 vertices, all invocations will happen within a single thread, so
barriers can become no-ops, which is nice. We also burn a maximum
of 4 registers for ICP handles, so we can compile without regard for
the value of N. It also works in all cases.
- DUAL_PATCH mode processes up to two patches at a time, where the first
four channels come from patch 1, and the second group of four come
from patch 2. This tries to provide better EU utilization for small
patches (N <= 4). It cannot be used in all cases.
- 8_PATCH mode processes 8 patches at a time, with a thread launched per
vertex in the patch. Each channel corresponds to the same vertex, but
in each of the 8 patches. This utilizes all channels even for small
patches. It also makes conditions on gl_InvocationID uniform, leading
to proper jumps. Barriers, unfortunately, become real. Worse, for
PATCHLIST_N, the thread payload burns N registers for ICP handles.
This can burn up to 32 registers, or 1/4 of our register file, for
URB handles. For Vulkan (and DX), we know the number of vertices at
compile time, so we can limit the amount of waste. In GL, the patch
dimension is dynamic state, so we either would have to waste all 32
(not reasonable) or guess (badly) and recompile. This is unfortunate.
Because we can only spawn 16 thread instances, we can only use this
mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH.
This patch implements the new 8_PATCH TCS mode, but leaves us using
SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to
using 8_PATCH mode for testing and benchmarking purposes. We may
want to consider using 8_PATCH mode in Vulkan in some cases.
The data I've seen shows that 8_PATCH mode can be more efficient in
some cases, but SINGLE_PATCH mode (the one we use today) is faster
in other cases. Ultimately, the TES matters much more than the TCS
for performance, so the decision may not matter much.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
iris_draw_vbo is divided into two functions to remove unnecessary
operations from the loop. This implementation of ARB_indirect_parameters
takes into account NV_conditional_render by saving MI_PREDICATE_RESULT
at the start of a draw call and restoring it at the end also the result
of NV_conditional_render is taken into account when computing predicates
that limit draw calls for ARB_indirect_parameters in a similar way
to 1952fd8d in ANV.
v2: Optimize indirect draws (suggested by Kenneth Graunke)
v3: (by Kenneth Graunke)
- Fix an issue where indirect draws wouldn't set patch information
before updating the compiled TCS.
- Move some code back to iris_draw_vbo to avoid duplicating it.
- Fix minor indentation issues.
Signed-off-by: Illia Iorin <illia.iorin@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Anuj fixed this in i965 and anv, but the fix never landed in iris.
Fixes tessellation corruption on Icelake. Thanks to Rafael for
bisecting this and tracking it down.
Fixes: d0996d5fab iris: Emit default L3 config for the render pipeline
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
this adds support for imports where the image data begins at an offset
from the start of the buffer, as used in h/x264
fixeskwg/mesa#47
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This makes CompressedTexSubImage from a PBO source do proper GPU
rendering to upload instead of stalling to map the PBO source on
the CPU (then copying it on the CPU).
Thanks Bas Nieuwenhuizen for pointing out that Vulkan includes this
functionality, and to Jason Ekstrand for writing the code I adapted.
Vulkan only supports a single layer, however, and this code tries to
support multiple layers as long as it's miplevel 0.
Improves performance in Sid Meier's Civilization VI:
Average frame time (ms): -3.67423% +/- 1.46201% (n=5)
99th percentile frame time (ms): -5.09910% +/- 3.87874% (n=5)
This is a port of Danylo's eca4a6548d
which fixed the hang on i965. It fixes GPU hangs in his new Piglit
test, arb_blend_func_extended-dual-src-blending-discard-without-src1.
I avoided my own review feedback here, and decided to simply adjust
3DSTATE_PS_BLEND rather than BLEND_STATE_ENTRY[0]. It has never been
clear to me which the hardware uses in every case. However, whacking
the enable in 3DSTATE_PS_BLEND seems to be sufficient to fix the hang,
and that packet is already dynamic, so it's easy to handle. I'd rather
avoid making BLEND_STATE_ENTRY[0] dynamic unless I have to.
I was setting it based off a pipe_rasterizer_state field that appears
to be entirely dead outside of the draw module respecting it.
I should be setting it when the primitive type reaching the SF is
neither points nor lines. This is, unfortunately, rather dirty,
as we have to look at the rasterizer state, the geometry shader state,
the tessellation evaluation shader state, and the primitive type...
Some of the dEQP.functional.transform_feedback tests end up doing
the following sequence of operations:
1. BeginTransformFeedback
2. PauseTransformFeedback
3. Draw
4. ResumeTransformFeedback
At step 1, we'd pack 3DSTATE_SO_BUFFER commands saying to zero the
SO_WRITE_OFFSET registers. At step 2, we disable streamout, so step 3
doesn't bother emitting those commands. Then, step 4 re-packs new
3DSTATE_SO_BUFFER commands with offset = 0xFFFFFFFF, saying to continue
appending at the existing offset. This loads the value from the BO as
the offsets - but we never actually zeroed it.
So, just maintain a flag saying "we actually emitted the commands",
and stomp offset back to zero until we emit some.
OpenGL 4.6 Spec:
"5.3.3 Rules
.......
Note: “Updates” via rendering or transform feedback
are treated consistently with updates via GL commands.
Once EndTransformFeedback has been issued, any subsequent
command in the same context that uses the results of the
transform feedback operation will see the results."
v2: removed a wrong comment
( Kenneth Graunke <kenneth@whitecape.org> )
v3: - flush+dirty depends on buffers usage history
- removed an old hack
( Kenneth Graunke <kenneth@whitecape.org> )
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110404
Signed-off-by: Andrii Simiklit <andrii.simiklit@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Just enable it during init_render_context on Gen10+, and move the
Gen9 state tracking into iris_genx_state so it only exists on Gen9.
Reviewed-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
this is basically just porting the following two commits to gallium:
d8b50e152a5c454661c6resolveskwg/mesa#49
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
this hooks up the iris gallium driver to existing mesa bits which handle
the implementation
resolveskwg/mesa#8
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We need to subtract the starting offset from the final offset before
dividing by the stride. See src/intel/vulkan/genX_cmd_buffer.c:3142.
Not known to fix anything.
Applications frequently call glBufferSubData() to consecutive regions
of a VBO to append new vertex data. If no data exists there yet, we
can promote these to unsynchronized writes, even if the buffer is busy,
since the GPU can't be doing anything useful with undefined content.
This can avoid a bunch of unnecessary blitting on the GPU.
u_threaded_context would do this for us, and in fact prohibits us from
doing so (see TC_TRANSFER_MAP_NO_INFER_UNSYNCHRONIZED). But we haven't
hooked that up yet, and it may be useful to disable u_threaded_context
when debugging...at which point we'd still want this optimization. At
the very least, it would let us measure the benefit of threading
independently from this optimization. And it's not a lot of code.
Removes most stall avoidance blits in "Total War: WARHAMMER."
On my Skylake GT4e at 1920x1080, this appears to improve performance
in games by the following (but I did not do many runs for proper
statistics gathering):
----------------------------------------------
| DiRT Rally | +2% (avg) | + 2% (max) |
| Bioshock Infinite | +3% (avg) | + 9% (max) |
| Shadow of Mordor | +7% (avg) | +20% (max) |
----------------------------------------------
This implements PIPE_CAP_INVALIDATE_BUFFER and invalidate_resource(),
as well as the PIPE_TRANSFER_DISCARD_WHOLE_RESOURCE flag. When either
of these happen, we swap out the backing storage of the buffer for a
new idle BO, allowing us to write to it immediately without stalling
or queueing a blit.
On my Skylake GT4e at 1920x1080, this improves performance in games:
-----------------------------------------------
| DiRT Rally | +25% (avg) | +17% (max) |
| Bioshock Infinite | +22% (avg) | +11% (max) |
| Shadow of Mordor | +27% (avg) | +83% (max) |
-----------------------------------------------
This unifies a bunch of the UBO and SSBO code to use common structures.
Beyond iris_state_ref, pipe_shader_buffer also gives us a buffer size,
which can be useful when filling out the surface state.
Marek recently extended pipe->set_shader_buffers() to take an extra
writable_bitmask parameter, indicating which SSBOs are writable (some
may be bound read-only). We can use this to decide whether to set
EXEC_OBJECT_WRITE when pinning. Avoiding the write flag can save us
some cross-batch flushing if the SSBO is used for reading in both the
render and compute engines.
This 3d performance workaround was initially put in the kernel but the
media driver requires different settings so the register has been
whitelisted in i915 [1] and userspace drivers are left initializing it as
they wish.
[1] : https://patchwork.freedesktop.org/series/59494/
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Pipeline statistics queries should not count BLORP's rectangles.
(23) How do operations like Clear, TexSubImage, etc. affect the
results of the newly introduced queries?
DISCUSSION: Implementations might require "helper" rendering
commands be issued to implement certain operations like Clear,
TexSubImage, etc.
RESOLVED: They don't. Only application submitted rendering
commands should have an effect on the results of the queries.
Piglit's arb_pipeline_statistics_query-vert_adj exposes this bug when
the driver is hacked to always perform glBufferData via a GPU staging
copy (for debugging purposes).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
We were pinning it for compute shaders, and pinning it when restoring
saved buffers, but we never actually pinned it in the original batch
for VS/TCS/TES/GS/FS.
Fixes rendering in GFXBench5's Tessellation demo and a bunch of Piglit
geometry shader tests.
The GL 4.5 spec says:
"If any enabled array’s buffer binding is zero when DrawArrays or
one of the other drawing commands defined in section 10.4 is called,
the result is undefined."
The result is undefined but it should not crash.
Fixes: gl-3.1-vao-broken-attrib
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
iris_upload_border_color is passed a pointer which points to
variable that is introduced in a different scope.
CID: 1444296
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We want to skip some types of aux usages (for instance,
ISL_AUX_USAGE_HIZ when the hardware doesn't support it, or when we have
multisampling) when sampling from the surface.
Instead of checking for those cases while filling the surface state and
leaving it blank, let's have a version of aux.possible_usages for
sampling. This way we can also avoid allocating surface state for the
cases we don't use.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Since we are not using it for the clear color, there's no need to
allocate it.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The swizzle for rendering surfaces is always identity. So when we are
doing the fast clear, we don't have enough information to store the
clear color OR'ed with the Shader Channel Select bits for the dword in
the SURFACE_STATE.
Instead of trying to patch up the SURFACE_STATE correctly later, by
reading the color from the clear color state buffer and then doing all
the operations to store it, let's just re-emit the whole SURFACE_STATE.
That should make things way simpler on gen8, and we can still use the
clear color state buffer for gen9+.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Newer gens can read it directly.
Also properly skip updating the ISL_AUX_USAGE_NONE surface.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
h/w specification requires this bit to be always set.
See Mesa commit 5eb173304b.
Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
From "Alpha Coverage" section of SKL PRM Volume 7:
"If Pixel Shader outputs oMask, AlphaToCoverage is disabled in
hardware, regardless of the state setting for this feature."
From OpenGL spec 4.6, "15.2 Shader Execution":
"The built-in integer array gl_SampleMask can be used to change
the sample coverage for a fragment from within the shader."
From OpenGL spec 4.6, "17.3.1 Alpha To Coverage":
"If SAMPLE_ALPHA_TO_COVERAGE is enabled, a temporary coverage value
is generated where each bit is determined by the alpha value at the
corresponding sample location. The temporary coverage value is then
ANDed with the fragment coverage value to generate a new fragment
coverage value."
Similar wording could be found in Vulkan spec 1.1.100
"25.6. Multisample Coverage"
Thus we need to compute alpha to coverage dithering manually in shader
and replace sample mask store with the bitwise-AND of sample mask and
alpha to coverage dithering.
The following formula is used to compute final sample mask:
m = int(16.0 * clamp(src0_alpha, 0.0, 1.0))
dither_mask = 0x1111 * ((0xfea80 >> (m & ~3)) & 0xf) |
0x0808 * (m & 2) | 0x0100 * (m & 1)
sample_mask = sample_mask & dither_mask
Credits to Francisco Jerez <currojerez@riseup.net> for creating it.
It gives a number of ones proportional to the alpha for 2, 4, 8 or 16
least significant bits of the result.
GEN6 hardware does not have issue with simultaneous usage of sample mask
and alpha to coverage however due to the wrong sending order of oMask
and src0_alpha it is still affected by it.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109743
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Invoking VALGRIND_CHECK_MEM_IS_DEFINED pulls in enough code to convince
gcc to not inline __gen_uint and results in a lot of packing code ending
up out-of-line with lots of stack copying. To ameliorate this, only
insert the check inside the packer if DEBUG is defined and instead
perform the validation checking before submitting the batch to the
kernel. This should give accurate results if --trace-origins=yes is
used, and failing that we can recompile in full debug mode to check on
insertion.
Improve drawoverhead baseline by 25% with a default build with
valgrind-dev installed (with effectively no loss of vg coverage).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
v2: Update tracked clear color when we update the surface state.
v3: Update all aux surface states when updating the clear color.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
If all the restrictions are satisfied, do a fast clear instead of
regular clear.
v2:
- add perf_debug() when we can't fast clear (Ken)
- improve comment: s/miptree/resource/ (Ken)
- use swizzle_color_value from blorp (Ken)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Take the clear depth into account when IRIS_DIRTY_DEPTH_BUFFER is marked
as dirty.
Also update the blorp surface clear color.
v2: Use a single if (zres && zres->aux.bo) (Ken).
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>