Today, we stream the compute shader thread IDs simply because they're
(annoyingly) relative to dynamic state base address. We could upload
them once at compile time, but we'd need a separate non-streaming
uploader for IRIS_MEMZONE_DYNAMIC, and I'm not sure it's worth it.
stream_state pins the buffer for use in the current batch, but also
returns a reference to the pipe_resource. We dropped this reference
on the floor, leaking a reference basically every time we dispatched
a compute shader after switching to a new one.
The reason it returns a reference is so that we can hold on to it and
re-pin it in iris_restore_compute_saved_bos, which we were also failing
to do. So if we actually filled up a batch with repeated dispatches to
the same compute shader, and flushed, then continued dispatching, we
would fail to pin it and likely GPU hang.
We were unconditionally uploading the new data, but then conditionally
using it with MEDIA_CURBE_LOAD. If we're not going to emit the command,
there's no point in uploading the data.
We only use push the compute shader thread IDs, not any actual constant
buffer data. So we should track the compute shader variant changing,
not constbuf changes.
MEDIA_INTERFACE_DESCRIPTOR's Interface Descriptor Data Start Address
field's docs say: "This bit specifies the 64-byte aligned address..."
And we were doing 32. Superfluous thread ID uploading was apparently
saving us from GPU hangs in most cases.
This commit moves the sysvals to a separate, new constant buffer
at the end (before the shader constants). It also allows us to
remove the special handling we had for cbuf0, and enables all
constant buffers to support user-specified resources and user
buffers.
v2: (by Kenneth Graunke)
- Rebase on the previous patch to fix system value uploading.
- Fix disk cache num_cbufs calculation
- Fix passthrough TCS to report num_cbufs = 1 so upload actually occurs
- Change upload_sysvals to assert that num_cbufs > 0 when
num_system_values > 0.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
I neglected to mark cbuf0_needs_upload = false after uploading it.
The obvious fix regressed user clip plane tests, because of a second
bug: we also forgot to mark that they may need re-uploading when
changing shader programs (which may have more or less system values).
Thanks to Timur Kristóf for catching the original issue.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Most vertex data lives in user VBOs in IRIS_MEMZONE_OTHER, which
typically have high bits set to 0xffff. The shader draw parameters were
being uploaded in IRIS_MEMZONE_DYNAMIC, which have high bets set to 0x2.
This was causing a lot of ping-ponging of high bits, leading to
unnecessary VF cache flushing.
Cuts 7.2% of the flushes in the Civizilation VI demo on Kabylake GT2.
This prints a log of every PIPE_CONTROL flush we emit, noting which bits
were set, and also the reason for the flush. That way we can see which
are caused by hardware workarounds, render-to-texture, buffer updates,
and so on. It should make it easier to determine whether we're doing
too many flushes and why.
Don't have a separate mechanism for NIR constants to be removed from
the table. If unused, we will compact it away. The use_null_surface
is needed when INTEL_DISABLE_COMPACT_BINDING_TABLE is set.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Change the iris_binding_table to keep track of what surfaces are
actually going to be used, then assign binding table indices just for
those. Reducing unused bytes on those are valuable because we use a
reduced space for those tables in Iris.
The rest of the driver can go from "group indices" (i.e. UBO #2) to
BTI and vice-versa using helper functions. The value
IRIS_SURFACE_NOT_USED is returned to indicate a certain group index is
not used or a certain BTI is not valid.
The environment variable INTEL_DISABLE_COMPACT_BINDING_TABLE can be
set to skip compacting binding table.
v2: (all from Ken)
Use BITFIELD64_MASK helper. Improve comments.
Assert all group is marked as used when we have indirects.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Stop using brw_compiler to lower the final binding table indices for
surface access. This is done by simply not setting the
'prog_data->binding_table.*_start' fields. Then make the driver
perform this lowering.
This is a better place to perfom the binding table assignments, since
the driver has more information and will also later consume those
assignments to upload resources.
This also prepares us for two changes: use ibc without having to
implement binding table logic there; and remove unused entries from
the binding table.
Since the `block` field in brw_ubo_range now refers to the final
binding table index, we need to adjust it before using to index
shs->constbuf.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Shader-db results on Kaby Lake:
total instructions in shared programs: 15306230 -> 15304726 (<.01%)
instructions in affected programs: 4570 -> 3066 (-32.91%)
helped: 16
HURT: 0
total cycles in shared programs: 361703436 -> 361680041 (<.01%)
cycles in affected programs: 129388 -> 105993 (-18.08%)
helped: 16
HURT: 0
LOST: 0
GAINED: 2
The helped programs were in XCom 2, Deus Ex: Mankind Divided, and Kerbal
Space Program
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Felix noticed a crash when using INTEL_DEBUG=bat decoding. It turned
out that we were sometimes placing variable length data near the end
of a buffer, and with the decoder guessing random lengths rather than
having an actual count, it was walking off the end and crashing. So
this does more than improve the decoder output.
Unfortunately, this is a bit more complicated than i965's handling,
because we don't have a single state buffer. Various places upload
data via u_upload_mgr, and so there isn't a central place to record
the size. We don't need to catch every single place, however, since
it's only important to record variable length packets (like viewports
and binding tables).
State data also lives arbitrarily long, rather than being discarded on
every batch like i965, so we don't know when to clear out old entries
either. (We also don't have a callback when an upload buffer is
released.) So, this tracking may space leak over time. That's probably
okay though, as this is only a debugging feature and it's a slow leak.
We may also get lucky and overwrite existing entries as we reuse BOs,
though I find this unlikely to happen.
The fact that the decoder works in terms of offsets from a state base
address is also not ideal, as dynamic state base address and surface
state base address differ for iris. However, because dynamic state
addresses start from the top of a 4GB region, and binding tables start
from addresses [0, 64K), it's highly unlikely that we'll get overlap.
We can always improve this, but for now it's better than what we had.
Our tessellation control shaders can be dispatched in several modes.
- SINGLE_PATCH (Gen7+) processes a single patch per thread, with each
channel corresponding to a different patch vertex. PATCHLIST_N will
launch (N / 8) threads. If N is less than 8, some channels will be
disabled, leaving some untapped hardware capabilities. Conditionals
based on gl_InvocationID are non-uniform, which means that they'll
often have to execute both paths. However, if there are fewer than
8 vertices, all invocations will happen within a single thread, so
barriers can become no-ops, which is nice. We also burn a maximum
of 4 registers for ICP handles, so we can compile without regard for
the value of N. It also works in all cases.
- DUAL_PATCH mode processes up to two patches at a time, where the first
four channels come from patch 1, and the second group of four come
from patch 2. This tries to provide better EU utilization for small
patches (N <= 4). It cannot be used in all cases.
- 8_PATCH mode processes 8 patches at a time, with a thread launched per
vertex in the patch. Each channel corresponds to the same vertex, but
in each of the 8 patches. This utilizes all channels even for small
patches. It also makes conditions on gl_InvocationID uniform, leading
to proper jumps. Barriers, unfortunately, become real. Worse, for
PATCHLIST_N, the thread payload burns N registers for ICP handles.
This can burn up to 32 registers, or 1/4 of our register file, for
URB handles. For Vulkan (and DX), we know the number of vertices at
compile time, so we can limit the amount of waste. In GL, the patch
dimension is dynamic state, so we either would have to waste all 32
(not reasonable) or guess (badly) and recompile. This is unfortunate.
Because we can only spawn 16 thread instances, we can only use this
mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH.
This patch implements the new 8_PATCH TCS mode, but leaves us using
SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to
using 8_PATCH mode for testing and benchmarking purposes. We may
want to consider using 8_PATCH mode in Vulkan in some cases.
The data I've seen shows that 8_PATCH mode can be more efficient in
some cases, but SINGLE_PATCH mode (the one we use today) is faster
in other cases. Ultimately, the TES matters much more than the TCS
for performance, so the decision may not matter much.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
iris_draw_vbo is divided into two functions to remove unnecessary
operations from the loop. This implementation of ARB_indirect_parameters
takes into account NV_conditional_render by saving MI_PREDICATE_RESULT
at the start of a draw call and restoring it at the end also the result
of NV_conditional_render is taken into account when computing predicates
that limit draw calls for ARB_indirect_parameters in a similar way
to 1952fd8d in ANV.
v2: Optimize indirect draws (suggested by Kenneth Graunke)
v3: (by Kenneth Graunke)
- Fix an issue where indirect draws wouldn't set patch information
before updating the compiled TCS.
- Move some code back to iris_draw_vbo to avoid duplicating it.
- Fix minor indentation issues.
Signed-off-by: Illia Iorin <illia.iorin@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Anuj fixed this in i965 and anv, but the fix never landed in iris.
Fixes tessellation corruption on Icelake. Thanks to Rafael for
bisecting this and tracking it down.
Fixes: d0996d5fab iris: Emit default L3 config for the render pipeline
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
this adds support for imports where the image data begins at an offset
from the start of the buffer, as used in h/x264
fixeskwg/mesa#47
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This makes CompressedTexSubImage from a PBO source do proper GPU
rendering to upload instead of stalling to map the PBO source on
the CPU (then copying it on the CPU).
Thanks Bas Nieuwenhuizen for pointing out that Vulkan includes this
functionality, and to Jason Ekstrand for writing the code I adapted.
Vulkan only supports a single layer, however, and this code tries to
support multiple layers as long as it's miplevel 0.
Improves performance in Sid Meier's Civilization VI:
Average frame time (ms): -3.67423% +/- 1.46201% (n=5)
99th percentile frame time (ms): -5.09910% +/- 3.87874% (n=5)
This is a port of Danylo's eca4a6548d
which fixed the hang on i965. It fixes GPU hangs in his new Piglit
test, arb_blend_func_extended-dual-src-blending-discard-without-src1.
I avoided my own review feedback here, and decided to simply adjust
3DSTATE_PS_BLEND rather than BLEND_STATE_ENTRY[0]. It has never been
clear to me which the hardware uses in every case. However, whacking
the enable in 3DSTATE_PS_BLEND seems to be sufficient to fix the hang,
and that packet is already dynamic, so it's easy to handle. I'd rather
avoid making BLEND_STATE_ENTRY[0] dynamic unless I have to.
I was setting it based off a pipe_rasterizer_state field that appears
to be entirely dead outside of the draw module respecting it.
I should be setting it when the primitive type reaching the SF is
neither points nor lines. This is, unfortunately, rather dirty,
as we have to look at the rasterizer state, the geometry shader state,
the tessellation evaluation shader state, and the primitive type...
Some of the dEQP.functional.transform_feedback tests end up doing
the following sequence of operations:
1. BeginTransformFeedback
2. PauseTransformFeedback
3. Draw
4. ResumeTransformFeedback
At step 1, we'd pack 3DSTATE_SO_BUFFER commands saying to zero the
SO_WRITE_OFFSET registers. At step 2, we disable streamout, so step 3
doesn't bother emitting those commands. Then, step 4 re-packs new
3DSTATE_SO_BUFFER commands with offset = 0xFFFFFFFF, saying to continue
appending at the existing offset. This loads the value from the BO as
the offsets - but we never actually zeroed it.
So, just maintain a flag saying "we actually emitted the commands",
and stomp offset back to zero until we emit some.
OpenGL 4.6 Spec:
"5.3.3 Rules
.......
Note: “Updates” via rendering or transform feedback
are treated consistently with updates via GL commands.
Once EndTransformFeedback has been issued, any subsequent
command in the same context that uses the results of the
transform feedback operation will see the results."
v2: removed a wrong comment
( Kenneth Graunke <kenneth@whitecape.org> )
v3: - flush+dirty depends on buffers usage history
- removed an old hack
( Kenneth Graunke <kenneth@whitecape.org> )
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110404
Signed-off-by: Andrii Simiklit <andrii.simiklit@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Just enable it during init_render_context on Gen10+, and move the
Gen9 state tracking into iris_genx_state so it only exists on Gen9.
Reviewed-by: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
this is basically just porting the following two commits to gallium:
d8b50e152a5c454661c6resolveskwg/mesa#49
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
this hooks up the iris gallium driver to existing mesa bits which handle
the implementation
resolveskwg/mesa#8
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We need to subtract the starting offset from the final offset before
dividing by the stride. See src/intel/vulkan/genX_cmd_buffer.c:3142.
Not known to fix anything.
Applications frequently call glBufferSubData() to consecutive regions
of a VBO to append new vertex data. If no data exists there yet, we
can promote these to unsynchronized writes, even if the buffer is busy,
since the GPU can't be doing anything useful with undefined content.
This can avoid a bunch of unnecessary blitting on the GPU.
u_threaded_context would do this for us, and in fact prohibits us from
doing so (see TC_TRANSFER_MAP_NO_INFER_UNSYNCHRONIZED). But we haven't
hooked that up yet, and it may be useful to disable u_threaded_context
when debugging...at which point we'd still want this optimization. At
the very least, it would let us measure the benefit of threading
independently from this optimization. And it's not a lot of code.
Removes most stall avoidance blits in "Total War: WARHAMMER."
On my Skylake GT4e at 1920x1080, this appears to improve performance
in games by the following (but I did not do many runs for proper
statistics gathering):
----------------------------------------------
| DiRT Rally | +2% (avg) | + 2% (max) |
| Bioshock Infinite | +3% (avg) | + 9% (max) |
| Shadow of Mordor | +7% (avg) | +20% (max) |
----------------------------------------------
This implements PIPE_CAP_INVALIDATE_BUFFER and invalidate_resource(),
as well as the PIPE_TRANSFER_DISCARD_WHOLE_RESOURCE flag. When either
of these happen, we swap out the backing storage of the buffer for a
new idle BO, allowing us to write to it immediately without stalling
or queueing a blit.
On my Skylake GT4e at 1920x1080, this improves performance in games:
-----------------------------------------------
| DiRT Rally | +25% (avg) | +17% (max) |
| Bioshock Infinite | +22% (avg) | +11% (max) |
| Shadow of Mordor | +27% (avg) | +83% (max) |
-----------------------------------------------
This unifies a bunch of the UBO and SSBO code to use common structures.
Beyond iris_state_ref, pipe_shader_buffer also gives us a buffer size,
which can be useful when filling out the surface state.
Marek recently extended pipe->set_shader_buffers() to take an extra
writable_bitmask parameter, indicating which SSBOs are writable (some
may be bound read-only). We can use this to decide whether to set
EXEC_OBJECT_WRITE when pinning. Avoiding the write flag can save us
some cross-batch flushing if the SSBO is used for reading in both the
render and compute engines.
This 3d performance workaround was initially put in the kernel but the
media driver requires different settings so the register has been
whitelisted in i915 [1] and userspace drivers are left initializing it as
they wish.
[1] : https://patchwork.freedesktop.org/series/59494/
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Pipeline statistics queries should not count BLORP's rectangles.
(23) How do operations like Clear, TexSubImage, etc. affect the
results of the newly introduced queries?
DISCUSSION: Implementations might require "helper" rendering
commands be issued to implement certain operations like Clear,
TexSubImage, etc.
RESOLVED: They don't. Only application submitted rendering
commands should have an effect on the results of the queries.
Piglit's arb_pipeline_statistics_query-vert_adj exposes this bug when
the driver is hacked to always perform glBufferData via a GPU staging
copy (for debugging purposes).
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
We were pinning it for compute shaders, and pinning it when restoring
saved buffers, but we never actually pinned it in the original batch
for VS/TCS/TES/GS/FS.
Fixes rendering in GFXBench5's Tessellation demo and a bunch of Piglit
geometry shader tests.
The GL 4.5 spec says:
"If any enabled array’s buffer binding is zero when DrawArrays or
one of the other drawing commands defined in section 10.4 is called,
the result is undefined."
The result is undefined but it should not crash.
Fixes: gl-3.1-vao-broken-attrib
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
iris_upload_border_color is passed a pointer which points to
variable that is introduced in a different scope.
CID: 1444296
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We want to skip some types of aux usages (for instance,
ISL_AUX_USAGE_HIZ when the hardware doesn't support it, or when we have
multisampling) when sampling from the surface.
Instead of checking for those cases while filling the surface state and
leaving it blank, let's have a version of aux.possible_usages for
sampling. This way we can also avoid allocating surface state for the
cases we don't use.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Since we are not using it for the clear color, there's no need to
allocate it.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The swizzle for rendering surfaces is always identity. So when we are
doing the fast clear, we don't have enough information to store the
clear color OR'ed with the Shader Channel Select bits for the dword in
the SURFACE_STATE.
Instead of trying to patch up the SURFACE_STATE correctly later, by
reading the color from the clear color state buffer and then doing all
the operations to store it, let's just re-emit the whole SURFACE_STATE.
That should make things way simpler on gen8, and we can still use the
clear color state buffer for gen9+.
Fixes: a8b5ea8ef0 "iris: Add function to update clear color in surface state."
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>