Software Vertex Processing allows:
. Less limitations for shaders (more loops, etc)
. Less limitations for ff (more enabled lights, 255
matrices for VertexBlend)
In particular shaders can get more constants.
This patch implements support for this (not using software
rendering, but hardware rendering, as llvmpipe and dx10+ hw
have the same limits...)
This is considered a second class path. Even apps asking for
"Mixed Vertex processing" (ie the ability to switch to swvp
on demand) do not use the feature much. Some just initialize
more constants than the normal limit at the start of the
application, but never use more than the normal limit.
When the apps do not need the software vertex processing
features, they do not seem to turn it on. This means it is
ok if that path is slow.
Thus no care has been made to make the path optimized.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
This path has been disabled for some time because
of some bugs with it. It hasn't been updated to the
new features, and is not faster.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
In mixed vertex processing, the user can enable or disable
software vertex processing. It is on hardware by default.
This feature is not a state, and thus the setting doesn't
need to be recorded by stateblocks.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Buffers with this flag must be usable with both software
and hardware vertex processing. Use Staging for fast cpu access.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
ATIx are "unknown" formats that do not follow block format conventions.
Tests showed that pitch*height bytes are allocated.
apitrace used to depend on this behaviour.
It used to copy more bytes than it has to for the ATI1 block format,
but it didn't crash on Windows.
Increase buffersize for ATI1 to fix this crash.
The same issue was present in WINE but a patch has been sent by me.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
To implement the feature we copy the ps inputs to a temp array.
This is not optimal for performance, but it is the simplest solution.
This is a feature that is very very rarely used.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Gallium nine relies on aliasing to work with this function.
Without this patch, dirty region tracking was incorrect, which
could lead to incorrect textures or vertex buffers.
Fixes several game bugs with nine.
Fixes https://github.com/iXit/Mesa-3D/issues/234
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Cc: "12.0" <mesa-stable@lists.freedesktop.org>
On 32 bits system, application memory is quite limited.
softpipe uses application memory. To help prevent memory
exhaustion, limit reported memory availability to 2GB.
Some gallium nine apps do check reported memory by allocating
resources until memory is full. Gallium nine refuses allocations
when 80% of the reported memory limit is used. This change
helps some apps to start.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
On 32 bits system, application memory is quite limited.
llvmpipe uses application memory. To help prevent memory
exhaustion, limit reported memory availability to 2GB.
Some gallium nine apps do check reported memory by allocating
resources until memory is full. Gallium nine refuses allocations
when 80% of the reported memory limit is used. This change
helps some apps to start.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
On systems with more than 4GB of ram,
os_get_total_physical_memory was triggering an integer
overflow for the linux and haiku path, when on
32 bits.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94561
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Fixes regression introduced by
ecd6fce261
and is more future proof than just clearing the next
field.
Other nine usages did already zero out the templates.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Acked-by: Edward O'Callaghan <funfunctor@folklore1984.net>
When offset != 0, the valid range was wrong because the second
argument of util_range_add() is end, not size.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Normally the value is an immediate, which is moved to some temporary, so
there's no problem. In the case of a non-constant offset (as allowed by
ARB_gpu_shader5), we have to take care to copy it first before using it
to build up the bits.
This fixes a compilation error observed in F1 2015.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
A function with multiple returns would have had multiple preret settings
at the top of the function. While this is unlikely to have caused issues
since we don't use functions in earnest, it could have in some cases
overflowed the call stack, in case a function had a lot of early
returns.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
This can be helpful with R600_DEBUG=preoptir.
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Not sure if it's possible to avoid programming the block size twice (once for
the userdata and once for the dispatch).
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Mostly test code, plus one spot I noticed in r600.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
We will only hit this with multi-planar YUV external images, so we would
probably never hit this code path in the first place. But if we did, it
wouldn't do the right thing so just bail.
Signed-off-by: Rob Clark <robdclark@gmail.com>
We have to be careful to not smash the value they're clearing to, but
other than that we're fine. Avoids quad clears in Processing, which likes
to do glClear(Z|S); glClear(Z).
Improves performance of Processing's QuadRendering demo at 5000 quads by
5.46507% +/- 1.35576% (n=15 before, 32 after)
We were incrementing the count at the end of vc4_start_draw(), except that
that function returns immediately if we've already started drawing on this
batch. It also failed to count the statechanges from the GFXH-515
workaround.
This incidentally allows repeated glClear() to be coalesced, because the
fast clears aren't counted in draw_calls_queued any more. Fixes most of
the extra flushes in Processing, which emits glClear(Z|S); glClear(Z);
glClear(C) during its frame setup.
Improves performance of Processing's QuadRendering demo at 5000 quads by
3.33538% +/- 2.05846% (n=21 before, 15 after)
This slightly reduces instructions on shader-db, but I think it's just
perturbing register allocation -- the allocator should have always
trivially colored these nodes, before. This commit is just to make QIR
code failing more intelligible when register allocation fails.
If a conditional assignment is only conditioned on the exec mask, that's
still screening off the value in the executed channels (and, since we're
not storing to the unexcuted channels, we don't care what's in there).
Fixes a bunch of extra register pressure on Processing's Ribbons demo,
which is failing to allocate.
We would assertion fail in setting up the simulator the second time
around. This at least postpones the assertion failure until we've closed
all of the first set of screens and started opening a new set.
Checking if MAD is supported is definitely wrong, and it's
more likely a typo I introduced few days ago which breaks
NV50 because SHLADD is not supported there.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
When the chipset is forced with NV50_PROG_CHIPSET, we actually
only want to output the binary if NV50_PROG_DEBUG is also
enabled. Otherwise, this pollutes the shader-db output.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Only expose 512 threads/block on Fermi to not be limited by
32 GPRs/thread.
v4: - use 512 threads on Fermi, 1024 on Kepler+
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
When a variable local size is defined as specified by
ARB_compute_variable_group_size, the fixed local size is set to 0
and a SIGFPE occurs when we compute the maximum number of regs.
This allows to use 64 GPRs/thread.
v4: - use 512 threads on Fermi, 1024 on Kepler+
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
v3: - use a new case statement in r600_pipe_common.c
- fix compilation of softpipe...
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
helped some ue4 demos and divinity OS shaders
total instructions in shared programs : 2818674 -> 2818606 (-0.00%)
total gprs used in shared programs : 379273 -> 379273 (0.00%)
total local used in shared programs : 9505 -> 9505 (0.00%)
total bytes used in shared programs : 25837792 -> 25837192 (-0.00%)
local gpr inst bytes
helped 0 0 33 33
hurt 0 0 0 0
Signed-off-by: Karol Herbst <karolherbst@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Pierre Moreau <pierre.morrow@free.fr>
One of NIR's invariants is that control flow lists always start and end
with blocks. There's no good reason why we should return a cf_node from
these functions since we know that it's always a block. Making it a block
lets us remove a bunch of code.
Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>