This should be at least as fast as using fast_idiv_by_const, and has the
advantage that the precomputation is simple enough to be evaluated at
Mesa-compile time for hash tables and sets which have a fixed table of
possible divisors.
Acked-by: Eric Anholt <eric@anholt.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
A significant portion of the time spent in nir_opt_cse for the Dolphin
ubershaders was in resizing the set. When resizing a hash table, we know
in advance that each new element to be inserted will be different from
every other element, so we don't have to compare them, and there will be
no tombstone elements, so we don't have to worry about caching the
first-seen tombstone. We add a specialized add function which skips
these steps entirely, speeding up resizing.
Compile-time results from my shader-db database:
Difference at 95.0% confidence
-2.29143 +/- 0.845534
-0.529475% +/- 0.194767%
(Student's t, pooled s = 1.08807)
Reviewed-by: Eric Anholt <eric@anholt.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
To keep the set and hash table in sync. Note that some of this had
already been done for hash tables, in particular pulling out the
hash % ht->size computation.
Reviewed-by: Eric Anholt <eric@anholt.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Unfortunately GCC can't do this for us, probably because we call the key
comparison function which GCC can't prove won't modify arbitrary memory.
This is a pretty hot function, so do the optimization manually to be
sure the compiler will get it right.
While we're here, make the computation of the new probe address use a
single conditional subtract instead of a modulo, since we know that it
won't ever get as big as 2 * ht->size before the modulo. Modulos tend to
be pretty expensive operations.
shader-db compile time results for my database:
Difference at 95.0% confidence
-2.24934 +/- 0.69897
-0.516296% +/- 0.159993%
(Student's t, pooled s = 0.983684)
Reviewed-by: Eric Anholt <eric@anholt.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Before this change, we were searching for each instruction twice, once
when checking if it exists and once when figuring out where to insert
it. By using the new function, we can do everything we need to do in one
operation.
Compilation time numbers for my shader-db database:
Difference at 95.0% confidence
-4.04706 +/- 0.669508
-0.922142% +/- 0.151948%
(Student's t, pooled s = 0.95824)
Reviewed-by: Eric Anholt <eric@anholt.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Unlike _mesa_set_search_and_add(), it doesn't replace an entry if it's
found, returning it instead. This is useful for nir_instr_set, where
we have to know both the original original instruction and its
equivalent.
Reviewed-by: Eric Anholt <eric@anholt.net>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
ncomp is never set for vertex shaders, but a3xx and a4xx still use it.
Fixes: 831f1a05c0 freedreno/ir3: rework varying packing
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Rob Clark <robdclark@chromium.org>
On some architectures, Boolean values used to control conditional
branches or condtional selection must be propagated into a flag. This
generally means that a stored Boolean value must be compared with zero.
Rather than force the generation of extra compares with zero, re-emit
the original comparison instruction. This can save register pressure by
not needing to store the Boolean value.
There are several possible ares for future improvement to this pass:
1. Be more conservative. If both sources to the comparison instruction
are non-constants, it may be better for register pressure to emit the
extra compare. The current shader-db results on Intel GPUs (next
commit) lead me to believe that this is not currently a problem.
2. Be less conservative. Currently the pass requires that all users of
the comparison match the pattern. The idea is that after the pass is
complete, no instruction will use the resulting Boolean value. The only
uses will be of the flag value. It may be beneficial to relax this
requirement in some cases.
3. Be less conservative. Also try to rematerialize comparisons used for
discard_if intrinsics. After changing the way the Intel compiler
generates cod e for discard_if (see MR!935), I tried implementing this
already. The changes were pretty small. Instructions were helped in 19
shaders, but, overall, cycles were hurt. A commit "nir: Rematerialize
comparisons for nir_intrinsic_discard_if too" is on my fd.o cgit.
4. Copy the preceeding ALU instruction. If the comparison is a
comparison with zero, and it is the only user of a particular ALU
instruction (e.g., (a+b) != 0.0), it may be a further improvment to also
copy the preceeding ALU instruction. On Intel GPUs, this may enable
cmod propagation to make additional progress.
v2: Use much simpler method to get the prev_block for an if-statement.
Suggested by Tim.
Reviewed-by: Matt Turner <mattst88@gmail.com>
And instead, link them as they are added.
Makes things a bit clearer and prepares future work such as FB reload
jobs.
Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We now bounds check properly in the uniform loading fast path, so
there's no need to disable it by pretending there are other UBO bindings
in use. The way this looks at the variable name was causing problems
when two piglit shaders, one with a name that triggered the hack and one
that didn't, got hashed to the same thing after stripping out the names.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
location is the API-level location, but driver_location is the actual
location the uniform gets passed to the driver. This apparently only
caused failures with builtins, where the location is 0 because it's
represented via the state tokens instead.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
ac expands the store to 32-bit components for us, but we still have to
deal with storing up to 8 components, and when a varying is split across
two vec4 slots we have to calculate the address again for the second
slot, since they aren't adjacent in memory. I didn't do this on the ac
level because we should generate better indexing arithmetic for the lds
store, where slots are contiguous.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Found with Jasons new metadata rework (https://gitlab.freedesktop.org/mesa/mesa/merge_requests/950).
Fixes: af355aaa07 "nir: add nir_opt_move_load_ubo() optimization pass"
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Because the core principle of the vars_to_ssa pass is that it globally
(within a function) looks at all of the uses of a never-indirected path
and does a full into-SSA on that path, it can't handle a path which has
any chance of having aliasing. If a function_temp variable has a cast
or anything else which may cause aliasing, we have to assume that all
paths to that variable may alias and ignore the entire variable.
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We're about to change the meaning of get_deref_node returning NULL so we
need a non-NULL value to mean properly undefined.
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
This lets passes easily detect derefs which have uses that fall outside
the standard load/store/copy pattern so they can bail appropriately.
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
When we inlined cf_node_has_side_effects into node_is_dead, all the
conditions flipped and we forgot to flip one. Fortunately, it doesn't
matter right now because no one uses this pass on shaders with more than
one function.
Fixes: b50465d197 "nir/dead_cf: Inline cf_node_has_side_effects"
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
When creating function parameters, we create pointers from ssa
values, this creates nir casts with stride 0, however we have
no where else to get this value from. Later passes to lower
explicit io need this stride value to do the right thing.
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Debugging use of unsafe iterators when you should have used the _safe
version sucks. Add some DEBUG build support to catch and assert if
someone does that.
I didn't update the UPPERCASE verions of the iterators. They should
probably be deprecated/removed.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
This mode is used by PhysicalStorageBufferEXT storage class.
Fixes: 8bdf5a008b "nir: Allow derefs to be used as phi sources"
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cannonlake hardware adds a new resolve type in 3DSTATE_PS called
FAST_CLEAR_0 which does an ambiguate. Now that the hardware can do it
directly, we should use that instead of binding the CCS as a render
target and doing it manually. This was tested with a full Vulkan CTS
run on Cannonlake.
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
No significant changes in the code needed to enable
the extension. Just updating SWR capabilities
and the documentation
Reviewed-by: Alok Hota <alok.hota@intel.com>
We set header_present but then pass it some random garbage. Give it g0
instead. I'm not actually sure this does anything but g0 is the usual
header data and this is what the windows driver does so it seems like a
good idea.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The driver should already support this without any changes.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Basically, this extension allows applications to use custom
sample locations. It doesn't support variable sample locations
during subpass. Note that we don't have to upload the user
sample locations because the spec doesn't allow this.
The extension is currently disabled because the driver needs to
support variable sample locations during layout transitions. The
depth decompress needs to know them and that's a bit invasive.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
We only need the lock for:
1. Rummaging through the cache
2. Allocating VMA
We don't need it for alloc_fresh_bo(), which does GEM_CREATE, and also
SET_DOMAIN to allocate the underlying pages. The idea behind calling
SET_DOMAIN was to avoid a lock in the kernel while allocating pages,
now we avoid our own global lock as well.
We do have to re-lock around VMA. Hopefully this shouldn't happen too
much in practice because we'll find a cached BO in the right memzone
and not have to reallocate it.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris pointed out that the order between SET_DOMAIN and SET_TILING
doesn't matter, so we can just do the page allocation when creating
a new BO. Simplifies the flow a bit.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Mathias Fröhlich reported that commit 6244da8e23 crashes.
list_for_each_entry_safe is safe against removing the current entry,
but iris_bo_cache_purge_bucket was potentially removing next entries
too, which broke our saved next pointer.
To fix this, don't bother with the iris_bo_cache_purge_bucket step.
We just detected a single entry where the kernel has purged the BO's
memory, and so it isn't a usable entry for our cache. We're about to
continue the search with the next BO. If that one's purged, we'll
clean it up too. And so on.
We may miss cleaning up purged BOs that are further down the list
after non-purged BOs...but that's probably fine. We still have the
time-based cleaner (cleanup_bo_cache) which will take care of them
eventually, and the kernel's already freed their memory, so it's not
that harmful to have a few kicking around a little longer.
Fixes: 6244da8e23 iris: Dig through the cache to find a BO in the right memzone
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
This saves some util_vma thrash when the first entry in the cache
happens to be in a different memory zone, but one just a tiny bit
ahead is already there and instantly reusable. Hopefully the cost
of a little extra searching won't break the bank - if it does, we
can consider having separate list heads or keeping a separate VMA
cache.
Improves OglDrvRes performance by 22%, restoring a regression from
deleting the bucket allocators in 694d1a08d3.
Thanks to Clayton Craft for alerting me to the regression.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
There's enough going on here to warrant a helper. This also simplifies
the control flow and eliminates the last non-error-case goto.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
It is unlikely that we would fail to map a cached BO in order to zero
its contents. When we did, we would free the first BO in the cache and
try again with the second. It's possible that this next BO already had
a map setup, in which case we'd succeed. But if it didn't, we'd likely
fail again in the same manner.
There's not much point in optimizing this case (and frankly, if we're
out of CPU-side VMA we should probably dump the cache entirely)...so
instead, just fall back to allocating a fresh BO from the kernel which
will already be zeroed so we don't have to try and map it.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Both the from-cache and fresh-from-GEM cases were calling SET_TILING.
In the cached case, we would retry the allocation on failure, pitching
one BO from the cache each time. This is silly, because the only time
it should fail is if the tiling or stride parameters are unacceptable,
which has nothing to do with the particular BO in question. So there's
no point in retrying - we should simply fail the allocation.
This patch moves both calls to bo_set_tiling_internal() below the
cache/fresh split, so we have it at a single point in time instead
of two.
To preserve the ordering between SET_TILING and SET_DOMAIN, we move
that below as well. (I am unsure if the order matters.)
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
We mark snooped BOs as non-reusable, so we never return them to the
cache. This means that we'd need to call I915_GEM_SET_CACHING to make
any BO we find in the cache snooped. But then again, any BO we freshly
allocate from the kernel will also be non-snooped, so it has the same
issue. There's really no reason to skip the cache - we may as well use
it to avoid the I915_GEM_CREATE overhead.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
util_vma needs to be protected by a lock. All other callers of
vma_alloc and vma_free appear to be holding a lock already.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
The goto jumped over the mtx_lock, but proceeded to hit the mtx_unlock.
We can simply set the bucket to NULL and it will skip the cache without
goto, and without messing up locking.
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>