Otherwise, nir_intrinsic_align() will assert when called on the intrinsics
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
It shouldn't matter, but the 1 was leftover from when it was handled
together with workgroup_size and num_work_groups.
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
The former was always true and hence dead code. We will want to
explicitly declare the ring offset register with ACO, but we also want
to declare the scratch offset too, and we can't try to disable it since
ACO also supports spilling and the determination of whether spilling has
to happen occurs well after setting up registers. So replace
supports_spill with something that will actually be used for ACO.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Due to how LLVM works we have to make some of the FS inputs become
vectors, and therefore have to split them early so that they don't take
up extra register pressure due to how RA currently works.
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
ac_shader_args will be similar to ac_shader_abi, except for being free
from LLVM-specific concepts and therefore capable of being shared
between LLVM and ACO. This will help us accomplish a few different
things:
- Decouple setting up SGPR and VGPR arguments from translating to LLVM,
so that we can reference these arguments in NIR lowering passes, which
will let us lower e.g. descriptor sets in NIR.
- Stop using radv-specific structures for things like determining the
chip generation in ACO.
In the end, we should replace ac_shader_abi with this structure +
driver-specific lowering passes.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
We'll duplicate this in a header file in the next commit, and then
remove the original enum. Just rename it temporarily so that things
keep building.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Fixes dEQP-VK.compute.builtin_var.local_invocation_index with
RADV_PERFTEST=cswave32.
My initial fix was to lower it but Rhys suggested the shift-right
and it's much better like this.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
They are broken like on GFX6-GFX7. It seems better to disable them
instead of enabling a broken feature.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
In order to prevent a potential malicious pipeline tainting our
secure compile process and interfering with successive pipelines
we want to create a fresh fork for each pipeline compile.
Benchmarking has shown that simply forking on each pipeline
creation doubles the total time it takes to compile a fossilize db
collection. So instead here we fork the process at device creation
so that we have a slim copy of the device and then fork this
otherwise idle and untainted process each time we compile a
pipeline. Forking this slim copy of the device results in only a
20% increase in compile time vs a 100% increase.
Fixes: cff53da3 ("radv: enable secure compile support")
This will be used to create a communication pipe between the user
facing device and a freshly forked (per pipeline compile) slim copy
of that device.
We can't use pipe() here because the fork will not be a direct fork
of the user facing process. Instead we use a previously forked
copy of the process that was forked at device creation in order to
reduce the resources required for the fork and avoid performance
issues.
Fixes: cff53da374 ("radv: enable secure compile support")
In the following commits we want to be able to fork an existing lightweight
fork created at device creation time. In order for the user facing process
to communicate with this new fresh fork we create some members here to hold
FIFO file descriptors and a unique id.
Here we also add a new fork enum that we use to tell the lightweight
process to create a fresh fork.
For more information on why we create a fresh fork see the following
commits.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Fixes: d1b9deee ('aco: improve waitcnt insertion around loops')
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Do this by repeating processing of loops until no progress is made.
Totals from affected shaders:
SGPRS: 162576 -> 162576 (0.00 %)
VGPRS: 145228 -> 145228 (0.00 %)
Spilled SGPRs: 668 -> 668 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 15778640 -> 15771336 (-0.05 %) bytes
LDS: 146 -> 146 (0.00 %) blocks
Max Waves: 6087 -> 6087 (0.00 %)
v2: use block_kind_loop_header/block_kind_loop_exit to repeat at the end
of loops instead of at each continue
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Most of DEQP-VK.subgroups are skipped because 16-bit float aren't
supported but others pass.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
When the scratch ringbuffer settings are changed, the shader unit has
to be idle or we will have shaders using old and new settings.
That combination is not supported on the HW (likely the offset is
ringbuffer idx * WAVESIZE * 1024).
CC: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
We could enable it on GFX10 if LLVM wasn't used as a fallback for
unsupported stages. Note that the CTS only tests it if
VK_KHR_shader_float16_int8 is enabled, even though it's not a
requirement.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
The multiplication reduction is larger than it could be, but it should be
easier to implement this way.
No failures with dEQP-VK.subgroups.*int64* except those caused by LLVM
being used for other stages.
v2: don't call setFixed() for v_add carry-out, since setHint sets physReg
v3: add and use emit_vadd32() helper
v4: use num_opcodes instead of last_opcode
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev> (v3)
Should make 64-bit integer reductions easier to implement.
v4: use num_opcodes instead of last_opcode
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev> (v3)
This extension allows to use subgroup operations with 8 and 16-bits
Untested on GFX6-GFX7, and most of subgroup operations are broken
on GFX10, so don't enable it for now. Not enabled on ACO because
it's still doesn't support 8-bits/16-bits.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
The return type is always the src type (32 or 64 bits).
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This extension adds the device coherent and device uncached memory
types. It's known to be slower than non-device coherent memory but
it might be useful for debugging.
This is only exposed for chips that support L2 uncached.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
For chips that have uncached device memory (ie. MTYPE_UC).
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This actually supports more of the extension than the LLVM backend but we
can't enable it because ACO doesn't work with all stages yet.
With more of it enabled, some CTS tests fail because our 64-bit sqrt
is very imprecise. I can't find any precision requirements for it
anywhere, so I'm thinking it might be a CTS issue.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Fixes: 93c8ebfa ('aco: Initial commit of independent AMD compiler')
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
No pipeline-db changes
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Fixes: 93c8ebfa ('aco: Initial commit of independent AMD compiler')
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
ACO sets this itself and will have to set it differently in the future to
support shaderDenormFlushToZeroFloat64.
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
To not overwrite the resolve if there is pending clear aspects,
same as color resolves.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This option is useless and shouldn't be used at all.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This should remove most of the excess code size that was
introduced by making all booleans per-lane.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>