mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2026-01-23 23:30:22 +01:00
173 lines
No EOL
9.3 KiB
ReStructuredText
173 lines
No EOL
9.3 KiB
ReStructuredText
:orphan:
|
|
|
|
.. _aco-fn-calls:
|
|
|
|
Function call support in RADV/ACO
|
|
=================================
|
|
|
|
ACO supports function calls inside shaders - given a function signature and ABI, shaders can call
|
|
an arbitrary function, even via only a function pointer (i.e. with an unknown function definition).
|
|
|
|
This function call support is useful for implementing ray tracing pipelines (by representing individual RT shaders
|
|
as callable functions), but it also has potential use cases in GPGPU/Compute workloads.
|
|
|
|
This page serves to document the concepts involved in implementing function calls as well as an overview of the
|
|
implementation components.
|
|
|
|
Function call representation
|
|
----------------------------
|
|
|
|
In NIR, function calls are represented by a `nir_call_instr`. The instruction takes a `nir_function` representing
|
|
the function being called, as well as SSA defs for each call parameter.
|
|
NIR can also represent "indirect calls", i.e. calls where the function being called is
|
|
unknown - instead, the instruction takes an SSA def containing a function pointer to the callee. In this case, the
|
|
`nir_function` only serves to provide information about the function signature, i.e. how many and which parameters
|
|
the function takes.
|
|
|
|
Call instructions do not have return values - instead, return values are represented by so-called "return parameters".
|
|
Instead of an SSA value, these parameters are derefs, and the return value is written into the deref when the callee
|
|
returns. Return parameters can double as input parameters, too - the callee can read the previous value of the deref
|
|
before (potentially) overwriting it with a new value.
|
|
|
|
ACO's representation of function calls follows this very closely. Calls are described by the `p_call` pseudo-instruction.
|
|
The operands to this instruction are a function pointer (i.e. the address of the callee), followed by the call
|
|
parameters. Return parameters are handled differently, though: While the initial value of the return parameter is passed
|
|
as an operand, the call instruction produces new definitions that refer to the SSA values of the return parameters after
|
|
the function call returns. There is a special NIR intrinsic ``load_return_param_amd`` that can be used to access these
|
|
new definitions when lowering return parameter derefs to SSA form.
|
|
|
|
.. _div-calls:
|
|
|
|
Divergent calls
|
|
---------------
|
|
|
|
On CPUs, a call instruction will only ever jump to a single address. However, GPUs are SIMT, and the value of a function
|
|
pointer may be divergent, i.e. different threads try calling different functions within the same call instruction. AMD
|
|
hardware executes one instruction for all threads in lockstep, so the multiple callees have to be executed one after
|
|
the other.
|
|
|
|
This is handled by RADV in ``radv_nir_lower_call_abi``. In addition to the (non-divergent) function pointer to jump to,
|
|
``radv_nir_lower_call_abi`` prepends another parameter representing the (potentially divergent) function pointer for all
|
|
lanes. For callable functions, ``radv_nir_lower_call_abi`` wraps the function body in a condition that verifies that the
|
|
current thread's (divergent) pointer matches the (non-divergent) pointer that is currently being executed. This serves
|
|
to "mask off" all threads that wanted to jump to a different function than what is currently executing. At the very end,
|
|
``radv_nir_lower_call_abi`` inserts some code deciding whether to jump to the next callee or to return.
|
|
|
|
.. _stack:
|
|
|
|
Stack
|
|
-----
|
|
|
|
Supporting arbitrary function calls also means supporting recursion, and recursive functions need a stack.
|
|
AMD hardware provides instructions for accessing a per-thread scratch memory area in VRAM, and ACO uses this per-thread
|
|
scratch memory to set up its stack.
|
|
|
|
The stack frame for a function consists of all scratch memory allocated for this function in NIR, as well as space to
|
|
spill VGPRs if that is required. ACO adds a stack pointer as a parameter to every function - this stack pointer is added
|
|
to the offset inside the scratch space for all scratch loads/stores to make sure they don't overwrite stack frames of
|
|
caller functions.
|
|
|
|
ACO's call instructions take two stack-related operands: The current (caller) stack pointer and the caller's stack size.
|
|
When converting the call instruction to hardware instructions, ACO will add the caller stack size to the stack pointer
|
|
for the duration of the call (and subtract it again afterwards). This allows us to re-use the same stack pointer after
|
|
the call.
|
|
|
|
Implicit/System Parameters
|
|
--------------------------
|
|
|
|
In addition to parameters defined by the function signature, both RADV and ACO will insert additional parameters while
|
|
lowering calls. This is an overview of which lowering passes add which parameters.
|
|
|
|
Parameters added by ``radv_nir_lower_call_abi`` (see :ref:`Divergent calls <div-calls>`):
|
|
- "Uniform"/Non-divergent callee pointer
|
|
- Divergent function pointer
|
|
|
|
Parameters added by ACO: (see :ref:`Stack <stack>`)
|
|
- Stack pointer (uniform)
|
|
|
|
ABI Definition
|
|
--------------
|
|
|
|
The ABI (Application Binary Interface) defines specifics about the interaction between the function caller and the
|
|
callee (e.g. assignment of registers to parameters or register preservation). In ACO, the primary purpose of the ABI is
|
|
to define which register ranges are "preserved" (i.e. never overwritten by the callee) or "clobbered" (i.e. potentially
|
|
overwritten by the callee).
|
|
|
|
The caller can use preserved register ranges to store temporaries that are live across a call, and the callee can use
|
|
clobbered register ranges to store its own temporaries. If the callee wants to use registers from a preserved range,
|
|
then it needs to back up the value contained in the preserved register beforehand, and restore it when it's done using
|
|
the preserved register. Similarly, if there are not enough preserved registers for the caller to store all its
|
|
temporaries, the caller will need to spill excess temporaries to the stack.
|
|
|
|
ACO has to cater to different needs when defining ABIs: On one side, ray tracing traversal shaders demand to free up
|
|
the entire register file for the callee (Ray traversal is a really hot loop, so we don't want to spill anything at all).
|
|
Besides some parameters like the invocation ID, these shaders should be able to overwrite almost anything. On the other
|
|
side, RT traversal shaders should not be required to free up the register file when calling any-hit/intersection shaders
|
|
as this would also cause spilling during traversal. GPGPU compute workloads could fall anywhere between these extremes,
|
|
so a middle-ground solution is desirable for these.
|
|
|
|
ACO's way of defining an ABI divides the register file into "blocks" (``struct aco::ABI::RegisterBlock``). Each block
|
|
consists of a fixed number of preserved and clobbered registers, and a boolean determining whether the preserved or
|
|
clobbered registers come first in the block. Preserved and clobbered register ranges are defined by
|
|
repeating these blocks for as long as there are unassigned registers.
|
|
|
|
Some examples of preserved/clobbered register ranges using this approach::
|
|
|
|
For all examples, there are 108 SGPRs and 128 VGPRs to assign.
|
|
|
|
RegisterBlock:
|
|
clobbered_size: {16 sgpr, 16 vgpr}
|
|
preserved_size: {16 sgpr, 16 vgpr}
|
|
clobbered_first: false
|
|
results in:
|
|
v0-v15: preserved
|
|
v16-v31: clobbered
|
|
v32-v47: preserved
|
|
v48-v63: clobbered
|
|
v64-v79: preserved
|
|
v80-v95: clobbered
|
|
v96-v111: preserved
|
|
v112-v127: clobbered
|
|
|
|
s0-s15: preserved
|
|
s16-s31: clobbered
|
|
s32-s47: preserved
|
|
s48-s63: clobbered
|
|
s64-s79: preserved
|
|
s80-s95: clobbered
|
|
s96-s108: preserved
|
|
|
|
RegisterBlock:
|
|
clobbered_size: {128 sgpr, 256 vgpr}
|
|
preserved_size: {80 sgpr, 80 vgpr}
|
|
clobbered_first: false
|
|
results in:
|
|
v0-v79: preserved
|
|
v80-v127: clobbered
|
|
|
|
s0-s79: preserved
|
|
s80-s108: clobbered
|
|
|
|
An alternating preserved-clobbered-preserved pattern is useful for generic compute workloads, because the ratio of
|
|
preserved to clobbered registers is roughly the same, no matter how many registers are used by the shaders.
|
|
|
|
The latter example where the lower part of the register file is preserved and only some registers high up in the
|
|
register file are clobbered is suitable for any-hit/intersection shaders - traversal shader temporaries can live in the
|
|
preserved part low in the register file.
|
|
|
|
This block assignment is optional - if no ``RegisterBlock`` is given, the ABI defines the entire register range as
|
|
clobbered-by-default, although parameters that are not marked as clobbered via ``ACO_NIR_PARAM_ATTRIB_DISCARDABLE``
|
|
will continue being preserved.
|
|
|
|
Parameter Register Assignment
|
|
-----------------------------
|
|
|
|
If a ``RegisterBlock`` defines preserved and clobbered ranges, then parameters are assigned registers from either range
|
|
depending on ``ACO_NIR_PARAM_ATTRIB_DISCARDABLE`` - if parameters are marked as clobbered with this attribute, then they
|
|
are assigned a register in a clobbered range, otherwise they are assigned in a register in a preserved range. The order
|
|
of the parameters in the register file is not necessarily the same order as in the function signature - they may get
|
|
reordered if it's beneficial to fill gaps or for alignment.
|
|
|
|
If there is no ``RegisterBlock``, then registers will be assigned based on alignment only.
|
|
|
|
If there is no more space for a parameter in any of its corresponding register ranges, it will be moved to the stack. |