mesa/docs/drivers/amd/function-calls.rst

:orphan:

.. _aco-fn-calls:

Function call support in RADV/ACO
=================================

ACO supports function calls inside shaders - given a function signature and ABI, shaders can call
an arbitrary function, even via only a function pointer (i.e. with an unknown function definition).

This function call support is useful for implementing ray tracing pipelines (by representing individual RT shaders
as callable functions), but it also has potential use cases in GPGPU/Compute workloads.

This page serves to document the concepts involved in implementing function calls as well as an overview of the
implementation components.

Function call representation
----------------------------

In NIR, function calls are represented by a `nir_call_instr`. The instruction takes a `nir_function` representing
the function being called, as well as SSA defs for each call parameter.
NIR can also represent "indirect calls", i.e. calls where the function being called is
unknown - instead, the instruction takes an SSA def containing a function pointer to the callee. In this case, the
`nir_function` only serves to provide information about the function signature, i.e. how many and which parameters
the function takes.

Call instructions do not have return values - instead, return values are represented by so-called "return parameters".
Instead of an SSA value, these parameters are derefs, and the return value is written into the deref when the callee
returns. Return parameters can double as input parameters, too - the callee can read the previous value of the deref
before (potentially) overwriting it with a new value.

ACO's representation of function calls follows this very closely. Calls are described by the `p_call` pseudo-instruction.
The operands to this instruction are a function pointer (i.e. the address of the callee), followed by the call
parameters. Return parameters are handled differently, though: While the initial value of the return parameter is passed
as an operand, the call instruction produces new definitions that refer to the SSA values of the return parameters after
the function call returns. There is a special NIR intrinsic ``load_return_param_amd`` that can be used to access these
new definitions when lowering return parameter derefs to SSA form.

.. _div-calls:

Divergent calls
---------------

On CPUs, a call instruction will only ever jump to a single address. However, GPUs are SIMT, and the value of a function
pointer may be divergent, i.e. different threads try calling different functions within the same call instruction. AMD
hardware executes one instruction for all threads in lockstep, so the multiple callees have to be executed one after
the other.

This is handled by RADV in ``radv_nir_lower_call_abi``. In addition to the (non-divergent) function pointer to jump to,
``radv_nir_lower_call_abi`` prepends another parameter representing the (potentially divergent) function pointer for all
lanes. For callable functions, ``radv_nir_lower_call_abi`` wraps the function body in a condition that verifies that the
current thread's (divergent) pointer matches the (non-divergent) pointer that is currently being executed. This serves
to "mask off" all threads that wanted to jump to a different function than what is currently executing. At the very end,
``radv_nir_lower_call_abi`` inserts some code deciding whether to jump to the next callee or to return.

.. _stack:

Stack
-----

Supporting arbitrary function calls also means supporting recursion, and recursive functions need a stack.
AMD hardware provides instructions for accessing a per-thread scratch memory area in VRAM, and ACO uses this per-thread
scratch memory to set up its stack.

The stack frame for a function consists of all scratch memory allocated for this function in NIR, as well as space to
spill VGPRs if that is required. ACO adds a stack pointer as a parameter to every function - this stack pointer is added
to the offset inside the scratch space for all scratch loads/stores to make sure they don't overwrite stack frames of
caller functions.

ACO's call instructions take two stack-related operands: The current (caller) stack pointer and the caller's stack size.
When converting the call instruction to hardware instructions, ACO will add the caller stack size to the stack pointer
for the duration of the call (and subtract it again afterwards). This allows us to re-use the same stack pointer after
the call.

Implicit/System Parameters
--------------------------

In addition to parameters defined by the function signature, both RADV and ACO will insert additional parameters while
lowering calls. This is an overview of which lowering passes add which parameters.

Parameters added by ``radv_nir_lower_call_abi`` (see :ref:`Divergent calls <div-calls>`):
- "Uniform"/Non-divergent callee pointer
- Divergent function pointer

Parameters added by ACO: (see :ref:`Stack <stack>`)
- Stack pointer (uniform)

ABI Definition
--------------

The ABI (Application Binary Interface) defines specifics about the interaction between the function caller and the
callee (e.g. assignment of registers to parameters or register preservation). In ACO, the primary purpose of the ABI is
to define which register ranges are "preserved" (i.e. never overwritten by the callee) or "clobbered" (i.e. potentially
overwritten by the callee).

The caller can use preserved register ranges to store temporaries that are live across a call, and the callee can use
clobbered register ranges to store its own temporaries. If the callee wants to use registers from a preserved range,
then it needs to back up the value contained in the preserved register beforehand, and restore it when it's done using
the preserved register. Similarly, if there are not enough preserved registers for the caller to store all its
temporaries, the caller will need to spill excess temporaries to the stack.

ACO has to cater to different needs when defining ABIs: On one side, ray tracing traversal shaders demand to free up
the entire register file for the callee (Ray traversal is a really hot loop, so we don't want to spill anything at all).
Besides some parameters like the invocation ID, these shaders should be able to overwrite almost anything. On the other
side, RT traversal shaders should not be required to free up the register file when calling any-hit/intersection shaders
as this would also cause spilling during traversal. GPGPU compute workloads could fall anywhere between these extremes,
so a middle-ground solution is desirable for these.

ACO's way of defining an ABI divides the register file into "blocks" (``struct aco::ABI::RegisterBlock``). Each block
consists of a fixed number of preserved and clobbered registers, and a boolean determining whether the preserved or
clobbered registers come first in the block. Preserved and clobbered register ranges are defined by
repeating these blocks for as long as there are unassigned registers.

Some examples of preserved/clobbered register ranges using this approach::

   For all examples, there are 108 SGPRs and 128 VGPRs to assign.

   RegisterBlock:
      clobbered_size: {16 sgpr, 16 vgpr}
      preserved_size: {16 sgpr, 16 vgpr}
      clobbered_first: false
   results in:
      v0-v15:    preserved
      v16-v31:   clobbered
      v32-v47:   preserved
      v48-v63:   clobbered
      v64-v79:   preserved
      v80-v95:   clobbered
      v96-v111:  preserved
      v112-v127: clobbered

      s0-s15:   preserved
      s16-s31:  clobbered
      s32-s47:  preserved
      s48-s63:  clobbered
      s64-s79:  preserved
      s80-s95:  clobbered
      s96-s108: preserved

   RegisterBlock:
      clobbered_size: {128 sgpr, 256 vgpr}
      preserved_size: {80 sgpr, 80 vgpr}
      clobbered_first: false
   results in:
      v0-v79:   preserved
      v80-v127: clobbered

      s0-s79:   preserved
      s80-s108: clobbered

An alternating preserved-clobbered-preserved pattern is useful for generic compute workloads, because the ratio of
preserved to clobbered registers is roughly the same, no matter how many registers are used by the shaders.

The latter example where the lower part of the register file is preserved and only some registers high up in the
register file are clobbered is suitable for any-hit/intersection shaders - traversal shader temporaries can live in the
preserved part low in the register file.

This block assignment is optional - if no ``RegisterBlock`` is given, the ABI defines the entire register range as
clobbered-by-default, although parameters that are not marked as clobbered via ``ACO_NIR_PARAM_ATTRIB_DISCARDABLE``
will continue being preserved.

Parameter Register Assignment
-----------------------------

If a ``RegisterBlock`` defines preserved and clobbered ranges, then parameters are assigned registers from either range
depending on ``ACO_NIR_PARAM_ATTRIB_DISCARDABLE`` - if parameters are marked as clobbered with this attribute, then they
are assigned a register in a clobbered range, otherwise they are assigned in a register in a preserved range. The order
of the parameters in the register file is not necessarily the same order as in the function signature - they may get
reordered if it's beneficial to fill gaps or for alignment.

If there is no ``RegisterBlock``, then registers will be assigned based on alignment only.

If there is no more space for a parameter in any of its corresponding register ranges, it will be moved to the stack.