afuc: Rework and significantly expand README.rst

This hasn't been updated since the a5xx days, and we've learned much more since then. I've tried to expand it from a random collection of notes to a more complete guide to explaining how to read the firmware and understand the various tricks it uses to make code more compact. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24125>
2026-05-07 07:08:04 +02:00 · 2023-07-12 20:00:07 +02:00 · 2023-07-12 20:00:07 +02:00 · 7220deff90
commit 7220deff90
parent 426708796c
2 changed files with 382 additions and 76 deletions
--- a/src/freedreno/afuc/README.rst
+++ b/src/freedreno/afuc/README.rst
@ -32,9 +32,9 @@ and purpose of the two microcontrollers remains the same.

 For lack of a better name, this new instruction set is called
 "Adreno Five MicroCode" or "afuc".  (No idea what Qualcomm calls
-it internally.
+it internally).

-With Adreno 6xx, the separate PF and ME are replaced with a single
+With Adreno 6xx, the separate PFP and ME are replaced with a single
 SQE microcontroller using the same instruction set as 5xx.

 .. _afuc-overview:
@ -42,20 +42,31 @@ SQE microcontroller using the same instruction set as 5xx.
 Instruction Set Overview
 ========================

-32bit instruction set with basic arithmatic ops that can take
-either two source registers or one src and a 16b immediate.
+The afuc instruction set is heavily inspired by MIPS, but not exactly
+compatible.

-32 registers, although some are special purpose:
+Registers
+=========

- ``$00`` - always reads zero, otherwise seems to be the PC
- ``$01`` - current PM4 packet header
- ``$1c`` - alias ``$rem``, remaining data in packet
- ``$1d`` - alias ``$addr``
- ``$1f`` - alias ``$data``
+Similar to MIPS, there are 32 registers, and some are special purpose. ``$00``
+is the same as ``$zero`` on MIPS, it reads 0 and writes are discarded.

-Branch instructions have a delay slot so the following instruction
-is always executed regardless of whether branch is taken or not.
+Registers are displayed in the current disassembly with a hexadecimal
+numbering, e.g. ``$0a`` is encoded as 10.

+The ABI used when processing packets is that ``$01`` contains the current PM4
+header, registers from ``$02`` up to ``$11`` are temporaries and may be freely
+clobbered by the packet handler, while ``$12`` and above are used to store
+global state like the IB level and next visible draw (for draw skipping).
+
+Unlike in MIPS, there is a special small hardware-managed stack and special
+instructions ``call``/``ret`` which use it. The stack only contains return
+addresses, there is no "stack frame" to spill values to. As a result, ``$sp``,
+``$fp``, and ``$ra`` don't exist as on MIPS. Instead the last 3 registers are
+used to :ref:`afuc-read<read>` from various queues and
+:ref:`afuc-reg-writes<write GPU registers>`. In addition there is a ``$rem``
+register which normally contains the number of words remaining in the packet
+but can also be used as a normal register in combination with the rep prefix.

 .. _afuc-alu:

@ -79,10 +90,10 @@ The following instructions are available:
 - ``mul8``  - multiply low 8b of two src
 - ``min``   - minimum
 - ``max``   - maximum
- ``comp``  - compare two values
+- ``cmp``  - compare two values

-The ALU instructions can take either two src registers, or a src
-plus 16b immediate as 2nd src, ex::
+Similar to MIPS, The ALU instructions can take either two src registers, or a
+src plus 16b immediate as 2nd src, ex::

  add $dst, $src, 0x1234   ; src2 is immed
  add $dst, $src1, $src2   ; src2 is reg
@ -92,6 +103,14 @@ The ``not`` instruction only takes a single source::
  not $dst, $src
  not $dst, 0x1234

+One departure from MIPS is that there is a special immediate-form ``mov``
+instruction that can shift the 16-bit immediate by a given amount::
+
+   mov $dst, 0x1234 << 2
+
+This replaces ``lui`` on MIPS (just use a shift of 16) while also allowing the
+quick construction of small bitfields, which comes in handy in various places.
+
 .. _afuc-alu-cmp:

 The ``cmp`` instruction returns:
@ -133,6 +152,41 @@ due to the bit pattern it returns, for example::

 will branch if ``$02`` is less than or equal to ``$03``.

+Delay slots
+-----------
+
+Branch instructions have a delay slot so the following instruction is always
+executed regardless of whether branch is taken or not. Unlike MIPS, a branch in
+the delay slot is legal as long as the original branch and the branch in its
+delay slot are never both taken. Because jump tables are awkward and slow due
+to the lack of memory caching, this is often exploited to create dense
+sequences of branches to implement switch-case constructs::
+
+   breq $02, 0x1, #foo
+   breq $02, 0x2, #bar
+   breq $02, 0x3, #baz
+   ...
+   nop
+   jump #default
+
+Another common use of a branch in a delay slot is a double-jump (jump to one
+location if a condition is true, and another location if false). In MIPS this
+requires two delay slots::
+
+   beq $t0, 0x1, #foo
+   nop ; beq delay slot
+   b #bar
+   nop ; b delay slot
+
+In afuc this only requires a delay slot for the second branch::
+
+   breq $02, 0x1, #foo
+   brne $02, 0x1, #bar
+   nop
+
+Note that for the second branch we had to use a conditional branch with the
+opposite condition instead of an unconditional branch as in the MIPS example,
+to guarantee that at most one is ever taken.

 .. _afuc-call:

@ -140,28 +194,49 @@ Call/Return
 ===========

 Simple subroutines can be implemented with ``call``/``ret``.  The
-jump instruction encodes a fixed offset.
+jump instruction encodes a fixed offset from the SQE instruction base.

  TODO not sure how many levels deep function calls can be nested.
  There isn't really a stack.  Definitely seems to be multiple
  levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
  f22.

+.. _afuc-nop:
+
+NOPs
+====
+
+Afuc has a special NOP encoding where the low 24 bits are ignored by the
+processor. On a5xx the high 8 bits are ``00``, on a6xx they are ``01``
+(probably to make sure that 0 is not a legal instruction, increasing the
+chances of halting immediately when something is misconfigured). This is used
+sometimes to create a "payload" that is ignored when executed. For example, the
+first 2 instructions of the firmware typically contain the firmware ID and
+version followed by the packet handling table offset encoded as NOPs. They are
+skipped when executed but they are later read as data by the bootstrap routine.

 .. _afuc-control:

-Config Instructions
-===================
+Control Registers
+=================

-These seem to read/write config state in other parts of CP.  In at
-least some cases I expect these map to CP registers (but possibly
-not directly??)
+Control registers are a special register space that can only be read/written
+directly by CP through ``cread``/``cwrite`` instructions::

 - ``cread $dst, [$off + addr], flags``
 - ``cwrite $src, [$off + addr], flags``

-In cases where no offset is needed, ``$00`` is frequently used as
-the offset.
+Control registers ``0x000`` to ``0x0ff`` are private registers used to control
+the CP, for example to indicate where to read from memory or (normal)
+registers.  ``0x100`` to ``0x17f`` are a private scratch space used by the
+firmware however it wants, for example as an ad-hoc stack to spill registers
+when calling a function or to store the scratch used in ``CP_SCRATCH_TO_*``
+packets.
+
+In cases where no offset is needed, ``$00`` is frequently used as the offset.
+
+A value of 4 for ``flags`` is known to be a pre-increment mode that writes the
+final address ``$off + addr`` to ``$off``, it's not known what other values do.

 For example, the following sequences sets::

@ -171,7 +246,7 @@ For example, the following sequences sets::
  mov $04, $data   ; IB size in dwords

  ; sanity check # of dwords:
-  breq $04, 0x0, #l23 (#69, 04a2)
+  breq $04, 0x0, #l23

  ; this seems something to do with figuring out whether
  ; we are going from RB->IB1 or IB1->IB2 (ie. so the
@ -185,15 +260,66 @@ For example, the following sequences sets::
  cwrite $03, [$05 + 0x0b1], 0x8
  cwrite $04, [$05 + 0x0b2], 0x8

+Unlike normal GPU registers, writing control registers seems to always take
+effect immediately; if writing a control register triggers some complex
+operation that the firmware needs to wait for, then it typically uses a
+spinloop with another control register to wait for it to finish.

+Control registers are documented in ``adreno_control_regs.xml``. The
+disassembler will try to recognize an immediate address as a known control
+register and print it, for example this sequence similar to the above sequence
+but on a6xx::

-.. _afuc-reg-access:
+  and $05, $12, 0x0003
+  shl $05, $05, 0x0002
+  cwrite $0e, [$05 + @IB1_BASE], 0x0
+  cwrite $0b, [$05 + @IB1_BASE+0x1], 0x0
+  cwrite $04, [$05 + @IB1_DWORDS], 0x0

-Register Access
-===============
+.. _afuc-read:

-The special registers ``$addr`` and ``$data`` can be used to write GPU
-registers, for example, to write::
+Reading Memory and Registers
+============================
+
+The CP accesses memory directly with no caching. This means that except for
+very small amounts of data accessed rarely, ``load`` and ``store`` are very
+slow. Instead, ME/PFP and later SQE read memory through various queues. Reading
+registers also use a queue, likely because burst reading several registers at
+once is faster than reading them one-by-one and reading does not complete
+immediately. Queueing up a read involves writing a (address, length) pair to a
+control register, and data is read from the queue using one of three special
+registers:
+
+- ``$data`` reads the next PM4 packet word. This comes from the RB, IB1, IB2,
+  or SDS (Set Draw State) queue, controlled by ``@IB_LEVEL``. It also
+  decrements ``$rem`` if it isn't already decremented by a rep prefix.
+- ``$memdata`` reads the next word from a memory read buffer (MRB) setup by
+  writing ``@MEM_READ_ADDR``/``@MEM_READ_DWORDS``. It's used by things like
+  ``CP_MEMCPY`` and reading indirect draw parameters in ``CP_DRAW_INDIRECT``.
+- ``$regdata`` reads from a register read buffer (RRB) setup by
+  ``@REG_READ_ADDR``/``@REG_READ_DWORDS``.
+
+RB, IB1, IB2, SDS, and MRB make up the Read-Only Queue or ROQ, in addition to
+the Visibility Stream Decoder (VSD) which is setup via a similar control
+register pair but is read by a fixed-function parser that the CP accesses via a
+few control registers.
+
+.. _afuc-reg-writes:
+
+Writing Registers
+=================
+
+The same special registers, when used as a destination, can be used to
+write GPU registers on ME. Because they have a totally different function when
+used as a destination, they use different names:
+
+- ``$addr`` sets the address and disables ``CP_PROTECT`` address checking.
+- ``$usraddr`` sets the address and checks it against the ``CP_PROTECT`` access
+  table. It's used for addresses specified by the PM4 packet stream instead of
+  internally.
+- ``$data`` writes the register and auto-increments the address.
+
+for example, to write::

  mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
  mov $data, $03                 ; CP_SCRATCH_REG[0x2]
@ -201,54 +327,88 @@ registers, for example, to write::
  ...

 subsequent writes to ``$data`` will increment the address of the register
-to write, so a sequence of consecutive registers can be written
+to write, so a sequence of consecutive registers can be written. On a5xx ME,
+this will directly write the register, on a6xx SQE this will instead determine
+which cluster(s) the register belongs to and push the write onto the
+appropriate per-cluster queue(s) letting the SQE run ahead of the GPU.

-To read::
+When bit 18 of ``$addr`` is set, the auto-incrementing is disabled. This is
+often used with :ref:`afuc-mem-writes <NRT_DATA>`.
+
+On a5xx ME, ``$regdata`` can also be used to directly read a register::

  mov $addr, CP_SCRATCH_REG[0x2]
-  mov $03, $addr
-  mov $04, $addr
+  mov $03, $regdata
+  mov $04, $regdata
+
+This does not exist on a6xx because register reads are not synchronized against
+writes any more.

 Many registers that are updated frequently have two banks, so they can be
-updated without stalling for previous draw to finish.  These banks are
+updated without stalling for previous draw to finish.  On a5xx, these banks are
 arranged so bit 11 is zero for bank 0 and 1 for bank 1.  The ME fw (at
-least the version I'm looking at) stores this in ``$17``, so to update
-these registers from ME::
+least the version I'm looking at) stores this in ``$17``, so to update these
+registers from ME::

  or $addr, $17, VFD_INDEX_OFFSET
  mov $data, $03
  ...

-Note that PFP doesn't seem to use this approach, instead it does something
-like::
+On a6xx this is handled transparently to the SQE, and the bank to use is stored
+separately in the cluster queue.
+
+Registers can also be written directly, skipping the queue, by writing
+``@REG_WRITE_ADDR``/``@REG_WRITE``. This is used on a6xx for certain frontend
+registers that have their own queues and on a5xx is used by the PFP::

  mov $0c, CP_SCRATCH_REG[0x7]
  mov $02, 0x789a   ; value
-  cwrite $0c, [$00 + 0x010], 0x8
-  cwrite $02, [$00 + 0x011], 0x8
+  cwrite $0c, [$00 + @REG_WRITE_ADDR], 0x8
+  cwrite $02, [$00 + @REG_WRITE], 0x8

 Like with the ``$addr``/``$data`` approach, the destination register address
-increments on each write.
+increments on each write to ``@REG_WRITE``.

-.. _afuc-mem:
+.. _afuc-pipe-regs:

-Memory Access
-=============
+Pipe Registers
+--------------

-There are no load/store instructions, as such.  The microcontrollers
-have only indirect memory access via GPU registers.  There are two
-mechanism possible.
+This yet another private register space, triggered by writing to the high 8
+bits of ``$addr`` and then writing ``$data`` normally. Some pipe registers like
+``WAIT_MEM_WRITES`` or ``WAIT_GPU_IDLE`` have no data and a write is triggered
+immediately when ``$addr`` is written, for example in ``CP_WAIT_MEM_WRITES``::

-Read/Write via CP_NRT Registers
-------------------------------
+  mov $addr, 0x0084 << 24 ; |WAIT_MEM_WRITES

-This seems to be only used by ME.  If PFP were also using it, they would
-race with each other.  It seems to be primarily used for small reads.
+The pipe register is decoded here by the disassembler in a comment.
+
+The main difference of pipe registers from control registers are:
+
+- They are always write-only.
+- On a6xx they are pipelined together with normal register writes, on a5xx they
+  are written from ME like normal registers.
+- Writing them can take an arbitrary amount of time, so they can be used to
+  wait for some condition without spinning.
+
+In short, they behave more like normal registers but are not expected to be
+read/written by anything other than CP. Over time more and more GPU registers
+not touched by the kernel driver have been converted to pipe registers.
+
+.. _afuc-mem-writes:
+
+Writing Memory
+==============
+
+Writing memory is done by writing GPU registers:

 - ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``
+- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``.

-The address register increments with successive reads or writes.
+The address register increments with successive writes.
+
+On a5xx, this seems to be only used by ME.  If PFP were also using it, they would
+race with each other.  It can also be used for reads, primarily small reads.

 Memory Write example::

@ -269,36 +429,179 @@ Memory Read example::
  mov $04, $addr
  mov $05, $addr

+On a6xx ``CP_ME_NRT_ADDR`` and ``CP_ME_NRT_DATA`` have been replaced by
+:ref:`afuc-pipe-regs <pipe registers>` and they can only be used for writes but
+it otherwise works similarly.

-Read via Control Instructions
-----------------------------
+Load and Store Instructions
+===========================

-This is used by PFP whenever it needs to read memory.  Also seems to be
-used by ME for streaming reads (larger amounts of data).  The DMA access
-seems to be done by ROQ.
+a6xx adds ``load`` and ``store`` instruction that work similarly to ``cread``
+and ``cwrite``. Because the address is 64-bits but registers are 32-bit, the
+high 32 bits come from the ``@LOAD_STORE_HI``
+:ref:`afuc-control <control register>`. They are mostly used by the context
+switch routine and even then very sparingly, before the memory read/write queue
+state is saved while it is being restored.

-  TODO might also be possible for write access
+Modifiers
+=========

-  TODO some of the control commands might be synchronizing access
-  between PFP and ME??
+There are two modifiers that enable more compact and efficient implementations
+of common patterns:

-An example from ``CP_DRAW_INDIRECT`` packet handler::
+.. _afuc-rep:

-  mov $07, 0x0004  ; # of dwords to read from draw-indirect buffer
-  ; load address of indirect buffer from cmdstream:
-  cwrite $data, [$00 + 0x0b8], 0x8
-  cwrite $data, [$00 + 0x0b9], 0x8
-  ; set # of dwords to read:
-  cwrite $07, [$00 + 0x0ba], 0x8
-  ...
-  ; read parameters from draw-indirect buffer:
-  mov $09, $addr
-  mov $07, $addr
-  cread $12, [$00 + 0x040], 0x8
-  ; the start parameter gets written into MEQ, which ME writes
-  ; to VFD_INDEX_OFFSET register:
-  mov $data, $addr
+Repeat
+------

+``(rep)`` repeats the same instruction ``$rem`` times. More precisely, it
+decrements ``$rem`` after the instruction executes if it wasn't already
+decremented from a read from ``$data`` and re-executes the instruction until
+``$rem`` is 0.  It can be used with ALU instructions and control instructions.
+Usually it is used in conjunction with ``$data`` to read the rest of the packet
+in one instruction, but it can also be used freestanding, for example this
+snippet clears the control register scratch space::
+
+  mov $rem, 0x0080 ; clear 0x80 registers
+  mov $03, 0x00ff ; start at 0xff + 1 = 0x100
+  (rep)cwrite $00, [$03 + 0x001], 0x4
+
+Note the use of pre-increment mode, so that the first execution clears
+``0x100`` and updates ``$03`` to ``0x100``, the second execution clears
+``0x101`` and updates ``$03`` to ``0x101``, and so on.
+
+.. _afuc-xmov:
+
+eXtra Moves
+-----------
+
+``(xmovN)`` is an optimization which lets the firmware read multiple words from
+a queue in the same cycle. Conceptually, it adds "extra" mov instructions to be
+executed after a given ALU instruction, although in practice they are likely
+executed in parallel. ``(xmov1)`` adds up to 1 move, ``(xmov2)`` adds up to 2,
+and ``(xmov3)`` adds up to 3. The actual number of moves added is the minimum
+of the number in the instruction and ``$rem``, so a ``(xmov3)`` instruction
+behaves like a ``(xmov1)`` instruction if ``$rem = 1``. Given an instruction::
+
+  (xmovN) alu $dst, $src1, $src2
+
+or a 1-source instruction::
+
+  (xmovN) alu $dst, $src2
+
+then we compute the number of extra moves ``M = min(N, $rem)``. If ``M = 1``,
+then we add::
+
+  mov $data, $src2
+
+If ``M = 2``, then we add::
+
+  mov $data, $src2
+  mov $data, $src2
+
+Finally, as a special case explained below, if ``M = 3`` then we add::
+
+  mov $data, $src2
+  mov $dst, $src2 ; !!!
+  mov $data, $src2
+
+If ``$dst`` is not one of the "special" registers ``$data``, ``$addr``,
+``$usraddr``, then ``$data`` is replaced by ``$00`` in all destinations, i.e.
+the results of the subsequent moves are discarded.
+
+The purpose of the ``M = 3`` special case is mostly to efficiently implement
+``CP_CONTEXT_REG_BUNCH``. This is the entire implementation of
+``CP_CONTEXT_REG_BUNCH``, which is essentially just one instruction::
+
+  CP_CONTEXT_REG_BUNCH:
+  (rep)(xmov3)mov $usraddr, $data
+  waitin
+  mov $01, $data
+
+If there are 4 or more words remaining in the packet, that is if there are at
+least two more registers to write, then (ignoring the ``(rep)`` for a moment)
+the instruction expands to::
+
+  mov $usraddr, $data
+  mov $data, $data
+  mov $usraddr, $data
+  mov $data, $data
+
+This is likely all executed in a single cycle, allowing us to write 2 registers
+per cycle.
+
+``(xmov1)`` can be also added to ``(rep)mov $data, $data``, which is a common
+pattern to write the rest of the packet to successive registers, to write up to
+2 registers per cycle as well. The firmware does not use ``(xmov3)``, however,
+so 2 registers per cycle is likely a hardware limitation.
+
+Although ``(xmovN)`` is often used in combination with ``(rep)``, it doesn't
+have to be. For example, ``(xmov1)mov $data, $data`` moves the next 2 packet
+words to 2 successive registers.
+
+Packet Table
+============
+
+The core of the microprocessor's job is to parse each packet header and jump to
+its handler. This is done through a ``waitin`` instruction which waits for the
+packet header to become available and then parses the header and jumps to the
+handler using a jump table. However it does *not* actually consume the header.
+Like any branch instruction, it has a delay slot, and by convention this delay
+slot always contains a ``mov $01, $data`` instruction. This consumes the same
+header that ``waitin`` parsed and puts it in ``$01`` so that the packet header
+is available in ``$01`` in the next packet. Thus all packet handlers end with
+this sequence::
+
+  waitin
+  mov $01, $data
+
+The jump table itself is initialized by the SQE in the bootstrap routine at the
+beginning of the firmware. Amongst other tasks, it reads the offset of the jump
+table from the NOP payload at the beginning, then uses a jump table embedded at
+the end of the firmware to set it up by writing the ``@PACKET_TABLE_WRITE``
+control register.  After everything is setup, it does the ``waitin`` sequence
+to start handling the first packet (which should be ``CP_ME_INIT``).
+
+Example Packet
+==============
+
+Let's examine an implementation of ``CP_MEM_WRITE``::
+
+  CP_MEM_WRITE:
+  mov $addr, 0x00a0 << 24 ; |NRT_ADDR
+
+First, we setup the register to write to, which is the ``NRT_ADDR``
+:ref:`afuc-pipe-regs <pipe register>`. It turns out that the low 2 bits of
+``NRT_ADDR`` are a flag which when 1 disables auto-incrementing ``NRT_ADDR``
+when ``NRT_DATA`` is written, but we don't want this behavior so we have to
+make sure they are clear::
+
+  or $02, $data, 0x0003 ; reading $data reads the next PM4 word
+  xor $data, $02, 0x0003 ; writing $data writes the register, which is NRT_ADDR
+
+Writing ``$data`` auto-increments ``$addr``, so now the next write is to
+``0xa1`` or ``NRT_ADDR+1`` (``NRT_ADDR`` is a 64-bit register)::
+
+  mov $data, $data
+
+Now, we have to write ``NRT_DATA``. We want to repeatedly write the same
+register, without having to fight the auto-increment by resetting ``$addr``
+each time, which is where the bit 18 that disables auto-increment comes in
+handy::
+
+  mov $addr, 0xa204 << 16 ; |NRT_DATA
+
+Finally, we have to repeatedly copy the remaining PM4 packet data to the
+``NRT_DATA`` register, which we can do in one instruction with
+:ref:`afuc-rep <(rep)>`. Furthermore we can use :ref:`afuc-xmov <(xmov1)>` to
+squeeze out some more performance::
+
+  (rep)(xmov1)mov $data, $data
+
+At the end is the standard go-to-next-packet sequence::
+
+  waitin
+  mov $01, $data

 A6XX NOTES
 ==========
--- a/src/freedreno/registers/adreno/adreno_control_regs.xml
+++ b/src/freedreno/registers/adreno/adreno_control_regs.xml
@ -9,6 +9,9 @@ xsi:schemaLocation="http://nouveau.freedesktop.org/ rules-ng.xsd">
 -->

 <domain name="A5XX_CONTROL_REG" width="32">
+	<reg32 name="REG_WRITE_ADDR" offset="0x010"/>
+	<reg32 name="REG_WRITE" offset="0x011"/>
+
 	<reg64 name="IB1_BASE" offset="0x0b0"/>
 	<reg32 name="IB1_DWORDS" offset="0x0b2"/>
 	<reg64 name="IB2_BASE" offset="0x0b4"/>