From 7220deff90c20d3495b92e7363b1360054d1ce24 Mon Sep 17 00:00:00 2001 From: Connor Abbott Date: Wed, 12 Jul 2023 20:00:07 +0200 Subject: [PATCH] afuc: Rework and significantly expand README.rst This hasn't been updated since the a5xx days, and we've learned much more since then. I've tried to expand it from a random collection of notes to a more complete guide to explaining how to read the firmware and understand the various tricks it uses to make code more compact. Part-of: --- src/freedreno/afuc/README.rst | 455 +++++++++++++++--- .../registers/adreno/adreno_control_regs.xml | 3 + 2 files changed, 382 insertions(+), 76 deletions(-) diff --git a/src/freedreno/afuc/README.rst b/src/freedreno/afuc/README.rst index e06c9397d60..c9e70e11f8a 100644 --- a/src/freedreno/afuc/README.rst +++ b/src/freedreno/afuc/README.rst @@ -32,9 +32,9 @@ and purpose of the two microcontrollers remains the same. For lack of a better name, this new instruction set is called "Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls -it internally. +it internally). -With Adreno 6xx, the separate PF and ME are replaced with a single +With Adreno 6xx, the separate PFP and ME are replaced with a single SQE microcontroller using the same instruction set as 5xx. .. _afuc-overview: @@ -42,20 +42,31 @@ SQE microcontroller using the same instruction set as 5xx. Instruction Set Overview ======================== -32bit instruction set with basic arithmatic ops that can take -either two source registers or one src and a 16b immediate. +The afuc instruction set is heavily inspired by MIPS, but not exactly +compatible. -32 registers, although some are special purpose: +Registers +========= -- ``$00`` - always reads zero, otherwise seems to be the PC -- ``$01`` - current PM4 packet header -- ``$1c`` - alias ``$rem``, remaining data in packet -- ``$1d`` - alias ``$addr`` -- ``$1f`` - alias ``$data`` +Similar to MIPS, there are 32 registers, and some are special purpose. ``$00`` +is the same as ``$zero`` on MIPS, it reads 0 and writes are discarded. -Branch instructions have a delay slot so the following instruction -is always executed regardless of whether branch is taken or not. +Registers are displayed in the current disassembly with a hexadecimal +numbering, e.g. ``$0a`` is encoded as 10. +The ABI used when processing packets is that ``$01`` contains the current PM4 +header, registers from ``$02`` up to ``$11`` are temporaries and may be freely +clobbered by the packet handler, while ``$12`` and above are used to store +global state like the IB level and next visible draw (for draw skipping). + +Unlike in MIPS, there is a special small hardware-managed stack and special +instructions ``call``/``ret`` which use it. The stack only contains return +addresses, there is no "stack frame" to spill values to. As a result, ``$sp``, +``$fp``, and ``$ra`` don't exist as on MIPS. Instead the last 3 registers are +used to :ref:`afuc-read` from various queues and +:ref:`afuc-reg-writes`. In addition there is a ``$rem`` +register which normally contains the number of words remaining in the packet +but can also be used as a normal register in combination with the rep prefix. .. _afuc-alu: @@ -79,10 +90,10 @@ The following instructions are available: - ``mul8`` - multiply low 8b of two src - ``min`` - minimum - ``max`` - maximum -- ``comp`` - compare two values +- ``cmp`` - compare two values -The ALU instructions can take either two src registers, or a src -plus 16b immediate as 2nd src, ex:: +Similar to MIPS, The ALU instructions can take either two src registers, or a +src plus 16b immediate as 2nd src, ex:: add $dst, $src, 0x1234 ; src2 is immed add $dst, $src1, $src2 ; src2 is reg @@ -92,6 +103,14 @@ The ``not`` instruction only takes a single source:: not $dst, $src not $dst, 0x1234 +One departure from MIPS is that there is a special immediate-form ``mov`` +instruction that can shift the 16-bit immediate by a given amount:: + + mov $dst, 0x1234 << 2 + +This replaces ``lui`` on MIPS (just use a shift of 16) while also allowing the +quick construction of small bitfields, which comes in handy in various places. + .. _afuc-alu-cmp: The ``cmp`` instruction returns: @@ -133,6 +152,41 @@ due to the bit pattern it returns, for example:: will branch if ``$02`` is less than or equal to ``$03``. +Delay slots +----------- + +Branch instructions have a delay slot so the following instruction is always +executed regardless of whether branch is taken or not. Unlike MIPS, a branch in +the delay slot is legal as long as the original branch and the branch in its +delay slot are never both taken. Because jump tables are awkward and slow due +to the lack of memory caching, this is often exploited to create dense +sequences of branches to implement switch-case constructs:: + + breq $02, 0x1, #foo + breq $02, 0x2, #bar + breq $02, 0x3, #baz + ... + nop + jump #default + +Another common use of a branch in a delay slot is a double-jump (jump to one +location if a condition is true, and another location if false). In MIPS this +requires two delay slots:: + + beq $t0, 0x1, #foo + nop ; beq delay slot + b #bar + nop ; b delay slot + +In afuc this only requires a delay slot for the second branch:: + + breq $02, 0x1, #foo + brne $02, 0x1, #bar + nop + +Note that for the second branch we had to use a conditional branch with the +opposite condition instead of an unconditional branch as in the MIPS example, +to guarantee that at most one is ever taken. .. _afuc-call: @@ -140,28 +194,49 @@ Call/Return =========== Simple subroutines can be implemented with ``call``/``ret``. The -jump instruction encodes a fixed offset. +jump instruction encodes a fixed offset from the SQE instruction base. TODO not sure how many levels deep function calls can be nested. There isn't really a stack. Definitely seems to be multiple levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 -> f22. +.. _afuc-nop: + +NOPs +==== + +Afuc has a special NOP encoding where the low 24 bits are ignored by the +processor. On a5xx the high 8 bits are ``00``, on a6xx they are ``01`` +(probably to make sure that 0 is not a legal instruction, increasing the +chances of halting immediately when something is misconfigured). This is used +sometimes to create a "payload" that is ignored when executed. For example, the +first 2 instructions of the firmware typically contain the firmware ID and +version followed by the packet handling table offset encoded as NOPs. They are +skipped when executed but they are later read as data by the bootstrap routine. .. _afuc-control: -Config Instructions -=================== +Control Registers +================= -These seem to read/write config state in other parts of CP. In at -least some cases I expect these map to CP registers (but possibly -not directly??) +Control registers are a special register space that can only be read/written +directly by CP through ``cread``/``cwrite`` instructions:: - ``cread $dst, [$off + addr], flags`` - ``cwrite $src, [$off + addr], flags`` -In cases where no offset is needed, ``$00`` is frequently used as -the offset. +Control registers ``0x000`` to ``0x0ff`` are private registers used to control +the CP, for example to indicate where to read from memory or (normal) +registers. ``0x100`` to ``0x17f`` are a private scratch space used by the +firmware however it wants, for example as an ad-hoc stack to spill registers +when calling a function or to store the scratch used in ``CP_SCRATCH_TO_*`` +packets. + +In cases where no offset is needed, ``$00`` is frequently used as the offset. + +A value of 4 for ``flags`` is known to be a pre-increment mode that writes the +final address ``$off + addr`` to ``$off``, it's not known what other values do. For example, the following sequences sets:: @@ -171,7 +246,7 @@ For example, the following sequences sets:: mov $04, $data ; IB size in dwords ; sanity check # of dwords: - breq $04, 0x0, #l23 (#69, 04a2) + breq $04, 0x0, #l23 ; this seems something to do with figuring out whether ; we are going from RB->IB1 or IB1->IB2 (ie. so the @@ -185,15 +260,66 @@ For example, the following sequences sets:: cwrite $03, [$05 + 0x0b1], 0x8 cwrite $04, [$05 + 0x0b2], 0x8 +Unlike normal GPU registers, writing control registers seems to always take +effect immediately; if writing a control register triggers some complex +operation that the firmware needs to wait for, then it typically uses a +spinloop with another control register to wait for it to finish. +Control registers are documented in ``adreno_control_regs.xml``. The +disassembler will try to recognize an immediate address as a known control +register and print it, for example this sequence similar to the above sequence +but on a6xx:: -.. _afuc-reg-access: + and $05, $12, 0x0003 + shl $05, $05, 0x0002 + cwrite $0e, [$05 + @IB1_BASE], 0x0 + cwrite $0b, [$05 + @IB1_BASE+0x1], 0x0 + cwrite $04, [$05 + @IB1_DWORDS], 0x0 -Register Access -=============== +.. _afuc-read: -The special registers ``$addr`` and ``$data`` can be used to write GPU -registers, for example, to write:: +Reading Memory and Registers +============================ + +The CP accesses memory directly with no caching. This means that except for +very small amounts of data accessed rarely, ``load`` and ``store`` are very +slow. Instead, ME/PFP and later SQE read memory through various queues. Reading +registers also use a queue, likely because burst reading several registers at +once is faster than reading them one-by-one and reading does not complete +immediately. Queueing up a read involves writing a (address, length) pair to a +control register, and data is read from the queue using one of three special +registers: + +- ``$data`` reads the next PM4 packet word. This comes from the RB, IB1, IB2, + or SDS (Set Draw State) queue, controlled by ``@IB_LEVEL``. It also + decrements ``$rem`` if it isn't already decremented by a rep prefix. +- ``$memdata`` reads the next word from a memory read buffer (MRB) setup by + writing ``@MEM_READ_ADDR``/``@MEM_READ_DWORDS``. It's used by things like + ``CP_MEMCPY`` and reading indirect draw parameters in ``CP_DRAW_INDIRECT``. +- ``$regdata`` reads from a register read buffer (RRB) setup by + ``@REG_READ_ADDR``/``@REG_READ_DWORDS``. + +RB, IB1, IB2, SDS, and MRB make up the Read-Only Queue or ROQ, in addition to +the Visibility Stream Decoder (VSD) which is setup via a similar control +register pair but is read by a fixed-function parser that the CP accesses via a +few control registers. + +.. _afuc-reg-writes: + +Writing Registers +================= + +The same special registers, when used as a destination, can be used to +write GPU registers on ME. Because they have a totally different function when +used as a destination, they use different names: + +- ``$addr`` sets the address and disables ``CP_PROTECT`` address checking. +- ``$usraddr`` sets the address and checks it against the ``CP_PROTECT`` access + table. It's used for addresses specified by the PM4 packet stream instead of + internally. +- ``$data`` writes the register and auto-increments the address. + +for example, to write:: mov $addr, CP_SCRATCH_REG[0x2] ; set register to write mov $data, $03 ; CP_SCRATCH_REG[0x2] @@ -201,54 +327,88 @@ registers, for example, to write:: ... subsequent writes to ``$data`` will increment the address of the register -to write, so a sequence of consecutive registers can be written +to write, so a sequence of consecutive registers can be written. On a5xx ME, +this will directly write the register, on a6xx SQE this will instead determine +which cluster(s) the register belongs to and push the write onto the +appropriate per-cluster queue(s) letting the SQE run ahead of the GPU. -To read:: +When bit 18 of ``$addr`` is set, the auto-incrementing is disabled. This is +often used with :ref:`afuc-mem-writes `. + +On a5xx ME, ``$regdata`` can also be used to directly read a register:: mov $addr, CP_SCRATCH_REG[0x2] - mov $03, $addr - mov $04, $addr + mov $03, $regdata + mov $04, $regdata + +This does not exist on a6xx because register reads are not synchronized against +writes any more. Many registers that are updated frequently have two banks, so they can be -updated without stalling for previous draw to finish. These banks are +updated without stalling for previous draw to finish. On a5xx, these banks are arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at -least the version I'm looking at) stores this in ``$17``, so to update -these registers from ME:: +least the version I'm looking at) stores this in ``$17``, so to update these +registers from ME:: or $addr, $17, VFD_INDEX_OFFSET mov $data, $03 ... -Note that PFP doesn't seem to use this approach, instead it does something -like:: +On a6xx this is handled transparently to the SQE, and the bank to use is stored +separately in the cluster queue. + +Registers can also be written directly, skipping the queue, by writing +``@REG_WRITE_ADDR``/``@REG_WRITE``. This is used on a6xx for certain frontend +registers that have their own queues and on a5xx is used by the PFP:: mov $0c, CP_SCRATCH_REG[0x7] mov $02, 0x789a ; value - cwrite $0c, [$00 + 0x010], 0x8 - cwrite $02, [$00 + 0x011], 0x8 + cwrite $0c, [$00 + @REG_WRITE_ADDR], 0x8 + cwrite $02, [$00 + @REG_WRITE], 0x8 Like with the ``$addr``/``$data`` approach, the destination register address -increments on each write. +increments on each write to ``@REG_WRITE``. -.. _afuc-mem: +.. _afuc-pipe-regs: -Memory Access -============= +Pipe Registers +-------------- -There are no load/store instructions, as such. The microcontrollers -have only indirect memory access via GPU registers. There are two -mechanism possible. +This yet another private register space, triggered by writing to the high 8 +bits of ``$addr`` and then writing ``$data`` normally. Some pipe registers like +``WAIT_MEM_WRITES`` or ``WAIT_GPU_IDLE`` have no data and a write is triggered +immediately when ``$addr`` is written, for example in ``CP_WAIT_MEM_WRITES``:: -Read/Write via CP_NRT Registers -------------------------------- + mov $addr, 0x0084 << 24 ; |WAIT_MEM_WRITES -This seems to be only used by ME. If PFP were also using it, they would -race with each other. It seems to be primarily used for small reads. +The pipe register is decoded here by the disassembler in a comment. + +The main difference of pipe registers from control registers are: + +- They are always write-only. +- On a6xx they are pipelined together with normal register writes, on a5xx they + are written from ME like normal registers. +- Writing them can take an arbitrary amount of time, so they can be used to + wait for some condition without spinning. + +In short, they behave more like normal registers but are not expected to be +read/written by anything other than CP. Over time more and more GPU registers +not touched by the kernel driver have been converted to pipe registers. + +.. _afuc-mem-writes: + +Writing Memory +============== + +Writing memory is done by writing GPU registers: - ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write -- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR`` +- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``. -The address register increments with successive reads or writes. +The address register increments with successive writes. + +On a5xx, this seems to be only used by ME. If PFP were also using it, they would +race with each other. It can also be used for reads, primarily small reads. Memory Write example:: @@ -269,36 +429,179 @@ Memory Read example:: mov $04, $addr mov $05, $addr +On a6xx ``CP_ME_NRT_ADDR`` and ``CP_ME_NRT_DATA`` have been replaced by +:ref:`afuc-pipe-regs ` and they can only be used for writes but +it otherwise works similarly. -Read via Control Instructions ------------------------------ +Load and Store Instructions +=========================== -This is used by PFP whenever it needs to read memory. Also seems to be -used by ME for streaming reads (larger amounts of data). The DMA access -seems to be done by ROQ. +a6xx adds ``load`` and ``store`` instruction that work similarly to ``cread`` +and ``cwrite``. Because the address is 64-bits but registers are 32-bit, the +high 32 bits come from the ``@LOAD_STORE_HI`` +:ref:`afuc-control `. They are mostly used by the context +switch routine and even then very sparingly, before the memory read/write queue +state is saved while it is being restored. - TODO might also be possible for write access +Modifiers +========= - TODO some of the control commands might be synchronizing access - between PFP and ME?? +There are two modifiers that enable more compact and efficient implementations +of common patterns: -An example from ``CP_DRAW_INDIRECT`` packet handler:: +.. _afuc-rep: - mov $07, 0x0004 ; # of dwords to read from draw-indirect buffer - ; load address of indirect buffer from cmdstream: - cwrite $data, [$00 + 0x0b8], 0x8 - cwrite $data, [$00 + 0x0b9], 0x8 - ; set # of dwords to read: - cwrite $07, [$00 + 0x0ba], 0x8 - ... - ; read parameters from draw-indirect buffer: - mov $09, $addr - mov $07, $addr - cread $12, [$00 + 0x040], 0x8 - ; the start parameter gets written into MEQ, which ME writes - ; to VFD_INDEX_OFFSET register: - mov $data, $addr +Repeat +------ +``(rep)`` repeats the same instruction ``$rem`` times. More precisely, it +decrements ``$rem`` after the instruction executes if it wasn't already +decremented from a read from ``$data`` and re-executes the instruction until +``$rem`` is 0. It can be used with ALU instructions and control instructions. +Usually it is used in conjunction with ``$data`` to read the rest of the packet +in one instruction, but it can also be used freestanding, for example this +snippet clears the control register scratch space:: + + mov $rem, 0x0080 ; clear 0x80 registers + mov $03, 0x00ff ; start at 0xff + 1 = 0x100 + (rep)cwrite $00, [$03 + 0x001], 0x4 + +Note the use of pre-increment mode, so that the first execution clears +``0x100`` and updates ``$03`` to ``0x100``, the second execution clears +``0x101`` and updates ``$03`` to ``0x101``, and so on. + +.. _afuc-xmov: + +eXtra Moves +----------- + +``(xmovN)`` is an optimization which lets the firmware read multiple words from +a queue in the same cycle. Conceptually, it adds "extra" mov instructions to be +executed after a given ALU instruction, although in practice they are likely +executed in parallel. ``(xmov1)`` adds up to 1 move, ``(xmov2)`` adds up to 2, +and ``(xmov3)`` adds up to 3. The actual number of moves added is the minimum +of the number in the instruction and ``$rem``, so a ``(xmov3)`` instruction +behaves like a ``(xmov1)`` instruction if ``$rem = 1``. Given an instruction:: + + (xmovN) alu $dst, $src1, $src2 + +or a 1-source instruction:: + + (xmovN) alu $dst, $src2 + +then we compute the number of extra moves ``M = min(N, $rem)``. If ``M = 1``, +then we add:: + + mov $data, $src2 + +If ``M = 2``, then we add:: + + mov $data, $src2 + mov $data, $src2 + +Finally, as a special case explained below, if ``M = 3`` then we add:: + + mov $data, $src2 + mov $dst, $src2 ; !!! + mov $data, $src2 + +If ``$dst`` is not one of the "special" registers ``$data``, ``$addr``, +``$usraddr``, then ``$data`` is replaced by ``$00`` in all destinations, i.e. +the results of the subsequent moves are discarded. + +The purpose of the ``M = 3`` special case is mostly to efficiently implement +``CP_CONTEXT_REG_BUNCH``. This is the entire implementation of +``CP_CONTEXT_REG_BUNCH``, which is essentially just one instruction:: + + CP_CONTEXT_REG_BUNCH: + (rep)(xmov3)mov $usraddr, $data + waitin + mov $01, $data + +If there are 4 or more words remaining in the packet, that is if there are at +least two more registers to write, then (ignoring the ``(rep)`` for a moment) +the instruction expands to:: + + mov $usraddr, $data + mov $data, $data + mov $usraddr, $data + mov $data, $data + +This is likely all executed in a single cycle, allowing us to write 2 registers +per cycle. + +``(xmov1)`` can be also added to ``(rep)mov $data, $data``, which is a common +pattern to write the rest of the packet to successive registers, to write up to +2 registers per cycle as well. The firmware does not use ``(xmov3)``, however, +so 2 registers per cycle is likely a hardware limitation. + +Although ``(xmovN)`` is often used in combination with ``(rep)``, it doesn't +have to be. For example, ``(xmov1)mov $data, $data`` moves the next 2 packet +words to 2 successive registers. + +Packet Table +============ + +The core of the microprocessor's job is to parse each packet header and jump to +its handler. This is done through a ``waitin`` instruction which waits for the +packet header to become available and then parses the header and jumps to the +handler using a jump table. However it does *not* actually consume the header. +Like any branch instruction, it has a delay slot, and by convention this delay +slot always contains a ``mov $01, $data`` instruction. This consumes the same +header that ``waitin`` parsed and puts it in ``$01`` so that the packet header +is available in ``$01`` in the next packet. Thus all packet handlers end with +this sequence:: + + waitin + mov $01, $data + +The jump table itself is initialized by the SQE in the bootstrap routine at the +beginning of the firmware. Amongst other tasks, it reads the offset of the jump +table from the NOP payload at the beginning, then uses a jump table embedded at +the end of the firmware to set it up by writing the ``@PACKET_TABLE_WRITE`` +control register. After everything is setup, it does the ``waitin`` sequence +to start handling the first packet (which should be ``CP_ME_INIT``). + +Example Packet +============== + +Let's examine an implementation of ``CP_MEM_WRITE``:: + + CP_MEM_WRITE: + mov $addr, 0x00a0 << 24 ; |NRT_ADDR + +First, we setup the register to write to, which is the ``NRT_ADDR`` +:ref:`afuc-pipe-regs `. It turns out that the low 2 bits of +``NRT_ADDR`` are a flag which when 1 disables auto-incrementing ``NRT_ADDR`` +when ``NRT_DATA`` is written, but we don't want this behavior so we have to +make sure they are clear:: + + or $02, $data, 0x0003 ; reading $data reads the next PM4 word + xor $data, $02, 0x0003 ; writing $data writes the register, which is NRT_ADDR + +Writing ``$data`` auto-increments ``$addr``, so now the next write is to +``0xa1`` or ``NRT_ADDR+1`` (``NRT_ADDR`` is a 64-bit register):: + + mov $data, $data + +Now, we have to write ``NRT_DATA``. We want to repeatedly write the same +register, without having to fight the auto-increment by resetting ``$addr`` +each time, which is where the bit 18 that disables auto-increment comes in +handy:: + + mov $addr, 0xa204 << 16 ; |NRT_DATA + +Finally, we have to repeatedly copy the remaining PM4 packet data to the +``NRT_DATA`` register, which we can do in one instruction with +:ref:`afuc-rep <(rep)>`. Furthermore we can use :ref:`afuc-xmov <(xmov1)>` to +squeeze out some more performance:: + + (rep)(xmov1)mov $data, $data + +At the end is the standard go-to-next-packet sequence:: + + waitin + mov $01, $data A6XX NOTES ========== diff --git a/src/freedreno/registers/adreno/adreno_control_regs.xml b/src/freedreno/registers/adreno/adreno_control_regs.xml index 8e14cdee895..ef428117cab 100644 --- a/src/freedreno/registers/adreno/adreno_control_regs.xml +++ b/src/freedreno/registers/adreno/adreno_control_regs.xml @@ -9,6 +9,9 @@ xsi:schemaLocation="http://nouveau.freedesktop.org/ rules-ng.xsd"> --> + + +