mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2026-05-07 07:08:04 +02:00
afuc: Rework and significantly expand README.rst
This hasn't been updated since the a5xx days, and we've learned much more since then. I've tried to expand it from a random collection of notes to a more complete guide to explaining how to read the firmware and understand the various tricks it uses to make code more compact. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24125>
This commit is contained in:
parent
426708796c
commit
7220deff90
2 changed files with 382 additions and 76 deletions
|
|
@ -32,9 +32,9 @@ and purpose of the two microcontrollers remains the same.
|
|||
|
||||
For lack of a better name, this new instruction set is called
|
||||
"Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls
|
||||
it internally.
|
||||
it internally).
|
||||
|
||||
With Adreno 6xx, the separate PF and ME are replaced with a single
|
||||
With Adreno 6xx, the separate PFP and ME are replaced with a single
|
||||
SQE microcontroller using the same instruction set as 5xx.
|
||||
|
||||
.. _afuc-overview:
|
||||
|
|
@ -42,20 +42,31 @@ SQE microcontroller using the same instruction set as 5xx.
|
|||
Instruction Set Overview
|
||||
========================
|
||||
|
||||
32bit instruction set with basic arithmatic ops that can take
|
||||
either two source registers or one src and a 16b immediate.
|
||||
The afuc instruction set is heavily inspired by MIPS, but not exactly
|
||||
compatible.
|
||||
|
||||
32 registers, although some are special purpose:
|
||||
Registers
|
||||
=========
|
||||
|
||||
- ``$00`` - always reads zero, otherwise seems to be the PC
|
||||
- ``$01`` - current PM4 packet header
|
||||
- ``$1c`` - alias ``$rem``, remaining data in packet
|
||||
- ``$1d`` - alias ``$addr``
|
||||
- ``$1f`` - alias ``$data``
|
||||
Similar to MIPS, there are 32 registers, and some are special purpose. ``$00``
|
||||
is the same as ``$zero`` on MIPS, it reads 0 and writes are discarded.
|
||||
|
||||
Branch instructions have a delay slot so the following instruction
|
||||
is always executed regardless of whether branch is taken or not.
|
||||
Registers are displayed in the current disassembly with a hexadecimal
|
||||
numbering, e.g. ``$0a`` is encoded as 10.
|
||||
|
||||
The ABI used when processing packets is that ``$01`` contains the current PM4
|
||||
header, registers from ``$02`` up to ``$11`` are temporaries and may be freely
|
||||
clobbered by the packet handler, while ``$12`` and above are used to store
|
||||
global state like the IB level and next visible draw (for draw skipping).
|
||||
|
||||
Unlike in MIPS, there is a special small hardware-managed stack and special
|
||||
instructions ``call``/``ret`` which use it. The stack only contains return
|
||||
addresses, there is no "stack frame" to spill values to. As a result, ``$sp``,
|
||||
``$fp``, and ``$ra`` don't exist as on MIPS. Instead the last 3 registers are
|
||||
used to :ref:`afuc-read<read>` from various queues and
|
||||
:ref:`afuc-reg-writes<write GPU registers>`. In addition there is a ``$rem``
|
||||
register which normally contains the number of words remaining in the packet
|
||||
but can also be used as a normal register in combination with the rep prefix.
|
||||
|
||||
.. _afuc-alu:
|
||||
|
||||
|
|
@ -79,10 +90,10 @@ The following instructions are available:
|
|||
- ``mul8`` - multiply low 8b of two src
|
||||
- ``min`` - minimum
|
||||
- ``max`` - maximum
|
||||
- ``comp`` - compare two values
|
||||
- ``cmp`` - compare two values
|
||||
|
||||
The ALU instructions can take either two src registers, or a src
|
||||
plus 16b immediate as 2nd src, ex::
|
||||
Similar to MIPS, The ALU instructions can take either two src registers, or a
|
||||
src plus 16b immediate as 2nd src, ex::
|
||||
|
||||
add $dst, $src, 0x1234 ; src2 is immed
|
||||
add $dst, $src1, $src2 ; src2 is reg
|
||||
|
|
@ -92,6 +103,14 @@ The ``not`` instruction only takes a single source::
|
|||
not $dst, $src
|
||||
not $dst, 0x1234
|
||||
|
||||
One departure from MIPS is that there is a special immediate-form ``mov``
|
||||
instruction that can shift the 16-bit immediate by a given amount::
|
||||
|
||||
mov $dst, 0x1234 << 2
|
||||
|
||||
This replaces ``lui`` on MIPS (just use a shift of 16) while also allowing the
|
||||
quick construction of small bitfields, which comes in handy in various places.
|
||||
|
||||
.. _afuc-alu-cmp:
|
||||
|
||||
The ``cmp`` instruction returns:
|
||||
|
|
@ -133,6 +152,41 @@ due to the bit pattern it returns, for example::
|
|||
|
||||
will branch if ``$02`` is less than or equal to ``$03``.
|
||||
|
||||
Delay slots
|
||||
-----------
|
||||
|
||||
Branch instructions have a delay slot so the following instruction is always
|
||||
executed regardless of whether branch is taken or not. Unlike MIPS, a branch in
|
||||
the delay slot is legal as long as the original branch and the branch in its
|
||||
delay slot are never both taken. Because jump tables are awkward and slow due
|
||||
to the lack of memory caching, this is often exploited to create dense
|
||||
sequences of branches to implement switch-case constructs::
|
||||
|
||||
breq $02, 0x1, #foo
|
||||
breq $02, 0x2, #bar
|
||||
breq $02, 0x3, #baz
|
||||
...
|
||||
nop
|
||||
jump #default
|
||||
|
||||
Another common use of a branch in a delay slot is a double-jump (jump to one
|
||||
location if a condition is true, and another location if false). In MIPS this
|
||||
requires two delay slots::
|
||||
|
||||
beq $t0, 0x1, #foo
|
||||
nop ; beq delay slot
|
||||
b #bar
|
||||
nop ; b delay slot
|
||||
|
||||
In afuc this only requires a delay slot for the second branch::
|
||||
|
||||
breq $02, 0x1, #foo
|
||||
brne $02, 0x1, #bar
|
||||
nop
|
||||
|
||||
Note that for the second branch we had to use a conditional branch with the
|
||||
opposite condition instead of an unconditional branch as in the MIPS example,
|
||||
to guarantee that at most one is ever taken.
|
||||
|
||||
.. _afuc-call:
|
||||
|
||||
|
|
@ -140,28 +194,49 @@ Call/Return
|
|||
===========
|
||||
|
||||
Simple subroutines can be implemented with ``call``/``ret``. The
|
||||
jump instruction encodes a fixed offset.
|
||||
jump instruction encodes a fixed offset from the SQE instruction base.
|
||||
|
||||
TODO not sure how many levels deep function calls can be nested.
|
||||
There isn't really a stack. Definitely seems to be multiple
|
||||
levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
|
||||
f22.
|
||||
|
||||
.. _afuc-nop:
|
||||
|
||||
NOPs
|
||||
====
|
||||
|
||||
Afuc has a special NOP encoding where the low 24 bits are ignored by the
|
||||
processor. On a5xx the high 8 bits are ``00``, on a6xx they are ``01``
|
||||
(probably to make sure that 0 is not a legal instruction, increasing the
|
||||
chances of halting immediately when something is misconfigured). This is used
|
||||
sometimes to create a "payload" that is ignored when executed. For example, the
|
||||
first 2 instructions of the firmware typically contain the firmware ID and
|
||||
version followed by the packet handling table offset encoded as NOPs. They are
|
||||
skipped when executed but they are later read as data by the bootstrap routine.
|
||||
|
||||
.. _afuc-control:
|
||||
|
||||
Config Instructions
|
||||
===================
|
||||
Control Registers
|
||||
=================
|
||||
|
||||
These seem to read/write config state in other parts of CP. In at
|
||||
least some cases I expect these map to CP registers (but possibly
|
||||
not directly??)
|
||||
Control registers are a special register space that can only be read/written
|
||||
directly by CP through ``cread``/``cwrite`` instructions::
|
||||
|
||||
- ``cread $dst, [$off + addr], flags``
|
||||
- ``cwrite $src, [$off + addr], flags``
|
||||
|
||||
In cases where no offset is needed, ``$00`` is frequently used as
|
||||
the offset.
|
||||
Control registers ``0x000`` to ``0x0ff`` are private registers used to control
|
||||
the CP, for example to indicate where to read from memory or (normal)
|
||||
registers. ``0x100`` to ``0x17f`` are a private scratch space used by the
|
||||
firmware however it wants, for example as an ad-hoc stack to spill registers
|
||||
when calling a function or to store the scratch used in ``CP_SCRATCH_TO_*``
|
||||
packets.
|
||||
|
||||
In cases where no offset is needed, ``$00`` is frequently used as the offset.
|
||||
|
||||
A value of 4 for ``flags`` is known to be a pre-increment mode that writes the
|
||||
final address ``$off + addr`` to ``$off``, it's not known what other values do.
|
||||
|
||||
For example, the following sequences sets::
|
||||
|
||||
|
|
@ -171,7 +246,7 @@ For example, the following sequences sets::
|
|||
mov $04, $data ; IB size in dwords
|
||||
|
||||
; sanity check # of dwords:
|
||||
breq $04, 0x0, #l23 (#69, 04a2)
|
||||
breq $04, 0x0, #l23
|
||||
|
||||
; this seems something to do with figuring out whether
|
||||
; we are going from RB->IB1 or IB1->IB2 (ie. so the
|
||||
|
|
@ -185,15 +260,66 @@ For example, the following sequences sets::
|
|||
cwrite $03, [$05 + 0x0b1], 0x8
|
||||
cwrite $04, [$05 + 0x0b2], 0x8
|
||||
|
||||
Unlike normal GPU registers, writing control registers seems to always take
|
||||
effect immediately; if writing a control register triggers some complex
|
||||
operation that the firmware needs to wait for, then it typically uses a
|
||||
spinloop with another control register to wait for it to finish.
|
||||
|
||||
Control registers are documented in ``adreno_control_regs.xml``. The
|
||||
disassembler will try to recognize an immediate address as a known control
|
||||
register and print it, for example this sequence similar to the above sequence
|
||||
but on a6xx::
|
||||
|
||||
.. _afuc-reg-access:
|
||||
and $05, $12, 0x0003
|
||||
shl $05, $05, 0x0002
|
||||
cwrite $0e, [$05 + @IB1_BASE], 0x0
|
||||
cwrite $0b, [$05 + @IB1_BASE+0x1], 0x0
|
||||
cwrite $04, [$05 + @IB1_DWORDS], 0x0
|
||||
|
||||
Register Access
|
||||
===============
|
||||
.. _afuc-read:
|
||||
|
||||
The special registers ``$addr`` and ``$data`` can be used to write GPU
|
||||
registers, for example, to write::
|
||||
Reading Memory and Registers
|
||||
============================
|
||||
|
||||
The CP accesses memory directly with no caching. This means that except for
|
||||
very small amounts of data accessed rarely, ``load`` and ``store`` are very
|
||||
slow. Instead, ME/PFP and later SQE read memory through various queues. Reading
|
||||
registers also use a queue, likely because burst reading several registers at
|
||||
once is faster than reading them one-by-one and reading does not complete
|
||||
immediately. Queueing up a read involves writing a (address, length) pair to a
|
||||
control register, and data is read from the queue using one of three special
|
||||
registers:
|
||||
|
||||
- ``$data`` reads the next PM4 packet word. This comes from the RB, IB1, IB2,
|
||||
or SDS (Set Draw State) queue, controlled by ``@IB_LEVEL``. It also
|
||||
decrements ``$rem`` if it isn't already decremented by a rep prefix.
|
||||
- ``$memdata`` reads the next word from a memory read buffer (MRB) setup by
|
||||
writing ``@MEM_READ_ADDR``/``@MEM_READ_DWORDS``. It's used by things like
|
||||
``CP_MEMCPY`` and reading indirect draw parameters in ``CP_DRAW_INDIRECT``.
|
||||
- ``$regdata`` reads from a register read buffer (RRB) setup by
|
||||
``@REG_READ_ADDR``/``@REG_READ_DWORDS``.
|
||||
|
||||
RB, IB1, IB2, SDS, and MRB make up the Read-Only Queue or ROQ, in addition to
|
||||
the Visibility Stream Decoder (VSD) which is setup via a similar control
|
||||
register pair but is read by a fixed-function parser that the CP accesses via a
|
||||
few control registers.
|
||||
|
||||
.. _afuc-reg-writes:
|
||||
|
||||
Writing Registers
|
||||
=================
|
||||
|
||||
The same special registers, when used as a destination, can be used to
|
||||
write GPU registers on ME. Because they have a totally different function when
|
||||
used as a destination, they use different names:
|
||||
|
||||
- ``$addr`` sets the address and disables ``CP_PROTECT`` address checking.
|
||||
- ``$usraddr`` sets the address and checks it against the ``CP_PROTECT`` access
|
||||
table. It's used for addresses specified by the PM4 packet stream instead of
|
||||
internally.
|
||||
- ``$data`` writes the register and auto-increments the address.
|
||||
|
||||
for example, to write::
|
||||
|
||||
mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
|
||||
mov $data, $03 ; CP_SCRATCH_REG[0x2]
|
||||
|
|
@ -201,54 +327,88 @@ registers, for example, to write::
|
|||
...
|
||||
|
||||
subsequent writes to ``$data`` will increment the address of the register
|
||||
to write, so a sequence of consecutive registers can be written
|
||||
to write, so a sequence of consecutive registers can be written. On a5xx ME,
|
||||
this will directly write the register, on a6xx SQE this will instead determine
|
||||
which cluster(s) the register belongs to and push the write onto the
|
||||
appropriate per-cluster queue(s) letting the SQE run ahead of the GPU.
|
||||
|
||||
To read::
|
||||
When bit 18 of ``$addr`` is set, the auto-incrementing is disabled. This is
|
||||
often used with :ref:`afuc-mem-writes <NRT_DATA>`.
|
||||
|
||||
On a5xx ME, ``$regdata`` can also be used to directly read a register::
|
||||
|
||||
mov $addr, CP_SCRATCH_REG[0x2]
|
||||
mov $03, $addr
|
||||
mov $04, $addr
|
||||
mov $03, $regdata
|
||||
mov $04, $regdata
|
||||
|
||||
This does not exist on a6xx because register reads are not synchronized against
|
||||
writes any more.
|
||||
|
||||
Many registers that are updated frequently have two banks, so they can be
|
||||
updated without stalling for previous draw to finish. These banks are
|
||||
updated without stalling for previous draw to finish. On a5xx, these banks are
|
||||
arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at
|
||||
least the version I'm looking at) stores this in ``$17``, so to update
|
||||
these registers from ME::
|
||||
least the version I'm looking at) stores this in ``$17``, so to update these
|
||||
registers from ME::
|
||||
|
||||
or $addr, $17, VFD_INDEX_OFFSET
|
||||
mov $data, $03
|
||||
...
|
||||
|
||||
Note that PFP doesn't seem to use this approach, instead it does something
|
||||
like::
|
||||
On a6xx this is handled transparently to the SQE, and the bank to use is stored
|
||||
separately in the cluster queue.
|
||||
|
||||
Registers can also be written directly, skipping the queue, by writing
|
||||
``@REG_WRITE_ADDR``/``@REG_WRITE``. This is used on a6xx for certain frontend
|
||||
registers that have their own queues and on a5xx is used by the PFP::
|
||||
|
||||
mov $0c, CP_SCRATCH_REG[0x7]
|
||||
mov $02, 0x789a ; value
|
||||
cwrite $0c, [$00 + 0x010], 0x8
|
||||
cwrite $02, [$00 + 0x011], 0x8
|
||||
cwrite $0c, [$00 + @REG_WRITE_ADDR], 0x8
|
||||
cwrite $02, [$00 + @REG_WRITE], 0x8
|
||||
|
||||
Like with the ``$addr``/``$data`` approach, the destination register address
|
||||
increments on each write.
|
||||
increments on each write to ``@REG_WRITE``.
|
||||
|
||||
.. _afuc-mem:
|
||||
.. _afuc-pipe-regs:
|
||||
|
||||
Memory Access
|
||||
=============
|
||||
Pipe Registers
|
||||
--------------
|
||||
|
||||
There are no load/store instructions, as such. The microcontrollers
|
||||
have only indirect memory access via GPU registers. There are two
|
||||
mechanism possible.
|
||||
This yet another private register space, triggered by writing to the high 8
|
||||
bits of ``$addr`` and then writing ``$data`` normally. Some pipe registers like
|
||||
``WAIT_MEM_WRITES`` or ``WAIT_GPU_IDLE`` have no data and a write is triggered
|
||||
immediately when ``$addr`` is written, for example in ``CP_WAIT_MEM_WRITES``::
|
||||
|
||||
Read/Write via CP_NRT Registers
|
||||
-------------------------------
|
||||
mov $addr, 0x0084 << 24 ; |WAIT_MEM_WRITES
|
||||
|
||||
This seems to be only used by ME. If PFP were also using it, they would
|
||||
race with each other. It seems to be primarily used for small reads.
|
||||
The pipe register is decoded here by the disassembler in a comment.
|
||||
|
||||
The main difference of pipe registers from control registers are:
|
||||
|
||||
- They are always write-only.
|
||||
- On a6xx they are pipelined together with normal register writes, on a5xx they
|
||||
are written from ME like normal registers.
|
||||
- Writing them can take an arbitrary amount of time, so they can be used to
|
||||
wait for some condition without spinning.
|
||||
|
||||
In short, they behave more like normal registers but are not expected to be
|
||||
read/written by anything other than CP. Over time more and more GPU registers
|
||||
not touched by the kernel driver have been converted to pipe registers.
|
||||
|
||||
.. _afuc-mem-writes:
|
||||
|
||||
Writing Memory
|
||||
==============
|
||||
|
||||
Writing memory is done by writing GPU registers:
|
||||
|
||||
- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
|
||||
- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``
|
||||
- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``.
|
||||
|
||||
The address register increments with successive reads or writes.
|
||||
The address register increments with successive writes.
|
||||
|
||||
On a5xx, this seems to be only used by ME. If PFP were also using it, they would
|
||||
race with each other. It can also be used for reads, primarily small reads.
|
||||
|
||||
Memory Write example::
|
||||
|
||||
|
|
@ -269,36 +429,179 @@ Memory Read example::
|
|||
mov $04, $addr
|
||||
mov $05, $addr
|
||||
|
||||
On a6xx ``CP_ME_NRT_ADDR`` and ``CP_ME_NRT_DATA`` have been replaced by
|
||||
:ref:`afuc-pipe-regs <pipe registers>` and they can only be used for writes but
|
||||
it otherwise works similarly.
|
||||
|
||||
Read via Control Instructions
|
||||
-----------------------------
|
||||
Load and Store Instructions
|
||||
===========================
|
||||
|
||||
This is used by PFP whenever it needs to read memory. Also seems to be
|
||||
used by ME for streaming reads (larger amounts of data). The DMA access
|
||||
seems to be done by ROQ.
|
||||
a6xx adds ``load`` and ``store`` instruction that work similarly to ``cread``
|
||||
and ``cwrite``. Because the address is 64-bits but registers are 32-bit, the
|
||||
high 32 bits come from the ``@LOAD_STORE_HI``
|
||||
:ref:`afuc-control <control register>`. They are mostly used by the context
|
||||
switch routine and even then very sparingly, before the memory read/write queue
|
||||
state is saved while it is being restored.
|
||||
|
||||
TODO might also be possible for write access
|
||||
Modifiers
|
||||
=========
|
||||
|
||||
TODO some of the control commands might be synchronizing access
|
||||
between PFP and ME??
|
||||
There are two modifiers that enable more compact and efficient implementations
|
||||
of common patterns:
|
||||
|
||||
An example from ``CP_DRAW_INDIRECT`` packet handler::
|
||||
.. _afuc-rep:
|
||||
|
||||
mov $07, 0x0004 ; # of dwords to read from draw-indirect buffer
|
||||
; load address of indirect buffer from cmdstream:
|
||||
cwrite $data, [$00 + 0x0b8], 0x8
|
||||
cwrite $data, [$00 + 0x0b9], 0x8
|
||||
; set # of dwords to read:
|
||||
cwrite $07, [$00 + 0x0ba], 0x8
|
||||
...
|
||||
; read parameters from draw-indirect buffer:
|
||||
mov $09, $addr
|
||||
mov $07, $addr
|
||||
cread $12, [$00 + 0x040], 0x8
|
||||
; the start parameter gets written into MEQ, which ME writes
|
||||
; to VFD_INDEX_OFFSET register:
|
||||
mov $data, $addr
|
||||
Repeat
|
||||
------
|
||||
|
||||
``(rep)`` repeats the same instruction ``$rem`` times. More precisely, it
|
||||
decrements ``$rem`` after the instruction executes if it wasn't already
|
||||
decremented from a read from ``$data`` and re-executes the instruction until
|
||||
``$rem`` is 0. It can be used with ALU instructions and control instructions.
|
||||
Usually it is used in conjunction with ``$data`` to read the rest of the packet
|
||||
in one instruction, but it can also be used freestanding, for example this
|
||||
snippet clears the control register scratch space::
|
||||
|
||||
mov $rem, 0x0080 ; clear 0x80 registers
|
||||
mov $03, 0x00ff ; start at 0xff + 1 = 0x100
|
||||
(rep)cwrite $00, [$03 + 0x001], 0x4
|
||||
|
||||
Note the use of pre-increment mode, so that the first execution clears
|
||||
``0x100`` and updates ``$03`` to ``0x100``, the second execution clears
|
||||
``0x101`` and updates ``$03`` to ``0x101``, and so on.
|
||||
|
||||
.. _afuc-xmov:
|
||||
|
||||
eXtra Moves
|
||||
-----------
|
||||
|
||||
``(xmovN)`` is an optimization which lets the firmware read multiple words from
|
||||
a queue in the same cycle. Conceptually, it adds "extra" mov instructions to be
|
||||
executed after a given ALU instruction, although in practice they are likely
|
||||
executed in parallel. ``(xmov1)`` adds up to 1 move, ``(xmov2)`` adds up to 2,
|
||||
and ``(xmov3)`` adds up to 3. The actual number of moves added is the minimum
|
||||
of the number in the instruction and ``$rem``, so a ``(xmov3)`` instruction
|
||||
behaves like a ``(xmov1)`` instruction if ``$rem = 1``. Given an instruction::
|
||||
|
||||
(xmovN) alu $dst, $src1, $src2
|
||||
|
||||
or a 1-source instruction::
|
||||
|
||||
(xmovN) alu $dst, $src2
|
||||
|
||||
then we compute the number of extra moves ``M = min(N, $rem)``. If ``M = 1``,
|
||||
then we add::
|
||||
|
||||
mov $data, $src2
|
||||
|
||||
If ``M = 2``, then we add::
|
||||
|
||||
mov $data, $src2
|
||||
mov $data, $src2
|
||||
|
||||
Finally, as a special case explained below, if ``M = 3`` then we add::
|
||||
|
||||
mov $data, $src2
|
||||
mov $dst, $src2 ; !!!
|
||||
mov $data, $src2
|
||||
|
||||
If ``$dst`` is not one of the "special" registers ``$data``, ``$addr``,
|
||||
``$usraddr``, then ``$data`` is replaced by ``$00`` in all destinations, i.e.
|
||||
the results of the subsequent moves are discarded.
|
||||
|
||||
The purpose of the ``M = 3`` special case is mostly to efficiently implement
|
||||
``CP_CONTEXT_REG_BUNCH``. This is the entire implementation of
|
||||
``CP_CONTEXT_REG_BUNCH``, which is essentially just one instruction::
|
||||
|
||||
CP_CONTEXT_REG_BUNCH:
|
||||
(rep)(xmov3)mov $usraddr, $data
|
||||
waitin
|
||||
mov $01, $data
|
||||
|
||||
If there are 4 or more words remaining in the packet, that is if there are at
|
||||
least two more registers to write, then (ignoring the ``(rep)`` for a moment)
|
||||
the instruction expands to::
|
||||
|
||||
mov $usraddr, $data
|
||||
mov $data, $data
|
||||
mov $usraddr, $data
|
||||
mov $data, $data
|
||||
|
||||
This is likely all executed in a single cycle, allowing us to write 2 registers
|
||||
per cycle.
|
||||
|
||||
``(xmov1)`` can be also added to ``(rep)mov $data, $data``, which is a common
|
||||
pattern to write the rest of the packet to successive registers, to write up to
|
||||
2 registers per cycle as well. The firmware does not use ``(xmov3)``, however,
|
||||
so 2 registers per cycle is likely a hardware limitation.
|
||||
|
||||
Although ``(xmovN)`` is often used in combination with ``(rep)``, it doesn't
|
||||
have to be. For example, ``(xmov1)mov $data, $data`` moves the next 2 packet
|
||||
words to 2 successive registers.
|
||||
|
||||
Packet Table
|
||||
============
|
||||
|
||||
The core of the microprocessor's job is to parse each packet header and jump to
|
||||
its handler. This is done through a ``waitin`` instruction which waits for the
|
||||
packet header to become available and then parses the header and jumps to the
|
||||
handler using a jump table. However it does *not* actually consume the header.
|
||||
Like any branch instruction, it has a delay slot, and by convention this delay
|
||||
slot always contains a ``mov $01, $data`` instruction. This consumes the same
|
||||
header that ``waitin`` parsed and puts it in ``$01`` so that the packet header
|
||||
is available in ``$01`` in the next packet. Thus all packet handlers end with
|
||||
this sequence::
|
||||
|
||||
waitin
|
||||
mov $01, $data
|
||||
|
||||
The jump table itself is initialized by the SQE in the bootstrap routine at the
|
||||
beginning of the firmware. Amongst other tasks, it reads the offset of the jump
|
||||
table from the NOP payload at the beginning, then uses a jump table embedded at
|
||||
the end of the firmware to set it up by writing the ``@PACKET_TABLE_WRITE``
|
||||
control register. After everything is setup, it does the ``waitin`` sequence
|
||||
to start handling the first packet (which should be ``CP_ME_INIT``).
|
||||
|
||||
Example Packet
|
||||
==============
|
||||
|
||||
Let's examine an implementation of ``CP_MEM_WRITE``::
|
||||
|
||||
CP_MEM_WRITE:
|
||||
mov $addr, 0x00a0 << 24 ; |NRT_ADDR
|
||||
|
||||
First, we setup the register to write to, which is the ``NRT_ADDR``
|
||||
:ref:`afuc-pipe-regs <pipe register>`. It turns out that the low 2 bits of
|
||||
``NRT_ADDR`` are a flag which when 1 disables auto-incrementing ``NRT_ADDR``
|
||||
when ``NRT_DATA`` is written, but we don't want this behavior so we have to
|
||||
make sure they are clear::
|
||||
|
||||
or $02, $data, 0x0003 ; reading $data reads the next PM4 word
|
||||
xor $data, $02, 0x0003 ; writing $data writes the register, which is NRT_ADDR
|
||||
|
||||
Writing ``$data`` auto-increments ``$addr``, so now the next write is to
|
||||
``0xa1`` or ``NRT_ADDR+1`` (``NRT_ADDR`` is a 64-bit register)::
|
||||
|
||||
mov $data, $data
|
||||
|
||||
Now, we have to write ``NRT_DATA``. We want to repeatedly write the same
|
||||
register, without having to fight the auto-increment by resetting ``$addr``
|
||||
each time, which is where the bit 18 that disables auto-increment comes in
|
||||
handy::
|
||||
|
||||
mov $addr, 0xa204 << 16 ; |NRT_DATA
|
||||
|
||||
Finally, we have to repeatedly copy the remaining PM4 packet data to the
|
||||
``NRT_DATA`` register, which we can do in one instruction with
|
||||
:ref:`afuc-rep <(rep)>`. Furthermore we can use :ref:`afuc-xmov <(xmov1)>` to
|
||||
squeeze out some more performance::
|
||||
|
||||
(rep)(xmov1)mov $data, $data
|
||||
|
||||
At the end is the standard go-to-next-packet sequence::
|
||||
|
||||
waitin
|
||||
mov $01, $data
|
||||
|
||||
A6XX NOTES
|
||||
==========
|
||||
|
|
|
|||
|
|
@ -9,6 +9,9 @@ xsi:schemaLocation="http://nouveau.freedesktop.org/ rules-ng.xsd">
|
|||
-->
|
||||
|
||||
<domain name="A5XX_CONTROL_REG" width="32">
|
||||
<reg32 name="REG_WRITE_ADDR" offset="0x010"/>
|
||||
<reg32 name="REG_WRITE" offset="0x011"/>
|
||||
|
||||
<reg64 name="IB1_BASE" offset="0x0b0"/>
|
||||
<reg32 name="IB1_DWORDS" offset="0x0b2"/>
|
||||
<reg64 name="IB2_BASE" offset="0x0b4"/>
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue