mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2026-05-09 04:38:03 +02:00
doc/freedreno: Add a bunch of docs of the hardware and drivers.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19733>
This commit is contained in:
parent
e284e6ad3c
commit
378f83917c
1 changed files with 277 additions and 3 deletions
|
|
@ -1,12 +1,286 @@
|
|||
Freedreno
|
||||
=========
|
||||
|
||||
Freedreno driver specific docs.
|
||||
Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs. It implements up to
|
||||
OpenGL ES 3.2 and desktop OpenGL 4.5.
|
||||
|
||||
See the `Freedreno Wiki
|
||||
<https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
|
||||
details.
|
||||
|
||||
Turnip
|
||||
======
|
||||
|
||||
Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.
|
||||
|
||||
The current set of specific chip versions supported can be found in
|
||||
:file:`src/freedreno/common/freedreno_devices.py`. The current set of features
|
||||
supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
|
||||
There are no plans to port to a5xx or earlier GPUs.
|
||||
|
||||
Hardware architecture
|
||||
---------------------
|
||||
|
||||
Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
|
||||
("gmem") and render directly to system memory ("sysmem"). It is UMA, using
|
||||
mostly write combined memory but with the ability to map some buffers as cache
|
||||
coherent with the CPU.
|
||||
|
||||
Hardware acronyms
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. glossary::
|
||||
|
||||
Cluster
|
||||
A group of hardware registers, often with multiple copies to allow
|
||||
pipelining. There is an M:N relationship between hardware blocks that do
|
||||
work and the clusters of registers for the state that hardware blocks use.
|
||||
|
||||
CP
|
||||
Command Processor. Reads the stream of statechanges and draw commands
|
||||
generated by the driver.
|
||||
|
||||
PFP
|
||||
Prefetch Parser. Adreno 2xx-4xx CP component.
|
||||
|
||||
ME
|
||||
Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.
|
||||
|
||||
SQE
|
||||
a6xx+ replacement for PFP/ME. This is the microcontroller that runs the
|
||||
microcode (loaded from Linux) which actually processes the command stream
|
||||
and writes to the hardware registers. See `afuc
|
||||
<https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.
|
||||
|
||||
ROQ
|
||||
DMA engine used by the SQE for reading memory, with some prefetch buffering.
|
||||
Mostly reads in the command stream, but also serves for
|
||||
``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.
|
||||
|
||||
SP
|
||||
Shader Processor. Unified, scalar shader engine. One or more, depending on
|
||||
GPU and tier.
|
||||
|
||||
TP
|
||||
Texture Processor.
|
||||
|
||||
UCHE
|
||||
Unified L2 Cache. 32KB on A330, unclear how big now.
|
||||
|
||||
CCU
|
||||
Color Cache Unit.
|
||||
|
||||
VSC
|
||||
Visibility Stream Compressor
|
||||
|
||||
PVS
|
||||
Primitive Visibiliy Stream
|
||||
|
||||
FE
|
||||
Front End? Index buffer and vertex attribute fetch cluster. Includes PC,
|
||||
VFD, VPC.
|
||||
|
||||
VFD
|
||||
Vertex Fetch and Decode
|
||||
|
||||
VPC
|
||||
Varying/Position Cache? Hardware block that stores shaded vertex data for
|
||||
primitive assembly.
|
||||
|
||||
HLSQ
|
||||
High Level Sequencer. Manages state for the SPs, batches up PS invocations
|
||||
between primitives, is involved in preemption.
|
||||
|
||||
PC_VS
|
||||
Cluster where varyings are read from VPC and assembled into primitives to
|
||||
feed GRAS.
|
||||
|
||||
VS
|
||||
Vertex Shader. Responsible for generating VS/GS/tess invocations
|
||||
|
||||
GRAS
|
||||
Rasterizer. Responsible for generating PS invocations from primitives, also
|
||||
does LRZ
|
||||
|
||||
PS
|
||||
Pixel Shader.
|
||||
|
||||
RB
|
||||
Render Backend. Performs both early and late Z testing, blending, and
|
||||
attachment stores of output of the PS.
|
||||
|
||||
GMEM
|
||||
Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
|
||||
attachments during tiled rendering
|
||||
|
||||
LRZ
|
||||
Low Resolution Z. A low resolution area of the depth buffer that can be
|
||||
initialized during the binning pass to contain the worst-case (farthest) Z
|
||||
values in a block, and then used to early reject fragments during
|
||||
rasterization.
|
||||
|
||||
Cache hierarchy
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The a6xx GPUs have two main caches: CCU and UCHE.
|
||||
|
||||
UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
|
||||
texture L1, LRZ, and storage image accesses (``ldib``/``stib``). Misses and
|
||||
flushes access system memory.
|
||||
|
||||
The CCU is the separate cache used by 2D blits and sysmem render target access
|
||||
(and also for resolves to system memory when in GMEM mode). Its memory comes
|
||||
from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
|
||||
reserved based on whether we're in a render pass using GMEM for attachment
|
||||
storage, or we're doing sysmem rendering. Cache entries have the attachment
|
||||
number and layer mixed into the cache tag in some way, likely so that a
|
||||
fragment's access is spread through the cache even if the attachments are the
|
||||
same size and alignments in address space. This means that the cache must be
|
||||
flushed and invalidated between memory being used for one attachment and another
|
||||
(notably depth vs color, but also MRT color).
|
||||
|
||||
The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
|
||||
unclear how big now) before accessing UCHE. This cache is used for normal
|
||||
sampling like ``sam``` and ``isam`` (and the compiler will make read-only
|
||||
storage image access through it as well). It is not coherent with UCHE (may get
|
||||
stale results when you ``sam`` after ``stib``), but must get flushed per draw or
|
||||
something because you don't need a manual invalidate between draws storing to an
|
||||
image and draws sampling from a texture.
|
||||
|
||||
The command processor (CP) does not read from either of these caches, and
|
||||
instead uses FIFOs in the ROQ to avoid stalls reading from system memory.
|
||||
|
||||
Draw states
|
||||
^^^^^^^^^^^
|
||||
|
||||
Since the SQE is not a fast processor, and tiled rendering means that many draws
|
||||
won't even be used in many bins, since a5xx state updates can be batched up into
|
||||
"draw states" that point to a fragment of CP packets. At draw time, if the draw
|
||||
call is going to actually execute (some primitive is visible in the current
|
||||
tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
|
||||
the last time they were executed, it executes the corresponding fragment.
|
||||
|
||||
Starting with a6xx, states can be taggged with whether they should be executed
|
||||
at draw time for any of sysmem, binning, or tile rendering. This allows a
|
||||
single command stream to be generated which can be executed in any of the modes,
|
||||
unlike pre-a6xx where we had to generate separate command lists for the binning
|
||||
and rendering phases.
|
||||
|
||||
Note that this means that the generated draw state has to always update all of
|
||||
the state you have chosen to pack into that ``GROUP_ID``, since any of your
|
||||
previous statechanges in a previous draw state command may have been skipped.
|
||||
|
||||
Pipelining (a6xx+)
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Most CP commands write to registers. In a6xx+, the registers are located in
|
||||
clusters corresponding to the stage of the pipeline they are used from (see
|
||||
``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
|
||||
generally have two copies ("contexts") in their cluster, so previous draws can
|
||||
be working on the previous set of register state while the next draw's state is
|
||||
being set up. You can find what registers go into which clusters by looking at
|
||||
:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.
|
||||
|
||||
As SQE processes register writes in the command stream, it sends them into a
|
||||
per-cluster queue stored in ``CP_MEMPOOL``. This allows the pipeline stages to
|
||||
process their stream of register updates and events independent of each other
|
||||
(so even with just 2 contexts in a stage, earlier stages can proceed on to later
|
||||
draws before later stages have caught up).
|
||||
|
||||
Each cluster has a per-context bit indicating that the context is done/free.
|
||||
Register writes will stall on the context being done.
|
||||
|
||||
During a 3D draw command, SQE generates several internal events flow through the
|
||||
pipeline:
|
||||
|
||||
- ``CP_EVENT_START`` clears the done bit for the context when written to the
|
||||
cluster
|
||||
- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
|
||||
the actual event/drawing.
|
||||
- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
|
||||
done flag.
|
||||
- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
|
||||
the registers that were dirtied in this context to that one.
|
||||
|
||||
The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
|
||||
``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
|
||||
rollover.
|
||||
|
||||
Because the clusters proceed independently of each other even across draws, if
|
||||
you need to synchronize an earlier cluster to the output of a later one, then
|
||||
you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
|
||||
necessary caches.
|
||||
|
||||
Also, note that some registers are not banked at all, and will require a
|
||||
CP_WAIT_FOR_IDLE for any previous usage of the register to complete.
|
||||
|
||||
In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
|
||||
register banks that were flipped between per draw.
|
||||
|
||||
Software Architecture
|
||||
---------------------
|
||||
|
||||
Freedreno and Turnip use a shared core for shader compiler, image layout, and
|
||||
register and command stream definitions. They implement separate state
|
||||
management and command stream generation.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
|
||||
freedreno/*
|
||||
|
||||
See the `Freedreno Wiki <https://github.com/freedreno/freedreno/wiki>`__
|
||||
for more details.
|
||||
GPU hang debugging
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A kernel message from DRM of "gpu fault" can mean any sort of error reported by
|
||||
the GPU (including its internal hang detection). If a fault in GPU address
|
||||
space happened, you should expect to find a message from the iommu, with the
|
||||
faulting address and a hardware unit involved:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
*** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)
|
||||
|
||||
On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
|
||||
``/sys/devices/virtual/devcoredump/**/data``. You can cp that file to a
|
||||
:file:`crash.devcore` to save it, otherwise the kernel will expire it
|
||||
eventually. Echo 1 to the file to free the core early, as another core won't be
|
||||
taken until then.
|
||||
|
||||
Once you have your core file, you can use :command:`crashdec -f crash.devcore`
|
||||
to decode it. The output will have ``ESTIMATED CRASH LOCATION`` where we
|
||||
estimate the CP to have stopped. Note that it is expected that this will be
|
||||
some distance past whatever state triggered the fault, given GPU pipelining, and
|
||||
will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
|
||||
``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
|
||||
event. You can try running the workload with ``TU_DEBUG=flushall`` or
|
||||
``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.
|
||||
|
||||
You can also find what commands were queued up to each cluster in the
|
||||
``regs-name: CP_MEMPOOL`` section.
|
||||
|
||||
Command Stream Capture
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
During Mesa development, it's often useful to look at the command streams we
|
||||
send to the kernel. Mesa itself doesn't implement a way to stream them out
|
||||
(though it maybe should!). Instead, we have an interface for the kernel to
|
||||
capture all submitted command streams:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
cat /sys/kernel/debug/dri/0/rd > cmdstream &
|
||||
|
||||
By default, command stream capture does not capture texture/vertex/etc. data.
|
||||
You can enable capturing all the BOs with:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
echo Y > /sys/module/msm/parameters/rd_full
|
||||
|
||||
Note that, since all command streams get captured, it is easy to run the system
|
||||
out of memory doing this, so you probably don't want to enable it during play of
|
||||
a heavyweight game. Instead, to capture a command stream within a game, you
|
||||
probably want to cause a crash in the GPU during a farme of interest so that a
|
||||
single GPU core dump is generated. Emitting ``0xdeadbeef`` in the CS should be
|
||||
enough to cause a fault.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue