mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2026-05-09 04:38:03 +02:00
docs/nvk: Add some developer hardware docs
Reviewed-by: Emma Anholt <emma@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37828>
This commit is contained in:
parent
0afd4bc831
commit
b92521a019
1 changed files with 126 additions and 0 deletions
126
docs/drivers/nvk/hardware_docs.rst
Normal file
126
docs/drivers/nvk/hardware_docs.rst
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
|
||||
Hardware docs
|
||||
=============
|
||||
|
||||
Command buffers
|
||||
---------------
|
||||
|
||||
Command format
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
Each command sent to the GPU contains a method and some data. Method names are
|
||||
documented in corresponding header files copied from NVIDIA, eg. the fermi
|
||||
graphics methods are in src/nouveau/headers/nvidia/classes/cl9097.h
|
||||
|
||||
P_IMMD
|
||||
""""""
|
||||
|
||||
A lot of the time, you will want to issue a single method with its data, which
|
||||
can be done with P_IMMD::
|
||||
|
||||
P_IMMD(p, NV9097, WAIT_FOR_IDLE, 0);
|
||||
|
||||
P_IMMD will emit either a single `immediate-data method`_, which takes a single
|
||||
word, or a pair of words that's equivalent to P_MTHD + the provided data. Code
|
||||
must count P_IMMD as possibly 2 words as a result.
|
||||
|
||||
.. _immediate-data method: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1214
|
||||
|
||||
P_MTHD
|
||||
""""""
|
||||
|
||||
`P_MTHD`_ is a convenient way to execute multiple consecutive methods without
|
||||
repeating the method header. For example, the code::
|
||||
|
||||
P_MTHD(p, NV9097, SET_REPORT_SEMAPHORE_A);
|
||||
P_NV9097_SET_REPORT_SEMAPHORE_A(p, addr >> 32);
|
||||
P_NV9097_SET_REPORT_SEMAPHORE_B(p, addr);
|
||||
P_NV9097_SET_REPORT_SEMAPHORE_C(p, value);
|
||||
|
||||
generates four words - one word for the method header (defaulting to 1INC) and
|
||||
then the next three words for data. 1INC will automatically increment the method
|
||||
id by one word for each data value, which is why the example can advance from
|
||||
SET_REPORT_SEMAPHORE_A to B to C.
|
||||
|
||||
.. _P_MTHD: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1042
|
||||
|
||||
P_0INC
|
||||
""""""
|
||||
|
||||
`0INC`_ will issue the same method repeatedly for each following data word.
|
||||
|
||||
.. _0INC: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1096
|
||||
|
||||
P_1INC
|
||||
""""""
|
||||
|
||||
`1INC`_ will increment after one word and then issue the following method
|
||||
repeatedly. For example, the code::
|
||||
|
||||
P_1INC(p, NV9097, CALL_MME_MACRO(NVK_MME_SET_PRIV_REG));
|
||||
P_INLINE_DATA(p, 0);
|
||||
P_INLINE_DATA(p, BITFIELD_BIT(3));
|
||||
|
||||
issues one NV9097_CALL_MME_MACRO command, then increments the method and issues
|
||||
two NV9097_CALL_MME_DATA commands.
|
||||
|
||||
.. _1INC: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1149
|
||||
|
||||
Execution barriers
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Commands within a command buffer can be synchronized in a few different ways.
|
||||
|
||||
* Explicit WFI - Idles all engines before executing the next command eg. via
|
||||
`NVA16F_WFI` or `NV9097_WAIT_FOR_IDLE`
|
||||
* Semaphores - Delay execution based on values in a memory location. See
|
||||
`open-gpu-doc on semaphores`_
|
||||
* A subchannel switch - Causes the hardware to execute an implied WFI
|
||||
|
||||
.. _open-gpu-doc on semaphores: https://github.com/NVIDIA/open-gpu-doc/blob/master/manuals/turing/tu104/dev_pbdma.ref.txt#L3231
|
||||
|
||||
Subchannel switches
|
||||
"""""""""""""""""""
|
||||
|
||||
A subchannel switch occurs when the hardware receives a command for a different
|
||||
subchannel than the one that it's currently executing. For example, if the
|
||||
hardware is currently executing commands on the 3D engine (`SUBC_NV9097 == 0`), a
|
||||
command executed on the compute engine (`SUBC_NV90C0 == 1`) will cause a
|
||||
subchannel switch. Host methods (class \*6F) are an exception to this - they can
|
||||
be issued to any subchannel and will not trigger a subchannel switch
|
||||
[#fhostsubcswitch]_ [#foverlapnvk]_.
|
||||
|
||||
Subchannel switches act the same way that an explicit WFI does - they fully idle
|
||||
the channel before issuing commands to the next engine [#fnsight]_
|
||||
[#foverlapnvk]_.
|
||||
|
||||
This works the same on Blackwell. Some NVIDIA documentation contradicts this:
|
||||
"On NVIDIA Blackwell Architecture GPUs and newer, subchannel switches do not
|
||||
occur between 3D and compute workloads"[#fnsight]_. This documentation appears
|
||||
to be wrong or inapplicable for some reason - tests do not reproduce this
|
||||
behavior [#foverlapnvk]_ [#foverlapprop]_, and the blob does not change its
|
||||
event implementation for blackwell [#feventprop]_.
|
||||
|
||||
|
||||
.. [#fnsight] https://docs.nvidia.com/nsight-graphics/UserGuide/index.html#subchannel-switch-overlay
|
||||
.. [#foverlapnvk] Based on reverse engineering by running
|
||||
https://gitlab.freedesktop.org/mhenning/re/-/tree/main/vk_test_overlap_exec
|
||||
under nvk, possibly with alterations to the way nvk generates commands for
|
||||
pipeline barriers.
|
||||
.. [#foverlapprop] By running
|
||||
https://gitlab.freedesktop.org/mhenning/re/-/tree/main/vk_test_overlap_exec
|
||||
under the proprietary driver and examining the main output (eg. execution
|
||||
overlaps without pipeline barriers)
|
||||
.. [#feventprop] For example, the event commands generated here only wait on a
|
||||
single engine
|
||||
https://gitlab.freedesktop.org/mhenning/re/-/blob/6ce8c860da65fbf2ab26d124d25f907dea2cf33a/vk_event/blackwell.out#L20612-20793
|
||||
.. [#fhostsubcswitch] https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1009
|
||||
|
||||
Copy engine
|
||||
^^^^^^^^^^^
|
||||
|
||||
The copy engine's `PIPELINED` mode allows a new transfer to start before the
|
||||
previous transfer finishes, while `NON_PIPELINED` acts as an execution barrier
|
||||
between the current copy and the previous one [#fpipelined]_.
|
||||
|
||||
.. [#fpipelined] https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/src/common/sdk/nvidia/inc/ctrl/ctrl0050.h#L75-L83
|
||||
Loading…
Add table
Reference in a new issue