docs/nvk: Add some developer hardware docs

Reviewed-by: Emma Anholt <emma@anholt.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37828>
2026-05-09 04:38:03 +02:00 · 2025-10-10 19:49:59 -04:00 · 2025-10-10 19:49:59 -04:00 · b92521a019
commit b92521a019
parent 0afd4bc831
1 changed files with 126 additions and 0 deletions
--- a/docs/drivers/nvk/hardware_docs.rst
+++ b/docs/drivers/nvk/hardware_docs.rst
@ -0,0 +1,126 @@
+
+Hardware docs
+=============
+
+Command buffers
+---------------
+
+Command format
+^^^^^^^^^^^^^^
+
+Each command sent to the GPU contains a method and some data. Method names are
+documented in corresponding header files copied from NVIDIA, eg. the fermi
+graphics methods are in src/nouveau/headers/nvidia/classes/cl9097.h
+
+P_IMMD
+""""""
+
+A lot of the time, you will want to issue a single method with its data, which
+can be done with P_IMMD::
+
+    P_IMMD(p, NV9097, WAIT_FOR_IDLE, 0);
+
+P_IMMD will emit either a single `immediate-data method`_, which takes a single
+word, or a pair of words that's equivalent to P_MTHD + the provided data. Code
+must count P_IMMD as possibly 2 words as a result.
+
+ .. _immediate-data method: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1214
+
+P_MTHD
+""""""
+
+`P_MTHD`_ is a convenient way to execute multiple consecutive methods without
+repeating the method header. For example, the code::
+
+      P_MTHD(p, NV9097, SET_REPORT_SEMAPHORE_A);
+      P_NV9097_SET_REPORT_SEMAPHORE_A(p, addr >> 32);
+      P_NV9097_SET_REPORT_SEMAPHORE_B(p, addr);
+      P_NV9097_SET_REPORT_SEMAPHORE_C(p, value);
+
+generates four words - one word for the method header (defaulting to 1INC) and
+then the next three words for data. 1INC will automatically increment the method
+id by one word for each data value, which is why the example can advance from
+SET_REPORT_SEMAPHORE_A to B to C.
+
+ .. _P_MTHD: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1042
+
+P_0INC
+""""""
+
+`0INC`_ will issue the same method repeatedly for each following data word.
+
+ .. _0INC: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1096
+
+P_1INC
+""""""
+
+`1INC`_ will increment after one word and then issue the following method
+repeatedly. For example, the code::
+
+      P_1INC(p, NV9097, CALL_MME_MACRO(NVK_MME_SET_PRIV_REG));
+      P_INLINE_DATA(p, 0);
+      P_INLINE_DATA(p, BITFIELD_BIT(3));
+
+issues one NV9097_CALL_MME_MACRO command, then increments the method and issues
+two NV9097_CALL_MME_DATA commands.
+
+.. _1INC: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1149
+
+Execution barriers
+^^^^^^^^^^^^^^^^^^
+
+Commands within a command buffer can be synchronized in a few different ways.
+
+ * Explicit WFI - Idles all engines before executing the next command eg. via
+   `NVA16F_WFI` or `NV9097_WAIT_FOR_IDLE`
+ * Semaphores - Delay execution based on values in a memory location. See
+   `open-gpu-doc on semaphores`_
+ * A subchannel switch - Causes the hardware to execute an implied WFI
+
+ .. _open-gpu-doc on semaphores: https://github.com/NVIDIA/open-gpu-doc/blob/master/manuals/turing/tu104/dev_pbdma.ref.txt#L3231
+
+Subchannel switches
+"""""""""""""""""""
+
+A subchannel switch occurs when the hardware receives a command for a different
+subchannel than the one that it's currently executing. For example, if the
+hardware is currently executing commands on the 3D engine (`SUBC_NV9097 == 0`), a
+command executed on the compute engine (`SUBC_NV90C0 == 1`) will cause a
+subchannel switch. Host methods (class \*6F) are an exception to this - they can
+be issued to any subchannel and will not trigger a subchannel switch
+[#fhostsubcswitch]_ [#foverlapnvk]_.
+
+Subchannel switches act the same way that an explicit WFI does - they fully idle
+the channel before issuing commands to the next engine [#fnsight]_
+[#foverlapnvk]_.
+
+This works the same on Blackwell. Some NVIDIA documentation contradicts this:
+"On NVIDIA Blackwell Architecture GPUs and newer, subchannel switches do not
+occur between 3D and compute workloads"[#fnsight]_. This documentation appears
+to be wrong or inapplicable for some reason - tests do not reproduce this
+behavior [#foverlapnvk]_ [#foverlapprop]_, and the blob does not change its
+event implementation for blackwell [#feventprop]_.
+
+
+.. [#fnsight] https://docs.nvidia.com/nsight-graphics/UserGuide/index.html#subchannel-switch-overlay
+.. [#foverlapnvk] Based on reverse engineering by running
+    https://gitlab.freedesktop.org/mhenning/re/-/tree/main/vk_test_overlap_exec
+    under nvk, possibly with alterations to the way nvk generates commands for
+    pipeline barriers.
+.. [#foverlapprop] By running
+    https://gitlab.freedesktop.org/mhenning/re/-/tree/main/vk_test_overlap_exec
+    under the proprietary driver and examining the main output (eg. execution
+    overlaps without pipeline barriers)
+.. [#feventprop] For example, the event commands generated here only wait on a
+    single engine
+    https://gitlab.freedesktop.org/mhenning/re/-/blob/6ce8c860da65fbf2ab26d124d25f907dea2cf33a/vk_event/blackwell.out#L20612-20793
+.. [#fhostsubcswitch] https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1009
+
+Copy engine
+^^^^^^^^^^^
+
+The copy engine's `PIPELINED` mode allows a new transfer to start before the
+previous transfer finishes, while `NON_PIPELINED` acts as an execution barrier
+between the current copy and the previous one [#fpipelined]_.
+
+ .. [#fpipelined] https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/src/common/sdk/nvidia/inc/ctrl/ctrl0050.h#L75-L83