Hardware docs
=============

Command buffers
---------------

Command format
^^^^^^^^^^^^^^

Each command sent to the GPU contains a method and some data. Method names are
documented in corresponding header files copied from NVIDIA, eg. the fermi
graphics methods are in src/nouveau/headers/nvidia/classes/cl9097.h

P_IMMD
""""""

A lot of the time, you will want to issue a single method with its data, which
can be done with P_IMMD::

    P_IMMD(p, NV9097, WAIT_FOR_IDLE, 0);

P_IMMD will emit either a single `immediate-data method`_, which takes a single
word, or a pair of words that's equivalent to P_MTHD + the provided data. Code
must count P_IMMD as possibly 2 words as a result.

 .. _immediate-data method: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1214

P_MTHD
""""""

`P_MTHD`_ is a convenient way to execute multiple consecutive methods without
repeating the method header. For example, the code::

      P_MTHD(p, NV9097, SET_REPORT_SEMAPHORE_A);
      P_NV9097_SET_REPORT_SEMAPHORE_A(p, addr >> 32);
      P_NV9097_SET_REPORT_SEMAPHORE_B(p, addr);
      P_NV9097_SET_REPORT_SEMAPHORE_C(p, value);

generates four words - one word for the method header (defaulting to 1INC) and
then the next three words for data. 1INC will automatically increment the method
id by one word for each data value, which is why the example can advance from
SET_REPORT_SEMAPHORE_A to B to C.

 .. _P_MTHD: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1042

P_0INC
""""""

`0INC`_ will issue the same method repeatedly for each following data word.

 .. _0INC: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1096

P_1INC
""""""

`1INC`_ will increment after one word and then issue the following method
repeatedly. For example, the code::

      P_1INC(p, NV9097, CALL_MME_MACRO(NVK_MME_SET_PRIV_REG));
      P_INLINE_DATA(p, 0);
      P_INLINE_DATA(p, BITFIELD_BIT(3));

issues one NV9097_CALL_MME_MACRO command, then increments the method and issues
two NV9097_CALL_MME_DATA commands.

.. _1INC: https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1149

Execution barriers
^^^^^^^^^^^^^^^^^^

Commands within a command buffer can be synchronized in a few different ways.

 * Explicit WFI - Idles all engines before executing the next command eg. via
   `NVA16F_WFI` or `NV9097_WAIT_FOR_IDLE`
 * Semaphores - Delay execution based on values in a memory location. See
   `open-gpu-doc on semaphores`_
 * A subchannel switch - Causes the hardware to execute an implied WFI

 .. _open-gpu-doc on semaphores: https://github.com/NVIDIA/open-gpu-doc/blob/master/manuals/turing/tu104/dev_pbdma.ref.txt#L3231

Subchannel switches
"""""""""""""""""""

A subchannel switch occurs when the hardware receives a command for a different
subchannel than the one that it's currently executing. For example, if the
hardware is currently executing commands on the 3D engine (`SUBC_NV9097 == 0`), a
command executed on the compute engine (`SUBC_NV90C0 == 1`) will cause a
subchannel switch. Host methods (class \*6F) are an exception to this - they can
be issued to any subchannel and will not trigger a subchannel switch
[#fhostsubcswitch]_ [#foverlapnvk]_.

Subchannel switches act the same way that an explicit WFI does - they fully idle
the channel before issuing commands to the next engine [#fnsight]_
[#foverlapnvk]_.

This works the same on Blackwell. Some NVIDIA documentation contradicts this:
"On NVIDIA Blackwell Architecture GPUs and newer, subchannel switches do not
occur between 3D and compute workloads"[#fnsight]_. This documentation appears
to be wrong or inapplicable for some reason - tests do not reproduce this
behavior [#foverlapnvk]_ [#foverlapprop]_, and the blob does not change its
event implementation for blackwell [#feventprop]_.


.. [#fnsight] https://docs.nvidia.com/nsight-graphics/UserGuide/index.html#subchannel-switch-overlay
.. [#foverlapnvk] Based on reverse engineering by running
    https://gitlab.freedesktop.org/mhenning/re/-/tree/main/vk_test_overlap_exec
    under nvk, possibly with alterations to the way nvk generates commands for
    pipeline barriers.
.. [#foverlapprop] By running
    https://gitlab.freedesktop.org/mhenning/re/-/tree/main/vk_test_overlap_exec
    under the proprietary driver and examining the main output (eg. execution
    overlaps without pipeline barriers)
.. [#feventprop] For example, the event commands generated here only wait on a
    single engine
    https://gitlab.freedesktop.org/mhenning/re/-/blob/6ce8c860da65fbf2ab26d124d25f907dea2cf33a/vk_event/blackwell.out#L20612-20793
.. [#fhostsubcswitch] https://github.com/NVIDIA/open-gpu-doc/blob/87ba53e0c385285a3aa304b864dccb975a9a0dd4/manuals/turing/tu104/dev_ram.ref.txt#L1009

Copy engine
^^^^^^^^^^^

The copy engine's `PIPELINED` mode allows a new transfer to start before the
previous transfer finishes, while `NON_PIPELINED` acts as an execution barrier
between the current copy and the previous one [#fpipelined]_.

 .. [#fpipelined] https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/src/common/sdk/nvidia/inc/ctrl/ctrl0050.h#L75-L83