mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2026-05-18 02:58:06 +02:00
286 lines
16 KiB
ReStructuredText
286 lines
16 KiB
ReStructuredText
Fragment Density Map
|
|
====================
|
|
|
|
``VK_EXT_fragment_density_map`` is an extension which is intended to allow
|
|
users to render parts of the screen at a lower resolution. It is designed to be
|
|
implemented on tiled rendering GPU architectures such as Adreno, and the
|
|
intention is that it is implemented by rendering some of the tiles at a lower
|
|
resolution and scaling them up when resolving to system memory or when sampling
|
|
the resulting image. This inherently means that it is "all or nothing," that
|
|
is, it must be enabled or disabled for the entire render pass. While the idea is
|
|
simple, the implementation in turnip is very subtle with lots of
|
|
interactions with various different features. This page attempts to document
|
|
the main principles behind the implementation.
|
|
|
|
Coordinate Space Soup
|
|
---------------------
|
|
|
|
In order to render a tile at lower resolution, we have to override the user's
|
|
viewport and scissor for each tile depending on the scaling factor provided by
|
|
the user. This becomes complicated fast, so let's start by defining a few
|
|
coordinate spaces that we'll have to work with.
|
|
|
|
Framebuffer space
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
This is the space of the final rendered image. From the user's perspective
|
|
everything is specified in this space, and fragments created by the rasterizer
|
|
appear to be larger than 1 pixel. But this is not what actually happens in the
|
|
hardware, it is a fiction created by the driver. The other spaces below are
|
|
what the hardware actually "sees".
|
|
|
|
GMEM Space
|
|
^^^^^^^^^^
|
|
|
|
This space exists whenever tiled rendering/GMEM is used, even without FDM. It
|
|
is the space used to access GMEM, with the origin at the upper left of the
|
|
tile. The hardware automatically transforms rendering space into GMEM space
|
|
whenever GMEM is accessed using the various ``*_WINDOW_OFFSET`` registers. The
|
|
origin of this space in rendering space, or the value of ``*_WINDOW_OFFSET``,
|
|
will be called :math:`b_{cs}`, the common bin start, for reasons that are
|
|
explained below. When using FDM, coordinates in this space must be multiplied
|
|
by the scaling factor :math:`s` derived from the fragment density map, or
|
|
equivalently divided by the fragment area (as defined by the Vulkan
|
|
specification), with the origin still at the upper left of the tile. For
|
|
example, if :math:`s_x = 1/2`, then the bin is half as wide as it would've been
|
|
without FDM and all coordinates in this space must be divided by 2.
|
|
|
|
Rendering space
|
|
^^^^^^^^^^^^^^^
|
|
|
|
This is the space in which the hardware rasterizer operates and produces
|
|
fragments. Normally this is the same as framebuffer space, but with FDM it is
|
|
not. We transform the viewport and scissor from framebuffer space to
|
|
rendering space by patching them per-tile in the driver and then when we
|
|
resolve the tile we scale the resulting tile back to the correct resolution by
|
|
blitting from the rendering space source to the framebuffer space destination.
|
|
|
|
In order to come up with the correct transform from framebuffer space to
|
|
rendering space, it has to shrink the coordinates by :math:`s` while
|
|
mapping the original bin start in framebuffer space :math:`b_s` to
|
|
:math:`b_{cs}`. Since :math:`b_{cs}` is entirely defined by the driver when
|
|
programming ``*_WINDOW_OFFSET``, one tempting way to do this is to just
|
|
multiply by :math:`s` and define :math:`b_{cs} = b_s * s`. It turns out,
|
|
however, that this doesn't work. A key requirement is to handle cases where the
|
|
same scene is rendered in multiple different views at the same time using
|
|
``VK_KHR_multiview``, as in VR use-cases, and in this case we want :math:`s` to
|
|
vary per view, but :math:`b_{cs}` is always the same for every view because
|
|
there is only one ``*_WINDOW_OFFSET`` register for all layers (hence the name).
|
|
|
|
We follow the blob by leaving :math:`b_{cs}` the same regardless of whether FDM
|
|
is enabled or not. This means that normally :math:`b_s = b_{cs}`, although this
|
|
is not the case if ``VK_EXT_fragment_density_map_offset`` is in use and the
|
|
bins are shifted per-view. Since the coordinates need to be scaled by :math:`s`,
|
|
we know that the transform needs to look like :math:`x' = s * x + o`, where
|
|
only the offset :math:`o` is free. Plugging in the constraint that :math:`b_s`
|
|
maps to :math:`b_{cs}`, we get that :math:`b_{cs} = s * b_s + o` or
|
|
:math:`o = b_{cs} - s * b_s`. This is the function computed by
|
|
``tu_fdm_per_bin_offset`` and used to calculate the transform for the viewport,
|
|
scissor, and ``gl_FragCoord``. One critical thing is that the offset must be an
|
|
integer, or in other words the framebuffer space bin start :math:`b_s` must be
|
|
a multiple of :math:`1 / s`. This is a natural constraint anyway, because if
|
|
it wasn't the case then the bin would start in the middle of a fragment which
|
|
isn't possible to handle correctly.
|
|
|
|
Subsampled Space
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
When using subsampled images, this is the space where the bin is stored in the
|
|
underlying subsampled image. When sampling from a subsampled image, the driver
|
|
inserts shader code to transform from framebuffer space to subsampled space
|
|
using metadata written when rendering to the image.
|
|
|
|
Accesses towards the edge of a bin may partially bleed into its neighboring bin
|
|
with linear or bicubic sampling. If its neighbor has a different scale or isn't
|
|
adjacent in subsampled space, we will sample the incorrect data or empty space
|
|
and return a corrupted result. In order to handle this, we need to insert an
|
|
"apron" around problematic edges and corners. This is done by blitting from the
|
|
nearest neighbor of each bin after the renderpass.
|
|
|
|
Subsampled space is normally scaled down similar to rendering space, which is
|
|
the point of subsampled images in the first place, but the origin of the bin
|
|
is up to the driver. The driver chooses the origin of each bin when rendering a
|
|
given render pass and then encodes it in the metadata used when sampling the
|
|
image. Bins that require an apron must be far enough away from each other that
|
|
their aprons don't intersect, and all of the bins must be contained within the
|
|
underlying image.
|
|
|
|
Even when subsampled images are in use, not all bins may be subsampled. For
|
|
example, there may not be enough space to insert aprons around every bin. When
|
|
this is the case, subsampled space is not scaled like rendering space, that is
|
|
we expand the bin when resolving similar to non-subsampled images, however the
|
|
origin of the bin may still differ from framebuffer space origin.
|
|
|
|
The algorithm used by turnip used to calculate the bin layout in subsampled
|
|
space is to start with a "default" layout of the bins and then recursively
|
|
solve conflicts caused by bins whose aprons are too close together. The first
|
|
strategy used is to shift one of the bins over by a certain amount. The second
|
|
fallback strategy is to un-subsample both neighboring bins, making them
|
|
expanded so that they touch each other and there is no apron.
|
|
|
|
One natural choice for the "default" layout is to just use rendering space.
|
|
That is, start each bin at :math:`b_cs` by default. That mostly works, except
|
|
for two problems. The first is easier to solve, and has to do with the border
|
|
when sampling: it is allowed to use border colors with subsampled images, and
|
|
when that happens and the framebuffer covers the entire image, it is expected
|
|
that sampling around the edge correctly blends the border color and the edge
|
|
pixel. In order for that to happen, bins that touch or intersect the edge of
|
|
the framebuffer in framebuffer space have to be shifted over so that their edge
|
|
touches the framebuffer edge in subsampled space too.
|
|
|
|
Doing this also allows an optimization: because we are storing the tile's
|
|
contents one to one from GMEM to system memory instead of scaling it up, we can
|
|
use the dedicated resolve engine instead of GRAS to resolve the tile to system
|
|
memory. Normally GRAS has to be used with non-subsampled images to scale up the
|
|
bin when resolving. However this doesn't work for tiles around the right and
|
|
bottom edge where we have to shift over the tile to align to the edge. This
|
|
also gets a bit tricky when the tile is shifted to avoid apron conflicts, because
|
|
normally the resolve engine would write the tile directly without shifting.
|
|
However there is a trick we can use to avoid falling back to GRAS: by
|
|
overriding ``RB_RESOLVE_WINDOW_OFFSET``, we can effectively apply an offset by
|
|
telling the resolve engine that the tile was rendered somewhere else. This
|
|
means that the shift amount has to be aligned to the alignment of
|
|
``RB_RESOLVE_WINDOW_OFFSET``, which is ``tile_align_*`` in the device info.
|
|
|
|
The other problem with making subsampled space equal rendering space is that
|
|
with an FDM offset, rendering space can be arbitrarily larger than framebuffer
|
|
space, and we may overflow the attachments by up to the size of a tile. The API
|
|
is designed to allow the driver to allocate extra slop space in the image in
|
|
this case, because there are image create flags for subsampled and FDM offset,
|
|
however the maximum tile size is far too large and images would take up
|
|
far too much memory if we allocated enough slop space for the largest
|
|
possible tile. An alternative is to use a hybrid of framebuffer space and
|
|
rendering space: shift over the tiles by :math:`b_o` so that their origin
|
|
is :math:`b_s` instead of :math:`b_cs`, but leave them scaled down. This
|
|
requires no slop space whatsoever, because the bins are shifted inside the
|
|
original image, but we can no longer use the resolve engine as the tile offsets
|
|
are no longer aligned to ``tile_align_*``. So in the driver we combine both
|
|
approaches: we calculate an aligned offset :math:`b_o'` which is :math:`b_o`
|
|
aligned down to ``tile_align_*`` and shift over the tiles by subtracting
|
|
:math:`b_o'` instead of :math:`b_o`. This requires slop space, but only
|
|
:math:`b_o - b_o'` slop space is required, which must be less than
|
|
``tile_align_*``. As usual the first row/column are not shifted over in x/y
|
|
respectively.
|
|
|
|
Here is an example of what a subsampled image looks like in memory, in this
|
|
case without any FDM offset:
|
|
|
|
.. figure:: subsampled_annotated.jpg
|
|
:alt: Example of a subsampled image
|
|
|
|
Note how some of the bins are shifted over to make space for the apron. After
|
|
applying the coordinate transform when sampling, this is the final image:
|
|
|
|
.. figure:: subsampled_final.jpg
|
|
:alt: Example of a subsampled image after coordinate transform
|
|
|
|
When ``VK_EXT_custom_resolve`` and subsampled images are used together, the
|
|
custom resolve subpass writes directly to the subsampled image. This means that
|
|
it needs to use subsampled space instead of rendering space, which in practice
|
|
means replacing :math:`b_{cs}` with the origin of the bin in the subsampled
|
|
image.
|
|
|
|
Viewport and Scissor Patching
|
|
-----------------------------
|
|
|
|
In order to have :math:`s` differ per view, we have to be able to override the
|
|
viewport per view. That is, we need to transform the viewport for each view
|
|
differently. If there is only one viewport, then we duplicate the user's
|
|
viewport for each view and transform it using the :math:`b_s` and :math:`s` for
|
|
that view, and we set a "per-view viewport" bit to select the viewport per view
|
|
instead of using the default viewport 0. When
|
|
``VK_VALVE_fragment_density_map_layered`` is in use, we instead have to insert
|
|
shader code to achieve the same thing.
|
|
|
|
If the user specifies multiple viewports but they are per-view because
|
|
``VK_QCOM_multiview_per_view_viewport`` is enabled, then we can just set the
|
|
per-view viewport bit and transform each user viewport individually by the
|
|
corresponding scale. But if the user explicitly writes ``gl_ViewportIndex``,
|
|
then there is nothing we can do and we have to make :math:`s` the same for all
|
|
views by conservatively taking the minimum. Then we apply :math:`s` to all of
|
|
the user-specified viewports.
|
|
|
|
Because the bin size is now per-view, the usual mechanism of
|
|
``*_WINDOW_SCISSOR`` for clipping fragments outside the bin doesn't work.
|
|
Instead the driver needs to intersect the transformed user-specified scissor
|
|
with the transformed rendering-space bin coordinates, effectively replacing
|
|
``*_WINDOW_SCISSOR``.
|
|
|
|
Fragment density map offset
|
|
---------------------------
|
|
|
|
In order to "properly" implement ``VK_EXT_fragment_density_map_offset``, we
|
|
need to add an extra row/column of bins at the end and then shift the binning
|
|
grid up and to the left by an offset :math:`b_o`. This offset is based on the
|
|
user's offset but has the opposite sign, i.e. when shifting the FDM to the left
|
|
we have to shift the binning grid to the right, and once the user's offset
|
|
becomes large enough then we "wrap around" and shift over the scaling factor
|
|
:math:`s` to the next bin. This has to happen per-view. In turnip the function
|
|
that computes :math:`b_o` is called ``tu_bin_offset``. Each tile then gets an
|
|
offseted start :math:`b_s = b_{cs} - b_o` except for the first row/column which
|
|
only shrink in height/width respectively.
|
|
|
|
If we cannot make :math:`s` per-view, then we also cannot make :math:`b_s`
|
|
per-view and so we cannot shift the bins over. Therefore we fall back to only
|
|
shifting where :math:`s` is sampled from, which produces jittery and jarring
|
|
transitions when a bin suddenly changes resolution.
|
|
|
|
Bin merging
|
|
-----------
|
|
|
|
FDM shrinks the size of the bin in GMEM, which results in a lot of wasteful
|
|
unused extra space in GMEM. a7xx mitigates this by introducing "bin merging".
|
|
If two tiles next to each other have the same scaling for each view, then we
|
|
combine them into one tile, as long as the combined size in rendering space
|
|
isn't larger than the original size of an unscaled bin in framebuffer space. We
|
|
can even merge larger groups of tiles. The only hardware feature needed for
|
|
this to work is the ability to merge the visibility streams for the tiles,
|
|
which was added on a7xx by a new bitmask in ``CP_SET_BIN_DATA5`` and variants.
|
|
Only bins within the same visibility stream/VSC pipe can be merged.
|
|
|
|
Hardware scaling registers and LRZ
|
|
----------------------------------
|
|
|
|
One disadvantage of FDM on a6xx is that low-resolution tiles cannot use
|
|
LRZ, because the LRZ hardware is not aware of the transform between framebuffer
|
|
space and rendering space and applies the framebuffer-space LRZ values to the
|
|
rendering-space fragments. In order to fix this, a740 adds new offset and scale
|
|
registers. The offset :math:`o'` is applied to fragment coordinates during
|
|
rasterization *after* LRZ, so that viewport, scissor, and LRZ are in a
|
|
new "LRZ space" while the other operations (resolves and unresolves, and
|
|
attachment writes) still happen in the rendering space which is now offset.
|
|
:math:`o'` is specified for each layer. The scale :math:`s` is the same as
|
|
before, and it is used to multiply the fragment area covered by each LRZ pixel.
|
|
|
|
Without ``VK_EXT_fragment_density_map_offset``, we can simply make LRZ space
|
|
equal to framebuffer space scaled down by :math:`s`. That is, we can set
|
|
:math:`o'` to what :math:`o` was before and then set :math:`o` to 0, only
|
|
scaling down the viewport but not shifting it and letting the hardware handle
|
|
the shift. Then LRZ pixels will be scaled up appropriately and everything will
|
|
work. However, this doesn't work if there is a bin offset :math:`b_o`. In order
|
|
to make binning work, we shift the viewport and scissor by :math:`b_o` when
|
|
binning. Unfortunately the offset registers do not have any effect when
|
|
binning, so rendering space and LRZ space have to be the same when binning, and
|
|
the visibility stream is generated from rendering space. This means that LRZ
|
|
space also has to be shifted over compared to framebuffer space, and the LRZ
|
|
buffer must be overallocated when FDM offset might be used with it (which is
|
|
signalled by ``VK_IMAGE_CREATE_FRAGMENT_DENSITY_MAP_OFFSET_BIT_EXT``) because
|
|
the LRZ image will be shifted by :math:`b_o`.
|
|
|
|
In order for LRZ to work, LRZ space when rendering must be equal to LRZ space
|
|
when binning scaled down by :math:`s`. The origin of LRZ space when binning is
|
|
:math:`-b_o`, and this must be mapped to 0. The transform from
|
|
framebuffer space to LRZ space is :math:`x' = x * s + o`, and the transform
|
|
from framebuffer space to rendering space is :math:`x'' = x * s + o + o'`.
|
|
We get that :math:`o + o' = b_{cs} - b_s * s`, similar to before, and
|
|
:math:`0 = -b_o * s + o` so that :math:`o = b_o * s` and finally
|
|
:math:`o' = b_{cs} - b_s * s - b_o * s`, or after rearranging
|
|
:math:`o' = b_{cs} - (b_s + b_o) * s`. For all tiles except those in the first
|
|
row or column, this simplifies to :math:`o' = b_{cs} - b_{cs} * s` because
|
|
:math:`b_{cs} = b_s + b_o`. For tiles in the first row or column, :math:`b_s`
|
|
and :math:`b_{cs}` are both 0 in one of the coordinates, so it becomes
|
|
:math:`o' = -b_o * s` in that coordinate. This isn't representable in hardware,
|
|
both because it is negative (which can be worked around by artifically
|
|
shifting :math:`b_{cs}`) but more importantly because it may not meet the
|
|
alignment requirements for the hardware register (which is currently 8 pixels).
|
|
We have to just disable LRZ in this case.
|