mesa/src/amd/vulkan/radv_rt_common.h
Bas Nieuwenhuizen b2972cf410 radv: Add scratch stack to reduce LDS stack in RT traversal.
The current stack size is a significant limiter for occupancy, and
hence we need smaller stacks in LDS.

Rhys earlier had a patch that just put the N entries closest to the
root in LDS and the rest in scratch. However, this is not ideal for
performance as most of the activity is happening away from the root,
near the leaves. Of course we can't just switch it around, as the
leaf activity likely isn't happening all the way at the end of the
stack.

So what we do is make the LDS stack kinda a ringbuffer by always
accessing it using the stack index modulo the buffer size (always
a power of two so we can efficiently mask). If we then do not have
free space in this buffer we evict the entries closest to the root
to scratch and if we hit the "bottom" of the LDS space we load from
scratch.

Some rough perf numbers for indication with Q2RTX:

| evicting | LDS entries | perf |
|----------|-------------|------|
|       no |          76 |  55% |
|       no |          32 | 100% |
|       no |          24 | 105% |
|      yes |          32 |  95% |
|      yes |          16 | 100% |
|      yes |           8 |  90% |
|      yes |           4 |  75% |

(For the case with 4 entries we need to do some extra accounting as
 a full batch may not be available to evict)

So an obvious choice is to use a stack of 16 entries.

One might wonder if Q2RTX perf is mainly good due to BVHs with very
little geometry and hence low depth, so I also did some profiling
with control. This is done with RGP instruction timing, so this is
instructions executed not weighted for enabled masks, i.e. divergence
effects included.

| game    | LDS entries | scratch action | fraction of iterations |
|---------|-------------|----------------|------------------------|
| Control |           8 |          store |                  10.3% |
| Control |           8 |          load  |                  34.8% |
| Control |          16 |          store |                  0.58% |
| Control |          16 |          load  |                  2.62% |
| Q2RTX   |          16 |          store |                  1.00% |
| Q2RTX   |          16 |          load  |                  3.07% |

So Q2RTX doesn't seem like an unreasonably good case for this
algorithm.

On the implementation side, we can always place the scratch stack at
address 0 by just reserving the scratch space, and in the case of fixed
callstack size moving that up. In the dynamic case the dynamic stack
base already takes any reserved scratch space into account.

Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18541>
2022-09-20 01:39:20 +00:00

75 lines
3.4 KiB
C

/*
* Copyright © 2021 Google
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#ifndef RADV_RT_COMMON_H
#define RADV_RT_COMMON_H
#include "nir/nir.h"
#include "nir/nir_builder.h"
#include "nir/nir_vulkan.h"
#include "compiler/spirv/spirv.h"
#include "radv_private.h"
void nir_sort_hit_pair(nir_builder *b, nir_variable *var_distances, nir_variable *var_indices,
uint32_t chan_1, uint32_t chan_2);
nir_ssa_def *intersect_ray_amd_software_box(struct radv_device *device, nir_builder *b,
nir_ssa_def *bvh_node, nir_ssa_def *ray_tmax,
nir_ssa_def *origin, nir_ssa_def *dir,
nir_ssa_def *inv_dir);
nir_ssa_def *intersect_ray_amd_software_tri(struct radv_device *device, nir_builder *b,
nir_ssa_def *bvh_node, nir_ssa_def *ray_tmax,
nir_ssa_def *origin, nir_ssa_def *dir,
nir_ssa_def *inv_dir);
nir_ssa_def *build_addr_to_node(nir_builder *b, nir_ssa_def *addr);
nir_ssa_def *build_node_to_addr(struct radv_device *device, nir_builder *b, nir_ssa_def *node);
nir_ssa_def *nir_build_vec3_mat_mult(nir_builder *b, nir_ssa_def *vec, nir_ssa_def *matrix[],
bool translation);
nir_ssa_def *nir_build_vec3_mat_mult_pre(nir_builder *b, nir_ssa_def *vec, nir_ssa_def *matrix[]);
void nir_build_wto_matrix_load(nir_builder *b, nir_ssa_def *instance_addr, nir_ssa_def **out);
nir_ssa_def *hit_is_opaque(nir_builder *b, nir_ssa_def *sbt_offset_and_flags, nir_ssa_def *flags,
nir_ssa_def *geometry_id_and_flags);
nir_ssa_def *create_bvh_descriptor(nir_builder *b);
/*
* A top-level AS can contain 2^24 children and a bottom-level AS can contain 2^24
* triangles. At a branching factor of 4, that means we may need up to 24 levels of box
* nodes + 1 triangle node
* + 1 instance node. Furthermore, when processing a box node, worst case we actually
* push all 4 children and remove one, so the DFS stack depth is box nodes * 3 + 2.
*/
#define MAX_STACK_ENTRY_COUNT 76
#define MAX_STACK_LDS_ENTRY_COUNT 16
#define MAX_STACK_SCRATCH_ENTRY_COUNT (MAX_STACK_ENTRY_COUNT - MAX_STACK_LDS_ENTRY_COUNT)
#endif