diff --git a/docs/drivers/panfrost/instancing.rst b/docs/drivers/panfrost/instancing.rst index bc103a3b326..66d5e4c4f0d 100644 --- a/docs/drivers/panfrost/instancing.rst +++ b/docs/drivers/panfrost/instancing.rst @@ -16,11 +16,12 @@ One option would be to do: \text{instance id} = \text{linear id} / \text{num vertices} but this involves a costly division and modulus by an arbitrary number. -Instead, we could pad num_vertices. We dispatch padded_num_vertices * -num_instances threads instead of num_vertices * num_instances, which results -in some "extra" threads with vertex_id >= num_vertices, which we have to -discard. The more we pad num_vertices, the more "wasted" threads we -dispatch, but the division is potentially easier. +Instead, we could pad num_vertices. We dispatch +:math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead +of :math:`\text{num_vertices} \cdot \text{num_instances}`, which results +in some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`, +which we have to discard. The more we pad num_vertices, the more "wasted" +threads we dispatch, but the division is potentially easier. One straightforward choice is to pad num_vertices to the next power of two, which means that the division and modulus are just simple bit shifts and @@ -50,14 +51,15 @@ high bits padded_num_vertices 111x :math:`2^{n+4}` ========== ======================= -For example, if num_vertices = 70 is passed to glDraw(), its binary -representation is 1000110, so n = 3 and the high bits are 1000, and -therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72. +For example, if :math:`\text{num_vertices} = 70` is passed to glDraw(), +its binary representation is 1000110, so :math:`n = 3` and the high bits +are 1000, and therefore +:math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`. The attribute unit works in terms of the original linear_id. if -num_instances = 1, then they are the same, and everything is simple. -However, with instancing things get more complicated. There are four -possible modes, two of them we can group together: +:math:`\text{num_instances} = 1`, then they are the same, and everything +is simple. However, with instancing things get more complicated. There are +four possible modes, two of them we can group together: 1. Use the linear_id directly. Only used when there is no instancing. @@ -66,12 +68,14 @@ attributes with instancing enabled by making the constant equal padded_num_vertices. Because the modulus is always padded_num_vertices, this mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. The shift field specifies the power of two, while the extra_flags field -specifies the odd number. If shift = n and extra_flags = m, then the modulus -is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as -computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set -extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware -algorithm used to get padded_num_vertices in order to correctly implement -per-vertex attributes. +specifies the odd number. If :math:`\text{shift} = n` and +:math:`\text{extra_flags} = m`, then the modulus is +:math:`(2m + 1) \cdot 2^n`. As an example, if +:math:`\text{num_vertices} = 70`, then as computed above, +:math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set +:math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we +must exactly follow the hardware algorithm used to get padded_num_vertices +in order to correctly implement per-vertex attributes. 3. Divide the linear_id by a constant. In order to correctly implement instance divisors, we have to divide linear_id by padded_num_vertices times @@ -94,7 +98,7 @@ The hardware further assumes the multiplier is between :math:`2^{31}` and to 0 by the driver -- presumably this simplifies the hardware multiplier a little. The hardware first multiplies linear_id by the multiplier and takes the high 32 bits, then applies the round-down correction if -extra_flags = 1, then finally shifts right by the shift field. +:math:`\text{extra_flags} = 1`, then finally shifts right by the shift field. There are some differences between ridiculousfish's algorithm and the Mali hardware algorithm, which means that the reference code from ridiculousfish @@ -105,8 +109,9 @@ It also forces the multiplier to be at least :math:`2^{31}`, which means that the exponent is entirely fixed, so there is no trial-and-error. Altogether, given the divisor d, the algorithm the driver must follow is: -1. Set shift = :math:`\lfloor \log_2(d) \rfloor`. +1. Set :math:`\text{shift} = \lfloor \log_2(d) \rfloor`. 2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`. -3. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm. Set - magic_divisor = m - 1 and extra_flags = 1. -4. Otherwise, set magic_divisor = m and extra_flags = 0. +3. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm. + Set :math:`\text{magic_divisor} = m - 1` and :math:`\text{extra_flags} = 1`. +4. Otherwise, set :math:`\text{magic_divisor} = m` and + :math:`\text{extra_flags} = 0`.