Backend only recognizes the MUL a*0.5 pattern if there is an immediate
as one of the sources, however by then the source would we already
coverted to RC_FILE_NONE with constant half swizzles. So teach
peephole_omod to recognize this pattern as well.
RV530:
total instructions in shared programs: 128860 -> 128750 (-0.09%)
instructions in affected programs: 11942 -> 11832 (-0.92%)
helped: 106
HURT: 17
total presub in shared programs: 8739 -> 8736 (-0.03%)
presub in affected programs: 32 -> 29 (-9.38%)
helped: 3
HURT: 0
total omod in shared programs: 427 -> 1212 (183.84%)
omod in affected programs: 38 -> 823 (2065.79%)
helped: 0
HURT: 160
total temps in shared programs: 17544 -> 17554 (0.06%)
temps in affected programs: 70 -> 80 (14.29%)
helped: 0
HURT: 10
total lits in shared programs: 3153 -> 3159 (0.19%)
lits in affected programs: 9 -> 15 (66.67%)
helped: 0
HURT: 6
total cycles in shared programs: 191334 -> 191253 (-0.04%)
cycles in affected programs: 21240 -> 21159 (-0.38%)
helped: 101
HURT: 27
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28784>
Consider the following case:
0: MUL temp[1].y, input[0]._x__, input[1]._y__;
1: MOV temp[1].x, input[0].x___;
2: MOV temp[1].z, const[0].__x_;
3: MUL temp[2].xyz, const[1].xxx_, temp[1].yxz_;
...
We correctly recognize that we can convert mul into omod for all three
instructions, however the mul swizzle was not handled correctly:
0: MUL temp[2].y / 2, input[0]._x__, input[1]._y__;
1: MOV temp[2].x / 2, input[0].x___;
2: MOV temp[2].z / 2, const[0].__x_;
...
Just create the conversion swizzle from the initial mul swizzle when rewriting
the original instruction writemasks.
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28784>
We add a cycles penalty when we see a begin tex and than subtract from
it based on when first alu comes that needs the results. However if the
only instruction in the TEX block is just KIL, we don't have to add any
penalty as nothing waits for it.
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28784>
This is faster for 8 samples because it forms a VMEM clause, unlike
the default shader.
It also uses 16-bit types in the shader when possible and averages fewer
components if the format has less than 4.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
The target is 8-16B per lane regardless of the format and number of
samples. This is needed to fully utilize the memory bandwidth instead
of only a small fraction of it. These are optimal numbers identified by
benchmarking.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
This merges the separate MSAA, downsampling, upsampling, and non-MSAA blocks.
It's not meant to change behavior, but some change are necessary:
- disallow 16 samples
- loads only load the number of components that we need
- optimizations barriers are placed optimally and include the sample index
in the same vector as the coordinates, so that LLVM is forced to form VMEM
clauses for loads and stores
- the shader queries the descriptor for the dst image manually and passes
it to the image store instead of the image variable (this is needed to get
latency hiding for scalar loads in the presence of optimization barriers)
This is a prerequisite for blitting multiple pixels per lane.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
to generate less shader code if only one of the axes needs clamping.
Use util_is_box_out_of_bounds instead of doing it manually.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
It's faster and handles more stuff.
This is mostly the same code as the old version, but it calls
si_compute_blit at the end.
A later commit will remove the old version, so that there is no code
duplication.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
The compute blit shader is faster and handles more stuff.
This removes the old clear_render_target shader.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
The compute blit is faster and handles more stuff than
the clear_render_target shader. We can just pass a clear value to it
to replace the source image.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
It makes supporting compute queues on all chips more complicated.
Other uses of si_can_use_compute_blit will be removed, so the function
will be removed too.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
If a blit starts on a coordinate that is not at the beginning of a tile
(e.g. 8x8), launch extra threads before 0,0,0 to make all following blocks
start at the beginning of such tiles. This makes such blits faster.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
it's not useful to have precisely the same behavior as u_blitter,
and D16 image stores are not supported by gfx6-7.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>
It had no effect on correctness and it was very inefficient because all
formats without alpha have SWIZZLE_1 in the last channel.
util_format_get_last_component is the same, but ignores PIPE_SWIZZLE_1.
It improves MSAA compute blit performance.
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28917>