mesa/src/intel/compiler/brw_lower_regioning.cpp

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

818 lines
32 KiB
C++
Raw Normal View History

/*
* Copyright © 2018 Intel Corporation
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#include "brw_fs.h"
#include "brw_cfg.h"
#include "brw_fs_builder.h"
using namespace brw;
namespace {
/* From the SKL PRM Vol 2a, "Move":
*
* "A mov with the same source and destination type, no source modifier,
* and no saturation is a raw move. A packed byte destination region (B
* or UB type with HorzStride == 1 and ExecSize > 1) can only be written
* using raw move."
*/
bool
is_byte_raw_mov(const fs_inst *inst)
{
return brw_type_size_bytes(inst->dst.type) == 1 &&
inst->opcode == BRW_OPCODE_MOV &&
inst->src[0].type == inst->dst.type &&
!inst->saturate &&
!inst->src[0].negate &&
!inst->src[0].abs;
}
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
/*
* Return an acceptable byte stride for the specified source of an
* instruction affected by a regioning restriction.
*/
unsigned
required_src_byte_stride(const intel_device_info *devinfo, const fs_inst *inst,
unsigned i)
{
if (has_dst_aligned_region_restriction(devinfo, inst)) {
return MAX2(brw_type_size_bytes(inst->dst.type),
byte_stride(inst->dst));
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
intel/fs/xe2: Fix up subdword integer region restriction with strided byte src and packed byte dst. This fixes a corner case of the LNL sub-dword integer restrictions that wasn't being detected by has_subdword_integer_region_restriction(), specifically: > if(Src.Type==Byte && Dst.Type==Byte && Dst.Stride==1 && W!=2) { > // ... > if(Src.Stride == 2) && (Src.UniformStride) && (Dst.SubReg%32 == Src.SubReg/2 ) { Allowed } > // ... > } All the other restrictions that require agreement between the SubReg number of source and destination only affect sources with a stride greater than a dword, which is why has_subdword_integer_region_restriction() was returning false except when "byte_stride(srcs[i]) >= 4" evaluated to true, but as implied by the pseudocode above, in the particular case of a packed byte destination, the restriction applies for source strides as narrow as 2B. The form of the equation that relates the subreg numbers is consistent with the existing calculations in brw_fs_lower_regioning (see required_src_byte_offset()), we just need to enable lowering for this corner case, and change lower_dst_region() to call lower_instruction() recursively, since some of the cases where we break this restriction are copy instructions introduced by brw_fs_lower_regioning() itself trying to lower other instructions with byte destinations. This fixes some Vulkan CTS test-cases that were hitting these restrictions with byte data types. Fixes: 217d41236076280 ("intel/fs/gfx20+: Implement sub-dword integer regioning restrictions.") Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30630>
2024-11-09 15:15:33 -08:00
} else if (has_subdword_integer_region_restriction(devinfo, inst,
&inst->src[i], 1)) {
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
/* Use a stride of 32bits if possible, since that will guarantee that
* the copy emitted to lower this region won't be affected by the
* sub-dword integer region restrictions. This may not be possible
* for the second source of an instruction if we're required to use
* packed data due to Wa_16012383669.
*/
return (i == 1 ? brw_type_size_bytes(inst->src[i].type) : 4);
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
} else {
return byte_stride(inst->src[i]);
}
}
/*
* Return an acceptable byte sub-register offset for the specified source
* of an instruction affected by a regioning restriction.
*/
unsigned
required_src_byte_offset(const intel_device_info *devinfo, const fs_inst *inst,
unsigned i)
{
if (has_dst_aligned_region_restriction(devinfo, inst)) {
return reg_offset(inst->dst) % (reg_unit(devinfo) * REG_SIZE);
intel/fs/xe2: Fix up subdword integer region restriction with strided byte src and packed byte dst. This fixes a corner case of the LNL sub-dword integer restrictions that wasn't being detected by has_subdword_integer_region_restriction(), specifically: > if(Src.Type==Byte && Dst.Type==Byte && Dst.Stride==1 && W!=2) { > // ... > if(Src.Stride == 2) && (Src.UniformStride) && (Dst.SubReg%32 == Src.SubReg/2 ) { Allowed } > // ... > } All the other restrictions that require agreement between the SubReg number of source and destination only affect sources with a stride greater than a dword, which is why has_subdword_integer_region_restriction() was returning false except when "byte_stride(srcs[i]) >= 4" evaluated to true, but as implied by the pseudocode above, in the particular case of a packed byte destination, the restriction applies for source strides as narrow as 2B. The form of the equation that relates the subreg numbers is consistent with the existing calculations in brw_fs_lower_regioning (see required_src_byte_offset()), we just need to enable lowering for this corner case, and change lower_dst_region() to call lower_instruction() recursively, since some of the cases where we break this restriction are copy instructions introduced by brw_fs_lower_regioning() itself trying to lower other instructions with byte destinations. This fixes some Vulkan CTS test-cases that were hitting these restrictions with byte data types. Fixes: 217d41236076280 ("intel/fs/gfx20+: Implement sub-dword integer regioning restrictions.") Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30630>
2024-11-09 15:15:33 -08:00
} else if (has_subdword_integer_region_restriction(devinfo, inst,
&inst->src[i], 1)) {
const unsigned dst_byte_stride =
MAX2(byte_stride(inst->dst), brw_type_size_bytes(inst->dst.type));
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
const unsigned src_byte_stride = required_src_byte_stride(devinfo, inst, i);
const unsigned dst_byte_offset =
reg_offset(inst->dst) % (reg_unit(devinfo) * REG_SIZE);
const unsigned src_byte_offset =
reg_offset(inst->src[i]) % (reg_unit(devinfo) * REG_SIZE);
if (src_byte_stride > brw_type_size_bytes(inst->src[i].type)) {
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
assert(src_byte_stride >= dst_byte_stride);
/* The source is affected by the Xe2+ sub-dword integer regioning
* restrictions. For the case of source 0 BSpec#56640 specifies a
* number of equations relating the source and destination
* sub-register numbers in all cases where a source stride of
* 32bits is allowed. These equations have the form:
*
* k * Dst.SubReg % m = Src.SubReg / l
*
* For some constants k, l and m different for each combination of
* source and destination types and strides. The expression in
* the return statement below computes a valid source offset by
* inverting the equation like:
*
* Src.SubReg = l * k * (Dst.SubReg % m)
*
* and then scaling by the element type sizes in order to get an
* expression in terms of byte offsets instead of sub-register
* numbers. It can be easily verified that in all cases listed on
* the hardware spec where the source has a well-defined uniform
* stride the product l*k is equal to the ratio between the source
* and destination strides.
*/
const unsigned m = 64 * dst_byte_stride / src_byte_stride;
return dst_byte_offset % m * src_byte_stride / dst_byte_stride;
} else {
assert(src_byte_stride == brw_type_size_bytes(inst->src[i].type));
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
/* A packed source is required, likely due to the stricter
* requirements of the second source region. The source being
* packed guarantees that the region of the original instruction
* will be valid, but the copy may break the regioning
* restrictions. Do our best to try to prevent that from
* happening by making sure the offset of the temporary matches
* the original source based on the same equation above -- However
* that may not be sufficient if the source had a stride larger
* than 32bits, lowering the copy recursively may be necessary.
*/
return src_byte_offset * src_byte_stride / byte_stride(inst->src[i]);
}
} else {
return reg_offset(inst->src[i]) % (reg_unit(devinfo) * REG_SIZE);
}
}
/*
* Return an acceptable byte stride for the destination of an instruction
* that requires it to have some particular alignment.
*/
unsigned
required_dst_byte_stride(const fs_inst *inst)
{
intel/fs: Don't touch accumulator destination while applying regioning alignment rule In some shaders, you can end up with a stride in the source of a SHADER_OPCODE_MULH. One way this can happen is if the MULH is acting on the top bits of a 64-bit value due to 64-bit integer lowering. In this case, the compiler will produce something like this: mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; The new region fixup pass looks at the MUL and sees a strided source and unstrided destination and determines that the sequence is illegal. It then attempts to fix the illegal stride by replacing the destination of the MUL with a temporary and emitting a MOV into the accumulator: mul(8) g9<2>UD g5<8,4,2>UD 0x0004UW { align1 1Q }; mov(8) acc0<1>UD g9<8,4,2>UD { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; Unfortunately, this new sequence isn't correct because MOV accesses the accumulator with a different precision to MUL and, instead of filling the bottom 32 bits with the source and zeroing the top 32 bits, it leaves the top 32 (or maybe 31) bits alone and full of garbage. When the MACH comes along and tries to complete the multiplication, the result is correct in the bottom 32 bits (which we throw away) and garbage in the top 32 bits which are actually returned by MACH. This commit does two things: First, it adds an assert to ensure that we don't try to rewrite accumulator destinations of MUL instructions so we can avoid this precision issue. Second, it modifies required_dst_byte_stride to require a tightly packed stride so that we fix up the sources instead and the actual code which gets emitted is this: mov(8) g9<1>UD g5<8,4,2>UD { align1 1Q }; mul(8) acc0<1>UD g9<8,8,1>UD 0x0004UW { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass" Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2019-01-16 17:40:13 -06:00
if (inst->dst.is_accumulator()) {
/* If the destination is an accumulator, insist that we leave the
* stride alone. We cannot "fix" accumulator destinations by writing
* to a temporary and emitting a MOV into the original destination.
* For multiply instructions (our one use of the accumulator), the
* MUL writes the full 66 bits of the accumulator whereas the MOV we
* would emit only writes 33 bits and leaves the top 33 bits
* undefined.
*
* It's safe to just require the original stride here because the
* lowering pass will detect the mismatch in has_invalid_src_region
* and fix the sources of the multiply instead of the destination.
*/
return inst->dst.hstride * brw_type_size_bytes(inst->dst.type);
} else if (brw_type_size_bytes(inst->dst.type) < get_exec_type_size(inst) &&
!is_byte_raw_mov(inst)) {
return get_exec_type_size(inst);
} else {
/* Calculate the maximum byte stride and the minimum/maximum type
* size across all source and destination operands we are required to
* lower.
*/
unsigned max_stride = inst->dst.stride * brw_type_size_bytes(inst->dst.type);
unsigned min_size = brw_type_size_bytes(inst->dst.type);
unsigned max_size = brw_type_size_bytes(inst->dst.type);
for (unsigned i = 0; i < inst->sources; i++) {
if (!is_uniform(inst->src[i]) && !inst->is_control_source(i)) {
const unsigned size = brw_type_size_bytes(inst->src[i].type);
max_stride = MAX2(max_stride, inst->src[i].stride * size);
min_size = MIN2(min_size, size);
max_size = MAX2(max_size, size);
}
}
/* All operands involved in lowering need to fit in the calculated
* stride.
*/
assert(max_size <= 4 * min_size);
/* Attempt to use the largest byte stride among all present operands,
* but never exceed a stride of 4 since that would lead to illegal
* destination regions during lowering.
*/
return MIN2(max_stride, 4 * min_size);
}
}
/*
* Return an acceptable byte sub-register offset for the destination of an
* instruction that requires it to be aligned to the sub-register offset of
* the sources.
*/
unsigned
required_dst_byte_offset(const intel_device_info *devinfo, const fs_inst *inst)
{
for (unsigned i = 0; i < inst->sources; i++) {
if (!is_uniform(inst->src[i]) && !inst->is_control_source(i))
if (reg_offset(inst->src[i]) % (reg_unit(devinfo) * REG_SIZE) !=
reg_offset(inst->dst) % (reg_unit(devinfo) * REG_SIZE))
return 0;
}
return reg_offset(inst->dst) % (reg_unit(devinfo) * REG_SIZE);
}
/*
* Return the closest legal execution type for an instruction on
* the specified platform.
*/
brw_reg_type
required_exec_type(const intel_device_info *devinfo, const fs_inst *inst)
{
const brw_reg_type t = get_exec_type(inst);
const bool has_64bit = brw_type_is_float(t) ?
devinfo->has_64bit_float : devinfo->has_64bit_int;
switch (inst->opcode) {
case SHADER_OPCODE_SHUFFLE:
/* IVB has an issue (which we found empirically) where it reads
* two address register components per channel for indirectly
* addressed 64-bit sources.
*
* From the Cherryview PRM Vol 7. "Register Region Restrictions":
*
* "When source or destination datatype is 64b or operation is
* integer DWord multiply, indirect addressing must not be
* used."
*
* Work around both of the above and handle platforms that
* don't support 64-bit types at all.
*/
if ((!devinfo->has_64bit_int ||
intel_device_info_is_9lp(devinfo) ||
devinfo->ver >= 20) && brw_type_size_bytes(t) > 4)
return BRW_TYPE_UD;
else if (has_dst_aligned_region_restriction(devinfo, inst))
return brw_int_type(brw_type_size_bytes(t), false);
else
return t;
case SHADER_OPCODE_SEL_EXEC:
if ((!has_64bit || devinfo->has_64bit_float_via_math_pipe) &&
brw_type_size_bytes(t) > 4)
return BRW_TYPE_UD;
else
return t;
case SHADER_OPCODE_QUAD_SWIZZLE:
if (has_dst_aligned_region_restriction(devinfo, inst))
return brw_int_type(brw_type_size_bytes(t), false);
else
return t;
case SHADER_OPCODE_CLUSTER_BROADCAST:
/* From the Cherryview PRM Vol 7. "Register Region Restrictions":
*
* "When source or destination datatype is 64b or operation is
* integer DWord multiply, indirect addressing must not be
* used."
*
* For MTL (verx10 == 125), float64 is supported, but int64 is not.
* Therefore we need to lower cluster broadcast using 32-bit int ops.
*
* For gfx12.5+ platforms that support int64, the register regions
* used by cluster broadcast aren't supported by the 64-bit pipeline.
*
* Work around the above and handle platforms that don't
* support 64-bit types at all.
*/
if ((!has_64bit || devinfo->verx10 >= 125 ||
intel_device_info_is_9lp(devinfo) ||
devinfo->ver >= 20) && brw_type_size_bytes(t) > 4)
return BRW_TYPE_UD;
else
return brw_int_type(brw_type_size_bytes(t), false);
default:
return t;
}
}
/*
* Return whether the instruction has an unsupported channel bit layout
* specified for the i-th source region.
*/
bool
has_invalid_src_region(const intel_device_info *devinfo, const fs_inst *inst,
unsigned i)
{
/* Wa_22016140776:
*
* Scalar broadcast on HF math (packed or unpacked) must not be used.
* Compiler must use a mov instruction to expand the scalar value to
* a vector before using in a HF (packed or unpacked) math operation.
*/
if (inst->is_math() && intel_needs_workaround(devinfo, 22016140776) &&
is_uniform(inst->src[i]) && inst->src[i].type == BRW_TYPE_HF) {
return true;
}
if (is_send(inst) || inst->is_control_source(i) ||
inst->opcode == BRW_OPCODE_DPAS) {
return false;
}
const unsigned dst_byte_offset = reg_offset(inst->dst) % (reg_unit(devinfo) * REG_SIZE);
const unsigned src_byte_offset = reg_offset(inst->src[i]) % (reg_unit(devinfo) * REG_SIZE);
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
return (has_dst_aligned_region_restriction(devinfo, inst) &&
!is_uniform(inst->src[i]) &&
(byte_stride(inst->src[i]) != required_src_byte_stride(devinfo, inst, i) ||
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
src_byte_offset != dst_byte_offset)) ||
(has_subdword_integer_region_restriction(devinfo, inst) &&
(byte_stride(inst->src[i]) != required_src_byte_stride(devinfo, inst, i) ||
src_byte_offset != required_src_byte_offset(devinfo, inst, i)));
}
/*
* Return whether the instruction has an unsupported channel bit layout
* specified for the destination region.
*/
bool
has_invalid_dst_region(const intel_device_info *devinfo,
const fs_inst *inst)
{
if (is_send(inst)) {
return false;
} else {
const brw_reg_type exec_type = get_exec_type(inst);
const unsigned dst_byte_offset = reg_offset(inst->dst) % (reg_unit(devinfo) * REG_SIZE);
const bool is_narrowing_conversion = !is_byte_raw_mov(inst) &&
brw_type_size_bytes(inst->dst.type) < brw_type_size_bytes(exec_type);
return (has_dst_aligned_region_restriction(devinfo, inst) &&
(required_dst_byte_stride(inst) != byte_stride(inst->dst) ||
required_dst_byte_offset(devinfo, inst) != dst_byte_offset)) ||
(is_narrowing_conversion &&
required_dst_byte_stride(inst) != byte_stride(inst->dst));
}
}
/**
* Return a non-zero value if the execution type of the instruction is
* unsupported. The destination and sources matching the returned mask
* will be bit-cast to an integer type of appropriate size, lowering any
* source or destination modifiers into separate MOV instructions.
*/
unsigned
has_invalid_exec_type(const intel_device_info *devinfo, const fs_inst *inst)
{
if (required_exec_type(devinfo, inst) != get_exec_type(inst)) {
switch (inst->opcode) {
case SHADER_OPCODE_SHUFFLE:
case SHADER_OPCODE_QUAD_SWIZZLE:
case SHADER_OPCODE_CLUSTER_BROADCAST:
case SHADER_OPCODE_BROADCAST:
case SHADER_OPCODE_MOV_INDIRECT:
return 0x1;
case SHADER_OPCODE_SEL_EXEC:
return 0x3;
default:
unreachable("Unknown invalid execution type source mask.");
}
} else {
return 0;
}
}
/**
* Return whether the instruction has an unsupported type conversion
* that must be handled by expanding the source operand.
*/
bool
has_invalid_src_conversion(const intel_device_info *devinfo,
const fs_inst *inst)
{
/* Scalar byte to float conversion is not allowed on DG2+ */
return devinfo->verx10 >= 125 &&
inst->opcode == BRW_OPCODE_MOV &&
brw_type_is_float(inst->dst.type) &&
brw_type_size_bits(inst->src[0].type) == 8 &&
is_uniform(inst->src[0]);
}
/*
* Return whether the instruction has unsupported source modifiers
* specified for the i-th source region.
*/
bool
has_invalid_src_modifiers(const intel_device_info *devinfo,
const fs_inst *inst, unsigned i)
{
return (!inst->can_do_source_mods(devinfo) &&
(inst->src[i].negate || inst->src[i].abs)) ||
((has_invalid_exec_type(devinfo, inst) & (1u << i)) &&
(inst->src[i].negate || inst->src[i].abs ||
inst->src[i].type != get_exec_type(inst))) ||
has_invalid_src_conversion(devinfo, inst);
}
/*
* Return whether the instruction has an unsupported type conversion
* specified for the destination.
*/
bool
has_invalid_conversion(const intel_device_info *devinfo, const fs_inst *inst)
{
switch (inst->opcode) {
case BRW_OPCODE_MOV:
return false;
case BRW_OPCODE_SEL:
return inst->dst.type != get_exec_type(inst);
default:
/* FIXME: We assume the opcodes not explicitly mentioned before just
* work fine with arbitrary conversions, unless they need to be
* bit-cast.
*/
return has_invalid_exec_type(devinfo, inst) &&
inst->dst.type != get_exec_type(inst);
}
}
/**
* Return whether the instruction has unsupported destination modifiers.
*/
bool
has_invalid_dst_modifiers(const intel_device_info *devinfo, const fs_inst *inst)
{
return (has_invalid_exec_type(devinfo, inst) &&
(inst->saturate || inst->conditional_mod)) ||
has_invalid_conversion(devinfo, inst);
}
/**
* Return whether the instruction has non-standard semantics for the
* conditional mod which don't cause the flag register to be updated with
* the comparison result.
*/
bool
has_inconsistent_cmod(const fs_inst *inst)
{
return inst->opcode == BRW_OPCODE_SEL ||
inst->opcode == BRW_OPCODE_CSEL ||
inst->opcode == BRW_OPCODE_IF ||
inst->opcode == BRW_OPCODE_WHILE;
}
bool
lower_instruction(fs_visitor *v, bblock_t *block, fs_inst *inst);
}
namespace brw {
/**
* Remove any modifiers from the \p i-th source region of the instruction,
* including negate, abs and any implicit type conversion to the execution
* type. Instead any source modifiers will be implemented as a separate
* MOV instruction prior to the original instruction.
*/
bool
lower_src_modifiers(fs_visitor *v, bblock_t *block, fs_inst *inst, unsigned i)
{
assert(inst->components_read(i) == 1);
assert(v->devinfo->has_integer_dword_mul ||
inst->opcode != BRW_OPCODE_MUL ||
brw_type_is_float(get_exec_type(inst)) ||
MIN2(brw_type_size_bytes(inst->src[0].type), brw_type_size_bytes(inst->src[1].type)) >= 4 ||
brw_type_size_bytes(inst->src[i].type) == get_exec_type_size(inst));
const fs_builder ibld(v, block, inst);
const brw_reg tmp = ibld.vgrf(get_exec_type(inst));
lower_instruction(v, block, ibld.MOV(tmp, inst->src[i]));
inst->src[i] = tmp;
return true;
}
}
namespace {
/**
* Remove any modifiers from the destination region of the instruction,
* including saturate, conditional mod and any implicit type conversion
* from the execution type. Instead any destination modifiers will be
* implemented as a separate MOV instruction after the original
* instruction.
*/
bool
lower_dst_modifiers(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
const fs_builder ibld(v, block, inst);
const brw_reg_type type = get_exec_type(inst);
/* Not strictly necessary, but if possible use a temporary with the same
* channel alignment as the current destination in order to avoid
* violating the restrictions enforced later on by lower_src_region()
* and lower_dst_region(), which would introduce additional copy
* instructions into the program unnecessarily.
*/
const unsigned stride =
brw_type_size_bytes(inst->dst.type) * inst->dst.stride <= brw_type_size_bytes(type) ? 1 :
brw_type_size_bytes(inst->dst.type) * inst->dst.stride / brw_type_size_bytes(type);
brw_reg tmp = ibld.vgrf(type, stride);
ibld.UNDEF(tmp);
tmp = horiz_stride(tmp, stride);
/* Emit a MOV taking care of all the destination modifiers. */
fs_inst *mov = ibld.at(block, inst->next).MOV(inst->dst, tmp);
mov->saturate = inst->saturate;
if (!has_inconsistent_cmod(inst))
mov->conditional_mod = inst->conditional_mod;
if (inst->opcode != BRW_OPCODE_SEL) {
mov->predicate = inst->predicate;
mov->predicate_inverse = inst->predicate_inverse;
}
mov->flag_subreg = inst->flag_subreg;
lower_instruction(v, block, mov);
/* Point the original instruction at the temporary, and clean up any
* destination modifiers.
*/
assert(inst->size_written == inst->dst.component_size(inst->exec_size));
inst->dst = tmp;
inst->size_written = inst->dst.component_size(inst->exec_size);
inst->saturate = false;
if (!has_inconsistent_cmod(inst))
inst->conditional_mod = BRW_CONDITIONAL_NONE;
intel/fs: sel.cond writes the flags on Gfx4 and Gfx5 On Gfx4 and Gfx5, sel.l (for min) and sel.ge (for max) are implemented using a separte cmpn and sel instruction. This lowering occurs in fs_vistor::lower_minmax which is called very, very late... a long, long time after the first calls to opt_cmod_propagation. As a result, conditional modifiers can be incorrectly propagated across sel.cond on those platforms. No tests were affected by this change, and I find that quite shocking. After just changing flags_written(), all of the atan tests started failing on ILK. That required the change in cmod_propagatin (and the addition of the prop_across_into_sel_gfx5 unit test). Shader-db results for ILK and GM45 are below. I looked at a couple before and after shaders... and every case that I looked at had experienced incorrect cmod propagation. This affected a LOT of apps! Euro Truck Simulator 2, The Talos Principle, Serious Sam 3, Sanctum 2, Gang Beasts, and on and on... :( I discovered this bug while working on a couple new optimization passes. One of the passes attempts to remove condition modifiers that are never used. The pass made no progress except on ILK and GM45. After investigating a couple of the affected shaders, I noticed that the code in those shaders looked wrong... investigation led to this cause. v2: Trivial changes in the unit tests. v3: Fix type in comment in unit tests. Noticed by Jason and Priit. v4: Tweak handling of BRW_OPCODE_SEL special case. Suggested by Jason. Fixes: df1aec763eb ("i965/fs: Define methods to calculate the flag subset read or written by an fs_inst.") Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Tested-by: Dave Airlie <airlied@redhat.com> Iron Lake total instructions in shared programs: 8180493 -> 8181781 (0.02%) instructions in affected programs: 541796 -> 543084 (0.24%) helped: 28 HURT: 1158 helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 helped stats (rel) min: 0.35% max: 0.86% x̄: 0.53% x̃: 0.50% HURT stats (abs) min: 1 max: 3 x̄: 1.14 x̃: 1 HURT stats (rel) min: 0.12% max: 4.00% x̄: 0.37% x̃: 0.23% 95% mean confidence interval for instructions value: 1.06 1.11 95% mean confidence interval for instructions %-change: 0.31% 0.38% Instructions are HURT. total cycles in shared programs: 239420470 -> 239421690 (<.01%) cycles in affected programs: 2925992 -> 2927212 (0.04%) helped: 49 HURT: 157 helped stats (abs) min: 2 max: 284 x̄: 62.69 x̃: 70 helped stats (rel) min: 0.04% max: 6.20% x̄: 1.68% x̃: 1.96% HURT stats (abs) min: 2 max: 48 x̄: 27.34 x̃: 24 HURT stats (rel) min: 0.02% max: 2.91% x̄: 0.31% x̃: 0.20% 95% mean confidence interval for cycles value: -0.80 12.64 95% mean confidence interval for cycles %-change: -0.31% <.01% Inconclusive result (value mean confidence interval includes 0). GM45 total instructions in shared programs: 4985517 -> 4986207 (0.01%) instructions in affected programs: 306935 -> 307625 (0.22%) helped: 14 HURT: 625 helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 helped stats (rel) min: 0.35% max: 0.82% x̄: 0.52% x̃: 0.49% HURT stats (abs) min: 1 max: 3 x̄: 1.13 x̃: 1 HURT stats (rel) min: 0.12% max: 3.90% x̄: 0.34% x̃: 0.22% 95% mean confidence interval for instructions value: 1.04 1.12 95% mean confidence interval for instructions %-change: 0.29% 0.36% Instructions are HURT. total cycles in shared programs: 153827268 -> 153828052 (<.01%) cycles in affected programs: 1669290 -> 1670074 (0.05%) helped: 24 HURT: 84 helped stats (abs) min: 2 max: 232 x̄: 64.33 x̃: 67 helped stats (rel) min: 0.04% max: 4.62% x̄: 1.60% x̃: 1.94% HURT stats (abs) min: 2 max: 48 x̄: 27.71 x̃: 24 HURT stats (rel) min: 0.02% max: 2.66% x̄: 0.34% x̃: 0.14% 95% mean confidence interval for cycles value: -1.94 16.46 95% mean confidence interval for cycles %-change: -0.29% 0.11% Inconclusive result (value mean confidence interval includes 0). Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191>
2021-08-02 21:33:17 -07:00
assert(!inst->flags_written(v->devinfo) || !mov->predicate);
return true;
}
/**
* Remove any non-trivial shuffling of data from the \p i-th source region
* of the instruction. Instead implement the region as a series of integer
* copies into a temporary with the same channel layout as the destination.
*/
bool
lower_src_region(fs_visitor *v, bblock_t *block, fs_inst *inst, unsigned i)
{
assert(inst->components_read(i) == 1);
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
const intel_device_info *devinfo = v->devinfo;
const fs_builder ibld(v, block, inst);
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
const unsigned stride = required_src_byte_stride(devinfo, inst, i) /
brw_type_size_bytes(inst->src[i].type);
assert(stride > 0);
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
/* Calculate the size of the temporary allocation manually instead of
* relying on the builder, since we may have to add some amount of
* padding mandated by the hardware for Xe2+ instructions with sub-dword
* integer regions.
*/
const unsigned size =
DIV_ROUND_UP(required_src_byte_offset(v->devinfo, inst, i) +
inst->exec_size * stride *
brw_type_size_bytes(inst->src[i].type),
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
reg_unit(devinfo) * REG_SIZE) * reg_unit(devinfo);
brw_reg tmp = brw_vgrf(v->alloc.allocate(size), inst->src[i].type);
ibld.UNDEF(tmp);
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
tmp = byte_offset(horiz_stride(tmp, stride),
required_src_byte_offset(devinfo, inst, i));
/* Emit a series of 32-bit integer copies with any source modifiers
* cleaned up (because their semantics are dependent on the type).
*/
const brw_reg_type raw_type = brw_int_type(MIN2(brw_type_size_bytes(tmp.type), 4),
false);
const unsigned n = brw_type_size_bytes(tmp.type) / brw_type_size_bytes(raw_type);
brw_reg raw_src = inst->src[i];
raw_src.negate = false;
raw_src.abs = false;
intel/fs/gfx20+: Implement sub-dword integer regioning restrictions. This patch introduces code to enforce the pages-long regioning restrictions introduced by Xe2 that apply to sub-dword integer datatypes (See BSpec page 56640). They impose a number of restrictions on what the regioning parameters of a source can be depending on the source and destination datatypes as well as the alignment of the destination. The tricky cases are when the destination stride is smaller than 32 bits and the source stride is at least 32 bits, since such cases require the destination and source offsets to be in agreement based on an equation determined by the source and destination strides. The second source of instructions with multiple sources is even more restricted, and due to the existence of hardware bug HSDES#16012383669 it basically requires the source data to be packed in the GRF if the destination stride isn't dword-aligned. In order to address those restrictions this patch leverages the existing infrastructure from brw_fs_lower_regioning.cpp. The same general approach can be used to handle this restriction we were using to handle restrictions of the floating-point pipeline in previous generations: Unsupported source regions are lowered by emitting an additional copy before the instruction that shuffles the data in a way that allows using a valid region in the original instruction. The main difficulty that wasn't encountered in previous platforms is that it is non-trivial to come up with a copy instruction that doesn't break the regioning restrictions itself, since on previous platforms we could just bitcast floating-point data and use integer copies in order to implement arbitrary regioning, which is unfortunately no longer a choice lacking a magic third pipeline able to do the regioning modes the integer pipeline is no longer able to do. The required_src_byte_stride() and required_src_byte_offset() helpers introduced here try to calculate parameters for both regions that avoid that situation, but it isn't always possible, and actually in some cases that involve the second source of ALU instructions a chain of multiple copy instructions will be required, so the lower_instruction() routine needs to be applied recursively to the instructions emitted to lower the original instruction. XXX - Allow more flexible regioning for the second source of an instruction if bug HSDES#16012383669 is fixed in a future hardware platform. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28698>
2024-03-06 16:16:45 -08:00
for (unsigned j = 0; j < n; j++) {
fs_inst *jnst = ibld.MOV(subscript(tmp, raw_type, j),
subscript(raw_src, raw_type, j));
if (has_subdword_integer_region_restriction(devinfo, jnst)) {
/* The copy isn't guaranteed to comply with all subdword integer
* regioning restrictions in some cases. Lower it recursively.
*/
lower_instruction(v, block, jnst);
}
}
/* Point the original instruction at the temporary, making sure to keep
* any source modifiers in the instruction.
*/
brw_reg lower_src = tmp;
lower_src.negate = inst->src[i].negate;
lower_src.abs = inst->src[i].abs;
inst->src[i] = lower_src;
return true;
}
/**
* Remove any non-trivial shuffling of data from the destination region of
* the instruction. Instead implement the region as a series of integer
* copies from a temporary with a channel layout compatible with the
* sources.
*/
bool
lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
intel/fs: Don't touch accumulator destination while applying regioning alignment rule In some shaders, you can end up with a stride in the source of a SHADER_OPCODE_MULH. One way this can happen is if the MULH is acting on the top bits of a 64-bit value due to 64-bit integer lowering. In this case, the compiler will produce something like this: mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; The new region fixup pass looks at the MUL and sees a strided source and unstrided destination and determines that the sequence is illegal. It then attempts to fix the illegal stride by replacing the destination of the MUL with a temporary and emitting a MOV into the accumulator: mul(8) g9<2>UD g5<8,4,2>UD 0x0004UW { align1 1Q }; mov(8) acc0<1>UD g9<8,4,2>UD { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; Unfortunately, this new sequence isn't correct because MOV accesses the accumulator with a different precision to MUL and, instead of filling the bottom 32 bits with the source and zeroing the top 32 bits, it leaves the top 32 (or maybe 31) bits alone and full of garbage. When the MACH comes along and tries to complete the multiplication, the result is correct in the bottom 32 bits (which we throw away) and garbage in the top 32 bits which are actually returned by MACH. This commit does two things: First, it adds an assert to ensure that we don't try to rewrite accumulator destinations of MUL instructions so we can avoid this precision issue. Second, it modifies required_dst_byte_stride to require a tightly packed stride so that we fix up the sources instead and the actual code which gets emitted is this: mov(8) g9<1>UD g5<8,4,2>UD { align1 1Q }; mul(8) acc0<1>UD g9<8,8,1>UD 0x0004UW { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass" Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2019-01-16 17:40:13 -06:00
/* We cannot replace the result of an integer multiply which writes the
* accumulator because MUL+MACH pairs act on the accumulator as a 66-bit
* value whereas the MOV will act on only 32 or 33 bits of the
* accumulator.
*/
assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||
brw_type_is_float(inst->dst.type));
intel/fs: Don't touch accumulator destination while applying regioning alignment rule In some shaders, you can end up with a stride in the source of a SHADER_OPCODE_MULH. One way this can happen is if the MULH is acting on the top bits of a 64-bit value due to 64-bit integer lowering. In this case, the compiler will produce something like this: mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; The new region fixup pass looks at the MUL and sees a strided source and unstrided destination and determines that the sequence is illegal. It then attempts to fix the illegal stride by replacing the destination of the MUL with a temporary and emitting a MOV into the accumulator: mul(8) g9<2>UD g5<8,4,2>UD 0x0004UW { align1 1Q }; mov(8) acc0<1>UD g9<8,4,2>UD { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; Unfortunately, this new sequence isn't correct because MOV accesses the accumulator with a different precision to MUL and, instead of filling the bottom 32 bits with the source and zeroing the top 32 bits, it leaves the top 32 (or maybe 31) bits alone and full of garbage. When the MACH comes along and tries to complete the multiplication, the result is correct in the bottom 32 bits (which we throw away) and garbage in the top 32 bits which are actually returned by MACH. This commit does two things: First, it adds an assert to ensure that we don't try to rewrite accumulator destinations of MUL instructions so we can avoid this precision issue. Second, it modifies required_dst_byte_stride to require a tightly packed stride so that we fix up the sources instead and the actual code which gets emitted is this: mov(8) g9<1>UD g5<8,4,2>UD { align1 1Q }; mul(8) acc0<1>UD g9<8,8,1>UD 0x0004UW { align1 1Q }; mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable }; Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass" Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2019-01-16 17:40:13 -06:00
const fs_builder ibld(v, block, inst);
const unsigned stride = required_dst_byte_stride(inst) /
brw_type_size_bytes(inst->dst.type);
assert(stride > 0);
brw_reg tmp = ibld.vgrf(inst->dst.type, stride);
ibld.UNDEF(tmp);
tmp = horiz_stride(tmp, stride);
if (!inst->dst.is_null()) {
/* Emit a series of 32-bit integer copies from the temporary into the
* original destination.
*/
const brw_reg_type raw_type =
brw_int_type(MIN2(brw_type_size_bytes(tmp.type), 4), false);
const unsigned n =
brw_type_size_bytes(tmp.type) / brw_type_size_bytes(raw_type);
if (inst->predicate && inst->opcode != BRW_OPCODE_SEL) {
/* Note that in general we cannot simply predicate the copies on
* the same flag register as the original instruction, since it
* may have been overwritten by the instruction itself. Instead
* initialize the temporary with the previous contents of the
* destination register.
*/
for (unsigned j = 0; j < n; j++)
ibld.MOV(subscript(tmp, raw_type, j),
subscript(inst->dst, raw_type, j));
}
intel/fs/xe2: Fix up subdword integer region restriction with strided byte src and packed byte dst. This fixes a corner case of the LNL sub-dword integer restrictions that wasn't being detected by has_subdword_integer_region_restriction(), specifically: > if(Src.Type==Byte && Dst.Type==Byte && Dst.Stride==1 && W!=2) { > // ... > if(Src.Stride == 2) && (Src.UniformStride) && (Dst.SubReg%32 == Src.SubReg/2 ) { Allowed } > // ... > } All the other restrictions that require agreement between the SubReg number of source and destination only affect sources with a stride greater than a dword, which is why has_subdword_integer_region_restriction() was returning false except when "byte_stride(srcs[i]) >= 4" evaluated to true, but as implied by the pseudocode above, in the particular case of a packed byte destination, the restriction applies for source strides as narrow as 2B. The form of the equation that relates the subreg numbers is consistent with the existing calculations in brw_fs_lower_regioning (see required_src_byte_offset()), we just need to enable lowering for this corner case, and change lower_dst_region() to call lower_instruction() recursively, since some of the cases where we break this restriction are copy instructions introduced by brw_fs_lower_regioning() itself trying to lower other instructions with byte destinations. This fixes some Vulkan CTS test-cases that were hitting these restrictions with byte data types. Fixes: 217d41236076280 ("intel/fs/gfx20+: Implement sub-dword integer regioning restrictions.") Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30630>
2024-11-09 15:15:33 -08:00
for (unsigned j = 0; j < n; j++) {
fs_inst *jnst = ibld.at(block, inst->next).MOV(subscript(inst->dst, raw_type, j),
subscript(tmp, raw_type, j));
if (has_subdword_integer_region_restriction(v->devinfo, jnst)) {
/* The copy isn't guaranteed to comply with all subdword integer
* regioning restrictions in some cases. Lower it recursively.
*/
lower_instruction(v, block, jnst);
}
}
/* If the destination was an accumulator, after lowering it will be a
* GRF. Clear writes_accumulator for the instruction.
*/
if (inst->dst.is_accumulator())
inst->writes_accumulator = false;
}
/* Point the original instruction at the temporary, making sure to keep
* any destination modifiers in the instruction.
*/
assert(inst->size_written == inst->dst.component_size(inst->exec_size));
inst->dst = tmp;
inst->size_written = inst->dst.component_size(inst->exec_size);
return true;
}
/**
* Change sources and destination of the instruction to an
* appropriate legal type, splitting the instruction into multiple
* ones of smaller execution type if necessary, to be used in cases
* where the execution type of an instruction is unsupported.
*/
bool
lower_exec_type(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
assert(inst->dst.type == get_exec_type(inst));
const unsigned mask = has_invalid_exec_type(v->devinfo, inst);
const brw_reg_type raw_type = required_exec_type(v->devinfo, inst);
const unsigned n = get_exec_type_size(inst) / brw_type_size_bytes(raw_type);
const fs_builder ibld(v, block, inst);
brw_reg tmp = ibld.vgrf(inst->dst.type, inst->dst.stride);
ibld.UNDEF(tmp);
tmp = horiz_stride(tmp, inst->dst.stride);
for (unsigned j = 0; j < n; j++) {
fs_inst sub_inst = *inst;
for (unsigned i = 0; i < inst->sources; i++) {
if (mask & (1u << i)) {
assert(inst->src[i].type == inst->dst.type);
sub_inst.src[i] = subscript(inst->src[i], raw_type, j);
}
}
sub_inst.dst = subscript(tmp, raw_type, j);
assert(sub_inst.size_written == sub_inst.dst.component_size(sub_inst.exec_size));
assert(!sub_inst.flags_written(v->devinfo) && !sub_inst.saturate);
ibld.emit(sub_inst);
fs_inst *mov = ibld.MOV(subscript(inst->dst, raw_type, j),
subscript(tmp, raw_type, j));
if (inst->opcode != BRW_OPCODE_SEL) {
mov->predicate = inst->predicate;
mov->predicate_inverse = inst->predicate_inverse;
}
lower_instruction(v, block, mov);
}
inst->remove(block);
return true;
}
brw/lower: Lower invalid source conversion to better code There are two fragment shaders from RDR2 that is hurt for spills and fills on Lunar Lake. Totals from 2 (0.00% of 551413) affected shaders: Spill count: 1252 -> 1317 (+5.19%) Fill count: 2518 -> 2642 (+4.92%) Those shaders... have a lot of room for improvement. There are some patterns in those shaders that we handle very, very poorly. Improving those patterns would likely improve the spills and fills in these shaders quite dramatically. Given how much other platforms are helped, I don't this should block this commit. No shader-db or fossil-db changes on any pre-Gfx12.5 Intel platforms. v2: Add some comments and an additional assertion. Suggested by Ken. shader-db: Lunar Lake total instructions in shared programs: 18094517 -> 18094511 (<.01%) instructions in affected programs: 809 -> 803 (-0.74%) helped: 6 / HURT: 0 total cycles in shared programs: 921532158 -> 921532168 (<.01%) cycles in affected programs: 2266 -> 2276 (0.44%) helped: 0 / HURT: 3 Meteor Lake and DG2 had similar results. (Meteor Lake shown) total instructions in shared programs: 19820845 -> 19820839 (<.01%) instructions in affected programs: 803 -> 797 (-0.75%) helped: 6 / HURT: 0 total cycles in shared programs: 906372999 -> 906372949 (<.01%) cycles in affected programs: 3216 -> 3166 (-1.55%) helped: 6 / HURT: 0 fossil-db: Lunar Lake Totals: Instrs: 141887377 -> 141884465 (-0.00%); split: -0.00%, +0.00% Cycle count: 21990301498 -> 21990267232 (-0.00%); split: -0.00%, +0.00% Spill count: 69732 -> 69797 (+0.09%) Fill count: 128521 -> 128645 (+0.10%) Totals from 349 (0.06% of 551413) affected shaders: Instrs: 506117 -> 503205 (-0.58%); split: -0.79%, +0.21% Cycle count: 32362996 -> 32328730 (-0.11%); split: -0.52%, +0.41% Spill count: 1951 -> 2016 (+3.33%) Fill count: 4899 -> 5023 (+2.53%) Meteor Lake and DG2 had similar results. (Meteor Lake shown) Totals: Instrs: 152773732 -> 152761383 (-0.01%); split: -0.01%, +0.00% Cycle count: 17187529968 -> 17187450663 (-0.00%); split: -0.00%, +0.00% Spill count: 79279 -> 79003 (-0.35%) Fill count: 148803 -> 147942 (-0.58%) Scratch Memory Size: 3949568 -> 3946496 (-0.08%) Max live registers: 31879325 -> 31879230 (-0.00%) Totals from 366 (0.06% of 633185) affected shaders: Instrs: 557377 -> 545028 (-2.22%); split: -2.22%, +0.01% Cycle count: 26171205 -> 26091900 (-0.30%); split: -0.54%, +0.24% Spill count: 3238 -> 2962 (-8.52%) Fill count: 10018 -> 9157 (-8.59%) Scratch Memory Size: 257024 -> 253952 (-1.20%) Max live registers: 28187 -> 28092 (-0.34%) Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32041>
2024-10-24 16:35:38 -07:00
/**
* Fast-path for very specific kinds of invalid regions.
*
* Gfx12.5+ does not allow moves of B or UB sources to floating-point
* destinations. This restriction can be resolved more efficiently than by
* the general lowering in lower_src_modifiers or lower_src_region.
*/
void
lower_src_conversion(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
const intel_device_info *devinfo = v->devinfo;
const fs_builder ibld = fs_builder(v, block, inst).scalar_group();
/* We only handle scalar conversions from small types for now. */
assert(is_uniform(inst->src[0]));
brw_reg tmp = ibld.vgrf(brw_type_with_size(inst->src[0].type, 32));
fs_inst *mov = ibld.MOV(tmp, inst->src[0]);
inst->src[0] = component(tmp, 0);
/* Assert that neither the added MOV nor the original instruction will need
* any additional lowering.
*/
assert(!has_invalid_src_region(devinfo, mov, 0));
assert(!has_invalid_src_modifiers(devinfo, mov, 0));
assert(!has_invalid_dst_region(devinfo, mov));
assert(!has_invalid_src_region(devinfo, inst, 0));
assert(!has_invalid_src_modifiers(devinfo, inst, 0));
}
/**
* Legalize the source and destination regioning controls of the specified
* instruction.
*/
bool
lower_instruction(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
const intel_device_info *devinfo = v->devinfo;
bool progress = false;
/* BROADCAST is special. It's destination region is a bit of a lie, and
* it gets lower in brw_eu_emit. For the purposes of region
* restrictions, let's assume that the final code emission will do the
* right thing. Doing a bunch of shuffling here is only going to make a
* mess of things.
*/
if (inst->opcode == SHADER_OPCODE_BROADCAST)
return false;
if (has_invalid_dst_modifiers(devinfo, inst))
progress |= lower_dst_modifiers(v, block, inst);
if (has_invalid_dst_region(devinfo, inst))
progress |= lower_dst_region(v, block, inst);
brw/lower: Lower invalid source conversion to better code There are two fragment shaders from RDR2 that is hurt for spills and fills on Lunar Lake. Totals from 2 (0.00% of 551413) affected shaders: Spill count: 1252 -> 1317 (+5.19%) Fill count: 2518 -> 2642 (+4.92%) Those shaders... have a lot of room for improvement. There are some patterns in those shaders that we handle very, very poorly. Improving those patterns would likely improve the spills and fills in these shaders quite dramatically. Given how much other platforms are helped, I don't this should block this commit. No shader-db or fossil-db changes on any pre-Gfx12.5 Intel platforms. v2: Add some comments and an additional assertion. Suggested by Ken. shader-db: Lunar Lake total instructions in shared programs: 18094517 -> 18094511 (<.01%) instructions in affected programs: 809 -> 803 (-0.74%) helped: 6 / HURT: 0 total cycles in shared programs: 921532158 -> 921532168 (<.01%) cycles in affected programs: 2266 -> 2276 (0.44%) helped: 0 / HURT: 3 Meteor Lake and DG2 had similar results. (Meteor Lake shown) total instructions in shared programs: 19820845 -> 19820839 (<.01%) instructions in affected programs: 803 -> 797 (-0.75%) helped: 6 / HURT: 0 total cycles in shared programs: 906372999 -> 906372949 (<.01%) cycles in affected programs: 3216 -> 3166 (-1.55%) helped: 6 / HURT: 0 fossil-db: Lunar Lake Totals: Instrs: 141887377 -> 141884465 (-0.00%); split: -0.00%, +0.00% Cycle count: 21990301498 -> 21990267232 (-0.00%); split: -0.00%, +0.00% Spill count: 69732 -> 69797 (+0.09%) Fill count: 128521 -> 128645 (+0.10%) Totals from 349 (0.06% of 551413) affected shaders: Instrs: 506117 -> 503205 (-0.58%); split: -0.79%, +0.21% Cycle count: 32362996 -> 32328730 (-0.11%); split: -0.52%, +0.41% Spill count: 1951 -> 2016 (+3.33%) Fill count: 4899 -> 5023 (+2.53%) Meteor Lake and DG2 had similar results. (Meteor Lake shown) Totals: Instrs: 152773732 -> 152761383 (-0.01%); split: -0.01%, +0.00% Cycle count: 17187529968 -> 17187450663 (-0.00%); split: -0.00%, +0.00% Spill count: 79279 -> 79003 (-0.35%) Fill count: 148803 -> 147942 (-0.58%) Scratch Memory Size: 3949568 -> 3946496 (-0.08%) Max live registers: 31879325 -> 31879230 (-0.00%) Totals from 366 (0.06% of 633185) affected shaders: Instrs: 557377 -> 545028 (-2.22%); split: -2.22%, +0.01% Cycle count: 26171205 -> 26091900 (-0.30%); split: -0.54%, +0.24% Spill count: 3238 -> 2962 (-8.52%) Fill count: 10018 -> 9157 (-8.59%) Scratch Memory Size: 257024 -> 253952 (-1.20%) Max live registers: 28187 -> 28092 (-0.34%) Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32041>
2024-10-24 16:35:38 -07:00
if (has_invalid_src_conversion(devinfo, inst)) {
lower_src_conversion(v, block, inst);
progress = true;
}
for (unsigned i = 0; i < inst->sources; i++) {
if (has_invalid_src_modifiers(devinfo, inst, i))
progress |= lower_src_modifiers(v, block, inst, i);
if (has_invalid_src_region(devinfo, inst, i))
progress |= lower_src_region(v, block, inst, i);
}
if (has_invalid_exec_type(devinfo, inst))
progress |= lower_exec_type(v, block, inst);
return progress;
}
}
bool
brw_lower_regioning(fs_visitor &s)
{
bool progress = false;
foreach_block_and_inst_safe(block, fs_inst, inst, s.cfg)
progress |= lower_instruction(&s, block, inst);
if (progress)
s.invalidate_analysis(DEPENDENCY_INSTRUCTIONS | DEPENDENCY_VARIABLES);
return progress;
}