iris: Perform load_constant address math in 32-bit rather than 64-bit

We lower NIR's load_constant to load_global_constant, which uses A64
bindless messages.  As such, we do the following math to produce the
address for each load:

   base_lo@32 <- BRW_SHADER_RELOC_CONST_DATA_ADDR_LOW
   base_hi@32 <- BRW_SHADER_RELOC_CONST_DATA_ADDR_HIGH
   base@64 <- pack_64_2x32_split(base_lo, base_hi)
   addr@64 <- iadd(base@64, u2u64(offset@32))

On platforms that emulate 64-bit math, we have to emit additional code
for the 64-bit iadd to handle the possibility of a carry happening and
affecting the top bits.

However, NIR constant data is always uploaded adjacent to the shader
assembly, in the same buffer.  These buffers are required to live in a
4GB region of memory starting at Instruction State Base Address.  We
always place the base address at a 4GB address.  So the constant data
always lives in a buffer entirely contained within a 4GB region, which
means any offsets from the start of the buffer cannot possibly affect
the high bits.

So instead, we can simply do a 32-bit addition between the low bits of
the base and the offset, then pack that with the unchanged high bits.

On iris, IRIS_MEMZONE_SHADER is at [0, 4GB) so the high bits are always
zero.  We don't even need to patch that portion of the address and can
simply use u2u64 to promote the 32-bit add result to a 64-bit value
where the top bits are 0.

shader-db on Icelake indicates that this:
- Helps instructions: -1.13% in 135 affected programs
- Helps spills/fills: -4.08% / -4.18% in 4 affected programs
- Gains us 1 SIMD16 compute shader instead of SIMD8

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20999>
This commit is contained in:
Kenneth Graunke 2023-01-23 11:57:18 -08:00 committed by Marge Bot
parent 95d06343c6
commit a0e7e7ff41

View file

@ -532,13 +532,18 @@ iris_setup_uniforms(ASSERTED const struct intel_device_info *devinfo,
unsigned max_offset = b.shader->constant_data_size - load_size;
offset = nir_umin(&b, offset, nir_imm_int(&b, max_offset));
nir_ssa_def *const_data_base_addr = nir_pack_64_2x32_split(&b,
nir_load_reloc_const_intel(&b, BRW_SHADER_RELOC_CONST_DATA_ADDR_LOW),
nir_load_reloc_const_intel(&b, BRW_SHADER_RELOC_CONST_DATA_ADDR_HIGH));
/* Constant data lives in buffers within IRIS_MEMZONE_SHADER
* and cannot cross that 4GB boundary, so we can do the address
* calculation with 32-bit adds. Also, we can ignore the high
* bits because IRIS_MEMZONE_SHADER is in the [0, 4GB) range.
*/
assert(IRIS_MEMZONE_SHADER_START >> 32 == 0ull);
nir_ssa_def *const_data_addr =
nir_iadd(&b, nir_load_reloc_const_intel(&b, BRW_SHADER_RELOC_CONST_DATA_ADDR_LOW), offset);
nir_ssa_def *data =
nir_load_global_constant(&b, nir_iadd(&b, const_data_base_addr,
nir_u2u64(&b, offset)),
nir_load_global_constant(&b, nir_u2u64(&b, const_data_addr),
load_align,
intrin->dest.ssa.num_components,
intrin->dest.ssa.bit_size);