aco/gfx10+: work around non uniform ds_append wave64 result

In wave64 for hw with native wave32, ds_append seems to be split in a load for
the low half and an atomic for the high half, and other LDS instructions can
be scheduled between the two.
Which means the result of the low half is unusable because it might be out of date.

I was only able to reproduce this issue in WGP mode, but be conservative and
apply the workaround in CU mode too.

Foz-DB Navi31:
Totals from 13 (0.02% of 79395) affected shaders:
Instrs: 7599 -> 7656 (+0.75%)
CodeSize: 39708 -> 39972 (+0.66%)
Latency: 83174 -> 83572 (+0.48%)
InvThroughput: 8271 -> 8357 (+1.04%)
Copies: 718 -> 717 (-0.14%)
VALU: 3689 -> 3703 (+0.38%)
SALU: 935 -> 965 (+3.21%)

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11921
Fixes: 45e935800a ("aco: implement nir_shared_append/consume_amd")

Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31301>
This commit is contained in:
Georg Lehmann 2024-09-21 15:02:23 +02:00 committed by Marge Bot
parent b6b363c478
commit 0e21cd9e15

View file

@ -7595,7 +7595,19 @@ visit_shared_append(isel_context* ctx, nir_intrinsic_instr* instr)
ds = bld.ds(op, Definition(tmp), m, address);
ds->ds().sync = memory_sync_info(storage_shared, semantic_atomicrmw);
bld.pseudo(aco_opcode::p_as_uniform, Definition(get_ssa_temp(ctx, &instr->def)), tmp);
/* In wave64 for hw with native wave32, ds_append seems to be split in a load for the low half
* and an atomic for the high half, and other LDS instructions can be scheduled between the two.
* Which means the result of the low half is unusable because it might be out of date.
*/
if (ctx->program->gfx_level >= GFX10 && ctx->program->wave_size == 64 &&
ctx->program->workgroup_size > 64) {
Temp last_lane = bld.sop1(aco_opcode::s_flbit_i32_b64, bld.def(s1), Operand(exec, s2));
last_lane = bld.sop2(aco_opcode::s_sub_u32, bld.def(s1), bld.def(s1, scc), Operand::c32(63),
last_lane);
bld.readlane(Definition(get_ssa_temp(ctx, &instr->def)), tmp, last_lane);
} else {
bld.pseudo(aco_opcode::p_as_uniform, Definition(get_ssa_temp(ctx, &instr->def)), tmp);
}
}
void