From f227aa7c98ea681b8555efbc4a80233bff7221b5 Mon Sep 17 00:00:00 2001
From: Iago Toral Quiroga <itoral@igalia.com>
Date: Fri, 8 Jul 2022 13:33:11 +0200
Subject: [PATCH] broadcom/compiler: don't try to hide TMU latency at QPU
 scheduling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Based on empirical testing with Sponza and a few UE4 samples this is
consistently slightly benefitial for performance.

The most likely reason why this helps is that thrsw is probably
already quite effective at hiding latency and we are already trying
to hide latency at NIR scheduling and also via TMU pipelining, so
piling up on this when scheduling QPU typically ends up providing no
benefit at all for latency and is instead possibly preventing us to
unblock critical paths in the shader that depend on the TMU result,
requiring us to execute more cycles to complete the program.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17451>
---
 src/broadcom/compiler/qpu_schedule.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/broadcom/compiler/qpu_schedule.c b/src/broadcom/compiler/qpu_schedule.c
index 112573dc613..74bd1cd7a9b 100644
--- a/src/broadcom/compiler/qpu_schedule.c
+++ b/src/broadcom/compiler/qpu_schedule.c
@@ -645,19 +645,32 @@ get_instruction_priority(const struct v3d_device_info *devinfo,
                 return next_score;
         next_score++;
 
+        /* Empirical testing shows that using priorities to hide latency of
+         * TMU operations when scheduling QPU leads to slightly worse
+         * performance, even at 2 threads. We think this is because the thread
+         * switching is already quite effective at hiding latency and NIR
+         * scheduling (and possibly TMU pipelining too) are sufficient to hide
+         * TMU latency, so piling up on that here doesn't provide any benefits
+         * and instead may cause us to postpone critical paths that depend on
+         * the TMU results.
+         */
+#if 0
         /* Schedule texture read results collection late to hide latency. */
         if (v3d_qpu_waits_on_tmu(inst))
                 return next_score;
         next_score++;
+#endif
 
         /* Default score for things that aren't otherwise special. */
         baseline_score = next_score;
         next_score++;
 
+#if 0
         /* Schedule texture read setup early to hide their latency better. */
         if (v3d_qpu_writes_tmu(devinfo, inst))
                 return next_score;
         next_score++;
+#endif
 
         /* We should increase the maximum if we assert here */
         assert(next_score < MAX_SCHEDULE_PRIORITY);