otherwise we get garbage in the other lanes. this was a pain to debug.
dEQP-VK.subgroups.clustered.compute.subgroupclusteredand_bvec2
this should be optimized (and maybe reworked/simplified too) but now this should
be /correct/ at least.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
The stack pointer starts out at b.shader->scratch_size, plus per-lane
offsets. Every time we spill/fill, we adjust the stack pointer to
the offset for our desired memory location, and leave it there. Over
the course of each block's spills/fills, we track the current delta from
the original value, and restore it to there at the end of the block.
However, when we started clobbering lane 0 and rematerializing it,
we were recreating it as the original base value (b.shader->scratch_size
+ sizeof(uint32_t) * 0). We need to include sp_delta_B too, or else we
will calculate our deltas incorrectly for that lane, and restore it
incorrectly at the end of the block too.
Found while debugging the issue fixed by the previous commit.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
The idea here is that the scratch surface address is stored in
ADDRESS_REGISTER, while per-lane offsets are stored in `sp', an
array of UGPR[dispatch_width]. When we encounter an opcode that
needs to clobber the address register, we stash it in the first
UGPR of `sp'. This clobbers the first lane offset, but that's
easy to reconstruct since it's lane 0. When we need to spill/fill,
we restore the address register and rematerialize the offset for
lane 0.
This is all good. However, we were saving the address register
every time we found an opcode that clobbered it...even if we'd
_already_ clobbered it. So if you had back to back shuffles,
the first would save the scratch surface address, and the second
would save...some part of the first shuffle. So we'd never get
the scratch address back again. Easy fix, only save if valid.
Fixes misrendering in Baldurs Gate 3 compute shaders.
Fixes: 64acab1d69 ("jay/lower_spill: use 1 less temporary")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
instead of appending instructions like brw_eu helpers, just construct a single
gen_inst at a time. this involves a lot less indirection.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
Equivalent now that the IR allows it.
For the dynamic case:
< (32&W) mov.u16 g0, g38<16,8,2> │ I@1
---
> (32&W) mov.u16 g0, g38<2> │ I@1
For the constant case it's actually better since copyprop can see through it:
< (1&W) mov.u32 u0.0, 0xaaaaaaaa │
< (32&W) mov.u16 g1, u0.0 │ I@1
---
> (32&W) mov.u16 g0, 0xaaaa │
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
Previously we had special ops doing data model breaking things on GPRs. But
there's no real reason for that, we can calculate lane IDs as UGPR vectors
within the Jay data model just fine. Adjust jay_ir/jay_validate to define packed
16-bit UGPR vectors, giving them the natural semantics, then use that to
calculate lane IDs, peeling back all the hacks we added along the way.
This also unfortunately pessimizes inverse_ballot() but only in a corner case
that could be revisited later. Stats are net positive.
In addition to the code clean up, this has 3 other benefits:
* Now that we can rematerialize the lane ID code anywhere we want, we could
theoretically reduce register pressure in some scenarios. Stats show this
doesn't help in the current implementation, though.
* Now that we can calculate lane IDs in control flow, the issues with divergent
function calls all go away. (Well, the lane ID issue. There are other issues.)
* Now that we use UGPRs for this, we don't need a stride=16 GRF in shaders that
don't actually use 16-bit math, meaning less shuffling from bad partitions.
That's reflected in the positive stats here.
SIMD16:
Totals from 1643 (62.07% of 2647) affected shaders:
Instrs: 2227750 -> 2221032 (-0.30%); split: -0.44%, +0.14%
CodeSize: 33138416 -> 33034224 (-0.31%); split: -0.52%, +0.20%
SIMD32:
Totals from 1643 (62.07% of 2647) affected shaders:
Instrs: 2864583 -> 2806217 (-2.04%); split: -2.22%, +0.19%
CodeSize: 43088064 -> 42171504 (-2.13%); split: -2.29%, +0.17%
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
look at what the program actually does instead of hardcoding a worst-case.
SIMD16:
Totals from 1965 (74.23% of 2647) affected shaders:
Instrs: 2603230 -> 2539932 (-2.43%); split: -3.44%, +1.01%
CodeSize: 38826160 -> 37811904 (-2.61%); split: -3.59%, +0.97%
Number of spill instructions: 1206 -> 555 (-53.98%)
Number of fill instructions: 1194 -> 551 (-53.85%)
SIMD32:
Totals from 1974 (74.57% of 2647) affected shaders:
Instrs: 3998126 -> 3033333 (-24.13%); split: -24.18%, +0.05%
CodeSize: 59563952 -> 45580448 (-23.48%); split: -23.52%, +0.05%
Number of spill instructions: 43534 -> 37471 (-13.93%); split: -13.97%, +0.04%
Number of fill instructions: 43118 -> 36412 (-15.55%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
These queries need to be used for partitioning too. And also this degunks the
core RA logic in jay_register_allocate.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41872>
Not sure if any workload uses this. This mostly allows us to document
the functionality of HSD 22011236099 on gfx20+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41712>
These platforms don't support CCS on MCS/HIZ/STC. There's nothing we can
do about this. So, stop warning about it.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41712>
This is a preparation for enabling INTEL_DEBUG=mda in Jay. Since we are
changing the passes to use a new macro, go ahead and use new JAY_NIR_*
macros so we don't have to rename them again when their implementation
gets decoupled from BRW.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41897>
On Release builds, I get this:
../../src/intel/tools/intel_eu_stall_viewer.cpp: In function ‘int main(int, char**)’:
../../src/intel/tools/intel_eu_stall_viewer.cpp:269:27: warning: ‘stall_csv_filename’ may be used uninitialized [-Wmaybe-uninitialized]
269 | if (!shaders_directory || !stall_csv_filename) {
| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
../../src/intel/tools/intel_eu_stall_viewer.cpp:244:43: note: ‘stall_csv_filename’ was declared here
244 | const char *shaders_directory = NULL, *stall_csv_filename;
We can't expect gcc to understand that it's a required argument.
This "fixes" b795a1a20c ("intel/tools: add eu stall viewer").
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41946>
I wanted to use another common function, and having to manual import it
felt silly given that we're a drirc_gen user. But we are limited because
we don't have the common code in the system path at startup, so we can't
just import it at module level.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41877>
brw_nir_lower_deferred_urb_writes assumes that constant offsets will be
properly folded. In brw itself we call the big optimization loop which
takes care of this, but jay doesn't do that in-between.
At any rate, nir_lower_io generates a lot of address math that really
ought to get cleaned up, so it seems like a good point to call it here.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41821>
I meant to add these together, not put a random semicolon in the middle
of the expression which meant the offset got tossed on the floor.
Fixes: 6fbe201a12 ("brw: Convert VS/TES/GS outputs to URB intrinsics.")
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41821>
We now calculate it when emitting push input loads at the NIR level,
rather than in the backend.
v2: Fix missing interaction with legacy tesslevel remapping
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com> [v1]
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41821>
We were seeing if the start of the load was within the push range,
rather than the entire load. (We could also split loads, but that
seems needlessly complex.)
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41821>