Just a bit cleaner, and we can unify point size too.
Signed-off-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Christoph Pillmayer <christoph.pillmayer@arm.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40677>
This comes up in lowered load_ubo sequences (observed in OpenCL test
test_api min_max_parameter_size). Hopefully the pack gets coalesced,
it's like nir_op_vec2 on most backends, so it should usually be ok to
sink even though the register pressure heuristic will reject it.
Allowing it to sink allows the UBO load to sink.
Intel's backend scheduler can optimize the relevant sequences locally
but there should still be a win here for global load sinking.
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40267>
Don't sink alu that uses ballot(true), as that can a local system value
and moving the alu then requires a new mov in the old location.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38829>
We add several new intrinsics for accessing URB handles:
- load_urb_output_handle_intel
- load_urb_input_handle_intel
- load_urb_input_handle_intel_indexed
The latter is used by stages like TCS and GS where each input control
point has a unique handle. The index is which ICP to read from. The
others are for most stages, where all inputs or outputs are accessed
via a single handle.
Then we have URB load and store operations, split for Xe2+ (URB via LSC)
and earlier (HDC OWord messages):
- load_urb_vec4_intel
- load_urb_lsc_intel
- store_urb_vec4_intel
- store_urb_lsc_intel
The legacy vec4 variants take a handle and a 128-bit OWord offset as
sources. Additionally, stores take a set of channel enables to mask
off and avoid writing vec4 components. We don't use the WRITE_MASK
const-index as our channel enables are not required to be constant.
The Xe2+ LSC variants are simpler. Handles are byte offsets into the
URB memory region, and offsets are expressed in bytes. So we simply
add them into a single "address" source. We don't support writemasks
here, as they aren't really necessary with the better addressability.
(Plus, the store_cmask operations work significantly differently than
the previous HDC OWord messages). We will lower disjoint writemasks
to multiple stores.
Based on earlier code by Lionel Landwerlin.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38482>
We add a bunch of new helpers to avoid the need to touch >parent_instr,
including the full set of:
* nir_def_is_*
* nir_def_as_*_or_null
* nir_def_as_* [assumes the right instr type]
* nir_src_is_*
* nir_src_as_*
* nir_scalar_is_*
* nir_scalar_as_*
Plus nir_def_instr() where there's no more suitable helper.
Also an existing helper is renamed to unify all the names, while we're
churning the tree:
* nir_src_as_alu_instr -> nir_src_as_alu
..and then we port the tree to use the helpers as much as possible, using
nir_def_instr() where that does not work.
Acked-by: Marek Olšák <maraeo@gmail.com>
---
To eliminate nir_def::parent_instr we need to churn the tree anyway, so I'm
taking this opportunity to clean up a lot of NIR patterns.
Co-authored-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38313>
On a fossil from the blender 4.5.0 vulkan backend, this improves compile
times in nak by about 17%. Compile time of other shaders improves by a
more modest 1.2%.
No stat changes on shader-db.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36184>
We're about to add to nir_metadata_control_flow, and we don't want
passes to require the new metadata.
Via coccinelle:
@@
expression e1;
@@
- nir_metadata_require(e1, nir_metadata_control_flow)
+ nir_metadata_require(e1, nir_metadata_block_index | nir_metadata_dominance)
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36184>
We can just place the set structures inside nir_block.
This reduces the number of ralloc calls by 6.7% when compiling Heaven
shaders with radeonsi+ACO using a release build (i.e. not including
nir_validate set allocations, which are also removed).
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36728>
Usually we can fold most ldc and ldcx into the instruction using it,
however there are a couple of cases where we can't, e.g. when there is an
indirect offset.
Moving the ldc(x) down to the consumer leads to increase value ranges for
uniform registers, but lowering them for normal registers.
Totals:
CodeSize: 914650304 -> 914469536 (-0.02%); split: -0.05%, +0.03%
Number of GPRs: 3879754 -> 3863818 (-0.41%); split: -0.42%, +0.01%
Static cycle count: 1073273107 -> 1073101189 (-0.02%); split: -0.09%, +0.08%
Spills to reg: 67219 -> 67707 (+0.73%); split: -0.10%, +0.83%
Fills from reg: 79733 -> 80456 (+0.91%); split: -0.10%, +1.01%
Max warps/SM: 3666036 -> 3672668 (+0.18%); split: +0.18%, -0.00%
Totals from 24235 (27.66% of 87622) affected shaders:
CodeSize: 444747392 -> 444566624 (-0.04%); split: -0.11%, +0.07%
Number of GPRs: 1360384 -> 1344448 (-1.17%); split: -1.20%, +0.03%
Static cycle count: 806310857 -> 806138939 (-0.02%); split: -0.12%, +0.10%
Spills to reg: 35826 -> 36314 (+1.36%); split: -0.19%, +1.55%
Fills from reg: 31863 -> 32586 (+2.27%); split: -0.26%, +2.53%
Max warps/SM: 911328 -> 917960 (+0.73%); split: +0.74%, -0.01%
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36536>
this lets the backend implement bounded loads (i.e. robust SSBOs) in a way
that's more clever than a full branch. similar idea to
load_global_constant_bound which should eventually be merged into this.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Job Noorman <job@noorman.info>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35720>
On Avalon, this is a bitfield that holds information on what
values a vertex shader should output.
Signed-off-by: Mary Guillemard <mary.guillemard@collabora.com>
Reviewed-by: Benjamin Lee <benjamin.lee@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33910>
This is needed for noperspective lowering, where we need to multiply the
varying value by gl_FragCoord.w at the same barycentric as the varying.
Normal nir_load_frag_coord_zw instructions are lowered to the new
intrinsic on bifrost with the pan_lower_frag_coord_zw pass.
Signed-off-by: Benjamin Lee <benjamin.lee@collabora.com>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32127>
load_attribute_pan is a panfrost-specific intrinsic for loading
vertex attributes. Takes explicit vertex and instance IDs which
we need in order to implement vertex attribute divisor with
non-zero base instance on v9+.
Passes which are used by panvk are modified to be aware of
load_attribute_pan.
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Mary Guillemard <mary.guillemard@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32039>
Inverse_ballot result is undefined if the input is not dynamically uniform.
And sinking out of loops might make the input divergent.
Fixes: 18a0ff137f ("nir: sink/move inverse_ballot like moves")
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30906>
Same reason as for load_ubo.
Fixes: d199d65c3a ("nir/nir_opt_move,sink: Include load_ubo_vec4 as a load_ubo instr.")
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30906>
First, we need to give the parent_instr field a unique name to be able to
replace with a helper. We have parent_instr fields for both nir_src and
nir_def, so let's rename nir_src::parent_instr in preparation for rework.
This was done with a combination of sed and manual fix-ups.
Then we use semantic patches plus manual fixups:
@@
expression s;
@@
-s->renamed_parent_instr
+nir_src_parent_instr(s)
@@
expression s;
@@
-s.renamed_parent_instr
+nir_src_parent_instr(&s)
@@
expression s;
@@
-s->parent_if
+nir_src_parent_if(s)
@@
expression s;
@@
-s.renamed_parent_if
+nir_src_parent_if(&s)
@@
expression s;
@@
-s->is_if
+nir_src_is_if(s)
@@
expression s;
@@
-s.is_if
+nir_src_is_if(&s)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24671>
In general, sinking ALU instructions can negatively impact register pressure,
since it extends the live ranges of the sources, although it does shrink the live range
of the destination.
However, constants do not usually contribute to register pressure. This is not a
totally true assumption, but it's pretty good in practice, since...
* constants can be rematerialized (backend-dependent)
* constants can often be inlined (ISA-dependent)
* constants can sometimes be promoted to free uniform registers (ISA-dependent)
* constants can live in scalar registers although the ALU destination might need
a vector register (and vector registers are assumed to be much more expensive
than scalar registers, again ISA-dependent)
So, assume that constants have zero effect on register pressure. Now consider an
ALU instruction where all but one source is a constant. Then there are two
cases:
1. The ALU instruction is moved past when its source was otherwise killed. Then
there is no effect on register pressure, since the source live range is
extended exactly as much as the destination live range shrinks.
2. The ALU instruction is moved down but its source is still alive where it's
moved to. Then register pressure is improved, since the source live range is
unchanged while the destination live range shrinks.
So, as a heuristic, we always move ALU instructions where n-1 sources are
constant. As an inevitable special case, this also (necessarily) moves unary ALU
ops, which should be beneficial by the same justification. This is not 100%
perfect but it is well-motivated. Results on AGX are decent:
total instructions in shared programs: 1796101 -> 1795652 (-0.02%)
instructions in affected programs: 326822 -> 326373 (-0.14%)
helped: 800
HURT: 371
Inconclusive result (%-change mean confidence interval includes 0).
total bytes in shared programs: 11805004 -> 11801424 (-0.03%)
bytes in affected programs: 2610630 -> 2607050 (-0.14%)
helped: 912
HURT: 462
Inconclusive result (%-change mean confidence interval includes 0).
total halfregs in shared programs: 525818 -> 515399 (-1.98%)
halfregs in affected programs: 118197 -> 107778 (-8.81%)
helped: 2095
HURT: 804
Halfregs are helped.
total threads in shared programs: 18916608 -> 18917056 (<.01%)
threads in affected programs: 4800 -> 5248 (9.33%)
helped: 7
HURT: 0
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24833>
At the moment, this does nothing. It will prevent problems from the next patch,
however.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24833>
This is the AGX version of load_output, which shaders can use for framebuffer
fetch. It is beneficial to sink framebuffer fetch as late as possible, both to
reduce register pressure but also to reduce serialization of overlapping
fragments.
Results on a collection of ubershaders:
total bytes in shared programs: 1468928 -> 1468550 (-0.03%)
bytes in affected programs: 495300 -> 494922 (-0.08%)
helped: 24
HURT: 0
Bytes are helped.
total halfregs in shared programs: 14162 -> 13946 (-1.53%)
halfregs in affected programs: 5148 -> 4932 (-4.20%)
helped: 27
HURT: 0
Halfregs are helped.
total threads in shared programs: 216896 -> 217664 (0.35%)
threads in affected programs: 6912 -> 7680 (11.11%)
helped: 12
HURT: 0
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24833>
By the time this runs, we will have already lowered load_ubo and load_vbo to
load_constant_agx so we need to handle the backend version.
This is very important for reducing register pressure in monolithic VS+GS
shaders on AGX. Since no other backend has _agx intrinsics, there's no need for
an option to gate this.
The additional instruction count is from more frequent wait instructions due to
fewer instructions grouped together. This should be mitigated in the future with
an ACO-style latency-reducing scheduler in the backend, after register pressure
is reduced by opt_sink.
total instructions in shared programs: 1793385 -> 1796101 (0.15%)
instructions in affected programs: 199816 -> 202532 (1.36%)
helped: 3
HURT: 941
Instructions are HURT.
total bytes in shared programs: 11799628 -> 11805004 (0.05%)
bytes in affected programs: 1345656 -> 1351032 (0.40%)
helped: 34
HURT: 919
Bytes are HURT.
total halfregs in shared programs: 533151 -> 525818 (-1.38%)
halfregs in affected programs: 40335 -> 33002 (-18.18%)
helped: 613
HURT: 42
Halfregs are helped.
total threads in shared programs: 18910464 -> 18916608 (0.03%)
threads in affected programs: 6144 -> 12288 (100.00%)
helped: 12
HURT: 0
Threads are helped.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24833>