I'm going to introduce another call site for this function, and just
handling SCHEDULE_NONE in the scheduler itself makes more sense than
duplicating the logic.
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24707>
The register pressure analysis I wrote in 2013 only considered VGRFs,
and not other GRFs, such as payload registers and push constants. We
need to consider those too, because payload registers definitely occupy
space and add to pressure.
In 2015, Connor already made the scheduler account for this, so the only
real use for this is in shader statistic dumps and optimizer printouts.
But we should make it more accurate. (We will use it in more places
shortly, a few commits from now.)
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24707>
If the NIR_DEBUG_PRINT_INTERNAL flag is not set, don't print debugging
information for internal shaders in INTEL_DEBUG=optimizer dumps.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24684>
This reverts commit 6b494745be.
The logic is not entirely correct: the comparison is between two
static-analysis estimates of a dynamic system with variables that aren't
captured by the shader source, so using ">" will always have greater potential
to cause regressions whenever the performance difference between the two builds
is something not captured by the static model, no matter how much the model is
improved.
Reference: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9262
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24615>
Instead, we replace every use of it with nir_def. Most of this commit
was generated by sed:
sed -i -e 's/dest.ssa/def/g' src/**/*.h src/**/*.c src/**/*.cpp
A few manual fixups were required in lima and the nir_legacy code.
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24674>
It can be useful to compare 2 runs with different compiler changes.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24552>
One particular nice thing to have is the first generated backend IR
before validation. Especially if you made a mistake in the NIR
translation, you can at least look at it before validation tells you
off.
Then the last 2 steps of the optimize() function can be interesting to
look at.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24552>
And split them into UBO and SSBO
v2 (Lionel):
- Get rid of robustness fields in anv_shader_bin
v3 (Lionel):
- Do not pass unused parameters around
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17545>
Purely from the backend point of view it's just an additional
parameter to sampler messages.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23882>
A few titles show max live register reductions, but nothing
significant in instruction count or other stats.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24282>
This is a modified version of a commit originally in !7698. This version
add the changes to brw_fs_copy_propagation. If the address passed to
fs_visitor::swizzle_nir_scratch_addr is a constant, that function will
generate SHL with two constant sources.
DG2 uses a different path to generate those addresses, so the constant
folding can't occur there yet. That will be addressed in the next
commit.
What follows is the commit change history from that older MR.
v2: Previously this commit was after `intel/fs: Combine constants for
integer instructions too`. However, this commit can create invalid
instructions that are only cleaned up by `intel/fs: Combine constants
for integer instructions too`. That would potentially affect the
shader-db results of each commit, but I did not collect new data for
the reordering.
v3: Fix masking for W/UW and for Q/UQ types. Add an assertion for
!saturate. Both suggested by Ken. Also add an assertion that B/UB types
don't matically come back.
v4: Fix sources count. See also ed3c2f73db ("intel/fs: fixup sources
number from opt_algebraic").
v5: Fix typo in comment added in v3. Noticed by Marcin. Fix a typo in a
comment added when pulling this commit out of !7698. Noticed by Ken.
shader-db results:
DG2
No changes.
Tiger Lake, Ice Lake, and Skylake had similar results (Ice Lake shown)
total instructions in shared programs: 20655696 -> 20651648 (-0.02%)
instructions in affected programs: 23125 -> 19077 (-17.50%)
helped: 7 / HURT: 0
total cycles in shared programs: 858436639 -> 858407749 (<.01%)
cycles in affected programs: 8990532 -> 8961642 (-0.32%)
helped: 7 / HURT: 0
Broadwell and Haswell had similar results. (Broadwell shown)
total instructions in shared programs: 18500780 -> 18496630 (-0.02%)
instructions in affected programs: 24715 -> 20565 (-16.79%)
helped: 7 / HURT: 0
total cycles in shared programs: 946100660 -> 946087688 (<.01%)
cycles in affected programs: 5838252 -> 5825280 (-0.22%)
helped: 7 / HURT: 0
total spills in shared programs: 17588 -> 17572 (-0.09%)
spills in affected programs: 1206 -> 1190 (-1.33%)
helped: 2 / HURT: 0
total fills in shared programs: 25192 -> 25156 (-0.14%)
fills in affected programs: 156 -> 120 (-23.08%)
helped: 2 / HURT: 0
No shader-db changes on any older Intel platforms.
fossil-db results:
DG2
Totals:
Instrs: 197780415 -> 197780372 (-0.00%); split: -0.00%, +0.00%
Cycles: 14066412266 -> 14066410782 (-0.00%); split: -0.00%, +0.00%
Totals from 16 (0.00% of 668055) affected shaders:
Instrs: 16420 -> 16377 (-0.26%); split: -0.43%, +0.17%
Cycles: 220133 -> 218649 (-0.67%); split: -0.69%, +0.01%
Tiger Lake, Ice Lake and Skylake had similar results. (Ice Lake shown)
Totals:
Instrs: 153425977 -> 153423678 (-0.00%)
Cycles: 14747928947 -> 14747929547 (+0.00%); split: -0.00%, +0.00%
Subgroup size: 8535968 -> 8535976 (+0.00%)
Send messages: 7697606 -> 7697607 (+0.00%)
Scratch Memory Size: 4380672 -> 4381696 (+0.02%)
Totals from 6 (0.00% of 662749) affected shaders:
Instrs: 13893 -> 11594 (-16.55%)
Cycles: 5386074 -> 5386674 (+0.01%); split: -0.42%, +0.43%
Subgroup size: 80 -> 88 (+10.00%)
Send messages: 675 -> 676 (+0.15%)
Scratch Memory Size: 91136 -> 92160 (+1.12%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23884>
opt_copy_propagation can create invalid instructions like
shl(8) vgrf96:UD, 2d, 8u
These instructions will be cleaned up by opt_algebraic. The irony is
opt_algebraic converts these to simple mov instructions that
opt_copy_propagation should clean up. I don't think we want a loop like
do {
progress = false;
if (OPT(opt_copy_propagation)) {
OPT(opt_algebraic);
OPT(dead_code_eliminate);
}
} while (progress);
But maybe we do?
Maybe this would be sufficient:
while (OPT(opt_copy_propagation))
OPT(opt_algebraic);
OPT(dead_code_eliminate);
No shader-db or fossil-db changes (yet) on any Intel platform. This is
expected.
v2: Do opt_algebraic immediately after every call to
opt_copy_propagation instead of being clever. Suggested by Lionel.
Tested-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23884>
Both per-primitive and per-vertex space is allocated in MUE in 8 dword
chunks and those 8-dword chunks (granularity of
3DSTATE_SBE_MESH.Per[Primitive|Vertex]URBEntryOutputReadLength)
are passed to fragment shaders as inputs (either non-interpolated
for per-primitive and flat vertex attributes or interpolated
for non-flat vertex attributes).
Some attributes have a special meaning and must be placed in separate
8/16-dword slot called Primitive Header or Vertex Header.
Primitive Header contains 4 such attributes (Cull Primitive,
ViewportIndex, RTAIndex, CPS), leaving 4 dwords (the rest of 8-dword
slot) potentially unused.
Vertex Header is similar - it starts with 3 unused dwords, 1 dword for
Point Size (but if we declare that shader doesn't produce Point Size
then we can reuse it), followed by 4 dwords for Position and optionally
8 dwords for clip distances.
This means we have an interesting optimization problem - we can put
some user attributes into holes in Primitive and Vertex Headers, which
may lead to smaller MUE size and potentially more mesh threads running
in parallel, but we have to be careful to use those holes only when
we need it, otherwise we could force HW to pass too much data to
fragment shader.
Example 1:
Let's assume that Primitive Header is enabled and user defined
12 dwords of per-primitive attributes.
Without packing we would consume 8 + ALIGN(12, 8) = 24 dwords of
MUE space and pass ALIGN(12, 8) = 16 dwords to fragment shader.
With packing, we'll consume 4 + 4 + ALIGN(12 - 4, 8) = 16 dwords of
MUE space and pass ALIGN(4, 8) + ALIGN(12 - 4, 8) = 16 dwords to
fragment shader.
16/16 is better than 24/16, so packing makes sense.
Example 2:
Now let's assume that Primitive Header is enabled and user defined
16 dwords of per-primitive attributes.
Without packing we would consume 8 + ALIGN(16, 8) = 24 dwords of
MUE space and pass ALIGN(16, 16) = 16 dwords to fragment shader.
With packing, we'll consume 4 + 4 + ALIGN(16 - 4, 8) = 24 dwords of
MUE space and pass ALIGN(4, 8) + ALIGN(16 - 4, 8) = 24 dwords to
fragment shader.
24/24 is worse than 24/16, so packing doesn't make sense.
This change doesn't affect vk_meshlet_cadscene in default configuration,
but it speeds it up by up to 25% with "-extraattributes N", where
N is some small value divisible by 2 (by default N == 1) and we
are bound by URB size.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20407>
Instead of using 4 dwords for each output slot, use only the amount
of memory actually needed by each variable.
There are some complications from this "obvious" idea:
- flat and non-flat variables can't be merged into the same vec4 slot,
because flat inputs mask has vec4 stride
- multi-slot variables can have different layout:
float[N] requires N 1-dword slots, but
i64vec3 requires 1 fully occupied 4-dword slot followed by 2-dword slot
- some output variables occur both in single-channel/component split
and combined variants
- crossing vec4 boundary requires generating more writes, so avoiding them
if possible is beneficial
This patch fixes some issues with arrays in per-vertex and per-primitive data
(func.mesh.ext.outputs.*.indirect_array.q0 in crucible)
and by reduction in single MUE size it allows spawning more threads at
the same time.
Note: this patch doesn't improve vk_meshlet_cadscene performance because
default layout is already optimal enough.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20407>
Use a struct for various common parameters rather than per stage
structure or arguments to stage specific entrypoints.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Felix DeGrood <felix.j.degrood@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23942>
Delete unnecessary virtual functions, we need just two. Refactor code
so the 'default behavior' logic (stderr and/or creating file) is not
duplicated.
Rename the virtuals so overrides don't hide the common convenience
functions. Finally, provide a variant of dump_instructions() with
a `FILE *` parameter.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23457>
This only impacts ADD3, so at this point it should not have any
affect. As soon as constants are propagated into ADD3 instructions, it
will be a problem.
The worst part is, the ADD3 instrutions that are broken by the old code
aren't even "progress" on this pass.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23262>
In commit 284f0c9a57 I refactored the
handling of the data source to just call a helper rather than special
casing opcodes with 0 or 2 sources. Unfortunately, I also dropped the
"else return 1", creating a fallthrough for all sources other than
SURFACE_LOGICAL_SRC_ADDRESS and SURFACE_LOGICAL_SRC_DATA.
The case below happened to return the correct value for all cases except
SURFACE_LOGICAL_SRC_SURFACE, which has been returning 2 instead of 1
since that commit.
Restore the else case. Thanks to Marcin Ślusarz for catching this.
Fixes: 284f0c9a57 ("intel/compiler: Add an lsc_op_num_data_values() helper")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23347>
CPS is dynamically turned off when per sample interpolation is active.
Update the asserts to reflect this.
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 5644011f06 ("intel/compiler: Convert wm_prog_key::persample_interp to a tri-state")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23103>
With Anv/Zink, the piglit test :
arb_shader_storage_buffer_object-max-ssbo-size -auto -fbo fsexceed
is failing validation after copy propagation :
load_payload(8) vgrf15:F, vgrf1+0.12<0>:F, vgrf1+0.0<0>:F, vgrf1+0.4<0>:F, vgrf1+0.8<0>:F, vgrf1+0.12<0>:F
../src/intel/compiler/brw_fs_validate.cpp:191: A <= B failed
A = inst->src[i].offset / REG_SIZE + regs_read(inst, i) = 2
B = alloc.sizes[inst->src[i].nr] = 1
In most cases it works because src[0] would be at offset 0 and so
reading a full reg passes validation, but Anv/Zink started emitting
slightly different code adding an offset maybe the size read 2 GRFs.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23126>
This reverts commit 5489033fa8.
The problem I was trying to address is that we were programming the
3DSTATE_PS::PositionXYOffsetSelect bit differently with GPL (CENTROID)
than without (NONE).
I failed to understand that this bit also impacts the thread payload
layout. GPL fragment shaders don't know ahead of time if pos_offset is
going to be used. It'll be choosen at runtime base on push constant
bits. So we need to program this bit different just to have a payload
matching the compiled shader code.
This fixes the freedoom replay with GPL FS shader in SIMD32.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22938>
This value depends on the per-sample value which can be unknown at
compile time with graphics pipeline libraries. So we need to have this
dynamic has well and pick the right value when generating the
3DSTATE_PS/3DSTATE_WM packet.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: d8dfd153c5 ("intel/fs: Make per-sample and coarse dispatch tri-state")
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22728>
Every nir_ssa_def is part of a chain of uses, implemented with doubly linked
lists. That means each requires 2 * 64-bit = 16 bytes per def, which is
memory intensive. Together they require 32 bytes per def. Not cool.
To cut that memory use in half, we can combine the two linked lists into a
single use list that contains both regular instruction uses and if-uses. To do
this, we augment the nir_src with a boolean "is_if", and reimplement the
abstract if-uses operations on top of that list. That boolean should fit into
the padding already in nir_src so should not actually affect memory use, and in
the future we sneak it into the bottom bit of a pointer.
However, this creates a new inefficiency: now iterating over regular uses
separate from if-uses is (nominally) more expensive. It turns out virtually
every caller of nir_foreach_if_use(_safe) also calls nir_foreach_use(_safe)
immediately before, so we rewrite most of the callers to instead call a new
single `nir_foreach_use_including_if(_safe)` which predicates the logic based on
`src->is_if`. This should mitigate the performance difference.
There's a bit of churn, but this is largely a mechanical set of changes.
Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22343>
This pass rarely makes any changes, so work a little harder to preserve
more meta data.
On my Ice Lake laptop (using a locked CPU speed and other measures to
prevent thermal throttling, etc.) using a debugoptimized build, improves
performance of Vulkan CTS "deqp-vk --deqp-case='dEQP-VK.*spir*'" by
-0.2% ± 0.1% (n = 5, pooled s = 0.431885).
v2: Add some parenthesis. Suggested by Lionel.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22299>
Two linked list management changes:
- Use the list head sentinel as the initial cursor. It is, after all, a
proper node in the list.
- Iterate the list of blocks starting with the second block instead of
skipping the first block in the loop.
On my Ice Lake laptop (using a locked CPU speed and other measures to
prevent thermal throttling, etc.) using a release build, improves
performance of compiling shaders from batman_arkham_city_goty.foz by
-0.24% ± 0.09% (n = 5, pooled s = 0.324106).
v2: Use nir_cursor instead of direct list manipultion. Suggested by
Lionel.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22299>
Reduce debug log spam by only logging the shader if a pass made some
changes. This can also elide some nir_validate calls in debug builds.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22299>
Ensure that the register's liveness is not expanded to loops.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21853>
Those SENDs are still doing a full register write. We just inserted
some predication for a workaround.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21853>
There are a number of instances of the dead code elimination pass that
could reduce the count. For some reason this also seems to affect
register allocation itself.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21853>
Most tools looking at shader stats assume that there is only a single
resulting binary shader out of a single input. On Intel HW this is not
always the case. So having a statistic on each variant that reports
the maximum dispatch width helps showing improvement on a single
shader in terms of how large we manage to compile it.
For shaders that can be compiled in multiple SIMD width (like fragment
shaders), this will report the maximum dispatch width in the
statistics of each variants.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22014>
For instance, to load uniform data with the LSC we usually rely on
tranpose messages which have to execute in SIMD1. Those end up being
considered as partial writes so within loops their life span spread to
the whole loop, increasing register pressure.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21867>