A next patch will emit more instructions in video and copy queues
for Gfx 200 and newer but the current code only creates anv_async_submit
if device has aux_map.
Instead we can always create anv_async_submit and only submit it to
hardware if any instruction was emited.
Fixes: 86813c60a4 ("mi-builder: add read/write memory fencing support on Gfx20+")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32680>
The CI_JOB_TIMEOUT variable is the GitLab-defined job timeout in
seconds.
Use this variable in LAVA instead of the separate JOB_TIMEOUT,
which was intended to represent the test phase timeout (job timeout
minus 5 minutes), but was often overlooked.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32609>
Remove bogus failures caused by wrong GPU_VERSION configuration,
delete tests that no longer exist in current CTS versions, and
update expectations.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32681>
Thanks to the speedup achieved by increasing tests_per_group,
nightly jobs are now within reasonable time limits, allowing them
to be re-enabled.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32681>
Due to the slow startup time of deqp-vk, the previous default of
500 tests per group caused the jobs to run up to twice as slowly
compared to using a higher number of tests per group.
Increase the number of tests per group for all subsets of the
deqp-runner suites, which allows decreasing the fractions.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32681>
Due to the slow startup time of deqp-vk, the previous default of
500 tests per group caused the jobs to run up to twice as slowly
compared to using a higher number of tests per group.
Increase the number of tests per group for all subsets of the
deqp-runner suites, which allows decreasing the fractions.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32681>
Due to the slow startup time of deqp-vk, the previous default of
500 tests per group caused the jobs to run up to twice as slowly
compared to using a higher number of tests per group.
Increase the number of tests per group for all subsets of the
deqp-runner suites, which allows decreasing the fractions.
Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32681>
Modified the BVH dumping mechanism to synchronously wait for the command
buffer to complete before saving BVH data to files. This approach is
more robust compared to the previous method of dumping during
acceleration strucutre destruction.
Note: if DEBUG_BVH_ANY is enabled but intel-rt is disabled, we will wait
for nothing.
Signed-off-by: Kevin Chuang <kaiwenjon23@gmail.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32585>
The current python-test job creates and compresses python related
artifacts for use by future jobs. The artifacts are currently named
`mesa-python-test` which is somewhat misleading because they are not
needed for testing python scripts or libraries.
Rename the artifacts generated by the python-test job to be more
descriptive of their purpose.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32340>
According to HSD 14016252163 if compute shader uses the sample
operation, morton walk order and set the thread group batch size to 4 is
expected to increase sampler cache hit rates by increasing sample
address locality within a subslice.
Rework:
* Caio: "||" => "&&" for type checking in instr_uses_sampler()
* Jordan: Use nir's foreach macros rather than
nir_shader_lower_instructions()
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32430>
For PTL, we can have one more additional walk order along with the
"Thread Group Batch Size" field.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32430>
Looks like this one got added accidently for Xe2. Xe2 doesn't support
Morton dispatch walk order.
Thanks to Rohan for bringing up this during review.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Rohan Garg <rohan.garg@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32430>
And the SEND gather variant that uses a scalar register as its only
source.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32236>
SEND instructions of gather variant will use the upcoming ARF scalar
register. They use only Src0 and reuse the bits of Src1.Length (part of
ex_desc). Src1.Length is (implicitly) defined as 0.
Adapt the helper functions to take the new variant into account when
manipulating ex_desc.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32236>
There are a lot of restrictions for bfloat16. The one that prevents this
very useful optimization from being possible is, "Broadcast of bfloat16
scalar is not supported."
Part of the reason this MR exists is to build up to implementing BF
support, and there are a couple more commits that implement
this. However, it fails on both real hardware and simulation:
Instruction is: mad (8|M0) r6.0<1>:f 0xBF80:bf r2.0<8;1>:f r64.0<0>:f
In bfloat/float mixed mode, bfloat src must be packed.
Alas.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
These can't mix with F values, but if the non-constant sources are
already HF, this is allowed in src0.
No shader-db changes on any Intel platform.
fossil-db:
Ice Lake
Totals:
Instrs: 236027458 -> 236027442 (-0.00%)
Cycle count: 24515944704 -> 24515945379 (+0.00%)
Totals from 8 (0.00% of 798454) affected shaders:
Instrs: 10226 -> 10210 (-0.16%)
Cycle count: 58567 -> 59242 (+1.15%)
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
Nothing can generate this currently, but a future commit will.
The Bspec and experimentation support the following limitations:
- Gfx11: Either src0 or src2 can be W or UW.
- Gfx12: Either src0 or src2 can be W or UW.
- Gfx12.5: Both src0 and src2 can be W or UW.
- Gfx20: Both src0 and src2 can be W or UW.
v2: Add missing break statement.
v3: Leave the MAD handling in the case with the other 3 source
instructions. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
This is very unlikely for floating point MAD. At some point I intend
to add internal integer MAD uses, and this could occur there.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
v2: Move the full constant folding part to
brw_constant_fold_instruction. Suggested by Caio. I did this by
extracting the core part of the folding to a helper function.
v3: Delete stale comment. Noticed by Caio.
shader-db:
All Intel platforms had similar results. (Lunar Lake shown)
total instructions in shared programs: 18090847 -> 18090843 (<.01%)
instructions in affected programs: 150 -> 146 (-2.67%)
helped: 1 / HURT: 0
total cycles in shared programs: 919664648 -> 919663210 (<.01%)
cycles in affected programs: 3426 -> 1988 (-41.97%)
helped: 1 / HURT: 0
LOST: 1
GAINED: 0
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Instrs: 220496486 -> 220496403 (-0.00%)
Cycle count: 31610880908 -> 31610879044 (-0.00%); split: -0.00%, +0.00%
Totals from 70 (0.01% of 702439) affected shaders:
Instrs: 47018 -> 46935 (-0.18%)
Cycle count: 6335504 -> 6333640 (-0.03%); split: -0.11%, +0.09%
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
This enables constant combining to do its job.
v2: Restore accidentally deleted line from a comment. Noticed by Caio.
shader-db:
All Intel platforms had similar results. (Lunar Lake shown)
total cycles in shared programs: 919668392 -> 919669310 (<.01%)
cycles in affected programs: 10125264 -> 10126182 (<.01%)
helped: 348 / HURT: 194
fossil-db:
All Intel platforms had similar results. (Lunar Lake shown)
Totals:
Cycle count: 31610720660 -> 31610692748 (-0.00%); split: -0.00%, +0.00%
Totals from 9066 (1.29% of 702433) affected shaders:
Cycle count: 810411934 -> 810384022 (-0.00%); split: -0.01%, +0.00%
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
No shader-db changes on any Intel platform.
fossil-db:
Meteor Lake, DG2, Tiger Lake, and Ice Lake had similar results. (Meteor Lake shown)
Totals:
Cycle count: 25096109024 -> 25096108722 (-0.00%); split: -0.00%, +0.00%
Totals from 4106 (0.51% of 797610) affected shaders:
Cycle count: 63266176 -> 63265874 (-0.00%); split: -0.01%, +0.01%
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
Always propagate into any source. Let commute_immedates and constant
combining sort out the mess. It's literally their job.
No shader-db changes on any Intel platform. The fossil-db changes just
appear to be subtle changes in register allocation if the immediate
source changes from src0 to src2.
v2: Update the comment in commute_immediates. Suggested by Caio.
fossil-db:
Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Cycle count: 31610720510 -> 31610720660 (+0.00%); split: -0.00%, +0.00%
Totals from 8 (0.00% of 702433) affected shaders:
Cycle count: 5522382 -> 5522532 (+0.00%); split: -0.00%, +0.00%
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
No shader-db or fossil-db changes on any Intel platform. This commit
just prevents issues with a later commit, "brw/copy: Don't try to be
clever about ADD3 constant propagation."
v2: Use 'can_promote = true; break;' instead of 'return
true;'. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
Fold the cases where one of the sources is zero or two of the sources
are constants. Both case will result in a regular ADD.
No shader-db or fossil-db changes on any Intel platform. This commit
just prevents issues with a later commit, "brw/copy: Don't try to be
clever about ADD3 constant propagation."
v2: Move the full constant folding part to
brw_constant_fold_instruction. Suggested by Caio.
v3: Eliminate the impossible src.file == BAD_FILE case in
brw_fs_opt_algebraic. Suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
The current assertion fails as soon as a MAD with src0 and src2 being
immediate is detected.
The assertion was supposted to catch, "If it's ADD3, only one of src0
and src2 can be immediate." The detect this, the opcode test should have
been !=.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: c1c09e3c4a ("brw/emit: Add correct 3-source instruction assertions for each platform")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
Some callers of brw_constant_fold_instruction depend on the result being
a MOV of immediate when progress is made. Previously `MUL dst:D src0:D
1:D` would be converted to `MOV dst:D src0:D`. There was also no
handling for `MUL dst:D imm0:D imm1:D`.
This could cause problems if one of the immedate values was -1. The
existing code would convert this to a `MOV dst:D imm0:D` and set the
negate flag on src0. That is not correct.
v2: Fix the is_negative_one case handling of the non-negative-one
source. Add a comment explaining the assertion. Both suggested by Caio.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 2cc1575a31 ("brw/algebraic: Refactor constant folding out of brw_fs_opt_algebraic")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
Some callers of brw_constant_fold_instruction depend on the result being
a MOV of immediate when progress is made. Previously `ADD dst:D src0:D
0:D` would be converted to `MOV dst:D src0:D`. There was also no
handling for `ADD dst:D imm0:D imm1:D`.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 2cc1575a31 ("brw/algebraic: Refactor constant folding out of brw_fs_opt_algebraic")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32436>
Almost all cases now handled with default arguments. The only real
extra work that was being done was pushed to the client code in
debug_optimizer().
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32596>
In this case, num_sources is bigger than this->sources, so if we loop
up to num_sources (instead of this->sources) we'll end up reading past
the end of old_src[]. Only copy up to what we originally had.
This was found by code inspection, I'm not aware of any applications
failing due to the lack of this patch.
Fixes: d9e737212d ("intel/brw: Add a src array for the common case in fs_inst")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32600>
Borderlands 3 (both DX11 and DX12 renderers) have a common pattern
across many shaders:
con 32x4 %510 = (uint32)txf %2 (handle), %1191 (0x10) (coord), %1 (0x0) (lod), 0 (texture)
con 32x4 %512 = (uint32)txf %2 (handle), %1511 (0x11) (coord), %1 (0x0) (lod), 0 (texture)
...
con 32x4 %550 = (uint32)txf %2 (handle), %1549 (0x25) (coord), %1 (0x0) (lod), 0 (texture)
con 32x4 %552 = (uint32)txf %2 (handle), %1551 (0x26) (coord), %1 (0x0) (lod), 0 (texture)
A single basic block contains piles of texelFetches from a 1D buffer
texture, with constant coordinates. In most cases, only the .x channel
of the result is read. So we have something on the order of 28 sampler
messages, each asking for...a single uint32_t scalar value. Because our
sampler doesn't have any support for convergent block loads (like the
untyped LSC transpose messages for SSBOs)...this means we were emitting
SIMD8/16 (or SIMD16/32 on Xe2) sampler messages for every single scalar,
replicating what's effectively a SIMD1 value to the entire register.
This is hugely wasteful, both in terms of register pressure, and also in
back-and-forth sending and receiving memory messages.
The good news is we can take advantage of our explicit SIMD model to
handle this more efficiently. This patch adds a new optimization pass
that detects a series of SHADER_OPCODE_TXF_LOGICAL, in the same basic
block, with constant offsets, from the same texture. It constructs a
new divergent coordinate where each channel is one of the constants
(i.e <10, 11, 12, ..., 26> in the above example). It issues a new
NoMask divergent texel fetch which loads N useful channels in one go,
and replaces the rest with expansion MOVs that splat the SIMD1 result
back to the full SIMD width. (These get copy propagated away.)
We can pick the SIMD size of the load independently of the native shader
width as well. On Xe2, those 28 convergent loads become a single SIMD32
ld message. On earlier hardware, we use 2 SIMD16 messages. Or we can
use a smaller size when there aren't many to combine.
In fossil-db, this cuts 27% of send messages in affected shaders, 3-6%
of cycles, 2-3% of instructions, and 8-12% of live registers. On A770,
this improves performance of Borderlands 3 by roughly 2.5-3.5%.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32573>
This updates entry for 14017823839 which fixes issues on BMG with:
dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.1
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32550>