This commit adds new debug options to dump out parent-child relationship
map using INTEL_DEBUG=bvh_pcrel_map.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Felix DeGrood <felix.j.degrood@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39617>
Track where is each leaf_id encoded in final BVH.
It's a map of leaf_id == final_bvh_offset. This will help us to navigate
the BVH layout in update pass.
Leaf block offset will give us : Leaf id -> bvh block
and parent-child map can be used for: bvh_block -> parent offset.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Felix DeGrood <felix.j.degrood@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39617>
This map stores parent BVH offset for each of their children. This will
help us to walk the BVH layout later in the update pass.
Since we are tracking block indexes, even with 2^32 large BVH size, we
can have 2^26 max indices (each block 64B wide) that leaves us 6 bits in
which we can track child slot index occupancies in parent.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Felix DeGrood <felix.j.degrood@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39617>
Extract leaf encoding in encode.h and move some of the helper in
anv_build_helper.h
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Felix DeGrood <felix.j.degrood@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39617>
Previously, we were accounting invalid nodes as well in child block
count which insert holes in the BVH memory.
These holes in the memory would trigger the HW traversal hangs.
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Iván Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40858>
When a base is larger than the supported [min, max] bounds, we were
clamping the base to that range, and adding the rest. This works,
but it leaves us with a bunch of loads/stores with the same maximum
base, and different iadds for addresses. This isn't ideal, because
it means that every access has a different iadd.
Instead, flip it around: now we calculate the largest multiple of
(max + 1) which is less than base, and iadd that. Then the new base
becomes the remaining portion, which is guaranteed to be <= max.
With that, all loads/stores within a maximum-offset window share a
common iadd which can be CSE'd, and use the immediate offset field
for small deltas from there.
Note that this should work for negative offsets beyond the minimum
too; we do calculate a larger negative addition and then flip to
positive immediate offsets.
Cuts 11% of instructions from the first compute shader of
dEQP-VK.ray_query.builtin.rayqueryterminate.comp.aabbs.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42237>
These take a base offset that we can plug into the LSC extended
descriptor immediate. This is essentially the same improvement that we
made by switching to the ssbo_intel intrinsics.
eliminates spilling in dEQP-VK.ray_query.builtin.rayqueryterminate.comp.aabbs
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42237>
The Anv driver doesn't ever set opts->softfp64 for the preprocess
stage (anv_shader_preprocess_nir()). The Vulkan preprocess stage is a
"physical device" stage, and softfp64 requires the actual anv_device:
see the comments for the preprocess_nir function pointer inside the
definition of struct vk_device_shader_ops, and the definition of
anv_ensure_fp64_shader().
It is only during anv_shader_compile() that we call
anv_ensure_fp64_shader(), where we actually build and store the
nir_shader we name fp64_nir. Then we have everything ready and we can
call the nir_lower_doubles pass.
To account for all that, just have brw check if opts->softfp64 is
actually set, and disable the full_software lowering if we don't have
it: otherwise we'll either segfault or hit the assert(softfp64) that
is in lower_doubles_instr_to_soft() in nir_lower_double_ops.c.
This prevents a segfault (or an assertion failure when in debug mode)
when running DIRT 5 on Tiger Lake.
Fixes: 7d3b62e13d ("anv: only load fp64 software shader when needed")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42105>
If the descriptor is allowed to be non uniform, we don't have to
force helpers to keep it uniform.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42232>
Before we only did this for loads, but the same logic applied here too.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42232>
We might as well make sure that those backends don't break on
future use. At least jay will probably use this pass.
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42232>
A future GPU will have a larger size for the sampler state in GPU, so here
doing the necessary adjustment to support sampler state of any size in run-time.
For now ANV_SAMPLER_STATE_GPU_SIZE is doing a dumb check because without it
compiler will complain that device is not used.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42023>
This issue happens in a couple of places but here main problem:
ANV_SAMPLER_STATE_SIZE is 32 bytes long(no idea why), but SAMPLER_STATE in GPU
is 16 bytes long.
anv_sampler_state::state and anv_sampler_state::state_no_bc has 16 bytes of
storage but in some places we do a mempcy of ANV_SAMPLER_STATE_SIZE bytes, like
in anv_GetDescriptorEXT():
memcpy(pDescriptor, sampler->state.state[0], ANV_SAMPLER_STATE_SIZE);
So lets replace the magic numbers by macros, have CPU data with ANV_SAMPLER_STATE_SIZE
size and only when copying to GPU copy the exacly size that GPU expects with
ANV_SAMPLER_STATE_GPU_SIZE.
Cc: stable
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42023>
Cleans up the final halt in
dEQP-VK.rasterization.frag_side_effects.color_at_beginning.terminate_invocation
with the terminate lowering.
O(1) for the function so that's pretty good.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42219>
* implement terminate
* fix HALT brokenness on all shader stages (we need a real end block)
* optimize demote codegen a ton
* optimize gl_HelperInvocation/gl_SampleMask
* optimize "all lanes demoted" via HALT.any
* optimize scheduling of stores/atomics/demotes in FS
* optimize some texturing with helper invocations
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
Even if nothing is being written, we still need to avoid generating
fragments for occlusion query purposes.
Fixes dEQP-GLES31.functional.fbo.no_attachments.maximums.height as well
as misrendering in Baldurs Gate 3 and Elder Scrolls Online.
Fixes: e7cfcf41f4 ("jay: Ignore RT store condition if there are no outputs")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
In SIMD16, map acc2/acc3 as extra GPRs. This gets us a pressure reduction. We
leave acc0/acc1 reserved for mul_32 lowering and for parallel copy lowering,
changing this would be very challenging due to the possibility of SIMD1
multiplies leading to uniform access on the accumulator => stuff blows up. But
this is an easy win on select platforms.
Note we still use acc2/acc3 for post-RA accumulator substitution, this just lets
us also use them as panic registers.
SIMD16:
Totals from 784 (29.62% of 2647) affected shaders:
Instrs: 1686724 -> 1686700 (-0.00%); split: -0.15%, +0.15%
CodeSize: 23406952 -> 23409432 (+0.01%); split: -0.16%, +0.17%
Number of spill instructions: 224 -> 174 (-22.32%)
Number of fill instructions: 546 -> 382 (-30.04%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
I did this for AGX for some Blender shader but apparently it's not doing all
that much for Jay!
SIMD16:
Totals from 0 (0.00% of 2647) affected shaders:
SIMD32:
Totals from 8 (0.30% of 2647) affected shaders:
Instrs: 29566 -> 29255 (-1.05%); split: -1.08%, +0.03%
CodeSize: 432672 -> 427408 (-1.22%); split: -1.24%, +0.02%
Number of spill instructions: 799 -> 658 (-17.65%)
Number of fill instructions: 1010 -> 906 (-10.30%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
Braun-Hack has a complex algorithm to insert spills on-demand when pressure
exceeds the limit, with fix ups along control flow to ensure we spill along each
execution path. Faith implemented a slightly different version in NAK which is
what Jay did, with some nonobvious tradeoffs between the two.
But actually.. why are we doing this at all?
We can alternatively spill immediately after the definition, which guarantees
(by the usual properties of dominance) that the spill executes before any
reloads. Then we don't need any tricky bookkeeping or control flow fixups.
Beyond the simplification, this has a couple advantages:
* Lower register pressure throughout more of the program. This doesn't affect
the /amount/ of spilling, but it gives RA more freedom so should reduce
shuffling. This in itself probably justifies doing this.
* Less SSA repair needed around memory definitions, which can reduce memory
traffic due to our naive memory definition handling. (I think if we
implemented Braun-Hack properly with CSSA this would be less of a concern, but
whatever.)
* Less reliance on the "no critical edges" property which will come in handy for
UGPR spilling, this is a yak shave for that.
* No risk of executing the same spill twice due to divergence (if we spill
inside of a divergent IF). This means this commit is probably a better idea
for GPUs than CPUs in practice.
This also has a couple of disadvantages explaining why the paper didn't do this:
* If a value only needs to be spilled/filled in conditional control flow, this
executes extra spills. But, spills are cheaper than fills (they burn bandwidth
but have basically no latency since they're stores), so I'm not super
concerned about this corner case.
* If a value is defined in a loop and needs to be spilled due to a use outside
the loop, it spills N times instead of 1. This is more compelling reason to do
the paper's think. But we demand the input program is LCSSA so this shouldn't
actually happen for us.
* The spill stalls on the definition value. That's probably not a big deal and
the eventual post-RA scheduler should be able to cope.
Overall I think this is a reasonable direction. We can revisit later but I don't
want to add more complexity to the spiller than absolutely necessarily, and it's
about to be necessary to add complexity for UGPRs.
SIMD16:
Totals from 17 (0.64% of 2647) affected shaders:
Instrs: 250304 -> 249221 (-0.43%); split: -0.44%, +0.01%
CodeSize: 3476640 -> 3461312 (-0.44%); split: -0.45%, +0.01%
Number of spill instructions: 555 -> 223 (-59.82%)
Number of fill instructions: 551 -> 543 (-1.45%)
SIMD32:
Totals from 420 (15.87% of 2647) affected shaders:
Instrs: 1779193 -> 1698683 (-4.53%); split: -4.53%, +0.01%
CodeSize: 25455456 -> 24198416 (-4.94%); split: -4.95%, +0.01%
Number of spill instructions: 36900 -> 14440 (-60.87%)
Number of fill instructions: 36550 -> 35103 (-3.96%)
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42097>
This saves instructions on Jay. We probably could teach backend predication to
chew thru the mess, but I don't see a reason not to just do this everywhere.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Acked-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42197>
Right now demote_if only works with 32-bit registers but in NIR it can
also have 16-bit sources, we have a couple of bug on those. First is
during NIR->BIR translation (h0 swizzle was not set), second is in
discard_b32 to discard_f32 lowering (bifrost has restrictions).
Signed-off-by: Lorenzo Rossi <lorenzo.rossi@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@intel.com>
Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42197>
Our get_buffer/image_memory_requirements() pad TRANSFER_SRC resources
with V3D_TFU_READAHEAD_SIZE, so allocating the reported requirements of
a resource of exactly maxMemoryAllocationSize failed with
VK_ERROR_OUT_OF_DEVICE_MEMORY.
Accept up to one extra page over the limit: since the allocation size
is page-aligned, that covers any sub-page padding.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42179>
This prevents holding open file descriptors after physical devices
are enumerated. This also prevents potential (and unknown) multithreading
issues with the winsys being shared between more than one logical device.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41824>
The winsys will be moved to logical devices. This creates a
ac_drm_device on-demand to make the call faster because otherwise it's
too slow for a function that can be called every frame.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41824>