Between this and the previous few commits, the Box<> has moved from
outside of Instr to inside the Op enum. This provides a few benefits:
1. We no longer need to allocate for the worst-case Op on every
instruction. For example, OpIAdd3X is 232 bytes and OpBra is 40
bytes, which means we can save some memory on OpBra if we only
allocate those 40 bytes.
2. The Op discriminant is available without chasing a pointer. The type
of op is probably the most frequently used field, and this should
benefit most passes that care about what type of Op they're
handling.
3. Small Ops don't need any indirection at all. For example, OpPBk is
only 4 bytes, which means we can just store it directly in less
space than a pointer.
Compared to Box<Instr>, this is around a 1.4% shader compile time
improvement.
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Seán de Búrca <leftmostcat@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37315>
We can skip creating an &dyn and instead use a match to dispatch into
the inner type's functions.
Acked-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Seán de Búrca <leftmostcat@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37315>
This reduces the size of ir::Src from 40 bytes down to 32 bytes. This
makes the size of ir::Op fall from 272 bytes down to 232 bytes, meaning
we save 40 bytes per instruction.
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130>
Nothing actually wants to mix register files in a SSARef so in practice
no callers really handled the None return case. Panic on that case
instead.
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37130>
The sampled and color attachment bits don't actually affect image layout
in any meaningful way. They just cause us to create extra descriptors
in cases where we may not need them. However, now that meta always sets
view usage, we always create the usages meta needs, even if the client
doesn't request them.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36957>
It alwways comes in through the create flags now.
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36957>
The shared memory settings in the QMD affect occupancy as the hardware
needs to manage the available shared memory across all workgroups.
We should set the target to the amount of shared memory used by all the
blocks that can run concurrently taking GPR usage and the local size into
account.
E.g. a shader using 88 gprs, 256 threads and a shared memory size of 18944
can have 2 blocks running concurrently, therefore on an Ampere we need to
set the target to 64kB to properly utilize the hardware.
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135>
It doesn't change the reported values, but it will allow us to easily
advertise real hardware limits in the future.
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135>
We can allocate more than 48k of shared memory, but the limits differ
across hardware, so we need to take it all into account to create the
shared memory splits the hardware can accept.
This does change behavior on Turing, but the assumption is, that the
hardware has simply rounded up. Might need performance testing on Turing
to verify nothing regresses here.
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135>
It's a bit of a disaster, but each generation supports a different set of
shared memory configurations.
Knowing the maximum is important for compute shader performance, knowing
all the legal sizes for QMD generation.
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135>
On hw we have up to 228k of available Shared memory so a 16 bit int isn't
enough for that.
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135>
Now that we're guaranteed the upper 32 bits are zero initialized,
there's no reason we need to do a 64-bit write here.
This is a 0.3% performance improvement on the Sascha Willems
conditionalrender demo with all rendering disabled (638 fps -> 640 fps)
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37187>
We can initialize this just once from the CPU side instead of
overwriting it each time using the copy engine.
This is a 5% performance improvement on the Sascha Willems
conditionalrender demo with all rendering disabled (607 fps -> 638 fps)
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37187>
Within a single command buffer, we know that our operations will happen
sequentially so we don't need to allocate a unique address per
vkCmdBeginConditionalRenderingEXT - we can re-use the same address
instead.
Improves perf on the Sascha Willems conditionalrender demo with all
rendering disabled by about 2% (595 fps -> 607 fps)
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37187>
This is a 41% performance improvement on the Sascha Willems
conditionalrender demo with all rendering disabled (422 fps -> 595 fps)
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37187>
The hardware actually compares a pair of 64-bit values, rather than
comparing a single value against zero like we previously assumed.
This wasn't an issue in most cases before because if the buffer is
zero-initialized the previous code happens to work. If we get a
buffer with garbage in it though we would run into issues.
Fixes: 80eac1337d ("nvk: Always copy conditional rendering value before compare")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13821
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37153>
These buffers get recycled, so we can't rely on their contents.
Unfortunately, conditional rendering was relying on these buffers being
all zero, implicitly, because we didn't fully understand the hardware.
By supporting this debug flag here, we add a way to check for code that
accidentally assumes zero-initialization.
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37153>
nvk_cmd_pool_free_gart_mem_list frees this buffer, so we need to clear
the pointers to it in order to avoid a use after free.
Fixes: 07c70c77de ("nvk: add cond render upload buffer.")
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Reviewed-by: Mary Guillemard <mary@mary.zone>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37153>
My first attempt to fix it ended up stomping classes to zero because it
happened too eary. Now it happens after we have the whole method parsed
but before we go searching for strings.
Fixes: 8e93a763a3 ("nouveau/push: Handle more recent versions of 6F")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37180>
With the existing union setup, only the first 8 bytes are initialized
properly for UBOs, yet the UBO size is 16, and all 16 bytes are copied
to applications. This leads to broken capture-replay since the
descriptor payload is no longer invariant.
Fix this by ensuring all union members are 16 bytes, which then get
properly initialized with the designated initializers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Fixes: 8b5835af31 ("nvk: Use bindless cbufs on Turing+")
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37053>
This was copied+pasted from the old GL driver but we never use any of
those PTE kinds since they're only for compressed depth pre-Turing.
We've also never provied that this is actually good for anything. Just
delete it for now.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37094>
The original lower_phis_to_regs_block() is a little too clever. It
crawls up the predecessor tree until it finds a cross edge and places
the register writes as deep as it can. This breaks nak_nir_lower_cf().
Say you have a shader like...
con %0 = load_uniform()
con loop {
if div {
} else {
}
break;
}
con %1 = phi %0
The original lower_phis_to_regs_block() will turn it into
con %0 = load_uniform()
con %r = decl_reg();
con loop {
if div {
reg_store(%r, %0)
} else {
reg_store(%r, %0)
}
break;
}
con %1 = reg_load(%r)
We then convert it into unstructured control-flow and run regs_to_ssa()
to get our phis back, which lowers each of the registers we inserted to
a phi tree. When we try to recover divergence information on phis by
looking at their sources, this works fine if each source maps directly
to a reg_store() whic maps directly to a phi in the original IR.
However, because the reg_store() instructions are placed deeper, it may
introduce false divergence.
Switch to the simple version of nir_lower_phis_to_regs_block() which
places reg writes directly in phi predecessor blocks. We could probably
be more conservative and just avoid placing writes to uniform regs in
divergent control-flow but it's more robust to make the load/store_reg
intrinsics match the original phis directly.
This fixes some shaders in Horizon: Zero Dawn Remastered
Fixes: b013d54e4f ("nak/lower_cf: Flag phis as convergent when possible")
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36914>
Right now it tries to place reg_write instructions as far up the
predecessor chain as possible. This is useful for a bunch of the passes
that call it since it ensures they don't get placed in dead blocks or in
single successors and things like that. But it screws up NAK's control
flow lowering so we need the option to turn it off and make the pass
place the reg_write instructions in the most obvious place possible.
Fixes: b013d54e4f ("nak/lower_cf: Flag phis as convergent when possible")
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36914>
If implementation does not actually replay the VA, it must return 0
to not violate:
"If the memory object was allocated with a non-zero value of
opaqueCaptureAddress, the return value must be the same address."
Fixes RenderDoc capture replay, which asserts on the this spec rule
being followed.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Fixes: ed6d5c33 ("nvk: Implement VK_EXT/KHR_buffer_device_address")
Reviewed-by: Mohamed Ahmed <mohamedahmedegypt2001@gmail.com>
Closes#13784
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37047>