fdo-mirrors/mesa

mirror of https://gitlab.freedesktop.org/mesa/mesa.git synced 2025-12-21 15:50:11 +01:00

Author	SHA1	Message	Date
Caio Marcelo de Oliveira Filho	0edb58a84e	intel/fs: Clean up variable group size handling in backend Just use the information from NIR shader_info. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>	2020-05-01 12:50:28 -07:00
Francisco Jerez	6579f562c3	intel/ir: Use brw::performance object instead of CFG cycle counts for codegen stats. These should be more accurate than the current cycle counts, since among other things they consider the effect of post-scheduling passes like the software scoreboard on TGL. In addition it will enable us to clean up some of the now redundant cycle-count estimation functionality in the instruction scheduler. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:01:27 -07:00
Francisco Jerez	65342be3ae	intel/fs: Add INTEL_DEBUG=no32 debugging flag. This is useful in order to identify codegen issues caused by SIMD32. It doesn't currently have any effect on compute shaders since SIMD32 dispatch is only enabled for CS when it's strictly necessary to do so in order to support the workgroup size requested for the shader -- That might change in the future though when we hook up the SIMD32 heuristic to CS compilation. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:01:27 -07:00
Francisco Jerez	14f0a5cf64	intel/fs: Implement performance analysis-based SIMD32 heuristic for fragment shaders. The heuristic enables the SIMD32 fragment shader based on whether the IR performance modeling pass predicts it to have greater throughput than the SIMD16 and SIMD8 variants of the same shader. It would be straightforward to do the same thing in order to control whether SIMD16 dispatch is enabled, but it's pending additional performance evaluation. The INTEL_DEBUG=do32 option is left around in order to force the SIMD32 shader to be used regardless of the result of the heuristic, since it's useful as a debugging aid e.g. in order to identify SIMD32-specific codegen issues which may be masked by the SIMD32 heuristic, or cases where the heuristic is incorrectly disabling SIMD32 shaders that offer a performance advantage. Currently this is only enabled on Gen6+, since SIMD32 codegen support is incomplete on earlier platforms. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:01:27 -07:00
Francisco Jerez	d6aa0c261f	intel/fs: Heap-allocate fs_visitors in brw_compile_fs(). This makes brw_compile_fs() look a bit more similar to brw_compile_cs(). It saves us three v*_shader_stats local variables, and will save us additional triplicated declarations as we start tracking IR performance analysis results. The triplicated cfg pointers are left around because they're set to NULL to mark specific dispatch modes as disabled (e.g. in order to enforce hardware restrictions). Doing the same thing with the visitor pointers would cause data leaks. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:01:27 -07:00
Francisco Jerez	6310a05f68	intel/fs: Rename half() helpers to quarter(), allow index up to 3. Makes more sense considering SIMD32. Relaxing the assertion in brw_ir_fs.h will be required in order to avoid assertion failures on SNB with SIMD32 fragment shaders. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:00:29 -07:00
Francisco Jerez	e549e4f6c0	intel/fs/gen12: Fix Render Target Read header setup for new thread payload layout. In Gen12 the Poly 0 Info DWORD containing the Viewport Index and Render Target Index fields were moved from r0.0 to r1.1 in order to make room for dual-polygon dispatch. The render target message format was updated to expect that information in the same location, so we didn't need to make any changes for framebuffer fetch to work with SIMD8 and SIMD16 dispatch. Unfortunately that won't work with SIMD32, since the render target message header is assembled from r0 and r2 instead of r1, and the r2 thread payload wasn't updated with an additional copy of the same information. We need to fix things up manually instead. This avoids a handful of EXT_shader_framebuffer_fetch regressions in combination with SIMD32 fragment shaders. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:00:29 -07:00
Francisco Jerez	d6ae079771	intel/fs/gen12: Fix hangs with per-sample SIMD32 fragment shader dispatch. The Gen12 docs are rather contradictory regarding the dispatch configurations supported by the fragment shader -- The same table present in previous generations seems to imply that only one dispatch mode can be enabled when doing per-sample shading, but a restriction documented in the 3DSTATE_PS_BODY page implies the opposite: That SIMD32 can only be used in combination with some other dispatch mode. The latter seems to match the behavior of real hardware as I could tell from my testing: A bunch of multisample test-cases that do per-sample shading hang if we only provide a SIMD32 shader. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-04-28 23:00:28 -07:00
Dylan Baker	f8e4542bad	replace _mesa_logbase2 with util_logbase2 Reviewed-by: Marek Olšák <marek.olsak@amd.com> Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com> Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>	2020-04-21 11:09:03 -07:00
Jason Ekstrand	d0d039a4d3	anv: Emit pushed UBO bounds checking code in the back-end compiler This commit fixes performance regressions introduced by `e03f965280` in which we started bounds checking our push constants. This added a LOT of shader code to shaders which use the robustBufferAccess feature and led to substantial spilling. The checking we just added to the FS back-end is far more efficient for two reasons: 1. It can be done at a whole register granularity rather than per- scalar and so we emit one SIMD8 SEL per 32B GRF rather than one SIMD16 SEL (executed as two SELs) for each component loaded. 2. Because we do it with NoMask instructions, we can do it on whole pushed GRFs without splatting them out to SIMD8 or SIME16 values. This means that robust buffer access no longer explodes our register pressure for no good reason. As a tiny side-benefit, we're now using can use AND instead of SEL which means no need for the flag and better scheduling. Vulkan pipeline database results on ICL: Instructions in all programs: 293586059 -> 238009118 (-18.9%) SENDs in all programs: 13568515 -> 13568515 (+0.0%) Loops in all programs: 149720 -> 149720 (+0.0%) Cycles in all programs: 88499234498 -> 84348917496 (-4.7%) Spills in all programs: 1229018 -> 184339 (-85.0%) Fills in all programs: 1348397 -> 246061 (-81.8%) This also improves the performance of a few apps: - Shadow of the Tomb Raider: +4% - Witcher 3: +3.5% - UE4 Shooter demo: +2% Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4447>	2020-04-17 14:48:06 +00:00
Caio Marcelo de Oliveira Filho	db74ad0696	intel/compiler: Remove cs_prog_data->threads At this point all drivers are doing this math on their own -- since most of them need to cover the variable group size case, in which at compile time the group size (and number of threads) is not defined. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>	2020-04-09 19:23:20 -07:00
Plamena Manolova	c77dc51203	intel/compiler: Add support for variable workgroup size Add new builtin parameters that are used to keep track of the group size. This will be used to implement ARB_compute_variable_group_size. The compiler will use the maximum group size supported to pick a suitable SIMD variant. A later improvement will be to keep all SIMD variants (like FS) so the driver can select the best one at dispatch time. When variable workgroup size is used, the small workgroup optimization is disabled as it we can't prove at compile time that the barriers won't be needed. Extracted from original i965 patch with additional changes by Caio Marcelo de Oliveira Filho. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>	2020-04-09 19:23:12 -07:00
Caio Marcelo de Oliveira Filho	c54fc0d07b	intel/compiler: Replace cs_prog_data->push.total with a helper The push.total field had three values but only one was directly used (size). Replace it with a helper function that explicitly takes the cs_prog_data and the number of threads -- and use that in the drivers. This is a preparation for ARB_compute_variable_group_size where the number of threads (hence the total size for push constants) is not defined at compile time (not cs_prog_data->threads). Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4504>	2020-04-09 19:23:12 -07:00
Caio Marcelo de Oliveira Filho	395de69b1f	intel/fs: Allow multiple slots for position Change brw_compute_vue_map() to also take the number of pos slots. If more than one slot is used, the VARYING_SLOT_POS is treated as an array. When using Primitive Replication, instead of a single position, the VUE must contain an array of positions. Padding might be necessary (after clip distance) to ensure rest of attributes start aligned. v2: Add note about array in the commit message and assert that pos_slots >= 1 to make clear 0 is invalid. (Jason) Move padding to be after the clip distance. v3: Apply the correct offset when gathering the sources from outputs. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> [v2] Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2313>	2020-04-07 17:16:09 +00:00
Juan A. Suarez Romero	460de2159e	intel/compiler: store the FS inputs in WM prog data Store the fragment shader inputs in the program data so we can use them later when required without needing the NIR shader. Cc: mesa-stable@lists.freedesktop.org Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Ivan Briano <ivan.briano@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/2010>	2020-04-01 23:36:28 +00:00
Ian Romanick	ba88e95187	intel/fs: Fix NULL destinations on 3-source instructions again after late DCE We considered moving this down near the call to insert_gen4_send_dependency_workarounds. By that point it's too late for a couple reasons. One, we're potentially increasing resiter pressure that may lead to anoter spill. Two, fixup_3src_null_dest tries to allocate a VGRF, but the post-register allocation shader uses physical registers. Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/2621 Fixes: `ba2fa1ceaf` ("intel/fs: Do cmod prop again after scheduling") Reviewed-by: Matt Turner <mattst88@gmail.com> Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4155> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4155>	2020-03-12 08:22:43 -07:00
Mathias Fröhlich	630154e77b	i965: Move down genX_upload_sbe in profiles. Avoid looping over all VARYING_SLOT_MAX urb_setup array entries from genX_upload_sbe. Prepare an array indirection to the active entries of urb_setup already in the compile step. On upload only walk the active arrays. v2: Use uint8_t to store the attribute numbers. v3: Change loop to build up the array indirection. v4: Rebase. v5: Style fix. Reviewed-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/308>	2020-03-10 14:28:36 +00:00
Ian Romanick	ba2fa1ceaf	intel/fs: Do cmod prop again after scheduling Pre-RA scheduling can create more opportunities for CMOD propagation. This takes advantage of that. It may be worth doing this again in post-RA scheduling, but there are additional problems there. I'm a little torn about the use of the OPT() macro. On the one hand, it would be confusing to see dumps from INTEL_DEBUG=optimizer that don't match the final output. On the other hand, since register allocation can fail, the same pass can be run multiple times. Each time one or both passes might or might not make progress. This would also lead to incongruous, confusing output. Ice Lake total instructions in shared programs: 14549808 -> 14548529 (<.01%) instructions in affected programs: 231985 -> 230706 (-0.55%) helped: 632 HURT: 0 helped stats (abs) min: 1 max: 32 x̄: 2.02 x̃: 1 helped stats (rel) min: 0.05% max: 2.56% x̄: 0.57% x̃: 0.41% 95% mean confidence interval for instructions value: -2.25 -1.79 95% mean confidence interval for instructions %-change: -0.61% -0.54% Instructions are helped. total cycles in shared programs: 203770850 -> 203776599 (<.01%) cycles in affected programs: 2495653 -> 2501402 (0.23%) helped: 282 HURT: 197 helped stats (abs) min: 1 max: 242 x̄: 20.37 x̃: 16 helped stats (rel) min: <.01% max: 11.65% x̄: 0.91% x̃: 0.64% HURT stats (abs) min: 2 max: 609 x̄: 58.35 x̃: 20 HURT stats (rel) min: <.01% max: 10.97% x̄: 1.35% x̃: 0.66% 95% mean confidence interval for cycles value: 5.27 18.73 95% mean confidence interval for cycles %-change: -0.16% 0.21% Inconclusive result (%-change mean confidence interval includes 0). LOST: 0 GAINED: 2 Skylake total instructions in shared programs: 13447708 -> 13446594 (<.01%) instructions in affected programs: 216813 -> 215699 (-0.51%) helped: 623 HURT: 0 helped stats (abs) min: 1 max: 32 x̄: 1.79 x̃: 1 helped stats (rel) min: 0.06% max: 2.86% x̄: 0.59% x̃: 0.42% 95% mean confidence interval for instructions value: -1.99 -1.59 95% mean confidence interval for instructions %-change: -0.63% -0.55% Instructions are helped. total cycles in shared programs: 193759224 -> 193762726 (<.01%) cycles in affected programs: 2540035 -> 2543537 (0.14%) helped: 249 HURT: 190 helped stats (abs) min: 2 max: 196 x̄: 16.67 x̃: 14 helped stats (rel) min: <.01% max: 4.71% x̄: 0.66% x̃: 0.62% HURT stats (abs) min: 2 max: 614 x̄: 40.27 x̃: 14 HURT stats (rel) min: 0.02% max: 5.78% x̄: 0.86% x̃: 0.37% 95% mean confidence interval for cycles value: 2.57 13.39 95% mean confidence interval for cycles %-change: -0.11% 0.11% Inconclusive result (%-change mean confidence interval includes 0). LOST: 0 GAINED: 1 Broadwell total instructions in shared programs: 13418631 -> 13417393 (<.01%) instructions in affected programs: 243192 -> 241954 (-0.51%) helped: 694 HURT: 0 helped stats (abs) min: 1 max: 31 x̄: 1.78 x̃: 1 helped stats (rel) min: 0.06% max: 2.86% x̄: 0.59% x̃: 0.44% 95% mean confidence interval for instructions value: -1.95 -1.62 95% mean confidence interval for instructions %-change: -0.62% -0.55% Instructions are helped. total cycles in shared programs: 200822940 -> 200829128 (<.01%) cycles in affected programs: 2128651 -> 2134839 (0.29%) helped: 251 HURT: 226 helped stats (abs) min: 1 max: 200 x̄: 14.32 x̃: 12 helped stats (rel) min: <.01% max: 3.56% x̄: 0.60% x̃: 0.50% HURT stats (abs) min: 2 max: 611 x̄: 43.28 x̃: 18 HURT stats (rel) min: 0.02% max: 7.03% x̄: 0.93% x̃: 0.54% 95% mean confidence interval for cycles value: 7.44 18.50 95% mean confidence interval for cycles %-change: 0.02% 0.23% Cycles are HURT. Haswell and Ivy Bridge had similar results. (Haswell shown) total instructions in shared programs: 11569710 -> 11568829 (<.01%) instructions in affected programs: 147862 -> 146981 (-0.60%) helped: 487 HURT: 0 helped stats (abs) min: 1 max: 34 x̄: 1.81 x̃: 1 helped stats (rel) min: 0.12% max: 4.75% x̄: 0.57% x̃: 0.45% 95% mean confidence interval for instructions value: -2.03 -1.59 95% mean confidence interval for instructions %-change: -0.61% -0.54% Instructions are helped. total cycles in shared programs: 187079425 -> 187079437 (<.01%) cycles in affected programs: 1088494 -> 1088506 (<.01%) helped: 234 HURT: 124 helped stats (abs) min: 2 max: 282 x̄: 22.66 x̃: 16 helped stats (rel) min: 0.03% max: 7.88% x̄: 0.93% x̃: 0.75% HURT stats (abs) min: 1 max: 276 x̄: 42.86 x̃: 20 HURT stats (rel) min: 0.03% max: 6.70% x̄: 0.99% x̃: 0.53% 95% mean confidence interval for cycles value: -5.54 5.61 95% mean confidence interval for cycles %-change: -0.41% -0.11% Inconclusive result (value mean confidence interval includes 0). total spills in shared programs: 7746 -> 7740 (-0.08%) spills in affected programs: 6 -> 0 helped: 1 HURT: 0 total fills in shared programs: 6264 -> 6258 (-0.10%) fills in affected programs: 6 -> 0 helped: 1 HURT: 0 Sandy Bridge total instructions in shared programs: 10688576 -> 10688177 (<.01%) instructions in affected programs: 137875 -> 137476 (-0.29%) helped: 358 HURT: 0 helped stats (abs) min: 1 max: 9 x̄: 1.11 x̃: 1 helped stats (rel) min: 0.15% max: 1.43% x̄: 0.35% x̃: 0.28% 95% mean confidence interval for instructions value: -1.18 -1.05 95% mean confidence interval for instructions %-change: -0.37% -0.32% Instructions are helped. total cycles in shared programs: 153397144 -> 153393046 (<.01%) cycles in affected programs: 1220713 -> 1216615 (-0.34%) helped: 255 HURT: 31 helped stats (abs) min: 1 max: 304 x̄: 16.71 x̃: 16 helped stats (rel) min: <.01% max: 6.70% x̄: 0.41% x̃: 0.31% HURT stats (abs) min: 1 max: 41 x̄: 5.29 x̃: 3 HURT stats (rel) min: 0.02% max: 0.65% x̄: 0.16% x̃: 0.11% 95% mean confidence interval for cycles value: -17.44 -11.22 95% mean confidence interval for cycles %-change: -0.40% -0.29% Cycles are helped. Iron Lake total instructions in shared programs: 8106894 -> 8105529 (-0.02%) instructions in affected programs: 287197 -> 285832 (-0.48%) helped: 1099 HURT: 0 helped stats (abs) min: 1 max: 10 x̄: 1.24 x̃: 1 helped stats (rel) min: 0.16% max: 4.55% x̄: 0.67% x̃: 0.61% 95% mean confidence interval for instructions value: -1.29 -1.19 95% mean confidence interval for instructions %-change: -0.70% -0.64% Instructions are helped. total cycles in shared programs: 188347022 -> 188344266 (<.01%) cycles in affected programs: 3740632 -> 3737876 (-0.07%) helped: 758 HURT: 10 helped stats (abs) min: 2 max: 38 x̄: 3.68 x̃: 2 helped stats (rel) min: <.01% max: 1.00% x̄: 0.12% x̃: 0.08% HURT stats (abs) min: 2 max: 4 x̄: 3.20 x̃: 4 HURT stats (rel) min: 0.03% max: 0.07% x̄: 0.06% x̃: 0.07% 95% mean confidence interval for cycles value: -3.82 -3.35 95% mean confidence interval for cycles %-change: -0.13% -0.11% Cycles are helped. GM45 total instructions in shared programs: 4985449 -> 4984768 (-0.01%) instructions in affected programs: 145154 -> 144473 (-0.47%) helped: 547 HURT: 0 helped stats (abs) min: 1 max: 10 x̄: 1.24 x̃: 1 helped stats (rel) min: 0.16% max: 2.86% x̄: 0.66% x̃: 0.61% 95% mean confidence interval for instructions value: -1.31 -1.18 95% mean confidence interval for instructions %-change: -0.69% -0.62% Instructions are helped. total cycles in shared programs: 128835062 -> 128833144 (<.01%) cycles in affected programs: 2720650 -> 2718732 (-0.07%) helped: 517 HURT: 1 helped stats (abs) min: 2 max: 38 x̄: 3.71 x̃: 2 helped stats (rel) min: <.01% max: 0.89% x̄: 0.11% x̃: 0.07% HURT stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2 HURT stats (rel) min: 0.04% max: 0.04% x̄: 0.04% x̃: 0.04% 95% mean confidence interval for cycles value: -4.02 -3.39 95% mean confidence interval for cycles %-change: -0.12% -0.10% Cycles are helped. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3965>	2020-03-09 16:46:19 -07:00
Matt Turner	bb3e7b0fe3	intel/compiler: Pass shader_stats for each SIMD mode Passing shader_stats to the fs_generator constructor means that the SIMD8 shader stats from the visitor (such as the scheduler mode) will be reported out for the SIMD16/SIMD32 versions as well. As you can see, we are now passing 'shader_stats' and 'stats' to generate_code(), which is obviously odd looking. Ian rebased and committed an old patch of mine which added the shader_stats struct on July 30 in commit `dabb5d4bee` (i965/fs: Add a shader_stats struct.) and shortly after on August 12 Jason added the brw_compile_stats struct in commit `134607760a` (intel/compiler: Fill a compiler statistics struct). I'd like to combine the two, but I'm not sure how. shader_stats is an input to generate_code() while brw_compile_stats is an output and is only used by the Vulkan driver. Leave it as is for now... Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>	2020-03-09 04:44:12 +00:00
Matt Turner	75a33e268e	intel/compiler: Mark some methods and parameters const Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4093>	2020-03-09 04:44:11 +00:00
Francisco Jerez	70349a2252	intel/compiler: Calculate num_instructions in O(1) during register pressure calculation And mark the variable declaration as const. Reviewed-by: Matt Turner <mattst88@gmail.com> Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:21:13 -08:00
Francisco Jerez	e5e4d016b9	intel/compiler: Move register pressure calculation into IR analysis object This defines a new BRW_ANALYSIS object which wraps the register pressure computation code along with its result. For the rationale see the previous commits converting the liveness and dominance analysis passes to the IR analysis framework. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:21:10 -08:00
Francisco Jerez	ea44de6d8c	intel/compiler/fs: Switch liveness analysis to IR analysis framework This involves wrapping fs_live_variables in a BRW_ANALYSIS object and hooking it up to invalidate_analysis() so it's properly invalidated. Seems like a lot of churn but it's fairly straightforward. The fs_visitor invalidate_ and calculate_live_intervals() methods are no longer necessary after this change. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:57 -08:00
Francisco Jerez	ba73e606f6	intel/compiler: Move all live interval analysis results into fs_live_variables This moves the following methods that are currently defined in fs_visitor (even though they are side products of the liveness analysis computation) and are already implemented in brw_fs_live_variables.cpp: > bool virtual_grf_interferes(int a, int b) const; > int virtual_grf_start; > int virtual_grf_end; It makes sense for them to be part of the fs_live_variables object, because they have the same lifetime as other liveness analysis results and because this will allow some extra validation to happen wherever they are accessed in order to make sure that we only ever use up-to-date liveness analysis results. This shortens the virtual_grf prefix in order to compensate for the slightly increased lexical overhead from the live_intervals pointer dereference. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:43 -08:00
Francisco Jerez	ab6d792986	intel/compiler: Pass detailed dependency classes to invalidate_analysis() Have fun reading through the whole back-end optimizer to verify whether I've missed any dependency flags -- Or alternatively, just trust that any mistake here will trigger an assertion failure during analysis pass validation if it ever poses a problem for the consistency of any of the analysis passes managed by the framework. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:39 -08:00
Francisco Jerez	d966a6b4c4	intel/compiler: Introduce backend_shader method to propagate IR changes to analysis passes The invalidate_analysis() method knows what analysis passes there are in the back-end and calls their invalidate() method to report changes in the IR. For the moment it just calls invalidate_live_intervals() (which will eventually be fully replaced by this function) if anything changed. This makes all optimization passes invalidate DEPENDENCY_EVERYTHING, which is clearly far from ideal -- The dependency classes passed to invalidate_analysis() will be refined in a future commit. Reviewed-by: Matt Turner <mattst88@gmail.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4012>	2020-03-06 10:20:32 -08:00
Jordan Justen	cf12faef61	intel/compiler: Restrict cs_threads to 64 Our current GPGPU_WALKER code only supports up to 64 threads. On HSW we could use up to 70 and TGL up to 112, but only if the walker is adjusted so the width does not exceed 64. Work to support this is in progress. Previous to this change, we might try to downgrade to SIMD8 if the SIMD16 shader spilled. Since HSW and TGL have the max number of threads above 64, we would then try to emit an invalid GPGPU walker command. Fixes: `932045061b` ("i965/cs: Emit compute shader code and upload programs") Signed-off-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Tested-by: Paulo Zanoni <paulo.r.zanoni@intel.com>	2020-02-28 14:45:43 -08:00
Danylo Piliaiev	d596795d4d	brw_fs: Avoid zero size vla ../src/intel/compiler/brw_fs.cpp:2247:46: runtime error: variable length array bound evaluates to non-positive value 0 #0 0x7f78f5697678 in fs_visitor::assign_constant_locations() ../src/intel/compiler/brw_fs.cpp:2247 #1 0x7f78f571d29e in fs_visitor::optimize() ../src/intel/compiler/brw_fs.cpp:7361 #2 0x7f78f574eb84 in fs_visitor::run_fs(bool, bool) ../src/intel/compiler/brw_fs.cpp:8022 #3 0x7f78f575641b in brw_compile_fs ../src/intel/compiler/brw_fs.cpp:8408 #4 0x7f78f255c8e4 in brw_codegen_wm_prog ../src/mesa/drivers/dri/i965/brw_wm.c:123 #5 0x7f78f2565571 in brw_fs_precompile ../src/mesa/drivers/dri/i965/brw_wm.c:608 #6 0x7f78f24edd2c in brw_shader_precompile ../src/mesa/drivers/dri/i965/brw_link.cpp:56 #7 0x7f78f24f3af8 in brw_link_shader ../src/mesa/drivers/dri/i965/brw_link.cpp:381 #8 0x7f78f39a302a in _mesa_glsl_link_shader ../src/mesa/program/ir_to_mesa.cpp:3119 #9 0x7f78f3a43826 in create_new_program ../src/mesa/main/ff_fragment_shader.cpp:1133 #10 0x7f78f3a43d00 in _mesa_get_fixed_func_fragment_program ../src/mesa/main/ff_fragment_shader.cpp:1163 #11 0x7f78f325ddcd in update_program ../src/mesa/main/state.c:134 #12 0x7f78f325fe64 in _mesa_update_state_locked ../src/mesa/main/state.c:360 #13 0x7f78f32600f1 in _mesa_update_state ../src/mesa/main/state.c:394 #14 0x7f78f2b3e587 in clear ../src/mesa/main/clear.c:169 #15 0x7f78f2b3e587 in _mesa_Clear ../src/mesa/main/clear.c:242 Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3825>	2020-02-19 12:07:24 +02:00
Francisco Jerez	8d3b86e34a	intel/fs/gen7+: Implement discard/demote for SIMD32 programs. At this point this simply involves fixing the initialization of the sample mask flag register to take the right dispatch mask from the thread payload, and fixing sample_mask_reg() to return f1.1 for the second half of a SIMD32 thread. This improves Manhattan 3.1 performance by 2.4%±0.31% (N>40) on my ICL with SIMD32 enabled relative to falling back to SIMD16 for the shaders that use discard. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:49 -08:00
Francisco Jerez	04c7d3d4b1	intel/fs: Return consistent UW types from sample_mask_reg() in fragment shaders. In SIMD32 programs that don't use discard, the upper 16 bits of the UD result of sample_mask_reg() don't contain the sample mask of the upper 16 channels as one would expect. Stop pretending we are returning a valid 32-bit mask. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:49 -08:00
Francisco Jerez	1c6853a9be	intel/fs: Refactor predication on sample mask into helper function. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	a792e11f5c	intel/fs/gen7+: Swap sample mask flag register and FIND_LIVE_CHANNEL temporary. FIND_LIVE_CHANNEL was using f1.0-f1.1 as temporary flag register on Gen7, instead use f0.0-f0.1. In order to avoid collision with the discard sample mask, move the latter to f1.0-f1.1. This makes room for keeping track of the sample mask of the second half of SIMD32 programs that use discard. Note that some MOVs of the sample mask into f1.0 become redundant now in lower_surface_logical_send() and lower_a64_logical_send(). Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>x	2020-02-14 14:31:48 -08:00
Francisco Jerez	083fd96a97	intel/fs: Use helper for discard sample mask flag subregister number. Use it instead of hard-coding f0.1 for the sample mask of programs that use discard. This will make the task easier when we replace f0.1 with another flag register location in order to support discard with SIMD32 shaders. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	a6bc11a789	intel/fs: Make sample_mask_reg() local to brw_fs.cpp and use it in more places. It's only really useful there. This will avoid confusion with another helper with a similar purpose I'm about to introduce that will be useful in multiple files from the FS back-end. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	57dee58c82	intel/fs: Set src0 alpha present bit in header when provided in message payload. Currently the "Source0 Alpha Present to RenderTarget" bit of the RT write message header is derived from brw_wm_prog_data::replicate_alpha. However the src0_alpha payload is provided anytime it's specified to the logical message. This could theoretically lead to an inconsistency if somebody provided a src0_alpha value while brw_wm_prog_data::replicate_alpha was false, as I'm planning to do in a future commit in order to implement a hardware workaround. Instead calculate the header bit based on whether a src0_alpha value was provided to the logical message, which guarantees the same behavior on pre-ICL and ICL+ (the latter used an extended descriptor bit for this which didn't suffer from the same issue). Remove the brw_wm_prog_data::replicate_alpha flag. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	a8ac0bd759	intel/fs/gen12: Workaround unwanted SEND execution due to broken NoMask control flow. This is a less invasive alternative to the workaround documented in the hardware spec for GEN:BUG:1407528679, which doesn't involve disabling structured control flow (it's unlikely that switching to GOTO/JOIN would have actually fixed the problem anyway). Under some conditions Gen12 hardware can end up executing a BB with all channels disabled, which will lead to the execution of any NoMask instructions in it, even though any execution-masked instructions will be correctly shot down. This may break assumptions of some NoMask SEND messages whose descriptor depends on data generated by live invocations of the shader. This avoids the problem by predicating certain instructions on an ANY horizontal predicate that makes sure that their execution is omitted when all channels of the program are disabled. The shader-db impact of this patch seems to be minimal: total instructions in shared programs: 17169833 -> 17169913 (0.00%) instructions in affected programs: 30663 -> 30743 (0.26%) helped: 0 HURT: 42 total cycles in shared programs: 336966176 -> 336968568 (0.00%) cycles in affected programs: 2367290 -> 2369682 (0.10%) helped: 0 HURT: 13 Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Cc: 20.0 <mesa-stable@lists.freedesktop.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	008f95a043	intel/fs: Add virtual instruction to load mask of live channels into flag register. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Cc: 20.0 <mesa-stable@lists.freedesktop.org>	2020-02-14 14:31:48 -08:00
Francisco Jerez	b8b509fb92	intel/fs/gen7: Fix fs_inst::flags_written() for SHADER_OPCODE_FIND_LIVE_CHANNEL. We need to pass a width of 32 since the opcode bashes the whole f1.0 register on IVB. This is unlikely to have caused problems since f1.0 is largely unused currently. That's likely to change soon though, even on platforms other than Gen7. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Cc: 20.0 <mesa-stable@lists.freedesktop.org>	2020-02-14 14:31:48 -08:00
Ian Romanick	58907568ec	intel/fs: Add SHADER_OPCODE_[IU]SUB_SAT pseudo-ops v2: Add a big comment explaining the [IU]SUB_SAT lowering. Suggested by Caio. v3: Use get_fpu_lowered_simd_width in get_lowered_simd_width. Suggested by Ken on IRC. v4: Fix a typo in a comment. Noticed by Caio. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>	2020-01-23 00:18:57 +00:00
Ian Romanick	74cd0964d6	intel/fs: Don't lower integer multiplies that don't need lowering v2: Move the check to fs_visitor::lower_integer_multiplication. Previously the cases where lowering was skipped, the original instruction was removed by fs_visitor::lower_integer_multiplication. Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/767>	2020-01-23 00:18:57 +00:00
Matt Turner	49c21802cb	intel/compiler: Split has_64bit_types into float/int Gen7 has 64-bit floats but not 64-bit ints. Acked-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2635>	2020-01-22 00:19:20 +00:00
Caio Marcelo de Oliveira Filho	ff5b74ef32	intel/fs: Add workgroup_size() helper Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>	2020-01-21 23:41:35 +00:00
Francisco Jerez	b54b67e067	intel/fs: Switch to standard vector layout for barycentrics at optimization time. This involves permuting the registers of barycentric vectors to have the standard X[0-n] Y[0-n] layout at NIR translation time. Barycentrics are converted to the format expected by the PLN instruction in the lower_barycentrics() pass run after the optimization loop. Main reason is correctness of SIMD32 fragment shaders. The shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used during NIR translation are busted for SIMD32. This leads to serious corruption at present with INTEL_DEBUG=do32, especially on Gen11+ where these helpers are hit more frequently due to the lack of a hardware PLN instruction. Of course one could have chosen to fix those helpers instead, but there is another far more subtle issue that was reported during review of the SIMD32 fragment shader codegen changes: The SIMD splitting pass currently handles SIMD32 barycentric vectors as if they had the standard X[0-n] Y[0-n] layout, even though they are interleaved for the PLN instruction, which causes incorrect execution masks to be applied to the MOVs unzipping barycentric vectors in cases where a LINTERP instruction occurs under non-uniform control flow. I'm not aware of any conformance regressions due to the latter issue at present, but for our peace of mind let's move the conversion to the PLN layout into the lower_barycentrics() pass run after lower_simd_width(). This leads to the following shader-db improvements (including SIMD32 shaders) in combination with the previous back-end preparation changes -- Without them (especially the copy propagation changes) this would lead to a massive number of regressions. On ICL: total instructions in shared programs: 20662316 -> 20466903 (-0.95%) instructions in affected programs: 10538474 -> 10343061 (-1.85%) helped: 68775 HURT: 6 total spills in shared programs: 8938 -> 8748 (-2.13%) spills in affected programs: 376 -> 186 (-50.53%) helped: 9 HURT: 5 total fills in shared programs: 8965 -> 8663 (-3.37%) fills in affected programs: 965 -> 663 (-31.30%) helped: 9 HURT: 6 LOST: 146 GAINED: 43 On SKL: total instructions in shared programs: 18725867 -> 18614912 (-0.59%) instructions in affected programs: 3876590 -> 3765635 (-2.86%) helped: 27492 HURT: 2 LOST: 191 GAINED: 417 On SNB: total instructions in shared programs: 14573613 -> 13980646 (-4.07%) instructions in affected programs: 5199074 -> 4606107 (-11.41%) helped: 29998 HURT: 0 LOST: 21 GAINED: 30 Results are somewhat less impressive but still significant without SIMD32 fragment shaders enabled. On ICL: total instructions in shared programs: 16148728 -> 16061659 (-0.54%) instructions in affected programs: 6114788 -> 6027719 (-1.42%) helped: 42046 HURT: 6 total spills in shared programs: 8218 -> 8028 (-2.31%) spills in affected programs: 376 -> 186 (-50.53%) helped: 9 HURT: 5 total fills in shared programs: 8953 -> 8651 (-3.37%) fills in affected programs: 965 -> 663 (-31.30%) helped: 9 HURT: 6 LOST: 0 GAINED: 3 On SKL: total instructions in shared programs: 14927994 -> 14926738 (-0.01%) instructions in affected programs: 168850 -> 167594 (-0.74%) helped: 711 HURT: 2 On SNB: total instructions in shared programs: 10770538 -> 10734403 (-0.34%) instructions in affected programs: 2702172 -> 2666037 (-1.34%) helped: 17818 HURT: 0 All of the hurt shaders are either spilling slightly more or emitting additional NOP instructions due to the SIMD16 POW workaround for Gen8-9 combined with differences in scheduling. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:23:12 -08:00
Francisco Jerez	79bd252d6e	intel/fs: Introduce barycentric layout lowering pass. The goal is to represent barycentrics with the standard vector layout during optimization and particularly SIMD lowering. Instead of emitting the barycentric layout conversions at NIR translation time, do it later as a lowering pass. For the moment this is only applied to PI messages, but we'll give the same treatment to LINTERP instructions too. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:22:59 -08:00
Francisco Jerez	ab0d1b3b3d	intel/fs: Rework fs_inst::is_copy_payload() into multiple classification helpers. This reworks the current fs_inst::is_copy_payload() method into a number of classification helpers with well-defined semantics. This will be useful later on in order to optimize LOAD_PAYLOAD instructions more aggressively in cases where we can determine it's safe to do so. The closest equivalent of the present fs_inst::is_copy_payload() method is the is_coalescing_payload() helper introduced here. No functional nor shader-db changes. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:21:19 -08:00
Francisco Jerez	1873202f44	intel/fs: Generalize fs_reg::is_contiguous() to register files other than VGRF. No functional nor shader-db changes. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:20:59 -08:00
Francisco Jerez	d9a57c85cc	intel/fs: Try to vectorize header setup in lower_load_payload(). In cases where LOAD_PAYLOAD is provided a pair of contiguous registers as header sources, try to use a single SIMD16 instruction in order to initialize them. This is unlikely to affect the overall cycle count of the shader, since the compressed instruction has twice the issue time, except due to the reduced pressure on the instruction cache. Main motivation is avoiding instruction-count regressions in combination with the following copy propagation improvements, which will allow the SIMD16 g0-1 header setup emitted for framebuffer writes to be copy-propagated into its LOAD_PAYLOAD, leading to the emission of two SIMD8 MOV instructions instead of a single SIMD16 MOV. Reverting this commit on top of the copy propagation changes would lead to the following shader-db regressions on SKL and other platforms: total instructions in shared programs: 14926738 -> 14935415 (0.06%) instructions in affected programs: 1892445 -> 1901122 (0.46%) helped: 0 HURT: 8676 Without the following copy propagation changes this doesn't have any effect on shader-db on Gen7+, because we would typically set up the FB write header with a separate SIMD16 MOV that isn't currently copy-propagated into the LOAD_PAYLOAD, so the individual SIMD8 MOVs result of LOAD_PAYLOAD lowering would get register-coalesced away under normal circumstances. However that wasn't the case for MRF LOAD_PAYLOAD destinations on Gen6 and earlier, because register coalesce only kicks in for GRFs, leaving a number of redundant SIMD8 MOVs lying around. On SNB this leads to the following shader-db improvements: total instructions in shared programs: 10770538 -> 10734681 (-0.33%) instructions in affected programs: 2700655 -> 2664798 (-1.33%) helped: 17791 HURT: 0 Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-17 13:20:46 -08:00
Eric Anholt	3d9a3d0be0	i965: Reuse the new core glsl_count_dword_slots(). The only difference I could see was treating interfaces like structs. Maintain that case. Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com> Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3297> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3297>	2020-01-14 23:55:00 +00:00
Francisco Jerez	c20dc9b836	intel/fs: Make implied_mrf_writes() an fs_inst method. This will be convenient in a later commit enabling SIMD32 fragment shaders, and happens to fix the calculation for MATH instructions which is currently inaccurate for SIMD-lowered instructions on Gen4-5 platforms (all of them on Gen4 in SIMD16 mode), since it was based on the shader's dispatch width rather than on the actual execution size of the instruction. This causes some shader-db noise on Gen4 due to the more compact register allocation interacting with the SEND dependency workarounds, but otherwise no major changes. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-10 11:02:30 -08:00
Francisco Jerez	0a6e46d44d	intel/fs/gen11+: Handle ROR/ROL in lower_simd_width(). Prevents invalid code from being emitted for ROR/ROL instructions in SIMD32 shaders. The problem can be reproduced with the following tests while forcing SIMD32 to be used for fragment shaders: piglit.shaders.glsl-rotate-left piglit.shaders.glsl-rotate-right However the issue could occur in production already with compute shaders and a workgroup size large enough to trigger SIMD32 dispatch. Fixes: `83fdec0f0d` "intel/compiler: Enable the emission of ROR/ROL instructions" Cc: Sagar Ghuge <sagar.ghuge@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2020-01-10 11:00:24 -08:00

1 2 3 4 5 ...

293 commits