fdo-mirrors/mesa

mirror of https://gitlab.freedesktop.org/mesa/mesa.git synced 2025-12-21 15:50:11 +01:00

Author	SHA1	Message	Date
Jason Ekstrand	0b830081f0	intel/fs: Rework KSP data to be SIMD width-based Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Francisco Jerez	5b6e91dd35	intel/fs: Remove program key argument from generator. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Jason Ekstrand	a14fb0184a	intel/fs: Set up FB write message headers in the visitor Doing instruction header setup in the generator is awful for a number of reasons. For one, we can't schedule the header setup at all. For another, it means lots of implied writes which the instruction scheduler and other passes can't properly read about. The second isn't a huge problem for FB writes since they always happen at the end. We made a similar change to sampler handling in `ff4726077d`. Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Francisco Jerez	dda31a7bbc	intel/fs: Fix implied_mrf_writes() for headerless FB writes. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Francisco Jerez	90643689aa	intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Jason Ekstrand	b1cc9a9ae1	intel/fs: Properly track implied header regs read by FB writes The FB write opcode on gen4-5 does implied copies from g0 and g1 to the message payload. With this commit, we start tracking that as part of the IR by having the FB write read from g0-1. Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-06-28 13:19:38 -07:00
Jose Maria Casanova Crespo	a0891eabca	intel/fs: Use shuffle_from_32bit_read at VARYING_PULL_CONSTANT_LOAD shuffle_from_32bit_read can manage the shuffle/unshuffle needed for different 8/16/32/64 bit-sizes at VARYING PULL CONSTANT LOAD. To get the specific component the first_component parameter is used. In the case of the previous 16-bit shuffle, the shuffle operation was generating not needed MOVs where its results where never used. This behaviour passed unnoticed on SIMD16 because dead_code_eliminate pass removed the generated instructions but for SIMD8 they cound't be removed because of being partial writes. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-06-16 22:39:08 +02:00
Ian Romanick	284b563fb0	i965/fs: Optimize OR with 0 into a MOV fs_visitor::set_gs_stream_control_data_bits generates some code like "control_data_bits \| stream_id << ((2 * (vertex_count - 1)) % 32)" as part of EmitVertex. The first time this (dynamically) occurs in the shader, control_data_bits is zero. Many times we can determine this statically and various optimizations will collaborate to make one of the OR operands literal zero. Converting the OR to a MOV usually allows it to be copy-propagated away. However, this does not happen in at least some shaders (in the assembly output of shaders/closed/UnrealEngine4/EffectsCaveDemo/301.shader_test, search for shl). All of the affected shaders are geometry shaders. Broadwell and Skylake had similar results. (Skylake shown) total instructions in shared programs: 14375452 -> 14375413 (<.01%) instructions in affected programs: 6422 -> 6383 (-0.61%) helped: 39 HURT: 0 helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 helped stats (rel) min: 0.14% max: 2.56% x̄: 1.91% x̃: 2.56% 95% mean confidence interval for instructions value: -1.00 -1.00 95% mean confidence interval for instructions %-change: -2.26% -1.57% Instructions are helped. total cycles in shared programs: 531981179 -> 531980555 (<.01%) cycles in affected programs: 27493 -> 26869 (-2.27%) helped: 39 HURT: 0 helped stats (abs) min: 16 max: 16 x̄: 16.00 x̃: 16 helped stats (rel) min: 0.60% max: 7.92% x̄: 5.94% x̃: 7.92% 95% mean confidence interval for cycles value: -16.00 -16.00 95% mean confidence interval for cycles %-change: -6.98% -4.90% Cycles are helped. No changes on earlier platforms. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>	2018-06-15 17:22:27 -07:00
Francisco Jerez	4bd2047dee	intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. When using multiple RT write messages to the same RT such as for dual-source blending or all RT writes in SIMD32, we have to set the "Last Render Target Select" bit on all write messages that target the last RT but only set EOT on the last RT write in the shader. Special-casing for dual-source blend works today because that is the only case which requires multiple RT write messages per RT. When we start doing SIMD32, this will become much more common so we add a dedicated bit for it. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-05-29 15:44:50 -07:00
Francisco Jerez	39de901a96	intel/fs: Use the ATTR file for FS inputs This replaces the special magic opcodes which implicitly read inputs with explicit use of the ATTR file. v2 (Jason Ekstrand): - Break into multiple patches - Change the units of the FS ATTR to be in logical scalars Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-05-29 15:44:50 -07:00
Ian Romanick	bdb15c2344	intel/compiler: Silence unused parameter warning in compile_cs_to_nir src/intel/compiler/brw_fs.cpp: In function ‘nir_shader* compile_cs_to_nir(const brw_compiler, void, const brw_cs_prog_key, brw_cs_prog_data, const nir_shader, unsigned int)’: src/intel/compiler/brw_fs.cpp:7205:44: warning: unused parameter ‘prog_data’ [-Wunused-parameter] struct brw_cs_prog_data prog_data, ^~~~~~~~~ Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-04-24 14:31:21 -04:00
Jason Ekstrand	de1f22d595	i965/fs: Return mlen * 8 for size_read() for INTERPOLATE_AT_* They are send messages and this makes size_read() and mlen agree. For both of these opcodes, the payload is just a dummy so mlen == 1 and this should decrease register pressure a bit. Reviewed-by: Francisco Jerez <currojerez@riseup.net> Cc: mesa-stable@lists.freedesktop.org	2018-04-23 14:04:42 -07:00
Ian Romanick	22fbb5c594	util: Add and use util_is_power_of_two_nonzero Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>	2018-03-29 14:09:28 -07:00
Ian Romanick	d76c204d05	util: Move util_is_power_of_two to bitscan.h and rename to util_is_power_of_two_or_zero The new name make the zero-input behavior more obvious. The next patch adds a new function with different zero-input behavior. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Suggested-by: Matt Turner <mattst88@gmail.com> Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>	2018-03-29 14:09:23 -07:00
Ian Romanick	8f83eea71e	i965: Add negative_equals methods This method is similar to the existing ::equals methods. Instead of testing that two src_regs are equal to each other, it tests that one is the negation of the other. v2: Simplify various checks based on suggestions from Matt. Use src_reg::type instead of fixed_hw_reg.type in a check. Also suggested by Matt. v3: Rebase on 3 years. Fix some problems with negative_equals with VF constants. Add fs_reg::negative_equals. v4: Replace the existing default case with BRW_REGISTER_TYPE_UB, BRW_REGISTER_TYPE_B, and BRW_REGISTER_TYPE_NF. Suggested by Matt. Expand the FINISHME comment to better explain why it isn't already finished. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> [v3] Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-03-26 08:50:43 -07:00
Eric Anholt	d25640c3a3	i965: Silence compiler warning about promoted_constants. We only have a cfg != NULL if we went through one of the paths that set it, but my compiler doesn't figure that out. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Fixes: `6411defdcd` ("intel/cs: Re-run final NIR optimizations for each SIMD size")	2018-03-16 15:09:55 -07:00
Karol Herbst	b617bfcccf	compiler: int8/uint8 support OpenCL kernels also have int8/uint8. v2: remove changes in nir_search as Jason posted a patch for that Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Signed-off-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Karol Herbst <kherbst@redhat.com>	2018-03-14 10:08:42 -04:00
Ian Romanick	52c7df1643	i965/fs: Merge CMP and SEL into CSEL on Gen8+ v2: Fix several problems handling inverted predicates. Add a much bigger comment around the BRW_CONDITIONAL_NZ case. v3: Allow uniforms and shader inputs as sources for the original SEL and CMP instructions. This enables a LOT more shaders to receive CSEL merging (5816 vs 8564 on SKL). v4: Report progress. Broadwell and Skylake had similar results. (Broadwell shown) helped: 8527 HURT: 0 helped stats (abs) min: 1 max: 27 x̄: 2.44 x̃: 1 helped stats (rel) min: 0.03% max: 17.80% x̄: 1.12% x̃: 0.70% 95% mean confidence interval for instructions value: -2.51 -2.36 95% mean confidence interval for instructions %-change: -1.15% -1.10% Instructions are helped. total cycles in shared programs: 559442317 -> 558288357 (-0.21%) cycles in affected programs: 372699860 -> 371545900 (-0.31%) helped: 6748 HURT: 1450 helped stats (abs) min: 1 max: 32000 x̄: 182.41 x̃: 12 helped stats (rel) min: <.01% max: 66.08% x̄: 3.42% x̃: 0.70% HURT stats (abs) min: 1 max: 2538 x̄: 53.08 x̃: 14 HURT stats (rel) min: <.01% max: 96.72% x̄: 3.32% x̃: 0.90% 95% mean confidence interval for cycles value: -179.01 -102.51 95% mean confidence interval for cycles %-change: -2.37% -2.08% Cycles are helped. LOST: 0 GAINED: 6 No changes on earlier platforms. Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> [v1] Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> [v3] Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-03-08 15:26:26 -08:00
Kenneth Graunke	70de61594d	i965/fs: Add infrastructure for generating CSEL instructions. v2 (idr): Don't allow CSEL with a non-float src2. v3 (idr): Add CSEL to fs_inst::flags_written. Suggested by Matt. v4 (idr): Only set BRW_ALIGN_16 on Gen < 10 (suggested by Matt). Don't reset the access mode afterwards (suggested by Samuel and Matt). Add support for CSEL not modifying the flags to more places (requested by Matt). Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Signed-off-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> [v3] Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-03-08 15:26:26 -08:00
Jason Ekstrand	8b4a5e641b	intel/fs: Add support for subgroup quad operations NIR has code to lower these away for us but we can do significantly better in many cases with register regioning and SIMD4x2. Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2018-03-07 12:13:47 -08:00
Jason Ekstrand	b0858c1cc6	intel/fs: Add a couple of simple helper opcodes Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2018-03-07 12:13:47 -08:00
Jason Ekstrand	90c9f29518	i965/fs: Add support for nir_intrinsic_shuffle Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2018-03-07 12:13:47 -08:00
Kenneth Graunke	9fa95359df	intel: Drop program size pointer from vec4/fs assembly getters. These days, we're just passing a pointer to a prog_data field, which we already have access to. We can just use it directly. (In the past, it was a pointer to a separate value.) Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2018-03-02 14:20:22 -08:00
Francisco Jerez	c063e88909	intel/fs: Handle surface opcode sample masks via predication. The main motivation is to enable HDC surface opcodes on ICL which no longer allows the sample mask to be provided in a message header, but this is enabled all the way back to IVB when possible because it decreases the instruction count of some shaders using HDC messages significantly, e.g. one of the SynMark2 CSDof compute shaders decreases instruction count by about 40% due to the removal of header setup boilerplate which in turn makes a number of send message payloads more easily CSE-able. Shader-db results on SKL: total instructions in shared programs: 15325319 -> 15314384 (-0.07%) instructions in affected programs: 311532 -> 300597 (-3.51%) helped: 491 HURT: 1 Shader-db results on BDW where the optimization needs to be disabled in some cases due to hardware restrictions: total instructions in shared programs: 15604794 -> 15598028 (-0.04%) instructions in affected programs: 220863 -> 214097 (-3.06%) helped: 351 HURT: 0 The FPS of SynMark2 CSDof improves by 5.09% ±0.36% (n=10) on my SKL laptop with this change. According to Eero this improves performance of the same test by 9% on BYT and by 7-8% on BXT J4205 and on SKL GT2 desktop. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Tested-By: Eero Tamminen <eero.t.tamminen@intel.com>	2018-03-02 11:28:56 -08:00
Francisco Jerez	6edb332b44	intel/ir: Allow arbitrary scratch flag registers for SHADER_OPCODE_FIND_LIVE_CHANNEL. This shouldn't cause any functional change at this point, it changes SHADER_OPCODE_FIND_LIVE_CHANNEL to use the flag register specified at the IR level instead of the hard-coded f1.0, now that it can be represented in backend_instruction::flag_subreg. This will be necessary for scheduling to behave correctly once more things start making use of f1.0. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-03-02 11:28:56 -08:00
Francisco Jerez	cc0fc8b8ac	intel/ir: Allow representing additional flag subregisters in the IR. This allows representing conditional mods and predicates on f1.0-f1.1 at the IR level by adding an extra bit to the flag_subreg backend_instruction field. Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-03-02 11:28:56 -08:00
Jason Ekstrand	ff4726077d	intel/fs: Set up sampler message headers in the visitor on gen7+ This gives the scheduler visibility into the headers which should improve scheduling. More importantly, however, it lets the scheduler know that the header gets written. As-is, the scheduler thinks that a texture instruction only reads it's payload and is unaware that it may write to the first register so it may reorder it with respect to a read from that register. This is causing issues in a couple of Dota 2 vertex shaders. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104923 Cc: mesa-stable@lists.freedesktop.org Reviewed-by: Francisco Jerez <currojerez@riseup.net>	2018-03-01 15:11:01 -08:00
Jose Maria Casanova Crespo	2dd94f462b	i965/fs: shuffle_32bit_load_result_to_16bit_data now skips components This helper used to load 16bit components from 32-bits read now allows skipping components with the new parameter first_component. The semantics now skip components until we reach the first_component, and then reads the number of components passed to the function. All previous uses of the helper are updated to use 0 as first_component. This will allow read 16-bit components when the first one is not aligned 32-bit. Enabling more usages of untyped_reads with 16-bit types. v2: (Jason Ektrand) Change parameters order to first_component, num_components Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-02-28 21:37:40 -08:00
Matt Turner	3a584a15c0	intel/compiler/fs: Don't generate integer DWord multiply on Gen11 Like CHV et al., Gen11 does not support 32x32 -> 32/64-bit integer multiplies. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2018-02-28 11:15:47 -08:00
Eric Anholt	afa7b2f199	i965: Fix compiler warning about write being undefined. This looks like it should be protected by the assume() about nr_color_regions, but my compiler warns anyway. Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-02-20 20:23:57 -08:00
Rafael Antognolli	bcfd78e448	i965/gen10: Re-enable push constants. The GPU hang caused by push constants is apparently fixed, so let's enable them again. Signed-off-by: Rafael Antognolli <rafael.antognolli@intel.com> Cc: "18.0" <mesa-stable@lists.freedesktop.org> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2018-01-26 10:07:44 -08:00
Jason Ekstrand	db682b8f0e	i965/fs: Reset the register file to VGRF in lower_integer_multiplication `18fde36ced` changed the way temporary registers were allocated in lower_integer_multiplication so that we allocate regs_written(inst) space and keep the stride of the original destination register. This was to ensure that any MUL which originally followed the CHV/BXT integer multiply regioning restrictions would continue to follow those restrictions even after lowering. This works fine except that I forgot to reset the register file to VGRF so, even though they were assigned a number from alloc.allocate(), they had the wrong register file. This caused some GLES 3.0 CTS tests to start failing on Sandy Bridge due to attempted reads from the MRF: ES3-CTS.functional.shaders.precision.int.highp_mul_fragment.snbm64 ES3-CTS.functional.shaders.precision.int.mediump_mul_fragment.snbm64 ES3-CTS.functional.shaders.precision.int.lowp_mul_fragment.snbm64 ES3-CTS.functional.shaders.precision.uint.highp_mul_fragment.snbm64 ES3-CTS.functional.shaders.precision.uint.mediump_mul_fragment.snbm64 ES3-CTS.functional.shaders.precision.uint.lowp_mul_fragment.snbm64 This commit remedies this problem by, instead of copying inst->dst and overwriting nr, just make a new register and set the region to match inst->dst. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103626 Fixes: `18fde36ced` Cc: "17.3" <mesa-stable@lists.freedesktop.org> Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-01-25 13:58:55 -08:00
Jason Ekstrand	c3d802d68e	i965: Use UD types for gl_SampleID setup We already had to switch all of the W types to UW to prevent issues with vector immediates on gen10. We may as well use unsigned types everywhere. Reviewed-by: Matt Turner <mattst88@gmail.com>	2018-01-11 14:31:47 -08:00
Jason Ekstrand	3d2b157e23	i965/fs: Use UW types when using V immediates Gen 10 has a strange hardware bug involving V immediates with W types. It appears that a mov(8) g2<1>W 0x76543210V will actually result in g2 getting the value {3, 2, 1, 0, 3, 2, 1, 0}. In particular, the bottom four nibbles are repeated instead of the top four being taken. (A mov of 0x00003210V yields the same result.) This bug does not appear in any hardware documentation as far as we can tell and the simulator does not implement the bug either. Commit `6132992cdb` was mostly a no-op except that it changed the type of the subgroup invocation from UW to W and caused us to tickle this bug with basically every compute shader that uses any sort of invocation ID (which is most of them). This is also potentially an issue for geometry shader input pulls and SampleID setup. The easy solution is just to change the few places where we use a vector integer immediate with a W type to use a UW type. Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: mesa-stable@lists.freedesktop.org Fixes: `6132992cdb`	2018-01-11 14:31:38 -08:00
Kenneth Graunke	a1afef8de0	i965: Combine {VS,FS}_OPCODE_GET_BUFFER_SIZE opcodes. These are the same, we don't need a separate opcode enum per backend. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-30 20:30:34 -08:00
Rafael Antognolli	85789831b4	intel/compiler/gen10: Disable push constants. We still have gpu hangs on Cannonlake when using push constants, so disable them for now until we have a proper fix for these hangs. v2: Add warning message when creating context too. Signed-off-by: Rafael Antognolli <rafael.antognolli@intel.com> Cc: Ben Widawsky <ben@bwidawsk.net> Cc: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Ben Widawsky <ben@bwidawsk.net>	2017-12-19 12:32:24 -08:00
Francisco Jerez	acab52f520	intel/fs/bank_conflicts: Don't touch Gen7 MRF hack registers. Fixes: `af2c320190` "intel/fs: Implement GRF bank conflict mitigation pass." Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104199 Reported-by: Darius Spitznagel <d.spitznagel@goodbytez.de> Reviewed-by: Matt Turner <mattst88@gmail.com>	2017-12-12 12:05:45 -08:00
Jason Ekstrand	f1ce0b905a	i965/fs: Handle !supports_pull_constants and push UBOs properly In Vulkan, we don't support classic pull constants and everything the client asks us to push, we push. However, for pushed UBOs, we still want to fall back to conventional pulls if we run out of space.	2017-12-08 15:43:25 -08:00
Jason Ekstrand	3b34ed79f1	i965/fs: Rewrite assign_constant_locations This rewires the logic for assigning uniform locations to work in terms of "complex alignments". The basic idea is that, as we walk the list of instructions, we keep track of the alignment and continuity requirements of each slot and assert that the alignments all match up. We then use those alignments in the compaction stage to ensure that everything gets placed at a properly aligned register. The old mechanism handled alignments by special-casing each of the bit sizes and placing 64-bit values first followed by 32-bit values. The old scheme had the advantage of never leaving a hole since all the 64-bit values could be tightly packed and so could the 32-bit values. However, the new scheme has no type size special cases so it handles not only 32 and 64-bit types but should gracefully extend to 16 and 8-bit types as the need arises. Tested-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>	2017-12-08 15:43:25 -08:00
Francisco Jerez	af2c320190	intel/fs: Implement GRF bank conflict mitigation pass. Unnecessary GRF bank conflicts increase the issue time of ternary instructions (the overwhelmingly most common of which is MAD) by roughly 50%, leading to reduced ALU throughput. This pass attempts to minimize the number of bank conflicts by rearranging the layout of the GRF space post-register allocation. It's in general not possible to eliminate all of them without introducing extra copies, which are typically more expensive than the bank conflict itself. In a shader-db run on SKL this helps roughly 46k shaders: total conflicts in shared programs: 1008981 -> 600461 (-40.49%) conflicts in affected programs: 816222 -> 407702 (-50.05%) helped: 46234 HURT: 72 The running time of shader-db itself on SKL seems to be increased by roughly 2.52%±1.13% with n=20 due to the additional work done by the compiler back-end. On earlier generations the pass is somewhat less effective in relative terms because the hardware incurs a bank conflict anytime the last two sources of the instruction are duplicate (e.g. while trying to square a value using MAD), which is impossible to avoid without introducing copies. E.g. for a shader-db run on SNB: total conflicts in shared programs: 944636 -> 623185 (-34.03%) conflicts in affected programs: 853258 -> 531807 (-37.67%) helped: 31052 HURT: 19 And on BDW: total conflicts in shared programs: 1418393 -> 987539 (-30.38%) conflicts in affected programs: 1179787 -> 748933 (-36.52%) helped: 47592 HURT: 70 On SKL GT4e this improves performance of GpuTest Volplosion by 3.64% ±0.33% with n=16. NOTE: This patch intentionally disregards some i965 coding conventions for the sake of reviewability. This is addressed by the next squash patch which introduces an amount of (for the most part boring) boilerplate that might distract reviewers from the non-trivial algorithmic details of the pass. The following patch is squashed in: SQUASH: intel/fs/bank_conflicts: Roll back to the nineties. Acked-by: Matt Turner <mattst88@gmail.com>	2017-12-07 15:56:06 -08:00
Jason Ekstrand	3282309f74	i965/fs: Enables 16-bit load_ubo with sampler load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit surface format defined. So when reading 16-bit components with the sampler we need to unshuffle two 16-bit components from each 32-bit component. Using the sampler avoids the use of the byte_scattered_read message that needs one message for each component and is supposed to be slower. v2: (Jason Ekstrand) - Simplify component selection and unshuffling for different bitsizes - Remove SKL optimization of reading only two 32-bit components when reading 16-bits types. Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	c57a3f200d	i965/fs: Add byte scattered read message and fs support v2: Fix alignment style (Topi Pohjolainen) (Jason Ekstrand) - Enable bit_size parameter to scattered messages to enable different bitsizes byte/word/dword. - Remove use of brw_send_indirect_scattered_message in favor of brw_send_indirect_surface_message. - Move scattered messages to surface messages namespace. - Assert align1 for scattered messages and assume Gen8+. - Inline brw_set_dp_byte_scattered_read. v3: (Jason Ekstrand) - Use renamed brw_byte_scattered_data_element_from_bit_size method - Assert scattered read for Gen8+ and Haswell. - Use conditional expresion at components_read. - Include comment about params for scattered opcodes. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	f1a9936ee1	i965/fs: Add byte scattered write message and fs support v2: (Jason Ekstrand) - Enable bit_size parameter to scattered messages to enable different bitsizes byte/word/dword. - Remove use of brw_send_indirect_scattered_message in favor of brw_send_indirect_surface_message. - Move scattered messages to surface messages namespace. - Assert align1 for scattered messages and assume Gen8+. - Inline brw_set_dp_byte_scattered_write. v3: - Remove leftover newline (Topi Pohjolainen) - Rename brw_data_size to brw_scattered_data_element and use defines instead of an enum (Jason Ekstrand) - Assert scattered write for Gen8+ and Haswell (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	d038deaa40	i965/fs: Add remove_extra_rounding_modes optimization Although from SPIR-V point of view, rounding modes are attached to the operation/destination, on i965 it is a status, so we don't need to explicitly set the rounding mode if the one we want is already set. Taking into account that the default mode is RTE, one possible optimization would be optimize out the first RTE set for each block. For in order to work, we would need to take into account block interrelationships. At this point, it is not worth to complicate the optimization for such small gain. v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate with the rounding mode (Curro) v3: Reset optimization for every block. (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	75a88d8567	i965: Support for 16-bit base types in helper functions v2: Fixed calculation of scalar size for 16-bit types. (Jason Ekstrand) v3: Fix coding style (Topi Pohjolainen) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Eduardo Lima <elima@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jason Ekstrand	dc4cf11dfc	intel/fs: Explicitly set EXECUTE_1 where needed Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	295605c930	intel/cs: Push subgroup ID instead of base thread ID We're going to want subgroup ID for SPIR-V subgroups eventually anyway. We really only want to push one and calculate the other from it. It makes a bit more sense to push the subgroup ID because it's simpler to calculate and because it's a real API thing. The only advantage to pushing the base thread ID is to avoid a single SHL in the shader. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	6411defdcd	intel/cs: Re-run final NIR optimizations for each SIMD size With the advent of SPIR-V subgroup operations, compute shaders will have to be slightly different depending on the SIMD size at which they execute. In order to allow us to do dispatch-width specific things in NIR, we re-run the final NIR stages for each sIMD width. One side-effect of this change is that we start rallocing fs_visitors which means we need DECLARE_RALLOC_CXX_OPERATORS. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	16ada419d7	i965/fs: Get rid of the early return in brw_compile_cs Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	80ddfab2f5	intel/cs: Rework the way thread local ID is handled Previously, brw_nir_lower_intrinsics added the param and then emitted a load_uniform intrinsic to load it directly. This commit switches things over to use a specific NIR intrinsic for the thread id. The one thing I don't like about this approach is that we have to copy thread_local_id over to the new visitor in import_uniforms. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00

1 2 3 4

200 commits