fdo-mirrors/mesa

mirror of https://gitlab.freedesktop.org/mesa/mesa.git synced 2025-12-22 13:30:12 +01:00

Author	SHA1	Message	Date
Francisco Jerez	af2c320190	intel/fs: Implement GRF bank conflict mitigation pass. Unnecessary GRF bank conflicts increase the issue time of ternary instructions (the overwhelmingly most common of which is MAD) by roughly 50%, leading to reduced ALU throughput. This pass attempts to minimize the number of bank conflicts by rearranging the layout of the GRF space post-register allocation. It's in general not possible to eliminate all of them without introducing extra copies, which are typically more expensive than the bank conflict itself. In a shader-db run on SKL this helps roughly 46k shaders: total conflicts in shared programs: 1008981 -> 600461 (-40.49%) conflicts in affected programs: 816222 -> 407702 (-50.05%) helped: 46234 HURT: 72 The running time of shader-db itself on SKL seems to be increased by roughly 2.52%±1.13% with n=20 due to the additional work done by the compiler back-end. On earlier generations the pass is somewhat less effective in relative terms because the hardware incurs a bank conflict anytime the last two sources of the instruction are duplicate (e.g. while trying to square a value using MAD), which is impossible to avoid without introducing copies. E.g. for a shader-db run on SNB: total conflicts in shared programs: 944636 -> 623185 (-34.03%) conflicts in affected programs: 853258 -> 531807 (-37.67%) helped: 31052 HURT: 19 And on BDW: total conflicts in shared programs: 1418393 -> 987539 (-30.38%) conflicts in affected programs: 1179787 -> 748933 (-36.52%) helped: 47592 HURT: 70 On SKL GT4e this improves performance of GpuTest Volplosion by 3.64% ±0.33% with n=16. NOTE: This patch intentionally disregards some i965 coding conventions for the sake of reviewability. This is addressed by the next squash patch which introduces an amount of (for the most part boring) boilerplate that might distract reviewers from the non-trivial algorithmic details of the pass. The following patch is squashed in: SQUASH: intel/fs/bank_conflicts: Roll back to the nineties. Acked-by: Matt Turner <mattst88@gmail.com>	2017-12-07 15:56:06 -08:00
Jose Maria Casanova Crespo	a1e257a5bf	i965/fs: Use untyped_surface_read for 16-bit load_ssbo SSBO loads were using byte_scattered read messages as they allow reading 16-bit size components. byte_scattered messages can only operate one component at a time so we needed to emit as many messages as components. But for vec2 and vec4 of 16-bit, being multiple of 32-bit we can use the untyped_surface_read message to read pairs of 16-bit components using only one message. Once each pair is read it is unshuffled to return the proper 16-bit components. vec3 case is assimilated to vec4 but the 4th component is ignored. 16-bit scalars are read using one byte_scattered_read message. v2: Removed use of stride = 2 on sources (Jason Ekstrand) Rework optimization using unshuffle 16 reads (Chema Casanova) v3: Use W and D types insead of HF and F in shuffle to avoid rounding erros (Jason Ekstrand) Use untyped_surface_read for 16-bit vec3. (Jason Ekstrand) v4: Use subscript insead of chaging type and stride (Jason Ekstrand) Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	ce2e572c4c	i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg Currently, we use byte-scattered write messages for storing 16-bit into an SSBO. This is because untyped surface messages have a fixed 32-bit size. This patch optimizes these 16-bit writes by combining 2 values (e.g, two consecutive components aligned with 32-bits) into a 32-bit register, packing the two 16-bit words. 16-bit single component values will continue to use byte-scattered write messages. The same will happens when the first consecutive component is not aligned 32-bits. This optimization reduces the number of SEND messages used for storing 16-bit values potentially by 2 or 4, which cuts down execution time significantly because byte-scattered writes are an expensive operation as they only write a component for message. v2: Removed use of stride = 2 on sources (Jason Ekstrand) Rework optimization using shuffle 16 write and enable writes of 16bit vec4 with only one message of 32-bits. (Chema Casanova) v3: - Fix coding style (Eduardo Lima) - Reorganize code to avoid duplication. (Jason Ekstrand) - Include new comments to explain the length calculations to fix alignment issues of components. (Jason Ekstrand) - Fix issues with writemask yz with 16-bit writes. (Jason Ektrand) v4: (Jason Ekstrand) - Reorganize 64-bit ssbo-writes to avoid using slots_per_component. - Comment about why suffle is needed when using byte_scattered_write. Signed-off-by: Eduardo Lima <elima@igalia.com> Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jason Ekstrand	3282309f74	i965/fs: Enables 16-bit load_ubo with sampler load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit surface format defined. So when reading 16-bit components with the sampler we need to unshuffle two 16-bit components from each 32-bit component. Using the sampler avoids the use of the byte_scattered_read message that needs one message for each component and is supposed to be slower. v2: (Jason Ekstrand) - Simplify component selection and unshuffling for different bitsizes - Remove SKL optimization of reading only two 32-bit components when reading 16-bits types. Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	3db31c0b06	i965/fs: Helpers for un/shuffle 16-bit pairs in 32-bit components This helpers are used to load/store 16-bit types from/to 32-bit components. The functions shuffle_32bit_load_result_to_16bit_data and shuffle_16bit_data_for_32bit_write are implemented in a similar way than the analogous functions for handling 64-bit types. v1: Explain need of temporary in shuffle operations. (Jason Ekstrand) Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	fa4a9d63bb	i965/fs: Use byte scattered read for 16-bit load_ssbo Used to enable 16-bit reads at do_untyped_vector_read, that is used on the following intrinsics: * nir_intrinsic_load_shared * nir_intrinsic_load_ssbo v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand) v3: - Add bitsize to scattered read operation (Jason Ekstrand) - Remove implementation of 16-bit UBO read from this patch. - Avoid assertion at opt_algebraic caused by ADD of two IMM with offset with BRW_REGISTER_TYPE_UD type found on matrix tests. (Jose Maria Casanova) v4: (Jason Ekstrand) - Put if case for 16-bits at the beginning of the if ladder. - Use type_sz(dest.type) * 8 as bit_size parameter for scattered read. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	c57a3f200d	i965/fs: Add byte scattered read message and fs support v2: Fix alignment style (Topi Pohjolainen) (Jason Ekstrand) - Enable bit_size parameter to scattered messages to enable different bitsizes byte/word/dword. - Remove use of brw_send_indirect_scattered_message in favor of brw_send_indirect_surface_message. - Move scattered messages to surface messages namespace. - Assert align1 for scattered messages and assume Gen8+. - Inline brw_set_dp_byte_scattered_read. v3: (Jason Ekstrand) - Use renamed brw_byte_scattered_data_element_from_bit_size method - Assert scattered read for Gen8+ and Haswell. - Use conditional expresion at components_read. - Include comment about params for scattered opcodes. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	a4031bdfa9	i965/fs: Predicate byte scattered writes if needed While on Untyped Surface messages the bits of the execution mask are ANDed with the corresponding bits of the Pixel/Sample Mask, that is not the case for byte scattered writes. That is needed to avoid ssbo stores writing on helper invocations. So when that can affect, we load the sample mask, and predicate the send message. Note: the need for this patch was tested with a custom test. Right now the 16 bit storage CTS tests doesnt need this path in order to get a full pass. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	96f1926aab	i965/fs: Use byte_scattered_write on 16-bit store_ssbo We need to rely on byte scattered writes as untyped writes are 32-bit size. We could try to keep using 32-bit messages when we have two or four 16-bit elements, but for simplicity sake, we use the same message for any component number. We revisit this aproach in the follwing patches. v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand) v3: (Jason Ekstrand) - Include bit_size to scattered write message and remove namespace - specific for scattered messages. - Move comment to proper place. - Squashed with i965/fs: Adjust type_size/type_slots on store_ssbo. (Jose Maria Casanova) - Take into account that get_nir_src returns now WORD types for 16-bit sources instead of DWORD. v4: (Jason Ekstrand) - Rename lenght variable to num_components. - Include assertions before emit_untyped_write. - Remove type_slot in favor of num_slot and first_slot. Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	f1a9936ee1	i965/fs: Add byte scattered write message and fs support v2: (Jason Ekstrand) - Enable bit_size parameter to scattered messages to enable different bitsizes byte/word/dword. - Remove use of brw_send_indirect_scattered_message in favor of brw_send_indirect_surface_message. - Move scattered messages to surface messages namespace. - Assert align1 for scattered messages and assume Gen8+. - Inline brw_set_dp_byte_scattered_write. v3: - Remove leftover newline (Topi Pohjolainen) - Rename brw_data_size to brw_scattered_data_element and use defines instead of an enum (Jason Ekstrand) - Assert scattered write for Gen8+ and Haswell (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	d038deaa40	i965/fs: Add remove_extra_rounding_modes optimization Although from SPIR-V point of view, rounding modes are attached to the operation/destination, on i965 it is a status, so we don't need to explicitly set the rounding mode if the one we want is already set. Taking into account that the default mode is RTE, one possible optimization would be optimize out the first RTE set for each block. For in order to work, we would need to take into account block interrelationships. At this point, it is not worth to complicate the optimization for such small gain. v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate with the rounding mode (Curro) v3: Reset optimization for every block. (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	82fa4d45e7	i965/fs: Enable rounding mode on f2f16 ops By default we don't set the rounding mode. We only set round-to-near-even or round-to-zero mode if explicitly set from nir. v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate with the rounding mode (Curro) v3: Use new helper brw_rnd_mode_from_nir_op (Jason Ekstrand) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	d6cd14f213	i965/fs: Define new shader opcode to set rounding modes Although it is possible to emit them directly as AND/OR on brw_fs_nir, having a specific opcode makes it easier to remove duplicate settings later. v2: (Curro) - Set thread control to 'switch' when using the control register - Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate with the rounding mode. - Avoid magic numbers setting rounding mode field at control register. v3: (Curro) - Remove redundant and add missing whitespace lines. - Match printing instruction to IR opcode "rnd_mode" v4: (Topi Pohjolainen) - Fix code style. Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	ac8d4734f6	i965: Add support for control register Control register cr0 in i965 can be used to change the rounding modes in 32-bit to 16-bit floating-point conversions. From intel Skylake PRM, vol 07, section "Register and Tegister Regions", subsection "Control Register" (page 754): "Subregister cr0.0:ud contains normal operation control fields such as the floating-point mode ... " Floating-point Rounding mode is changed at bits 5:4 of cr0.0: "Rounding Mode. This field specifies the FPU rounding mode. It is initialized by Thread Dispatch." 00b = Round to Nearest or Even (RTNE) 01b = Round Up, toward +inf (RU) 10b = Round Down, toward -inf (RD) 11b = Round Toward Zero (RTZ)" Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	5d5ee507fb	i965/fs: Handle 32-bit to 16-bit conversions Conversions to 16-bit need having aligment between the 16-bit and 32-bit types. So the conversion operations unpack 16-bit types to with an stride=2 and then applies a MOV with the conversion. v2 (Jason Ekstrand): - Avoid the general use of stride=2 for 16-bit register types. v3 (Topi Pohjolainen) - Code style fix (Jason Ekstrand) - Now nir_op_f2f16 was renamed to nir_op_f2f16_undef because conversion to f16 with undefined rounding is explicit Signed-off-by: Eduardo Lima <elima@igalia.com> Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com> Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	a05b6f25bf	i965/fs: Remove BRW_REGISTER_TYPE_HF assert at get_exec_type Note that we don't remove the assert at i965/vec4. At this point half float support is only for the scalar backend. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo	75a88d8567	i965: Support for 16-bit base types in helper functions v2: Fixed calculation of scalar size for 16-bit types. (Jason Ekstrand) v3: Fix coding style (Topi Pohjolainen) Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com> Signed-off-by: Eduardo Lima <elima@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Alejandro Piñeiro	2d28ca7000	i965/vec4: Handle 16-bit types at type_size_xvec4 These types have similar vec4 sizes as their 32-bit counterparts. The vec4 backend doesn't support 16-bit types and probably never will, but this method is called by the scalar backend at fs_visitor::nir_setup_outputs(), so we still need to provide valid vec4 sizes for 16-bit types. In the future, something different should be implemented to avoid this dependency. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-12-06 08:57:18 +01:00
Rafael Antognolli	2919adffe9	intel/compiler: Implement WaClearTDRRegBeforeEOTForNonPS. The bspec describes: "WA: Clear tdr register before send EOT in all non-PS shader kernels mov(8) tdr0:ud 0x0:ud {NoMask}" Signed-off-by: Rafael Antognolli <rafael.antognolli@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-12-01 11:27:27 -08:00
Iago Toral Quiroga	8620f7ebbc	i965/vec4: use a temp register to compute offsets for pull loads 64-bit pull loads are implemented by emitting 2 separate 32-bit pull load messages, where the second message loads from an offset at +16B. That addition of 16B to the original offset should not alter the original offset register used as source for the pull load instruction though, since the compiler might use that same offset register in other instructions (for example, for other pull loads in the shader code that take that same offset as reference). If the pull load is 32-bit then we only need to emit one message and we don't need to do offset calculations, but in that case the optimizer should be able to drop the redundant MOV. Fixes the following test on Haswell: KHR-GL45.gpu_shader_fp64.fp64.max_uniform_components Reviewed-by: Matt Turner <mattst88@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103007	2017-11-30 07:57:53 +01:00
Iago Toral Quiroga	f1873956db	i965/vec4: fix splitting of interleaved attributes When we split an instruction that reads an uniform value (vstride 0) we need to respect the vstride on the second half of the instruction (that is, the second half should read the same region as the first). We were doing this already, but we didn't account for stages that have interleaved input attributes which also have a vstride of 0 and need the same treatment. Fixes the following on Haswell: KHR-GL45.enhanced_layouts.varying_locations KHR-GL45.enhanced_layouts.varying_array_locations KHR-GL45.enhanced_layouts.varying_structure_locations Reviewed-by: Matt Turner <mattst88@gmail.com> Acked-by: Andres Gomez <agomez@igalia.com>	2017-11-24 09:24:06 +01:00
Matt Turner	beaea7abfa	i965/fs: Check ADD/MAD with immediates in satprop unit test The gen had to be changed from 4 to 6 so that we could test MAD, which is new on Gen6. mad_imm_float_neg_mov_sat tests the case fixed by the previous commit. Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>	2017-11-21 10:13:07 -08:00
Matt Turner	a05af1f7b8	i965/fs: Handle negating immediates on MADs when propagating saturates MADs don't take immediate sources, but we allow them in the IR since it simplifies a lot of things. I neglected to consider that case. Fixes: `4009a9ead4` ("i965/fs: Allow saturate propagation to propagate negations into MADs.") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103616 Reported-and-Tested-by: Ruslan Kabatsayev <b7.10110111@gmail.com> Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>	2017-11-21 10:13:07 -08:00
Tapani Pälli	6236ffeb83	intel: fix disasm_info memory leaks Fixes: `4f82b17287` ("i965: Rewrite disassembly annotation code") Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Tapani Pälli <tapani.palli@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Reviewed-by: Matt Turner <mattst88@gmail.com>	2017-11-21 08:36:43 +02:00
Jason Ekstrand	1eab327ba7	i965: Stop including brw_cfg.h in brw_disasm_info.h The brw_disasm_info header is included by certain tools in order to get shader assembly from binaries so it's a semi-external header. Including brw_cfg.h also pulls in brw_shader.h so you end up getting quite a bit of our back-end compiler internals. Instead, make the couple of forward declarations we need and make the header more stand-alone. This fixes the meson build. Reviewed-by: Matt Turner <mattst88@gmail.com> Fixes: `4f82b17287`	2017-11-17 21:51:16 -08:00
Andres Gomez	1866f7aee5	i965: Correct disasm_info usage in eu_validate test Fixes: `4f82b17287` ("i965: Rewrite disassembly annotation code") Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Andres Gomez <agomez@igalia.com> Reviewed-by: Matt Turner <mattst88@gmail.com>	2017-11-18 03:07:06 +02:00
Matt Turner	821ec473a8	i965: Rename intel_asm_annotation -> brw_disasm_info It was the only file named intel_* in the compiler. Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-11-17 12:14:38 -08:00
Matt Turner	4f82b17287	i965: Rewrite disassembly annotation code The old code used an array to store each "instruction group" (the new, better name than the old overloaded "annotation"), and required a memmove() to shift elements over in the array when we needed to split a group so that we could add an error message. This was confusing and difficult to get right, not the least of which was because the array has a tail sentinel not included in .ann_count. Instead use a linked list, a data structure made for efficient insertion. Acked-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-11-17 12:14:38 -08:00
Matt Turner	f80e97346b	i965: Simplify annotation_insert_error() Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-11-17 12:14:38 -08:00
Matt Turner	f4276ef7ef	i965: Move common code out of #ifdef I'm going to change the call in a later patch and with the difference in indentation level it wasn't immediately obvious that the calls were identical. Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>	2017-11-17 12:14:38 -08:00
Kenneth Graunke	e48cc01be9	intel: Drop mtypes.h include from brw_compiler.h. This isn't necessary and causes trouble for a project I'm working on.	2017-11-15 09:37:32 -08:00
Kenneth Graunke	ff964916dc	i965: Use nir_lower_atomics_to_ssbos and delete ABO compiler code. We use the same hardware mechanism for both atomic counters and SSBO atomics, so there's really no benefit to maintaining separate code to handle each case. Instead, we can just use Rob's shiny new NIR pass to convert atomic_uints to SSBOs, and delete piles of code. The ssbo_start section of the binding table becomes a combined ABO and SSBO section, with ABOs first, then SSBOs. Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-11-15 09:37:32 -08:00
Matt Turner	a31d038208	Revert "intel/fs: Use a pure vertical stride for large register strides" This reverts commit `e8c9e65185`. With the actual bug fixed (by commit `6ac2d16901`), this is not necessary. I'm doubtful of its correctness in any case.	2017-11-14 11:24:08 -08:00
Matt Turner	6ac2d16901	i965/fs: Fix extract_i8/u8 to a 64-bit destination The MOV instruction can extract bytes to words/double words, and words/double words to quadwords, but not byte to quadwords. For unsigned byte to quadword, we can read them as words and AND off the high byte and extract to quadword in one instruction. For signed bytes, we need to first sign extend to word and the sign extend that word to a quadword. Fixes the following test on CHV, BXT, and GLK: KHR-GL46.shader_ballot_tests.ShaderBallotBitmasks Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103628 Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>	2017-11-14 10:56:18 -08:00
Matt Turner	cfcfa0b9cd	i965/fs: Split all 32->64-bit MOVs on CHV, BXT, GLK Fixes the following tests on CHV, BXT, and GLK: KHR-GL46.shader_ballot_tests.ShaderBallotFunctionBallot dEQP-VK.spirv_assembly.instruction.compute.uconvert.uint32_to_int64 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103115	2017-11-14 10:56:18 -08:00
Jason Ekstrand	951a5dc4cc	intel/nir: Use the correct indirect lowering masks in link_shaders Previously, if we were linking a vec4 VS with a SIMD8/16 FS, we wouldn't lower indirects on the fragment shader which is wrong. Instead of using a single indirect mask, take advantage of our new little helper. Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com> Cc: mesa-stable@lists.freedesktop.org	2017-11-08 20:10:04 -08:00
Jason Ekstrand	3e63cf893f	intel/nir: Break the linking code into a helper in brw_nir.c Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com> Cc: mesa-stable@lists.freedesktop.org	2017-11-08 14:09:51 -08:00
Jason Ekstrand	7364f080f9	intel/nir: Add a helper for getting the NoIndirect mask Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com> Cc: mesa-stable@lists.freedesktop.org	2017-11-08 14:09:49 -08:00
Jason Ekstrand	d002950e54	intel/fs/nir: Return Q types from brw_reg_type_for_bit_size Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>	2017-11-07 10:41:24 -08:00
Jason Ekstrand	dee58ecd2e	intel/fs/nir: Use Q immediates for load_const on gen8+ Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>	2017-11-07 10:41:24 -08:00
Jason Ekstrand	9bb34892bf	intel/fs/nir: Setup immediates based on type in i2b and f2b Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>	2017-11-07 10:41:24 -08:00
Jason Ekstrand	1cb210f4bc	intel/reg: Add helpers for 64-bit integer immediates Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>	2017-11-07 10:41:24 -08:00
Jason Ekstrand	ab9220edd6	nir,intel/compiler: Use a fixed subgroup size The GL_ARB_shader_ballot spec says that gl_SubGroupSizeARB is declared as a uniform. This means that it cannot change across an invocation such as a draw call or a compute dispatch. For compute shaders, we're ok because we only ever use one dispatch size. For fragment, however, the hardware dynamically chooses between SIMD8 and SIMD16 which violates the spec. Instead, let's just pick a subgroup size based on the shader stage. The fixed size we choose for compute shaders is a bit higher than strictly needed but there's no real harm in that. The advantage is that, if they do anything interesting with the value, NIR will see it as an immediate and can optimize better. Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	a026458020	nir/lower_subgroups: Lower ballot intrinsics to the specified bit size Ballot intrinsics return a bitfield of subgroups. In GLSL and some SPIR-V extensions, they return a uint64_t. In SPV_KHR_shader_ballot, they return a uvec4. Also, some back-ends would rather pass around 32-bit values because it's easier than messing with 64-bit all the time. To solve this mess, we make nir_lower_subgroups take a new parameter called ballot_bit_size and it lowers whichever thing it gets in from the source language (uint64_t or uvec4) to a scalar with the specified number of bits. This replaces a chunk of the old lowering code. Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	28da82f978	nir: Add a new subgroups lowering pass This commit pulls nir_lower_read_invocations_to_scalar along with most of the guts of nir_opt_intrinsics (which mostly does subgroup lowering) into a new nir_lower_subgroups pass. There are various other bits of subgroup lowering that we're going to want to do so it makes a bit more sense to keep it all together in one pass. We also move it in i965 to happen after nir_lower_system_values to ensure that because we want to handle the subgroup mask system value intrinsics here. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	1ca3a94427	intel/fs: Don't use automatic exec size inference The automatic exec size inference can accidentally mess things up if we're not careful. For instance, if we have add(4) g38.2<4>D g38.1<8,2,4>D g38.2<8,2,4>D then the destination register will end up having a width of 2 with a horizontal stride of 4 and a vertical stride of 8. The EU emit code sees the width of 2 and decides that we really wanted an exec size of 2 which doesn't do what we wanted. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	dc4cf11dfc	intel/fs: Explicitly set EXECUTE_1 where needed Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	ab378734f5	intel/eu: Explicitly set EXECUTE_1 where needed Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	8280560705	intel/eu: Make automatic exec sizes a configurable option We have had a feature in codegen for some time that tries to automatically infer the execution size of an instruction from the width of its destination. For things such as fixed function GS, clipper, and SF programs, this is very useful because they tend to have lots of hand-rolled register setup and trying to specify the exec size all the time would be prohibitive. For things that come from a higher-level IR, however, it's easier to just set the right size all the time and the automatic exec sizes can, in fact, cause problems. This commit makes it optional while enabling it by default. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>	2017-11-07 10:37:52 -08:00
Jason Ekstrand	7a82ad54bb	intel/fs: Rework zero-length URB write handling Originally we tried to handle this case based on slots_valid. However, there are a number of ways that this can go wrong. For one, we throw away any trailing slots which either aren't written or are set to VARYING_SLOT_PAD. Second, even if PSIZ is a valid slot, we may not actually write anything there. Between the lot of these, it was possible to end up in a case where we tried to do a regular URB write but ended up with a length of 1 which is invalid. This commit moves it to the end and makes it based on a new boolean flag urb_written. Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Cc: mesa-stable@lists.freedesktop.org	2017-11-07 10:37:52 -08:00

... 20 21 22 23 24 ...

1355 commits