tu: Implement subsampled images

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39868>
2026-06-20 19:08:30 +02:00 · 2026-02-06 16:59:08 -05:00 · 2026-02-06 16:59:08 -05:00 · 4b87df29b3
commit 4b87df29b3
parent cc710283a7
20 changed files with 2137 additions and 194 deletions
--- a/docs/drivers/freedreno/fdm.rst
+++ b/docs/drivers/freedreno/fdm.rst
@ -36,11 +36,12 @@ This space exists whenever tiled rendering/GMEM is used, even without FDM. It
 is the space used to access GMEM, with the origin at the upper left of the
 tile. The hardware automatically transforms rendering space into GMEM space
 whenever GMEM is accessed using the various ``*_WINDOW_OFFSET`` registers. The
-origin of this space will be called :math:`b_{cs}`, the common bin start, for
-reasons that are explained below. When using FDM, coordinates in this space
-must be multiplied by the scaling factor :math:`s` derived from the fragment
-density map, or equivalently divided by the fragment area (as defined by the
-Vulkan specification), with the origin still at the upper left of the tile. For
+origin of this space in rendering space, or the value of ``*_WINDOW_OFFSET``,
+will be called :math:`b_{cs}`, the common bin start, for reasons that are
+explained below. When using FDM, coordinates in this space must be multiplied
+by the scaling factor :math:`s` derived from the fragment density map, or
+equivalently divided by the fragment area (as defined by the Vulkan
+specification), with the origin still at the upper left of the tile. For
 example, if :math:`s_x = 1/2`, then the bin is half as wide as it would've been
 without FDM and all coordinates in this space must be divided by 2.

@ -81,6 +82,104 @@ a multiple of :math:`1 / s`.  This is a natural constraint anyway, because if
 it wasn't the case then the bin would start in the middle of a fragment which
 isn't possible to handle correctly.

+Subsampled Space
+^^^^^^^^^^^^^^^^
+
+When using subsampled images, this is the space where the bin is stored in the
+underlying subsampled image. When sampling from a subsampled image, the driver
+inserts shader code to transform from framebuffer space to subsampled space
+using metadata written when rendering to the image.
+
+Accesses towards the edge of a bin may partially bleed into its neighboring bin
+with linear or bicubic sampling. If its neighbor has a different scale or isn't
+adjacent in subsampled space, we will sample the incorrect data or empty space
+and return a corrupted result. In order to handle this, we need to insert an
+"apron" around problematic edges and corners. This is done by blitting from the
+nearest neighbor of each bin after the renderpass.
+
+Subsampled space is normally scaled down similar to rendering space, which is
+the point of subsampled images in the first place, but the origin of the bin
+is up to the driver. The driver chooses the origin of each bin when rendering a
+given render pass and then encodes it in the metadata used when sampling the
+image. Bins that require an apron must be far enough away from each other that
+their aprons don't intersect, and all of the bins must be contained within the
+underlying image.
+
+Even when subsampled images are in use, not all bins may be subsampled. For
+example, there may not be enough space to insert aprons around every bin. When
+this is the case, subsampled space is not scaled like rendering space, that is
+we expand the bin when resolving similar to non-subsampled images, however the
+origin of the bin may still differ from framebuffer space origin.
+
+The algorithm used by turnip used to calculate the bin layout in subsampled
+space is to start with a "default" layout of the bins and then recursively
+solve conflicts caused by bins whose aprons are too close together. The first
+strategy used is to shift one of the bins over by a certain amount. The second
+fallback strategy is to un-subsample both neighboring bins, making them
+expanded so that they touch each other and there is no apron.
+
+One natural choice for the "default" layout is to just use rendering space.
+That is, start each bin at :math:`b_cs` by default. That mostly works, except
+for two problems. The first is easier to solve, and has to do with the border
+when sampling: it is allowed to use border colors with subsampled images, and
+when that happens and the framebuffer covers the entire image, it is expected
+that sampling around the edge correctly blends the border color and the edge
+pixel. In order for that to happen, bins that touch or intersect the edge of
+the framebuffer in framebuffer space have to be shifted over so that their edge
+touches the framebuffer edge in subsampled space too.
+
+Doing this also allows an optimization: because we are storing the tile's
+contents one to one from GMEM to system memory instead of scaling it up, we can
+use the dedicated resolve engine instead of GRAS to resolve the tile to system
+memory. Normally GRAS has to be used with non-subsampled images to scale up the
+bin when resolving. However this doesn't work for tiles around the right and
+bottom edge where we have to shift over the tile to align to the edge. This
+also gets a bit tricky when the tile is shifted to avoid apron conflicts, because
+normally the resolve engine would write the tile directly without shifting.
+However there is a trick we can use to avoid falling back to GRAS: by
+overriding ``RB_RESOLVE_WINDOW_OFFSET``, we can effectively apply an offset by
+telling the resolve engine that the tile was rendered somewhere else. This
+means that the shift amount has to be aligned to the alignment of
+``RB_RESOLVE_WINDOW_OFFSET``, which is ``tile_align_*`` in the device info.
+
+The other problem with making subsampled space equal rendering space is that
+with an FDM offset, rendering space can be arbitrarily larger than framebuffer
+space, and we may overflow the attachments by up to the size of a tile. The API
+is designed to allow the driver to allocate extra slop space in the image in
+this case, because there are image create flags for subsampled and FDM offset,
+however the maximum tile size is far too large and images would take up
+far too much memory if we allocated enough slop space for the largest
+possible tile. An alternative is to use a hybrid of framebuffer space and
+rendering space: shift over the tiles by :math:`b_o` so that their origin
+is :math:`b_s` instead of :math:`b_cs`, but leave them scaled down. This
+requires no slop space whatsoever, because the bins are shifted inside the
+original image, but we can no longer use the resolve engine as the tile offsets
+are no longer aligned to ``tile_align_*``. So in the driver we combine both
+approaches: we calculate an aligned offset :math:`b_o'` which is :math:`b_o`
+aligned down to ``tile_align_*`` and shift over the tiles by subtracting
+:math:`b_o'` instead of :math:`b_o`. This requires slop space, but only
+:math:`b_o - b_o'` slop space is required, which must be less than
+``tile_align_*``.  As usual the first row/column are not shifted over in x/y
+respectively.
+
+Here is an example of what a subsampled image looks like in memory, in this
+case without any FDM offset:
+
+.. figure:: subsampled_annotated.jpg
+   :alt: Example of a subsampled image
+
+Note how some of the bins are shifted over to make space for the apron. After
+applying the coordinate transform when sampling, this is the final image:
+
+.. figure:: subsampled_final.jpg
+   :alt: Example of a subsampled image after coordinate transform
+
+When ``VK_EXT_custom_resolve`` and subsampled images are used together, the
+custom resolve subpass writes directly to the subsampled image. This means that
+it needs to use subsampled space instead of rendering space, which in practice
+means replacing :math:`b_{cs}` with the origin of the bin in the subsampled
+image.
+
 Viewport and Scissor Patching
 -----------------------------

--- a/docs/drivers/freedreno/subsampled_annotated.jpg
+++ b/docs/drivers/freedreno/subsampled_annotated.jpg
--- a/docs/drivers/freedreno/subsampled_final.jpg
+++ b/docs/drivers/freedreno/subsampled_final.jpg
--- a/src/freedreno/vulkan/meson.build
+++ b/src/freedreno/vulkan/meson.build
@ -48,6 +48,7 @@ libtu_files = files(
  'tu_rmv.cc',
  'tu_shader.cc',
  'tu_suballoc.cc',
+  'tu_subsampled_image.cc',
  'tu_tile_config.cc',
  'tu_util.cc',
 )
--- a/src/freedreno/vulkan/tu_clear_blit.cc
+++ b/src/freedreno/vulkan/tu_clear_blit.cc
@ -647,6 +647,51 @@ build_blit_vs_shader(void)
   return b->shader;
 }

+static nir_shader *
+build_multi_blit_vs_shader(void)
+{
+   nir_builder _b =
+      nir_builder_init_simple_shader(MESA_SHADER_VERTEX, NULL, "multi blit vs");
+   nir_builder *b = &_b;
+
+   nir_variable *out_pos =
+      nir_create_variable_with_location(b->shader, nir_var_shader_out,
+                                        VARYING_SLOT_POS,
+                                        glsl_vec4_type());
+
+   b->shader->info.num_ubos = 1;
+
+   nir_def *vertex = nir_load_vertex_id(b);
+   nir_def *pos_and_coords =
+      nir_load_ubo(b, 4, 32, nir_imm_int(b, 0),
+                   nir_ishl_imm(b, vertex, 4),
+                   .align_mul = 16,
+                   .align_offset = 0,
+                   .range = 1 << 16);
+
+   nir_def *pos = nir_channels(b, pos_and_coords, 0x3);
+   nir_def *coords = nir_channels(b, pos_and_coords, 0xc);
+
+   pos = nir_vec4(b, nir_channel(b, pos, 0),
+                     nir_channel(b, pos, 1),
+                     nir_imm_float(b, 0.0),
+                     nir_imm_float(b, 1.0));
+
+   nir_store_var(b, out_pos, pos, 0xf);
+
+   nir_variable *out_coords =
+      nir_create_variable_with_location(b->shader, nir_var_shader_out,
+                                        VARYING_SLOT_VAR0,
+                                        glsl_vec_type(3));
+
+   coords = nir_vec3(b, nir_channel(b, coords, 0), nir_channel(b, coords, 1),
+                     nir_imm_float(b, 0));
+
+   nir_store_var(b, out_coords, coords, 0x7);
+
+   return b->shader;
+}
+
 static nir_shader *
 build_clear_vs_shader(void)
 {
@ -823,6 +868,7 @@ tu_init_clear_blit_shaders(struct tu_device *dev)
 {
   unsigned offset = 0;
   compile_shader(dev, build_blit_vs_shader(), 3, &offset, GLOBAL_SH_VS_BLIT);
+   compile_shader(dev, build_multi_blit_vs_shader(), 3, &offset, GLOBAL_SH_VS_MULTI_BLIT);
   compile_shader(dev, build_clear_vs_shader(), 2, &offset, GLOBAL_SH_VS_CLEAR);
   compile_shader(dev, build_blit_fs_shader(false), 0, &offset, GLOBAL_SH_FS_BLIT);
   compile_shader(dev, build_blit_fs_shader(true), 0, &offset, GLOBAL_SH_FS_BLIT_ZSCALE);
@ -846,6 +892,7 @@ tu_destroy_clear_blit_shaders(struct tu_device *dev)
 enum r3d_type {
   R3D_CLEAR,
   R3D_BLIT,
+   R3D_MULTI_BLIT,
 };

 template <chip CHIP>
@ -855,7 +902,8 @@ r3d_common(struct tu_cmd_buffer *cmd, struct tu_cs *cs, enum r3d_type type,
           VkSampleCountFlagBits dst_samples)
 {
   enum global_shader vs_id =
-      type == R3D_CLEAR ? GLOBAL_SH_VS_CLEAR : GLOBAL_SH_VS_BLIT;
+      type == R3D_CLEAR ? GLOBAL_SH_VS_CLEAR :
+      (type == R3D_MULTI_BLIT ? GLOBAL_SH_VS_MULTI_BLIT : GLOBAL_SH_VS_BLIT);

   struct ir3_shader_variant *vs = cmd->device->global_shader_variants[vs_id];
   uint64_t vs_iova = cmd->device->global_shader_va[vs_id];
@ -1056,6 +1104,49 @@ r3d_coords(struct tu_cmd_buffer *cmd,
   r3d_coords_raw(cmd, cs, coords);
 }

+static void
+r3d_coords_multi(struct tu_cmd_buffer *cmd,
+                 struct tu_cs *cs,
+                 const VkRect2D *dst,
+                 const tu_rect2d_float *src,
+                 unsigned count)
+{
+   struct tu_cs sub_cs;
+   VkResult result =
+      tu_cs_begin_sub_stream_aligned(&cmd->sub_cs, count * 2, 4, &sub_cs);
+   if (result != VK_SUCCESS) {
+      vk_command_buffer_set_error(&cmd->vk, result);
+      return;
+   }
+
+   for (unsigned i = 0; i < count; i++) {
+      tu_cs_emit(&sub_cs, fui(dst[i].offset.x));
+      tu_cs_emit(&sub_cs, fui(dst[i].offset.y));
+      tu_cs_emit(&sub_cs, fui(src[i].x_start));
+      tu_cs_emit(&sub_cs, fui(src[i].y_start));
+      tu_cs_emit(&sub_cs, fui(dst[i].offset.x + dst[i].extent.width));
+      tu_cs_emit(&sub_cs, fui(dst[i].offset.y + dst[i].extent.height));
+      tu_cs_emit(&sub_cs, fui(src[i].x_end));
+      tu_cs_emit(&sub_cs, fui(src[i].y_end));
+   }
+
+   struct tu_draw_state coords_ubo = tu_cs_end_draw_state(&cmd->sub_cs,
+                                                          &sub_cs);
+
+   tu_cs_emit_pkt7(cs, CP_LOAD_STATE6_GEOM, 5);
+   tu_cs_emit(cs,
+              CP_LOAD_STATE6_0_DST_OFF(0) |
+              CP_LOAD_STATE6_0_STATE_TYPE(ST6_UBO) |
+              CP_LOAD_STATE6_0_STATE_SRC(SS6_DIRECT) |
+              CP_LOAD_STATE6_0_STATE_BLOCK(SB6_VS_SHADER) |
+              CP_LOAD_STATE6_0_NUM_UNIT(1));
+   tu_cs_emit(cs, CP_LOAD_STATE6_1_EXT_SRC_ADDR(0));
+   tu_cs_emit(cs, CP_LOAD_STATE6_2_EXT_SRC_ADDR_HI(0));
+   tu_cs_emit_qw(cs,
+                 coords_ubo.iova |
+                 (uint64_t)A6XX_UBO_1_SIZE(count * 2) << 32);
+}
+
 static void
 r3d_clear_value(struct tu_cmd_buffer *cmd, struct tu_cs *cs, enum pipe_format format, const VkClearValue *val)
 {
@ -1290,6 +1381,7 @@ r3d_src_load(struct tu_cmd_buffer *cmd,
             struct tu_cs *cs,
             const struct tu_image_view *iview,
             uint32_t layer,
+             VkFilter filter,
             bool override_swap)
 {
   uint32_t desc[FDL6_TEX_CONST_DWORDS];
@ -1321,7 +1413,7 @@ r3d_src_load(struct tu_cmd_buffer *cmd,
   r3d_src_common<CHIP>(cmd, cs, desc,
                        iview->view.layer_size * layer,
                        iview->view.ubwc_layer_size * layer,
-                        VK_FILTER_NEAREST);
+                        filter);
 }

 template <chip CHIP>
@ -1331,7 +1423,7 @@ r3d_src_gmem_load(struct tu_cmd_buffer *cmd,
                  const struct tu_image_view *iview,
                  uint32_t layer)
 {
-   r3d_src_load<CHIP>(cmd, cs, iview, layer, true);
+   r3d_src_load<CHIP>(cmd, cs, iview, layer, VK_FILTER_NEAREST, true);
 }

 template <chip CHIP>
@ -1339,9 +1431,10 @@ static void
 r3d_src_sysmem_load(struct tu_cmd_buffer *cmd,
                    struct tu_cs *cs,
                    const struct tu_image_view *iview,
-                    uint32_t layer)
+                    uint32_t layer,
+                    VkFilter filter)
 {
-   r3d_src_load<CHIP>(cmd, cs, iview, layer, false);
+   r3d_src_load<CHIP>(cmd, cs, iview, layer, filter, false);
 }

 template <chip CHIP>
@ -1594,6 +1687,9 @@ enum r3d_blit_param {
   R3D_Z_SCALE = 1 << 0,
   R3D_DST_GMEM = 1 << 1,
   R3D_COPY = 1 << 2,
+   R3D_USE_MULTI_BLIT = 1 << 3,
+   R3D_OUTSIDE_PASS = 1 << 4,
+   R3D_OVERLAPPING = 1 << 5,
 };

 template <chip CHIP>
@ -1617,7 +1713,7 @@ r3d_setup(struct tu_cmd_buffer *cmd,
                                                 blit_param & R3D_DST_GMEM);
   fixup_dst_format(src_format, &dst_format, &fmt);

-   if (!cmd->state.pass) {
+   if (!cmd->state.pass || (blit_param & R3D_OUTSIDE_PASS)) {
      tu_emit_cache_flush_ccu<CHIP>(cmd, cs, TU_CMD_CCU_SYSMEM);
      tu6_emit_window_scissor<CHIP>(cs, 0, 0, 0x3fff, 0x3fff);
      if (cmd->device->physical_device->info->props.has_hw_bin_scaling) {
@ -1651,7 +1747,8 @@ r3d_setup(struct tu_cmd_buffer *cmd,
      }
   }

-   const enum r3d_type type = (clear) ? R3D_CLEAR : R3D_BLIT;
+   const enum r3d_type type = (clear) ? R3D_CLEAR :
+      ((blit_param & R3D_USE_MULTI_BLIT) ? R3D_MULTI_BLIT : R3D_BLIT);
   r3d_common<CHIP>(cmd, cs, type, 1, blit_param & R3D_Z_SCALE, src_samples,
                    dst_samples);

@ -1696,7 +1793,17 @@ r3d_setup(struct tu_cmd_buffer *cmd,
      tu_cs_emit_regs(cs, GRAS_VRS_CONFIG(CHIP));
   }

-   tu_cs_emit_regs(cs, GRAS_SC_CNTL(CHIP, .ccusinglecachelinesize = 2));
+   /* We need to handle overlapping blits the same as feedback loops, which
+    * means setting this bit to avoid corruption due to UBWC flag caches
+    * becoming desynchronized. On a7xx+ UBWC caches are coherent.
+    */
+   enum a6xx_single_prim_mode prim_mode =
+      CHIP == A6XX && (blit_param & R3D_OVERLAPPING) && ubwc ?
+      FLUSH_PER_OVERLAP_AND_OVERWRITE : NO_FLUSH;
+
+   tu_cs_emit_regs(cs, GRAS_SC_CNTL(CHIP,
+      .single_prim_mode = prim_mode,
+      .ccusinglecachelinesize = 2));

   /* Disable sample counting in order to not affect occlusion query. */
   tu_cs_emit_regs(cs, A6XX_RB_SAMPLE_COUNTER_CNTL(.disable = true));
@ -1738,6 +1845,17 @@ r3d_run_vis(struct tu_cmd_buffer *cmd, struct tu_cs *cs)
   tu_cs_emit(cs, 2); /* vertex count */
 }

+static void
+r3d_run_multi(struct tu_cmd_buffer *cmd, struct tu_cs *cs, unsigned count)
+{
+   tu_cs_emit_pkt7(cs, CP_DRAW_INDX_OFFSET, 3);
+   tu_cs_emit(cs, CP_DRAW_INDX_OFFSET_0_PRIM_TYPE(DI_PT_RECTLIST) |
+                  CP_DRAW_INDX_OFFSET_0_SOURCE_SELECT(DI_SRC_SEL_AUTO_INDEX) |
+                  CP_DRAW_INDX_OFFSET_0_VIS_CULL(IGNORE_VISIBILITY));
+   tu_cs_emit(cs, 1); /* instance count */
+   tu_cs_emit(cs, count * 2); /* vertex count */
+}
+
 template <chip CHIP>
 static void
 r3d_teardown(struct tu_cmd_buffer *cmd, struct tu_cs *cs)
@ -3620,12 +3738,6 @@ tu_CmdResolveImage2(VkCommandBuffer commandBuffer,
 }
 TU_GENX(tu_CmdResolveImage2);

-#define for_each_layer(layer, layer_mask, layers) \
-   for (uint32_t layer = 0; \
-        layer < ((layer_mask) ? (util_logbase2(layer_mask) + 1) : layers); \
-        layer++) \
-      if (!layer_mask || (layer_mask & BIT(layer)))
-
 template <chip CHIP>
 static void
 resolve_sysmem(struct tu_cmd_buffer *cmd,
@ -3673,7 +3785,7 @@ resolve_sysmem(struct tu_cmd_buffer *cmd,
         }
      } else {
         if (ops == &r3d_ops<CHIP>) {
-            r3d_src_sysmem_load<CHIP>(cmd, cs, src, i);
+            r3d_src_sysmem_load<CHIP>(cmd, cs, src, i, VK_FILTER_NEAREST);
         } else {
            ops->src(cmd, cs, &src->view, i, VK_FILTER_NEAREST, dst_format);
         }
@ -4984,6 +5096,124 @@ tu7_generic_clear_attachment(struct tu_cmd_buffer *cmd,
   trace_end_generic_clear(&cmd->rp_trace, cs);
 }

+/* Transform the render area from framebuffer space to subsampled space. Be
+ * conservative if the render area partially covers a fragment.
+ */
+static VkRect2D
+transform_render_area(VkRect2D render_area, const struct tu_tile_config *tile,
+                      const VkRect2D *bins, unsigned view)
+{
+   /* Calculate transform from framebuffer space to subsampled space.
+    */
+   VkExtent2D frag_area = (tile->subsampled_views & (1u << view)) ?
+      tile->frag_areas[view] : (VkExtent2D) { 1, 1 };
+
+   VkOffset2D offset = {
+      tile->subsampled_pos[view].offset.x -
+         bins[view].offset.x / frag_area.width,
+      tile->subsampled_pos[view].offset.y -
+         bins[view].offset.y / frag_area.height,
+   };
+
+   /* In the unlikely case subsampling was disabled due to running out of
+    * tiles, don't transform the render area.
+    */
+   if (!tile->subsampled)
+      offset = (VkOffset2D) { 0, 0 };
+
+   unsigned x1 =
+      render_area.offset.x / frag_area.width + offset.x;
+   unsigned x2 =
+      DIV_ROUND_UP(render_area.offset.x + render_area.extent.width,
+                   frag_area.width) + offset.x;
+   unsigned y1 =
+      render_area.offset.y / frag_area.height + offset.y;
+   unsigned y2 =
+      DIV_ROUND_UP(render_area.offset.y + render_area.extent.height,
+                   frag_area.height) + offset.y;
+
+   return (VkRect2D) {
+      { x1, y1 }, { x2 - x1, y2 - y1 }
+   };
+}
+
+struct apply_blit_scissor_state {
+   unsigned view;
+   VkRect2D render_area;
+};
+
+template <chip CHIP>
+static void
+fdm_apply_blit_scissor(struct tu_cmd_buffer *cmd,
+                       struct tu_cs *cs,
+                       void *data,
+                       VkOffset2D common_bin_offset,
+                       const VkOffset2D *hw_viewport_offsets,
+                       unsigned views,
+                       const struct tu_tile_config *tile,
+                       const VkRect2D *bins,
+                       bool binning)
+{
+   struct tu_physical_device *phys_dev = cmd->device->physical_device;
+   const struct apply_blit_scissor_state *state =
+      (const struct apply_blit_scissor_state *)data;
+   unsigned view = MIN2(state->view, views - 1);
+
+   VkRect2D subsampled_render_area =
+      transform_render_area(state->render_area, tile, bins, view);
+   VkOffset2D pos = tile->subsampled ? 
+      tile->subsampled_pos[view].offset : common_bin_offset;
+
+   VkRect2D scissor = subsampled_render_area;
+   if (tile->subsampled) {
+      /* Intersect the render area with the subsampled tile. We don't want to
+       * store the whole unscaled tile, and the unscaled tile may jut into the
+       * next tile.
+       */
+      scissor.offset.x = MAX2(scissor.offset.x, tile->subsampled_pos[view].offset.x);
+      scissor.offset.y = MAX2(scissor.offset.y, tile->subsampled_pos[view].offset.y);
+      scissor.extent.width =
+         MIN2(subsampled_render_area.offset.x +
+             subsampled_render_area.extent.width,
+             tile->subsampled_pos[view].offset.x +
+             tile->subsampled_pos[view].extent.width) - scissor.offset.x;
+      scissor.extent.height =
+         MIN2(subsampled_render_area.offset.y +
+              subsampled_render_area.extent.height,
+              tile->subsampled_pos[view].offset.y +
+              tile->subsampled_pos[view].extent.height) - scissor.offset.y;
+   }
+
+   if (bins[view].extent.width == 0 && bins[view].extent.height == 0) {
+      tu_cs_emit_regs(cs,
+                      A6XX_RB_RESOLVE_CNTL_1(.x = 1, .y = 1),
+                      A6XX_RB_RESOLVE_CNTL_2(.x = 0, .y = 0));
+      tu_cs_emit_regs(cs,
+                      A6XX_RB_RESOLVE_WINDOW_OFFSET(.x = 0, .y = 0));
+   } else {
+      /* Note: we will not dynamically enable CCU_RESOLVE for stores unless the
+       * offset is aligned, but this patchpoint will be executed anyway so we
+       * have to do something and not assert in the builder.
+       */
+      uint32_t x1 = scissor.offset.x &
+         ~(phys_dev->info->gmem_align_w - 1);
+      uint32_t y1 = scissor.offset.y &
+         ~(phys_dev->info->gmem_align_h - 1);
+      uint32_t x2 = ALIGN_POT(scissor.offset.x +
+                              scissor.extent.width,
+                              phys_dev->info->gmem_align_w) - 1;
+      uint32_t y2 = ALIGN_POT(scissor.offset.y +
+                              scissor.extent.height,
+                              phys_dev->info->gmem_align_h) - 1;
+
+      tu_cs_emit_regs(cs,
+                      A6XX_RB_RESOLVE_CNTL_1(.x = x1, .y = y1),
+                      A6XX_RB_RESOLVE_CNTL_2(.x = x2, .y = y2));
+      tu_cs_emit_regs(cs,
+                      A6XX_RB_RESOLVE_WINDOW_OFFSET(.x = pos.x, .y = pos.y));
+   }
+}
+
 template <chip CHIP>
 static void
 tu_emit_blit(struct tu_cmd_buffer *cmd,
@ -5041,8 +5271,17 @@ tu_emit_blit(struct tu_cmd_buffer *cmd,
   event_blit_setup(cs, buffer_id, attachment, blit_event_type, clear_mask);

   for_each_layer(i, attachment->used_views, cmd->state.framebuffer->layers) {
-      if (scissor_per_layer)
+      if (cmd->state.pass->has_fdm && cmd->state.fdm_subsampled) {
+            struct apply_blit_scissor_state state = {
+               .view = i,
+               .render_area = scissor_per_layer ?
+                  cmd->state.render_areas[i] : cmd->state.render_areas[0],
+            };
+            tu_create_fdm_bin_patchpoint(cmd, cs, 5, TU_FDM_SKIP_BINNING,
+                                         fdm_apply_blit_scissor<CHIP>, state);
+      } else if (scissor_per_layer) {
         tu6_emit_blit_scissor(cmd, cs, i, align_scissor);
+      }
      event_blit_dst_view blt_view = blt_view_from_tu_view(iview, i);
      event_blit_run<CHIP>(cmd, cs, attachment, &blt_view, separate_stencil);
   }
@ -5331,7 +5570,8 @@ store_cp_blit(struct tu_cmd_buffer *cmd,
 {
   r2d_setup_common<CHIP>(cmd, cs, src_format, dst_format,
                          VK_IMAGE_ASPECT_COLOR_BIT, 0, false,
-                          dst_iview->view.ubwc_enabled, true);
+                          dst_iview->view.ubwc_enabled,
+                          true);

   if (dst_iview->image->vk.format == VK_FORMAT_D32_SFLOAT_S8_UINT) {
      if (!separate_stencil) {
@ -5509,13 +5749,16 @@ tu_attachment_store_unaligned(struct tu_cmd_buffer *cmd, uint32_t a)
   if (TU_DEBUG(UNALIGNED_STORE))
      return true;

-   /* We always use the unaligned store path when scaling rendering. */
-   if (cmd->state.pass->has_fdm)
-      return true;
-
   unsigned render_area_count =
      cmd->state.per_layer_render_area ? cmd->state.pass->num_views : 1;

+   /* With subsampling, the formula below doesn't work, but we already
+    * conditionally use A2D for the unaligned blits at the edge. Just return
+    * false here.
+    */
+   if (cmd->state.fdm_subsampled)
+      return false;
+
   for (unsigned i = 0; i < render_area_count; i++) {
      const VkRect2D *render_area = &cmd->state.render_areas[i];
      uint32_t x1 = render_area->offset.x;
@ -5564,6 +5807,9 @@ tu_choose_gmem_layout(struct tu_cmd_buffer *cmd)
 {
   cmd->state.gmem_layout = TU_GMEM_LAYOUT_FULL;

+   if (cmd->state.pass->has_fdm)
+      cmd->state.gmem_layout = TU_GMEM_LAYOUT_AVOID_CCU;
+
   for (unsigned i = 0; i < cmd->state.pass->attachment_count; i++) {
      if (!cmd->state.attachments[i])
         continue;
@ -5620,8 +5866,9 @@ fdm_apply_store_coords(struct tu_cmd_buffer *cmd,
 {
   const struct apply_store_coords_state *state =
      (const struct apply_store_coords_state *)data;
-   VkExtent2D frag_area = tile->frag_areas[MIN2(state->view, views - 1)];
-   VkRect2D bin = bins[MIN2(state->view, views - 1)];
+   unsigned view = MIN2(state->view, views - 1);
+   VkExtent2D frag_area = tile->frag_areas[view];
+   VkRect2D bin = bins[view];

   /* The bin width/height must be a multiple of the frag_area to make sure
    * that the scaling happens correctly. This means there may be some
@ -5643,10 +5890,22 @@ fdm_apply_store_coords(struct tu_cmd_buffer *cmd,
                      GRAS_A2D_SRC_YMIN(CHIP, 1),
                      GRAS_A2D_SRC_YMAX(CHIP, 0));
   } else {
-      tu_cs_emit_regs(cs,
-         GRAS_A2D_DEST_TL(CHIP, .x = bin.offset.x, .y = bin.offset.y),
-         GRAS_A2D_DEST_BR(CHIP, .x = bin.offset.x + bin.extent.width - 1,
-                          .y = bin.offset.y + bin.extent.height - 1));
+      VkOffset2D start =
+         tile->subsampled ? tile->subsampled_pos[view].offset : bin.offset;
+      if (tile->subsampled_views & (1u << view)) {
+         /* Subsampled blits don't scale up the bin, and go to the subsampled
+          * destination.
+          */
+         tu_cs_emit_regs(cs,
+            GRAS_A2D_DEST_TL(CHIP, .x = start.x, .y = start.y),
+            GRAS_A2D_DEST_BR(CHIP, .x = start.x + scaled_width - 1,
+                             .y = start.y + scaled_height - 1));
+      } else {
+         tu_cs_emit_regs(cs,
+            GRAS_A2D_DEST_TL(CHIP, .x = start.x, .y = start.y),
+            GRAS_A2D_DEST_BR(CHIP, .x = start.x + bin.extent.width - 1,
+                             .y = start.y + bin.extent.height - 1));
+      }
      tu_cs_emit_regs(cs,
                      GRAS_A2D_SRC_XMIN(CHIP, common_bin_offset.x),
                      GRAS_A2D_SRC_XMAX(CHIP, common_bin_offset.x + scaled_width - 1),
@ -5655,6 +5914,45 @@ fdm_apply_store_coords(struct tu_cmd_buffer *cmd,
   }
 }

+struct apply_render_area_state {
+   unsigned view;
+   VkRect2D render_area;
+};
+
+template <chip CHIP>
+static void
+fdm_apply_render_area(struct tu_cmd_buffer *cmd,
+                      struct tu_cs *cs,
+                      void *data,
+                      VkOffset2D common_bin_offset,
+                      const VkOffset2D *hw_viewport_offsets,
+                      unsigned views,
+                      const struct tu_tile_config *tile,
+                      const VkRect2D *bins,
+                      bool binning)
+{
+   struct apply_render_area_state *state =
+      (struct apply_render_area_state *)data;
+
+   unsigned view = MIN2(state->view, views - 1);
+
+   VkRect2D subsampled_render_area =
+      transform_render_area(state->render_area, tile, bins, view);
+
+   unsigned x1 = subsampled_render_area.offset.x;
+   unsigned x2 = subsampled_render_area.offset.x +
+      subsampled_render_area.extent.width - 1;
+   unsigned y1 = subsampled_render_area.offset.y;
+   unsigned y2 = subsampled_render_area.offset.y +
+      subsampled_render_area.extent.height - 1;
+
+   tu_cs_emit_regs(cs,
+                   GRAS_A2D_SCISSOR_TL(CHIP, .x = x1,
+                                             .y = y1,),
+                   GRAS_A2D_SCISSOR_BR(CHIP, .x = x2,
+                                             .y = y2,));
+}
+
 template <chip CHIP>
 void
 tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
@ -5703,7 +6001,10 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,

   bool use_fast_path = !unaligned && !mismatched_mutability &&
                        !resolve_d24s8_s8 &&
-                        (a == gmem_a || blit_can_resolve(dst->format));
+                        (a == gmem_a || blit_can_resolve(dst->format)) &&
+                        (!cmd->state.pass->has_fdm || CHIP >= A7XX);
+
+   bool fast_path_conditional = use_fast_path && cmd->state.pass->has_fdm;

   trace_start_gmem_store(&cmd->rp_trace, cs, cmd, dst->format, use_fast_path, unaligned);

@ -5717,6 +6018,11 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,

   /* use fast path when render area is aligned, except for unsupported resolve cases */
   if (use_fast_path) {
+      if (fast_path_conditional) {
+         tu_cond_exec_start(cs, CP_COND_REG_EXEC_0_MODE(PRED_TEST) |
+                            CP_COND_REG_EXEC_0_PRED_BIT(TU_PREDICATE_FAST_STORE));
+      }
+
      if (store_common)
         tu_emit_blit<CHIP>(cmd, cs, resolve_group, dst_iview, src, clear_value,
                            BLIT_EVENT_STORE, per_layer_render_area, true, false);
@ -5724,16 +6030,25 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
         tu_emit_blit<CHIP>(cmd, cs, resolve_group, dst_iview, src, clear_value,
                            BLIT_EVENT_STORE, per_layer_render_area, true, true);

-      if (cond_exec) {
-         tu_end_load_store_cond_exec(cmd, cs, false);
-      }
+      if (fast_path_conditional) {
+         tu_cond_exec_end(cs);
+      } else {
+         if (cond_exec) {
+            tu_end_load_store_cond_exec(cmd, cs, false);
+         }

-      trace_end_gmem_store(&cmd->rp_trace, cs);
-      return;
+         trace_end_gmem_store(&cmd->rp_trace, cs);
+         return;
+      }
   }

   assert(cmd->state.gmem_layout == TU_GMEM_LAYOUT_AVOID_CCU);

+   if (fast_path_conditional) {
+      tu_cond_exec_start(cs, CP_COND_REG_EXEC_0_MODE(PRED_TEST) |
+                         CP_COND_REG_EXEC_0_PRED_BIT(TU_PREDICATE_NO_FAST_STORE));
+   }
+
   enum pipe_format src_format = vk_format_to_pipe_format(src->format);
   if (src_format == PIPE_FORMAT_Z32_FLOAT_S8X24_UINT)
      src_format = PIPE_FORMAT_Z32_FLOAT;
@ -5773,7 +6088,7 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
         if (!cmd->state.pass->has_fdm) {
            r2d_coords<CHIP>(cmd, cs, render_area->offset, render_area->offset,
                             render_area->extent);
-         } else {
+         } else if (!cmd->state.fdm_subsampled) {
            /* Usually GRAS_2D_RESOLVE_CNTL_* clips the destination to the bin
             * area and the coordinates span the entire render area, but for
             * FDM we need to scale the coordinates so we need to take the
@ -5795,7 +6110,7 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
            if (!cmd->state.pass->has_fdm) {
               r2d_coords<CHIP>(cmd, cs, render_area->offset, render_area->offset,
                                render_area->extent);
-            } else {
+            } else if (!cmd->state.fdm_subsampled) {
               tu_cs_emit_regs(cs,
                               GRAS_A2D_SCISSOR_TL(CHIP, .x = render_area->offset.x,
                                                         .y = render_area->offset.y,),
@ -5805,6 +6120,17 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
         }

         if (cmd->state.pass->has_fdm) {
+            if (cmd->state.fdm_subsampled) {
+               struct apply_render_area_state state {
+                  .view = i,
+                  .render_area =
+                     per_layer_render_area ? cmd->state.render_areas[i] :
+                     cmd->state.render_areas[0],
+               };
+               tu_create_fdm_bin_patchpoint(cmd, cs, 3, TU_FDM_SKIP_BINNING,
+                                            fdm_apply_render_area<CHIP>,
+                                            state);
+            }
            struct apply_store_coords_state state = {
               .view = i,
            };
@ -5822,6 +6148,9 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
      }
   }

+   if (fast_path_conditional)
+      tu_cond_exec_end(cs);
+
   if (cond_exec) {
      tu_end_load_store_cond_exec(cmd, cs, false);
   }
@ -5829,3 +6158,71 @@ tu_store_gmem_attachment(struct tu_cmd_buffer *cmd,
   trace_end_gmem_store(&cmd->rp_trace, cs);
 }
 TU_GENX(tu_store_gmem_attachment);
+
+template <chip CHIP>
+static void
+blit_subsampled_apron(struct tu_cmd_buffer *cmd,
+                      struct tu_cs *cs,
+                      const struct tu_image_view *iview,
+                      enum VkFormat vk_format,
+                      unsigned layer,
+                      const VkRect2D *dst_coord,
+                      const tu_rect2d_float *src_coord,
+                      unsigned count)
+{
+   enum pipe_format format = vk_format_to_pipe_format(vk_format);
+
+   r3d_setup<CHIP>(cmd, cs, format, format, VK_IMAGE_ASPECT_COLOR_BIT,
+                   R3D_USE_MULTI_BLIT | R3D_OUTSIDE_PASS | R3D_OVERLAPPING,
+                   false, iview->image->layout[0].ubwc,
+                   VK_SAMPLE_COUNT_1_BIT, VK_SAMPLE_COUNT_1_BIT);
+
+   for (unsigned i = 0; i < count; i++) {
+      assert(dst_coord[i].offset.x + dst_coord[i].extent.width <=
+             iview->image->layout[0].width0);
+      assert(dst_coord[i].offset.y + dst_coord[i].extent.height <=
+             iview->image->layout[0].height0);
+   }
+
+   r3d_coords_multi(cmd, cs, dst_coord, src_coord, count);
+
+   if (iview->image->vk.format == VK_FORMAT_D32_SFLOAT_S8_UINT) {
+      if (vk_format == VK_FORMAT_D32_SFLOAT) {
+         r3d_src_stencil<CHIP>(cmd, cs, iview, layer, VK_FILTER_NEAREST);
+         r3d_dst_stencil<CHIP>(cs, iview, layer);
+      } else {
+         r3d_src_depth<CHIP>(cmd, cs, iview, layer, VK_FILTER_NEAREST);
+         r3d_dst_depth<CHIP>(cs, iview, layer);
+      }
+   } else {
+      r3d_src_sysmem_load<CHIP>(cmd, cs, iview, layer, VK_FILTER_NEAREST);
+      r3d_dst<CHIP>(cs, &iview->view, layer, format);
+   }
+
+   r3d_run_multi(cmd, cs, count);
+
+   r3d_teardown<CHIP>(cmd, cs);
+}
+
+template <chip CHIP>
+void
+tu_blit_subsampled_apron(struct tu_cmd_buffer *cmd,
+                         struct tu_cs *cs,
+                         const struct tu_image_view *iview,
+                         unsigned layer,
+                         const VkRect2D *dst_coord,
+                         const tu_rect2d_float *src_coord,
+                         unsigned count)
+{
+   if (iview->image->vk.format == VK_FORMAT_D32_SFLOAT_S8_UINT) {
+      blit_subsampled_apron<CHIP>(cmd, cs, iview, VK_FORMAT_D32_SFLOAT, layer,
+                                  dst_coord, src_coord, count);
+      blit_subsampled_apron<CHIP>(cmd, cs, iview, VK_FORMAT_S8_UINT, layer,
+                                  dst_coord, src_coord, count);
+   } else {
+      blit_subsampled_apron<CHIP>(cmd, cs, iview, iview->vk.format, layer,
+                                  dst_coord, src_coord, count);
+   }
+}
+TU_GENX(tu_blit_subsampled_apron);
+
--- a/src/freedreno/vulkan/tu_clear_blit.h
+++ b/src/freedreno/vulkan/tu_clear_blit.h
@ -100,4 +100,14 @@ tu_cmd_fill_buffer_addr(VkCommandBuffer commandBuffer,
                        VkDeviceSize fillSize,
                        uint32_t data);

+template <chip CHIP>
+void
+tu_blit_subsampled_apron(struct tu_cmd_buffer *cmd,
+                         struct tu_cs *cs,
+                         const struct tu_image_view *iview,
+                         unsigned layer,
+                         const VkRect2D *dst_coord,
+                         const tu_rect2d_float *src_coord,
+                         unsigned count);
+
 #endif /* TU_CLEAR_BLIT_H */
--- a/src/freedreno/vulkan/tu_cmd_buffer.cc
+++ b/src/freedreno/vulkan/tu_cmd_buffer.cc
@ -22,6 +22,7 @@
 #include "tu_knl.h"
 #include "tu_tile_config.h"
 #include "tu_tracepoints.h"
+#include "tu_subsampled_image.h"

 #include "common/freedreno_gpu_event.h"
 #include "common/freedreno_lrz.h"
@ -1733,6 +1734,29 @@ tu6_emit_tile_select(struct tu_cmd_buffer *cmd,
         }
      }

+      if (CHIP >= A7XX) {
+         /* Without FDM offset, b_s = b_cs which is always aligned. With FDM
+          * offset, none may be aligned. With FDM offset, it may not be
+          * aligned.  However with FDM offset and subsampled, we shift the
+          * subsampled coordinates to align the bins, so we can enable the
+          * fast path except for the last row/column where the end has to be
+          * aligned to the framebuffer end.
+          *
+          * We don't just directly check for aligned-ness because that depends
+          * on the actual offset, and signficantly changing the performance
+          * could result in jank between frames as the offset changes.
+          */
+         bool use_fast_store = (!fdm_offsets && !bin_scale_en) ||
+            (tile->subsampled_views == tile->visible_views &&
+             !tile->subsampled_border);
+
+         tu7_set_pred_mask(cs, (1u << TU_PREDICATE_FAST_STORE) |
+                               (1u << TU_PREDICATE_NO_FAST_STORE),
+                               (1u << (use_fast_store ?
+                                       TU_PREDICATE_FAST_STORE :
+                                       TU_PREDICATE_NO_FAST_STORE)));
+      }
+
      util_dynarray_foreach (&cmd->fdm_bin_patchpoints,
                             struct tu_fdm_bin_patchpoint, patch) {
         tu_cs_emit_pkt7(cs, CP_MEM_WRITE, 2 + patch->size);
@ -2951,6 +2975,16 @@ tu_renderpass_begin(struct tu_cmd_buffer *cmd)
              MESA_VK_DYNAMIC_IA_PRIMITIVE_RESTART_ENABLE);

   cmd->state.fdm_enabled = cmd->state.pass->has_fdm;
+
+   cmd->state.fdm_subsampled = false;
+
+   for (unsigned i = 0; i < cmd->state.framebuffer->attachment_count; i++) {
+      const struct tu_image_view *iview = cmd->state.attachments[i];
+      if (iview && (iview->image->vk.create_flags &
+                    VK_IMAGE_CREATE_SUBSAMPLED_BIT_EXT)) {
+         cmd->state.fdm_subsampled = true;
+      }
+   }
 }

 static inline bool
@ -3169,6 +3203,18 @@ tu6_sysmem_render_end(struct tu_cmd_buffer *cmd, struct tu_cs *cs,
   tu_cs_emit_pkt7(cs, CP_SKIP_IB2_ENABLE_GLOBAL, 1);
   tu_cs_emit(cs, 0x0);

+   if (cmd->state.fdm_subsampled) {
+      for (unsigned i = 0; i < cmd->state.pass->attachment_count; i++) {
+         if (i != cmd->state.pass->fragment_density_map.attachment &&
+             cmd->state.pass->attachments[i].store) {
+            /* emit dummy subsampled metadata since we didn't use FDM */
+            tu_emit_subsampled_metadata(cmd, &cmd->cs, i,
+                                        NULL, NULL, NULL,
+                                        cmd->state.framebuffer, NULL);
+         }
+      }
+   }
+
   tu_lrz_sysmem_end<CHIP>(cmd, cs);

   /* Clear the resource list for any LRZ resources we emitted at the
@ -3651,6 +3697,73 @@ tu_allocate_transient_attachments(struct tu_cmd_buffer *cmd, bool sysmem)
   return VK_SUCCESS;
 }

+template <chip CHIP>
+static void
+tu_emit_subsampled(struct tu_cmd_buffer *cmd,
+                   const struct tu_tile_config *tiles,
+                   const struct tu_tiling_config *tiling,
+                   const struct tu_vsc_config *vsc,
+                   const struct tu_framebuffer *fb,
+                   const VkOffset2D *fdm_offsets)
+{
+   struct tu_cs *cs = &cmd->cs;
+
+   for (unsigned i = 0; i < cmd->state.pass->attachment_count; i++) {
+      if (i != cmd->state.pass->fragment_density_map.attachment &&
+          cmd->state.pass->attachments[i].store) {
+         tu_emit_subsampled_metadata(cmd, cs, i,
+                                     tiles, tiling, vsc,
+                                     cmd->state.framebuffer,
+                                     fdm_offsets);
+      }
+   }
+
+   /* We may have subsampled images without FDM if FDM is disabled due to
+    * multisampled loads/stores, in which case we only need to emit the
+    * metadata.
+    */
+   if (!tiles)
+      return;
+
+   /* Flush for GMEM -> UCHE */
+   cmd->state.cache.pending_flush_bits |=
+      TU_CMD_FLAG_CACHE_INVALIDATE |
+      TU_CMD_FLAG_WAIT_FOR_IDLE;
+
+   VkRect2D *dst =
+      (VkRect2D *)malloc(8 * vsc->tile_count.width * vsc->tile_count.height *
+                         (sizeof(VkRect2D) + sizeof(struct tu_rect2d_float)));
+   struct tu_rect2d_float *src =
+      (struct tu_rect2d_float *)(dst + 8 * vsc->tile_count.width * vsc->tile_count.height);
+   unsigned count;
+
+   /* Iterate over layers and then attachments so that we don't recompute the
+    * list of areas to copy for each attachment.
+    */
+   for (unsigned layer = 0; layer < MAX2(cmd->state.pass->num_views,
+                                         fb->layers); layer++) {
+      unsigned view = fb->layers > 1 ?
+         (cmd->state.fdm_per_layer ? layer : 0) : layer;
+      count = tu_calc_subsampled_aprons(dst, src, view, tiles, tiling, vsc, fb,
+                                        fdm_offsets);
+
+      if (count != 0) {
+         for (unsigned i = 0; i < cmd->state.pass->attachment_count; i++) {
+            if (i != cmd->state.pass->fragment_density_map.attachment &&
+                cmd->state.pass->attachments[i].store &&
+                (cmd->state.pass->num_views == 0 ||
+                 (cmd->state.pass->attachments[i].used_views & (1u << layer)) ||
+                 (cmd->state.pass->attachments[i].resolve_views & (1u << layer)))) {
+               tu_blit_subsampled_apron<CHIP>(cmd, cs, cmd->state.attachments[i],
+                                              layer, dst, src, count);
+            }
+         }
+      }
+   }
+
+   free(dst);
+}
+
 template <chip CHIP>
 static void
 tu_cmd_render_tiles(struct tu_cmd_buffer *cmd,
@ -3750,6 +3863,17 @@ tu_cmd_render_tiles(struct tu_cmd_buffer *cmd,

   tu6_tile_render_end<CHIP>(cmd, &cmd->cs, autotune_result);

+   /* Outside of renderpasses we assume all draw states are disabled. We do
+    * this outside the draw CS for the normal case where 3d gmem stores aren't
+    * used. Do this before emitting subsampled blits.
+    */
+   tu_disable_draw_states(cmd, &cmd->cs);
+
+   if (cmd->state.fdm_subsampled) {
+      tu_emit_subsampled<CHIP>(cmd, tiles, tiling, vsc, cmd->state.framebuffer,
+                               fdm_offsets);
+   }
+
   tu_trace_end_render_pass<CHIP>(cmd, true);

   /* We have trashed the dynamically-emitted viewport, scissor, and FS params
@ -3791,6 +3915,9 @@ tu_cmd_render_sysmem(struct tu_cmd_buffer *cmd,

   tu6_sysmem_render_end<CHIP>(cmd, &cmd->cs, autotune_result);

+   /* Outside of renderpasses we assume all draw states are disabled. */
+   tu_disable_draw_states(cmd, &cmd->cs);
+
   tu_clone_trace_range(cmd, &cmd->cs, &cmd->trace,
                        cmd->trace_renderpass_start,
                        u_trace_end_iterator(&cmd->rp_trace));
@ -3811,13 +3938,6 @@ tu_cmd_render(struct tu_cmd_buffer *cmd_buffer,
      tu_cmd_render_sysmem<CHIP>(cmd_buffer, autotune_result);
   else
      tu_cmd_render_tiles<CHIP>(cmd_buffer, autotune_result, fdm_offsets);
-
-   /* Outside of renderpasses we assume all draw states are disabled. We do
-    * this outside the draw CS for the normal case where 3d gmem stores aren't
-    * used.
-    */
-   tu_disable_draw_states(cmd_buffer, &cmd_buffer->cs);
-
 }

 static void tu_reset_render_pass(struct tu_cmd_buffer *cmd_buffer)
@ -5907,7 +6027,8 @@ tu_restore_suspended_pass(struct tu_cmd_buffer *cmd,
   memcpy(cmd->state.render_areas,
          suspended->state.suspended_pass.render_areas,
          sizeof(cmd->state.render_areas));
-   cmd->state.per_layer_render_area = suspended->state.per_layer_render_area;
+   cmd->state.per_layer_render_area = suspended->state.suspended_pass.per_layer_render_area;
+   cmd->state.fdm_subsampled = suspended->state.suspended_pass.fdm_subsampled;
   cmd->state.gmem_layout = suspended->state.suspended_pass.gmem_layout;
   cmd->state.tiling = &cmd->state.framebuffer->tiling[cmd->state.gmem_layout];
   cmd->state.lrz = suspended->state.suspended_pass.lrz;
@ -6903,6 +7024,7 @@ tu_CmdBeginRendering(VkCommandBuffer commandBuffer,
         tu_lrz_begin_renderpass<CHIP>(cmd);
   }

+   tu_renderpass_begin(cmd);

   if (suspending) {
      cmd->state.suspended_pass.pass = cmd->state.pass;
@ -6912,6 +7034,8 @@ tu_CmdBeginRendering(VkCommandBuffer commandBuffer,
             cmd->state.render_areas, sizeof(cmd->state.render_areas));
      cmd->state.suspended_pass.per_layer_render_area = 
         cmd->state.per_layer_render_area;
+      cmd->state.suspended_pass.fdm_subsampled =
+         cmd->state.fdm_subsampled;
      cmd->state.suspended_pass.attachments = cmd->state.attachments;
      cmd->state.suspended_pass.clear_values = cmd->state.clear_values;
      cmd->state.suspended_pass.gmem_layout = cmd->state.gmem_layout;
@ -6919,8 +7043,6 @@ tu_CmdBeginRendering(VkCommandBuffer commandBuffer,

   tu_fill_render_pass_state(&cmd->state.vk_rp, cmd->state.pass, cmd->state.subpass);

-   tu_renderpass_begin(cmd);
-
   if (!resuming) {
      cmd->patchpoints_ctx = ralloc_context(NULL);
      tu_emit_subpass_begin<CHIP>(cmd);
@ -7676,41 +7798,53 @@ fdm_apply_fs_params(struct tu_cmd_buffer *cmd,
       * in which case views will be 1 and we have to replicate the one view
       * to all of the layers.
       */
-      VkExtent2D area = config->frag_areas[MIN2(i, views - 1)];
+      unsigned view = MIN2(i, views - 1);
+      VkExtent2D tile_frag_area = config->frag_areas[view];
      VkRect2D bin = bins[MIN2(i, views - 1)];
-      VkOffset2D offset = tu_fdm_per_bin_offset(area, bin, common_bin_offset);
-
-      /* For custom resolve, we switch to rendering directly to sysmem and so
-       * the fragment size becomes 1x1. This means we have to scale down
-       * FragCoord when accessing GMEM input attachments.
+      /* The space HW FragCoord (as well as viewport and scissor) is in is:
+       *   - Without custom resolve, rendering space as usual.
+       *   - With custom resolve to non-subsampled images, framebuffer
+       *     space.
+       *   - With custom resolve to subsampled images, subsampled space. Its
+       *     origin is subsampled_pos.offset, and it may or may not be scaled
+       *     down depending on whether the view is subsampled.
       *
-       * TODO: When we support subsampled images, this should also only happen
-       * for non-subsampled images.
+       * For user FragCoord, we need to transform from this space to
+       * framebuffer space. However the transform in the shader performs the
+       * opposite, so we actually need to transform from framebuffer space to
+       * this "custom rendering space". For GMEM FragCoord, we need to
+       * transform this space to rendering space.
       */
+      VkOffset2D tile_start = common_bin_offset;
+      VkExtent2D rendering_frag_area = tile_frag_area;
+      VkExtent2D gmem_frag_area = (VkExtent2D) { 1, 1 };
      if (state->custom_resolve) {
-         tu_cs_emit(cs, 1 /* width */);
-         tu_cs_emit(cs, 1 /* height */);
-         tu_cs_emit(cs, fui(0.0));
-         tu_cs_emit(cs, fui(0.0));
-      } else {
-         tu_cs_emit(cs, area.width);
-         tu_cs_emit(cs, area.height);
-         tu_cs_emit(cs, fui(offset.x));
-         tu_cs_emit(cs, fui(offset.y));
+         if (config->subsampled)
+            tile_start = config->subsampled_pos[view].offset;
+         else
+            tile_start = bin.offset;
+         if (!(config->subsampled_views & (1u << view))) {
+            rendering_frag_area = (VkExtent2D){ 1, 1 };
+            gmem_frag_area = tile_frag_area;
+         }
      }
+      VkRect2D gmem_bin = bin;
+      gmem_bin.offset = tile_start;
+
+      VkOffset2D offset = tu_fdm_per_bin_offset(rendering_frag_area, bin, tile_start);
+      VkOffset2D gmem_offset = tu_fdm_per_bin_offset(gmem_frag_area, gmem_bin,
+                                                     common_bin_offset);
+      
+      tu_cs_emit(cs, rendering_frag_area.width);
+      tu_cs_emit(cs, rendering_frag_area.height);
+      tu_cs_emit(cs, fui(offset.x));
+      tu_cs_emit(cs, fui(offset.y));

      if (i * 2 + 1 < num_consts) {
-         if (state->custom_resolve) {
-            tu_cs_emit(cs, fui(1. / area.width));
-            tu_cs_emit(cs, fui(1. / area.height));
-            tu_cs_emit(cs, fui(offset.x));
-            tu_cs_emit(cs, fui(offset.y));
-         } else {
-            tu_cs_emit(cs, fui(1.0));
-            tu_cs_emit(cs, fui(1.0));
-            tu_cs_emit(cs, fui(0.0));
-            tu_cs_emit(cs, fui(0.0));
-         }
+         tu_cs_emit(cs, fui(1. / gmem_frag_area.width));
+         tu_cs_emit(cs, fui(1. / gmem_frag_area.height));
+         tu_cs_emit(cs, fui(gmem_offset.x));
+         tu_cs_emit(cs, fui(gmem_offset.y));
      }
   }
 }
--- a/src/freedreno/vulkan/tu_cmd_buffer.h
+++ b/src/freedreno/vulkan/tu_cmd_buffer.h
@ -551,6 +551,7 @@ struct tu_cmd_state
      const struct tu_framebuffer *framebuffer;
      VkRect2D render_areas[MAX_VIEWS];
      bool per_layer_render_area;
+      bool fdm_subsampled;
      enum tu_gmem_layout gmem_layout;

      const struct tu_image_view **attachments;
@ -560,6 +561,7 @@ struct tu_cmd_state
   } suspended_pass;

   bool fdm_enabled;
+   bool fdm_subsampled;

   bool tessfactor_addr_set;
   bool predication_active;
--- a/src/freedreno/vulkan/tu_common.h
+++ b/src/freedreno/vulkan/tu_common.h
@ -156,6 +156,8 @@ enum tu_predicate_bit {
   TU_PREDICATE_VTX_STATS_RUNNING = 3,
   TU_PREDICATE_VTX_STATS_NOT_RUNNING = 4,
   TU_PREDICATE_FIRST_TILE = 5,
+   TU_PREDICATE_FAST_STORE = 6,
+   TU_PREDICATE_NO_FAST_STORE = 7,
 };

 /* Onchip timestamp register layout. */
@ -176,6 +178,11 @@ enum tu_onchip_addr {
    */
 };

+struct tu_rect2d_float {
+   float x_start, y_start;
+   float x_end, y_end;
+};
+

 #define TU_GENX(FUNC_NAME) FD_GENX(FUNC_NAME)

@ -213,4 +220,13 @@ struct tu_suballocator;
 struct tu_subpass;
 struct tu_u_trace_submission_data;

+/* Helper for iterating over layers of an attachment that handles both
+ * multiview and layered rendering cases.
+ */
+#define for_each_layer(layer, layer_mask, layers) \
+   for (uint32_t layer = 0; \
+        layer < ((layer_mask) ? (util_logbase2(layer_mask) + 1) : (layers)); \
+        layer++) \
+      if (!(layer_mask) || ((layer_mask) & BIT(layer)))
+
 #endif /* TU_COMMON_H */
--- a/src/freedreno/vulkan/tu_descriptor_set.cc
+++ b/src/freedreno/vulkan/tu_descriptor_set.cc
@ -32,6 +32,7 @@
 #include "tu_image.h"
 #include "tu_formats.h"
 #include "tu_rmv.h"
+#include "tu_subsampled_image.h"
 #include "bvh/tu_build_interface.h"

 static inline uint8_t *
@ -43,7 +44,8 @@ pool_base(struct tu_descriptor_pool *pool)
 static uint32_t
 descriptor_size(struct tu_device *dev,
                const VkDescriptorSetLayoutBinding *binding,
-                VkDescriptorType type)
+                VkDescriptorType type,
+                bool subsampled)
 {
   switch (type) {
   case VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER:
@ -54,7 +56,7 @@ descriptor_size(struct tu_device *dev,
       * descriptors which are less than 16 dwords. However combined images
       * and samplers are actually two descriptors, so they have size 2.
       */
-      return FDL6_TEX_CONST_DWORDS * 4 * 2;
+      return FDL6_TEX_CONST_DWORDS * 4 * (subsampled ? 3 : 2);
   case VK_DESCRIPTOR_TYPE_STORAGE_BUFFER:
   case VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC:
      /* isam.v allows using a single 16-bit descriptor for both 16-bit and
@ -80,7 +82,8 @@ mutable_descriptor_size(struct tu_device *dev,
   uint32_t max_size = 0;

   for (uint32_t i = 0; i < list->descriptorTypeCount; i++) {
-      uint32_t size = descriptor_size(dev, NULL, list->pDescriptorTypes[i]);
+      uint32_t size = descriptor_size(dev, NULL, list->pDescriptorTypes[i],
+                                      false);
      max_size = MAX2(max_size, size);
   }

@ -194,6 +197,7 @@ tu_CreateDescriptorSetLayout(
      set_layout->binding[b].dynamic_offset_offset = dynamic_offset_size;
      set_layout->binding[b].shader_stages = binding->stageFlags;

+      bool has_subsampled_sampler = false;
      if ((binding->descriptorType == VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER ||
           binding->descriptorType == VK_DESCRIPTOR_TYPE_SAMPLER) &&
          binding->pImmutableSamplers) {
@ -208,8 +212,12 @@ tu_CreateDescriptorSetLayout(

         bool has_ycbcr_sampler = false;
         for (unsigned i = 0; i < pCreateInfo->pBindings[j].descriptorCount; ++i) {
-            if (tu_sampler_from_handle(binding->pImmutableSamplers[i])->vk.ycbcr_conversion)
+            VK_FROM_HANDLE(tu_sampler, sampler,
+                           binding->pImmutableSamplers[i]);
+            if (sampler->vk.ycbcr_conversion)
               has_ycbcr_sampler = true;
+            if (sampler->vk.flags & VK_SAMPLER_CREATE_SUBSAMPLED_BIT_EXT)
+               has_subsampled_sampler = true;
         }

         if (has_ycbcr_sampler) {
@ -236,7 +244,8 @@ tu_CreateDescriptorSetLayout(
            mutable_descriptor_size(device, &mutable_info->pMutableDescriptorTypeLists[j]);
      } else {
         set_layout->binding[b].size =
-            descriptor_size(device, binding, binding->descriptorType);
+            descriptor_size(device, binding, binding->descriptorType,
+                            has_subsampled_sampler);
      }

      if (binding->descriptorType == VK_DESCRIPTOR_TYPE_INLINE_UNIFORM_BLOCK)
@ -365,7 +374,19 @@ tu_GetDescriptorSetLayoutSupport(
         descriptor_sz =
            mutable_descriptor_size(device, &mutable_info->pMutableDescriptorTypeLists[i]);
      } else {
-         descriptor_sz = descriptor_size(device, binding, binding->descriptorType);
+         bool has_subsampled_sampler = false;
+         if (binding->pImmutableSamplers) {
+            for (unsigned i = 0; i < binding->descriptorType; i++) {
+               VK_FROM_HANDLE(tu_sampler, sampler,
+                              binding->pImmutableSamplers[i]);
+               if (sampler->vk.flags & VK_SAMPLER_CREATE_SUBSAMPLED_BIT_EXT) {
+                  has_subsampled_sampler = true;
+                  break;
+               }
+            }
+         }
+         descriptor_sz = descriptor_size(device, binding, binding->descriptorType,
+                                         has_subsampled_sampler);
      }
      uint64_t descriptor_alignment = 4 * FDL6_TEX_CONST_DWORDS;

@ -453,6 +474,9 @@ sha1_update_descriptor_set_binding_layout(struct mesa_sha1 *ctx,
   SHA1_UPDATE_VALUE(ctx, layout->dynamic_offset_offset);
   SHA1_UPDATE_VALUE(ctx, layout->immutable_samplers_offset);

+   const struct tu_sampler *samplers =
+      tu_immutable_samplers(set_layout, layout);
+
   const struct vk_ycbcr_conversion_state *ycbcr_samplers =
      tu_immutable_ycbcr_samplers(set_layout, layout);

@ -460,6 +484,16 @@ sha1_update_descriptor_set_binding_layout(struct mesa_sha1 *ctx,
      for (unsigned i = 0; i < layout->array_size; i++)
         sha1_update_ycbcr_sampler(ctx, ycbcr_samplers + i);
   }
+
+   if (samplers) {
+      for (unsigned i = 0; i < layout->array_size; i++) {
+         if (samplers[i].vk.flags & VK_SAMPLER_CREATE_SUBSAMPLED_BIT_EXT) {
+            SHA1_UPDATE_VALUE(ctx, i);
+            SHA1_UPDATE_VALUE(ctx, samplers[i].vk.address_mode_u);
+            SHA1_UPDATE_VALUE(ctx, samplers[i].vk.address_mode_v);
+         }
+      }
+   }
 }


@ -721,7 +755,7 @@ tu_CreateDescriptorPool(VkDevice _device,
      switch (pool_size->type) {
      case VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC:
      case VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC:
-         dynamic_size += descriptor_size(device, NULL, pool_size->type) *
+         dynamic_size += descriptor_size(device, NULL, pool_size->type, false) *
            pool_size->descriptorCount;
         break;
      case VK_DESCRIPTOR_TYPE_MUTABLE_EXT:
@ -740,7 +774,11 @@ tu_CreateDescriptorPool(VkDevice _device,
         bo_size += pool_size->descriptorCount;
         break;
      default:
-         bo_size += descriptor_size(device, NULL, pool_size->type) *
+         /* We don't know whether this pool will be used with subsampled
+          * images, so we have to assume it may be.
+          */
+         bo_size += descriptor_size(device, NULL, pool_size->type,
+                                    device->vk.enabled_features.fragmentDensityMap) *
                              pool_size->descriptorCount;
         break;
      }
@ -1084,15 +1122,35 @@ static void
 write_combined_image_sampler_descriptor(uint32_t *dst,
                                        VkDescriptorType descriptor_type,
                                        const VkDescriptorImageInfo *image_info,
-                                        bool has_sampler)
+                                        bool write_sampler,
+                                        const struct tu_sampler *immutable_sampler)
 {
   write_image_descriptor(dst, descriptor_type, image_info);
-   /* copy over sampler state */
-   if (has_sampler) {
-      VK_FROM_HANDLE(tu_sampler, sampler, image_info->sampler);

+   /* copy over sampler state */
+   if (write_sampler) {
+      VK_FROM_HANDLE(tu_sampler, sampler, image_info->sampler);
      memcpy(dst + FDL6_TEX_CONST_DWORDS, sampler->descriptor, sizeof(sampler->descriptor));
   }
+
+   /* It's technically legal to sample from a mismatched descriptor (i.e. only
+    * the sampler or only the image has SUBSAMPLED_BIT) but it gives undefined
+    * results. So we have to make sure not to crash or disturb other
+    * descriptors. Therefore we check the sampler, because that's what
+    * triggers allocating extra space in the descriptor set.
+    */
+   if (immutable_sampler &&
+       (immutable_sampler->vk.flags & VK_SAMPLER_CREATE_SUBSAMPLED_BIT_EXT)) {
+      VK_FROM_HANDLE(tu_image_view, iview, image_info->imageView);
+      VkDescriptorAddressInfoEXT info = {
+         .address = iview->image->iova +
+            iview->image->subsampled_metadata_offset +
+            iview->vk.base_array_layer * sizeof(struct tu_subsampled_metadata),
+         .range =
+            iview->vk.layer_count * sizeof(struct tu_subsampled_metadata),
+      };
+      write_ubo_descriptor_addr(dst + 2 * FDL6_TEX_CONST_DWORDS, &info);
+   }
 }

 static void
@ -1156,12 +1214,15 @@ tu_GetDescriptorEXT(
      write_image_descriptor(dest, VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,
                             pDescriptorInfo->data.pStorageImage);
      break;
-   case VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER:
+   case VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER: {
+      VK_FROM_HANDLE(tu_sampler, sampler,
+                     pDescriptorInfo->data.pCombinedImageSampler->sampler);
      write_combined_image_sampler_descriptor(dest,
                                              VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,
                                              pDescriptorInfo->data.pCombinedImageSampler,
-                                              true);
+                                              true, sampler);
      break;
+   }
   case VK_DESCRIPTOR_TYPE_SAMPLER:
      write_sampler_descriptor(dest, *pDescriptorInfo->data.pSampler);
      break;
@ -1285,7 +1346,8 @@ tu_update_descriptor_sets(const struct tu_device *device,
            write_combined_image_sampler_descriptor(ptr,
                                                    writeset->descriptorType,
                                                    writeset->pImageInfo + j,
-                                                    !binding_layout->immutable_samplers_offset);
+                                                    !samplers,
+                                                    samplers ? &samplers[writeset->dstArrayElement + j] : NULL);

            if (copy_immutable_samplers)
               write_sampler_push(ptr + FDL6_TEX_CONST_DWORDS, &samplers[writeset->dstArrayElement + j]);
@ -1636,7 +1698,8 @@ tu_update_descriptor_set_with_template(
            write_combined_image_sampler_descriptor(ptr,
                                                    templ->entry[i].descriptor_type,
                                                    (const VkDescriptorImageInfo *) src,
-                                                    !samplers);
+                                                    !samplers,
+                                                    samplers ? &samplers[j] : NULL);
            if (templ->entry[i].copy_immutable_samplers)
               write_sampler_push(ptr + FDL6_TEX_CONST_DWORDS, &samplers[j]);
            break;
--- a/src/freedreno/vulkan/tu_device.cc
+++ b/src/freedreno/vulkan/tu_device.cc
@ -1411,7 +1411,7 @@ tu_get_properties(struct tu_physical_device *pdevice,
   props->samplerDescriptorBufferAddressSpaceSize = ~0ull;
   props->resourceDescriptorBufferAddressSpaceSize = ~0ull;
   props->descriptorBufferAddressSpaceSize = ~0ull;
-   props->combinedImageSamplerDensityMapDescriptorSize = 2 * FDL6_TEX_CONST_DWORDS * 4;
+   props->combinedImageSamplerDensityMapDescriptorSize = 3 * FDL6_TEX_CONST_DWORDS * 4;

   /* VK_EXT_legacy_vertex_attributes */
   props->nativeUnalignedPerformance = true;
--- a/src/freedreno/vulkan/tu_device.h
+++ b/src/freedreno/vulkan/tu_device.h
@ -44,6 +44,7 @@

 enum global_shader {
   GLOBAL_SH_VS_BLIT,
+   GLOBAL_SH_VS_MULTI_BLIT,
   GLOBAL_SH_VS_CLEAR,
   GLOBAL_SH_FS_BLIT,
   GLOBAL_SH_FS_BLIT_ZSCALE,
--- a/src/freedreno/vulkan/tu_image.cc
+++ b/src/freedreno/vulkan/tu_image.cc
@ -29,6 +29,7 @@
 #include "tu_formats.h"
 #include "tu_lrz.h"
 #include "tu_rmv.h"
+#include "tu_subsampled_image.h"
 #include "tu_wsi.h"

 uint32_t
@ -538,6 +539,15 @@ tu_image_update_layout(struct tu_device *device, struct tu_image *image,
         /* no UBWC for separate stencil */
         image->ubwc_enabled = false;

+      /* Subsampled images with FDM offset require extra space for adjusting
+       * the offset to make the tiles aligned.
+       */
+      if ((image->vk.create_flags & VK_IMAGE_CREATE_SUBSAMPLED_BIT_EXT) &&
+          (image->vk.create_flags & VK_IMAGE_CREATE_FRAGMENT_DENSITY_MAP_OFFSET_BIT_EXT)) {
+         width0 += device->physical_device->info->tile_align_w;
+         height0 += device->physical_device->info->tile_align_h;
+      }
+
      struct fdl_explicit_layout plane_layout;

      if (plane_layouts) {
@ -634,6 +644,12 @@ tu_image_update_layout(struct tu_device *device, struct tu_image *image,
      image->lrz_layout.lrz_total_size = 0;
   }

+   if (image->vk.create_flags & VK_IMAGE_CREATE_SUBSAMPLED_BIT_EXT) {
+      image->subsampled_metadata_offset = align64(image->total_size, 16);
+      image->total_size = image->subsampled_metadata_offset +
+         image->vk.array_layers * sizeof(struct tu_subsampled_metadata);
+   }
+
   return VK_SUCCESS;
 }
 TU_GENX(tu_image_update_layout);
--- a/src/freedreno/vulkan/tu_image.h
+++ b/src/freedreno/vulkan/tu_image.h
@ -34,6 +34,7 @@ struct tu_image
   struct vk_image vk;

   struct fdl_layout layout[3];
+   uint64_t subsampled_metadata_offset;
   uint64_t total_size;

   /* Set when bound */
--- a/src/freedreno/vulkan/tu_pipeline.cc
+++ b/src/freedreno/vulkan/tu_pipeline.cc
@ -2732,32 +2732,46 @@ fdm_apply_viewports(struct tu_cmd_buffer *cmd, struct tu_cs *cs, void *data,
       * renderpass, views will be 1 and we also have to replicate the 0'th
       * view to every view.
       */
-      VkExtent2D frag_area =
-         (state->share_scale || views == 1) ? tile->frag_areas[0] : tile->frag_areas[i];
-      VkRect2D bin =
-         (state->share_scale || views == 1) ? bins[0] : bins[i];
-      VkOffset2D hw_viewport_offset =
-         (state->share_scale || views == 1) ? hw_viewport_offsets[0] :
-         hw_viewport_offsets[i];
+      unsigned view = (state->share_scale || views == 1) ? 0 : i;
+      VkExtent2D frag_area = tile->frag_areas[view];
+      VkRect2D bin = bins[view];
+      VkOffset2D hw_viewport_offset = hw_viewport_offsets[view];
      /* Implement fake_single_viewport by replicating viewport 0 across all
       * views.
       */
      VkViewport viewport =
         state->fake_single_viewport ? state->vp.viewports[0] : state->vp.viewports[i];
-      if ((frag_area.width == 1 && frag_area.height == 1 &&
-           common_bin_offset.x == bin.offset.x &&
-           common_bin_offset.y == bin.offset.y) ||
-          /* When in a custom resolve operation (TODO: and using
-           * non-subsampled images) we switch to framebuffer coordinates so we
-           * shouldn't apply the transform.  However the binning pass isn't
-           * aware of this, so we have to keep applying the transform for
-           * binning.
-           */
-          (state->custom_resolve && !binning)) {
+      if (frag_area.width == 1 && frag_area.height == 1 &&
+          common_bin_offset.x == bin.offset.x &&
+          common_bin_offset.y == bin.offset.y) {
         vp.viewports[i] = viewport;
         continue;
      }

+      /* When custom resolve is enabled, we need to apply the viewport
+       * transform so that we render to where we would've blitted the tile to.
+       * Without subsampled images, this the framebuffer space bin (so there
+       * is effectively no transform). With subsampled images, this is
+       * subsampled space, which may not be the same as rendering space if
+       * we had to shift the tile or with FDM offset.
+       */
+      VkOffset2D tile_start = common_bin_offset;
+      if (state->custom_resolve && !binning) {
+         if (tile->subsampled)
+            tile_start = tile->subsampled_pos[view].offset;
+         else
+            tile_start = bin.offset;
+      }
+
+      /* When in a custom resolve operation without subsampling we shouldn't
+       * scale the viewport down.  However the binning pass isn't aware of
+       * this, so we have to keep applying the transform for binning.
+       */
+      if (state->custom_resolve &&
+          !(tile->subsampled_views & (1u << view)) && !binning) {
+         frag_area = (VkExtent2D) {1, 1};
+      }
+
      float scale_x = (float) 1.0f / frag_area.width;
      float scale_y = (float) 1.0f / frag_area.height;

@ -2767,9 +2781,12 @@ fdm_apply_viewports(struct tu_cmd_buffer *cmd, struct tu_cs *cs, void *data,
      vp.viewports[i].height = viewport.height * scale_y;

      VkOffset2D offset = tu_fdm_per_bin_offset(frag_area, bin,
-                                                common_bin_offset);
-      offset.x -= hw_viewport_offset.x;
-      offset.y -= hw_viewport_offset.y;
+                                                tile_start);
+      /* FDM offsets are disabled with custom resolve. */
+      if (!state->custom_resolve) {
+         offset.x -= hw_viewport_offset.x;
+         offset.y -= hw_viewport_offset.y;
+      }

      vp.viewports[i].x = scale_x * viewport.x + offset.x;
      vp.viewports[i].y = scale_y * viewport.y + offset.y;
@ -2861,15 +2878,33 @@ fdm_apply_scissors(struct tu_cmd_buffer *cmd, struct tu_cs *cs, void *data,
   struct vk_viewport_state vp = state->vp;

   for (unsigned i = 0; i < vp.scissor_count; i++) {
-      VkExtent2D frag_area =
-         (state->share_scale || views == 1) ? tile->frag_areas[0] : tile->frag_areas[i];
-      VkRect2D bin =
-         (state->share_scale || views == 1) ? bins[0] : bins[i];
+      unsigned view = (state->share_scale || views == 1) ? 0 : i;
+      VkExtent2D frag_area = tile->frag_areas[view];
+      VkRect2D bin = bins[view];
      VkRect2D scissor =
         state->fake_single_viewport ? state->vp.scissors[0] : state->vp.scissors[i];
-      VkOffset2D hw_viewport_offset =
-         (state->share_scale || views == 1) ? hw_viewport_offsets[0] :
-         hw_viewport_offsets[i];
+      VkOffset2D hw_viewport_offset = hw_viewport_offsets[view];
+
+      VkOffset2D tile_start = common_bin_offset;
+      if (state->custom_resolve && !binning) {
+         if (tile->subsampled)
+            tile_start = tile->subsampled_pos[view].offset;
+         else
+            tile_start = bin.offset;
+      }
+
+      /* Disable scaling when doing a custom resolve to a non-subsampled image
+       * and not in the binning pass, because we use framebuffer coordinates.
+       */
+      if (state->custom_resolve &&
+          !(tile->subsampled_views & (1u << view)) && !binning) {
+         frag_area = (VkExtent2D) {1, 1};
+      }
+
+      if (!state->custom_resolve) {
+         tile_start.x -= hw_viewport_offset.x;
+         tile_start.y -= hw_viewport_offset.y;
+      }

      /* Transform the scissor following the viewport. It's unclear how this
       * is supposed to handle cases where the scissor isn't aligned to the
@ -2878,22 +2913,7 @@ fdm_apply_scissors(struct tu_cmd_buffer *cmd, struct tu_cs *cs, void *data,
       * isn't aligned to the fragment area.
       */
      VkOffset2D offset = tu_fdm_per_bin_offset(frag_area, bin,
-                                                common_bin_offset);
-      offset.x -= hw_viewport_offset.x;
-      offset.y -= hw_viewport_offset.y;
-
-      /* Disable scaling and offset when doing a custom resolve to a
-       * non-subsampled image and not in the binning pass, because we
-       * use framebuffer coordinates.
-       *
-       * TODO: When we support subsampled images, only do this for
-       * non-subsampled images.
-       */
-      if (state->custom_resolve && !binning) {
-         offset = (VkOffset2D) {};
-         frag_area = (VkExtent2D) {1, 1};
-      }
-
+                                                tile_start);
      VkOffset2D min = {
         scissor.offset.x / frag_area.width + offset.x,
         scissor.offset.y / frag_area.height + offset.y,
@ -2904,26 +2924,17 @@ fdm_apply_scissors(struct tu_cmd_buffer *cmd, struct tu_cs *cs, void *data,
      };

      /* Intersect scissor with the scaled bin, this essentially replaces the
-       * window scissor. With custom resolve (TODO: and non-subsampled images)
-       * we have to use the unscaled bin instead.
+       * window scissor. With custom resolve we have to use the unscaled bin
+       * instead.
       */
      uint32_t scaled_width = bin.extent.width / frag_area.width;
      uint32_t scaled_height = bin.extent.height / frag_area.height;
-      int32_t bin_x;
-      int32_t bin_y;
-      if (state->custom_resolve && !binning) {
-         bin_x = bin.offset.x;
-         bin_y = bin.offset.y;
-      } else {
-         bin_x = common_bin_offset.x - hw_viewport_offset.x;
-         bin_y = common_bin_offset.y - hw_viewport_offset.y;
-      }
-      vp.scissors[i].offset.x = MAX2(min.x, bin_x);
-      vp.scissors[i].offset.y = MAX2(min.y, bin_y);
+      vp.scissors[i].offset.x = MAX2(min.x, tile_start.x);
+      vp.scissors[i].offset.y = MAX2(min.y, tile_start.y);
      vp.scissors[i].extent.width =
-         MIN2(max.x, bin_x + scaled_width) - vp.scissors[i].offset.x;
+         MIN2(max.x, tile_start.x + scaled_width) - vp.scissors[i].offset.x;
      vp.scissors[i].extent.height =
-         MIN2(max.y, bin_y + scaled_height) - vp.scissors[i].offset.y;
+         MIN2(max.y, tile_start.y + scaled_height) - vp.scissors[i].offset.y;
   }

   TU_CALLX(cs->device, tu6_emit_scissor)(cs, &vp);
--- a/src/freedreno/vulkan/tu_shader.cc
+++ b/src/freedreno/vulkan/tu_shader.cc
@ -21,6 +21,7 @@
 #include "tu_lrz.h"
 #include "tu_pipeline.h"
 #include "tu_rmv.h"
+#include "tu_subsampled_image.h"

 #include <initializer_list>

@ -506,7 +507,7 @@ lower_ssbo_ubo_intrinsic(struct tu_device *dev,

 static nir_def *
 build_bindless(struct tu_device *dev, nir_builder *b,
-               nir_deref_instr *deref, bool is_sampler,
+               nir_deref_instr *deref, unsigned combined_descriptor_offset,
               struct tu_shader *shader,
               const struct tu_pipeline_layout *layout,
               uint32_t read_only_input_attachments,
@ -568,9 +569,8 @@ build_bindless(struct tu_device *dev, nir_builder *b,
   /* Samplers come second in combined image/sampler descriptors, see
      * write_combined_image_sampler_descriptor().
      */
-   if (is_sampler && bind_layout->type ==
-         VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER) {
-      offset = 1;
+   if (bind_layout->type == VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER) {
+      offset = combined_descriptor_offset;
   }
   desc_offset =
      nir_imm_int(b, (bind_layout->offset / (4 * FDL6_TEX_CONST_DWORDS)) +
@ -594,7 +594,7 @@ lower_image_deref(struct tu_device *dev, nir_builder *b,
                  const struct tu_pipeline_layout *layout)
 {
   nir_deref_instr *deref = nir_src_as_deref(instr->src[0]);
-   nir_def *bindless = build_bindless(dev, b, deref, false, shader, layout, 0, false);
+   nir_def *bindless = build_bindless(dev, b, deref, 0, shader, layout, 0, false);
   nir_rewrite_image_intrinsic(instr, bindless, true);
 }

@ -697,42 +697,93 @@ lower_intrinsic(nir_builder *b, nir_intrinsic_instr *instr,
 }

 static void
-lower_tex_ycbcr(const struct tu_pipeline_layout *layout,
+lower_tex_subsampled(const struct tu_sampler *sampler,
+                     struct tu_device *dev,
+                     struct tu_shader *shader,
+                     const struct tu_pipeline_layout *layout,
+                     nir_builder *b,
+                     nir_tex_instr *tex)
+{
+   /* Only these ops are allowed with subsampled images */
+   if (tex->op != nir_texop_tex &&
+       tex->op != nir_texop_txl)
+      return;
+
+   b->cursor = nir_before_instr(&tex->instr);
+
+   int tex_src_idx = nir_tex_instr_src_index(tex, nir_tex_src_texture_deref);
+   assert(tex_src_idx >= 0);
+   nir_deref_instr *deref = nir_src_as_deref(tex->src[tex_src_idx].src);
+   nir_def *bindless = build_bindless(dev, b, deref, 2, shader, layout,
+                                      0, /* read_only_input_attachments (not used) */
+                                      false /* dynamic_renderpass (not used)*/
+                                      );
+
+   nir_def *coord = nir_steal_tex_src(tex, nir_tex_src_coord);
+   nir_def *coord_xy = nir_channels(b, coord, 0x3);
+   nir_def *layer = NULL;
+   if (coord->num_components > 2)
+      layer = nir_channel(b, coord, 2);
+
+   /* In order to avoid problems in the math for finding the bin with
+    * an x or y coordinate of exactly 1.0, where we would overflow into the
+    * next bin, we have to clamp to some 1.0 - epsilon. The largest possible
+    * framebuffer is 2^14 pixels currently, and we cannot shift the coordinate
+    * to before the pixel center, so we use 2^-15.
+    */
+   const float epsilon = 0x1p-15f;
+   nir_def *clamped_coord_xy =
+      nir_fmax(b, nir_fmin(b, coord_xy, nir_imm_float(b, 1.0f - epsilon)),
+               nir_imm_float(b, 0.0));
+
+   nir_def *clamped_coord = clamped_coord_xy;
+   if (layer) {
+      clamped_coord = nir_vec3(b, nir_channel(b, clamped_coord_xy, 0),
+                               nir_channel(b, clamped_coord_xy, 1),
+                               layer);
+   }
+
+   nir_def *transformed_coord_xy =
+      tu_get_subsampled_coordinates(b, clamped_coord, bindless);
+
+   /* Due to VUID-VkSamplerCreateInfo-flags-02577 we only have to handle
+    * CLAMP_TO_EDGE and CLAMP_TO_BORDER. We implicitly do CLAMP_TO_EDGE to
+    * prevent OOB accesses to the metadata anyway, so we just fixup the
+    * coordinates to pass the original coordinates if OOB.
+    */
+   if (sampler->vk.address_mode_u == VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_BORDER) {
+      nir_def *x = nir_channel(b, coord, 0);
+      nir_def *oob = nir_fneu(b, nir_fsat(b, x), x);
+      transformed_coord_xy =
+         nir_vec2(b, nir_bcsel(b, oob, x,
+                               nir_channel(b, transformed_coord_xy, 0)),
+                  nir_channel(b, transformed_coord_xy, 1));
+   }
+
+   if (sampler->vk.address_mode_v == VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_BORDER) {
+      nir_def *y = nir_channel(b, coord, 1);
+      nir_def *oob = nir_fneu(b, nir_fsat(b, y), y);
+      transformed_coord_xy =
+         nir_vec2(b, nir_channel(b, transformed_coord_xy, 0),
+                  nir_bcsel(b, oob, y,
+                               nir_channel(b, transformed_coord_xy, 1)));
+   }
+
+   nir_def *transformed_coord = transformed_coord_xy;
+   if (layer) {
+      transformed_coord = nir_vec3(b, nir_channel(b, transformed_coord_xy, 0),
+                                   nir_channel(b, transformed_coord_xy, 1),
+                                   layer);
+   }
+
+   nir_tex_instr_add_src(tex, nir_tex_src_coord, transformed_coord);
+}
+
+static void
+lower_tex_ycbcr(const struct vk_ycbcr_conversion_state *ycbcr_sampler,
                nir_builder *builder,
                nir_tex_instr *tex)
 {
-   int deref_src_idx = nir_tex_instr_src_index(tex, nir_tex_src_texture_deref);
-   assert(deref_src_idx >= 0);
-   nir_deref_instr *deref = nir_src_as_deref(tex->src[deref_src_idx].src);
-
-   nir_variable *var = nir_deref_instr_get_variable(deref);
-   const struct tu_descriptor_set_layout *set_layout =
-      layout->set[var->data.descriptor_set].layout;
-   const struct tu_descriptor_set_binding_layout *binding =
-      &set_layout->binding[var->data.binding];
-   const struct vk_ycbcr_conversion_state *ycbcr_samplers =
-      tu_immutable_ycbcr_samplers(set_layout, binding);
-
-   if (!ycbcr_samplers)
-      return;
-
-   /* For the following instructions, we don't apply any change */
-   if (tex->op == nir_texop_txs ||
-       tex->op == nir_texop_query_levels ||
-       tex->op == nir_texop_lod)
-      return;
-
-   assert(tex->texture_index == 0);
-   unsigned array_index = 0;
-   if (deref->deref_type != nir_deref_type_var) {
-      assert(deref->deref_type == nir_deref_type_array);
-      if (!nir_src_is_const(deref->arr.index))
-         return;
-      array_index = nir_src_as_uint(deref->arr.index);
-      array_index = MIN2(array_index, binding->array_size - 1);
-   }
-   const struct vk_ycbcr_conversion_state *ycbcr_sampler = ycbcr_samplers + array_index;
-
   if (ycbcr_sampler->ycbcr_model == VK_SAMPLER_YCBCR_MODEL_CONVERSION_RGB_IDENTITY)
      return;

@ -756,6 +807,55 @@ lower_tex_ycbcr(const struct tu_pipeline_layout *layout,
   builder->cursor = nir_before_instr(&tex->instr);
 }

+static void
+lower_tex_immutable(struct tu_device *dev,
+                    struct tu_shader *shader,
+                    const struct tu_pipeline_layout *layout,
+                    nir_builder *builder,
+                    nir_tex_instr *tex)
+{
+   int deref_src_idx = nir_tex_instr_src_index(tex, nir_tex_src_texture_deref);
+   assert(deref_src_idx >= 0);
+   nir_deref_instr *deref = nir_src_as_deref(tex->src[deref_src_idx].src);
+
+   nir_variable *var = nir_deref_instr_get_variable(deref);
+   const struct tu_descriptor_set_layout *set_layout =
+      layout->set[var->data.descriptor_set].layout;
+   const struct tu_descriptor_set_binding_layout *binding =
+      &set_layout->binding[var->data.binding];
+
+   /* For the following instructions, we don't apply any change */
+   if (tex->op == nir_texop_txs ||
+       tex->op == nir_texop_query_levels ||
+       tex->op == nir_texop_lod)
+      return;
+
+   assert(tex->texture_index == 0);
+   unsigned array_index = 0;
+   if (deref->deref_type != nir_deref_type_var) {
+      assert(deref->deref_type == nir_deref_type_array);
+      if (!nir_src_is_const(deref->arr.index))
+         return;
+      array_index = nir_src_as_uint(deref->arr.index);
+      array_index = MIN2(array_index, binding->array_size - 1);
+   }
+
+   const struct vk_ycbcr_conversion_state *ycbcr_samplers =
+      tu_immutable_ycbcr_samplers(set_layout, binding);
+   if (ycbcr_samplers) {
+      const struct vk_ycbcr_conversion_state *ycbcr_sampler = ycbcr_samplers + array_index;
+      lower_tex_ycbcr(ycbcr_sampler, builder, tex);
+   }
+
+   const struct tu_sampler *samplers =
+      tu_immutable_samplers(set_layout, binding);
+   if (samplers) {
+      const struct tu_sampler *sampler = samplers + array_index;
+      if (sampler->vk.flags & VK_SAMPLER_CREATE_SUBSAMPLED_BIT_EXT)
+         lower_tex_subsampled(sampler, dev, shader, layout, builder, tex);
+   }
+}
+
 static bool
 lower_tex_impl(nir_builder *b, nir_tex_instr *tex, struct tu_device *dev,
          struct tu_shader *shader, const struct tu_pipeline_layout *layout,
@ -765,7 +865,7 @@ lower_tex_impl(nir_builder *b, nir_tex_instr *tex, struct tu_device *dev,
   int sampler_src_idx = nir_tex_instr_src_index(tex, ref ? nir_tex_src_sampler_2_deref : nir_tex_src_sampler_deref);
   if (sampler_src_idx >= 0) {
      nir_deref_instr *deref = nir_src_as_deref(tex->src[sampler_src_idx].src);
-      nir_def *bindless = build_bindless(dev, b, deref, true, shader, layout,
+      nir_def *bindless = build_bindless(dev, b, deref, 1, shader, layout,
                                         read_only_input_attachments,
                                         dynamic_renderpass);
      nir_src_rewrite(&tex->src[sampler_src_idx].src, bindless);
@ -775,7 +875,7 @@ lower_tex_impl(nir_builder *b, nir_tex_instr *tex, struct tu_device *dev,
   int tex_src_idx = nir_tex_instr_src_index(tex, ref ? nir_tex_src_texture_2_deref : nir_tex_src_texture_deref);
   if (tex_src_idx >= 0) {
      nir_deref_instr *deref = nir_src_as_deref(tex->src[tex_src_idx].src);
-      nir_def *bindless = build_bindless(dev, b, deref, false, shader, layout,
+      nir_def *bindless = build_bindless(dev, b, deref, 0, shader, layout,
                                         read_only_input_attachments,
                                         dynamic_renderpass);
      nir_src_rewrite(&tex->src[tex_src_idx].src, bindless);
@ -800,7 +900,7 @@ lower_tex(nir_builder *b, nir_tex_instr *tex, struct tu_device *dev,
      lower_tex_impl(b, tex, dev, shader, layout, read_only_input_attachments, dynamic_renderpass, false);
      lower_tex_impl(b, tex, dev, shader, layout, read_only_input_attachments, dynamic_renderpass, true);
   } else {
-      lower_tex_ycbcr(layout, b, tex);
+      lower_tex_immutable(dev, shader, layout, b, tex);
      lower_tex_impl(b, tex, dev, shader, layout, read_only_input_attachments, dynamic_renderpass, false);
   }

--- a/src/freedreno/vulkan/tu_subsampled_image.cc
+++ b/src/freedreno/vulkan/tu_subsampled_image.cc
@ -0,0 +1,584 @@
+/*
+ * Copyright © 2026 Valve Corporation.
+ * SPDX-License-Identifier: MIT
+ */
+
+#include "tu_cmd_buffer.h"
+#include "tu_subsampled_image.h"
+
+#include "nir_builder.h"
+
+/* If a tile is not subsampled, we treat it as if its fragment area is (1,1)
+ * for the purposes of subsampling.
+ */
+static VkExtent2D
+get_effective_frag_area(const struct tu_tile_config *tile, unsigned view)
+{
+   return (tile->subsampled_views & (1u << view)) ?
+      tile->frag_areas[view] : (VkExtent2D) {1, 1};
+}
+
+void
+tu_emit_subsampled_metadata(struct tu_cmd_buffer *cmd,
+                            struct tu_cs *cs,
+                            unsigned a,
+                            const struct tu_tile_config *tiles,
+                            const struct tu_tiling_config *tiling,
+                            const struct tu_vsc_config *vsc,
+                            const struct tu_framebuffer *fb,
+                            const VkOffset2D *fdm_offsets)
+{
+   const struct tu_image_view *iview = cmd->state.attachments[a];
+   float size_ratio_x = (float)iview->image->vk.extent.width /
+      iview->image->layout[0].width0;
+   float size_ratio_y = (float)iview->image->vk.extent.height /
+      iview->image->layout[0].height0;
+   for_each_layer (i, cmd->state.pass->attachments[a].used_views |
+                      cmd->state.pass->attachments[a].resolve_views,
+                   fb->layers) {
+      struct tu_subsampled_metadata metadata;
+
+      metadata.hdr.pad0[0] = metadata.hdr.pad0[1] = metadata.hdr.pad0[2] = 0;
+
+      unsigned tile_count;
+      if (!tiles || vsc->tile_count.width * vsc->tile_count.height >
+          TU_SUBSAMPLED_MAX_BINS) {
+         tile_count = 1;
+         metadata.hdr.scale_x = 1.0;
+         metadata.hdr.scale_y = 1.0;
+         metadata.hdr.offset_x = 0.0;
+         metadata.hdr.offset_y = 0.0;
+         metadata.hdr.bin_stride = 1;
+         metadata.bins[0].scale_x = size_ratio_x;
+         metadata.bins[0].scale_y = size_ratio_y;
+         metadata.bins[0].offset_x = 0.0;
+         metadata.bins[0].offset_y = 0.0;
+      } else {
+         unsigned view = MIN2(i, tu_fdm_num_layers(cmd) - 1);
+         VkOffset2D bin_offset = {};
+         if (fdm_offsets)
+            bin_offset = tu_bin_offset(fdm_offsets[view], tiling);
+         tile_count = vsc->tile_count.width * vsc->tile_count.height;
+         metadata.hdr.scale_x = (float)iview->vk.extent.width / tiling->tile0.width;
+         metadata.hdr.scale_y = (float)iview->vk.extent.height / tiling->tile0.height;
+         metadata.hdr.offset_x = (float)bin_offset.x / tiling->tile0.width;
+         metadata.hdr.offset_y = (float)bin_offset.y / tiling->tile0.height;
+         metadata.hdr.bin_stride = vsc->tile_count.width;
+
+         for (unsigned j = 0; j < tile_count; j++) {
+            const struct tu_tile_config *tile = &tiles[j];
+
+            while (tile->merged_tile)
+               tile = tile->merged_tile;
+
+            if (!(tile->visible_views & (1u << view)) ||
+                !tile->subsampled) {
+               metadata.bins[j].scale_x = metadata.bins[j].scale_y = 1.0;
+               metadata.bins[j].offset_x = metadata.bins[j].offset_y = 0.0;
+               continue;
+            }
+
+            VkExtent2D frag_area = get_effective_frag_area(tile, view);
+            VkOffset2D fb_bin_start = (VkOffset2D) {
+               MAX2(tile->pos.x * (int32_t)tiling->tile0.width - bin_offset.x, 0),
+               MAX2(tile->pos.y * (int32_t)tiling->tile0.height - bin_offset.y, 0),
+            };
+            metadata.bins[j].scale_x = 1.0 / frag_area.width * size_ratio_x;
+            metadata.bins[j].scale_y = 1.0 / frag_area.height * size_ratio_y;
+            metadata.bins[j].offset_x =
+               (float)(tile->subsampled_pos[view].offset.x -
+                       fb_bin_start.x / frag_area.width) /
+               iview->image->layout[0].width0;
+            metadata.bins[j].offset_y =
+               (float)(tile->subsampled_pos[view].offset.y -
+                       fb_bin_start.y / frag_area.height) /
+               iview->image->layout[0].height0;
+         }
+      }
+
+      uint64_t iova = iview->image->iova +
+         iview->image->subsampled_metadata_offset +
+         sizeof(struct tu_subsampled_metadata) *
+         (iview->vk.base_array_layer + i);
+
+      tu_cs_emit_pkt7(cs, CP_MEM_WRITE,
+                      2 + (sizeof(struct tu_subsampled_header) +
+                           tile_count * sizeof(struct tu_subsampled_bin)) / 4);
+      tu_cs_emit_qw(cs, iova);
+      tu_cs_emit_array(cs, (const uint32_t *)&metadata.hdr,
+                       sizeof(struct tu_subsampled_header) / 4);
+      tu_cs_emit_array(cs, (const uint32_t *)&metadata.bins,
+                       sizeof(struct tu_subsampled_bin) * tile_count / 4);
+   }
+
+   /* The cache-tracking infrastructure can't be aware of subsampled images,
+    * so manually make sure the writes land. Sampling as an image should
+    * already insert a CACHE_INVALIDATE + WFI.
+    */
+   cmd->state.cache.pending_flush_bits |=
+      TU_CMD_FLAG_WAIT_MEM_WRITES;
+}
+
+nir_def *
+tu_get_subsampled_coordinates(nir_builder *b,
+                              nir_def *coords,
+                              nir_def *descriptor)
+{
+   nir_def *layer;
+   if (coords->num_components > 2)
+      layer = nir_f2u16(b, nir_channel(b, coords, 2));
+   else
+      layer = nir_imm_intN_t(b, 0, 16);
+
+   nir_def *layer_offset =
+      nir_imul_imm_nuw(b, layer, sizeof(struct tu_subsampled_metadata) / 16);
+
+   nir_def *hdr0 =
+      nir_load_ubo(b, 4, 32, descriptor,
+                   nir_ishl_imm(b, nir_u2u32(b, layer_offset), 4),
+                   .align_mul = 16,
+                   .align_offset = 0,
+                   .range = TU_SUBSAMPLED_MAX_LAYERS * sizeof(struct tu_subsampled_metadata));
+   nir_def *bin_stride =
+      nir_load_ubo(b, 1, 32, descriptor, nir_ishl_imm(b, nir_u2u32(b, nir_iadd_imm(b, layer_offset, 1)), 4),
+                   .align_mul = 16,
+                   .align_offset = 0,
+                   .range = TU_SUBSAMPLED_MAX_LAYERS * sizeof(struct tu_subsampled_metadata));
+
+   nir_def *hdr_scale = nir_channels(b, hdr0, 0x3);
+   nir_def *hdr_offset = nir_channels(b, hdr0, 0xc);
+
+   nir_def *bin = nir_f2u16(b, nir_ffma(b, coords, hdr_scale, hdr_offset));
+   nir_def *bin_idx = nir_iadd(b, nir_imul(b, nir_channel(b, bin, 1),
+                                           nir_u2u16(b, bin_stride)),
+                               nir_channel(b, bin, 0));
+
+   bin_idx = nir_iadd_imm(b, nir_iadd(b, bin_idx, layer_offset),
+                          sizeof(struct tu_subsampled_header) / 16);
+
+   nir_def *bin_data =
+      nir_load_ubo(b, 4, 32, descriptor, nir_ishl_imm(b, nir_u2u32(b, bin_idx), 4),
+                   .align_mul = 16,
+                   .align_offset = 0,
+                   .range = TU_SUBSAMPLED_MAX_LAYERS * sizeof(struct tu_subsampled_metadata));
+
+   nir_def *bin_scale = nir_channels(b, bin_data, 0x3);
+   nir_def *bin_offset = nir_channels(b, bin_data, 0xc);
+
+   return nir_ffma(b, coords, bin_scale, bin_offset);
+}
+
+/* Calculate the y coordinate in subsampled space of a given number of tiles
+ * after the start of "tile".
+ */
+static void
+calc_tile_vert_pos(const struct tu_tile_config *tile,
+                   const struct tu_tiling_config *tiling,
+                   const struct tu_framebuffer *fb,
+                   unsigned view,
+                   VkOffset2D bin_offset,
+                   unsigned tile_offset,
+                   unsigned *pos_y_out)
+{
+   int offset_px = 0;
+   if (tile->pos.y == 0 && tile_offset > 0) {
+      /* The first row is a partial row with FDM offset. */
+      offset_px += tiling->tile0.height - bin_offset.y;
+      tile_offset--;
+   }
+   offset_px += tiling->tile0.height * tile_offset;
+
+   unsigned pos_y = tile->subsampled_pos[view].offset.y +
+      offset_px / get_effective_frag_area(tile, view).height;
+
+   /* The last tile is along the framebuffer edge, so clamp to the framebuffer
+    * height.
+    */
+   *pos_y_out = MIN2(pos_y, tile->subsampled_pos[view].offset.y +
+                     tile->subsampled_pos[view].extent.height);
+}
+
+static void
+calc_tile_horiz_pos(const struct tu_tile_config *tile,
+                    const struct tu_tiling_config *tiling,
+                    const struct tu_framebuffer *fb,
+                    unsigned view,
+                    VkOffset2D bin_offset,
+                    unsigned tile_offset,
+                    unsigned *pos_x_out)
+{
+   int offset_px = 0;
+   if (tile->pos.x == 0 && tile_offset > 0) {
+      /* The first column is a partial column with FDM offset. */
+      offset_px += tiling->tile0.width - bin_offset.x;
+      tile_offset--;
+   }
+   offset_px += tiling->tile0.width * tile_offset;
+
+   unsigned pos_x = tile->subsampled_pos[view].offset.x +
+      offset_px / get_effective_frag_area(tile, view).width;
+
+   /* The last tile is along the framebuffer edge, so clamp to the framebuffer
+    * width.
+    */
+   *pos_x_out = MIN2(pos_x, tile->subsampled_pos[view].offset.x +
+                     tile->subsampled_pos[view].extent.width);
+}
+
+/* Given two tiles "tile" and "other_tile", calculate the y coordinates of
+ * their shared vertical edge in subsampled space relative to "tile". That is,
+ * calculate the y coordinates along the edge of "tile" where "other_tile"
+ * will touch it after scaling up to framebuffer coordinates. The start and
+ * end may be the same coordinate if "tile" and "other_tile" only share a
+ * corner, but this will be extended when handling corners.
+ */
+static void
+calc_shared_vert_edge(const struct tu_tile_config *tile,
+                      const struct tu_tile_config *other_tile,
+                      const struct tu_tiling_config *tiling,
+                      const struct tu_framebuffer *fb,
+                      unsigned view,
+                      VkOffset2D bin_offset,
+                      unsigned *out_start,
+                      unsigned *out_end)
+{
+   int other_start_tile = MAX2(other_tile->pos.y - tile->pos.y, 0);
+   assert(other_start_tile <= tile->sysmem_extent.height);
+   calc_tile_vert_pos(tile, tiling, fb, view, bin_offset,
+                      other_start_tile, out_start);
+   int other_end_tile =
+      MIN2(tile->pos.y + tile->sysmem_extent.height,
+           other_tile->pos.y + other_tile->sysmem_extent.height) - tile->pos.y;
+   assert(other_end_tile >= 0);
+   calc_tile_vert_pos(tile, tiling, fb, view, bin_offset,
+                      other_end_tile, out_end);
+}
+
+static void
+calc_shared_horiz_edge(const struct tu_tile_config *tile,
+                       const struct tu_tile_config *other_tile,
+                       const struct tu_tiling_config *tiling,
+                       const struct tu_framebuffer *fb,
+                       unsigned view,
+                       VkOffset2D bin_offset,
+                       unsigned *out_start,
+                       unsigned *out_end)
+{
+   int other_start_tile = MAX2(other_tile->pos.x - tile->pos.x, 0);
+   assert(other_start_tile <= tile->sysmem_extent.width);
+   calc_tile_horiz_pos(tile, tiling, fb, view, bin_offset,
+                       other_start_tile, out_start);
+   int other_end_tile =
+      MIN2(tile->pos.x + tile->sysmem_extent.width,
+           other_tile->pos.x + other_tile->sysmem_extent.width) - tile->pos.x;
+   assert(other_end_tile >= 0);
+   calc_tile_horiz_pos(tile, tiling, fb, view, bin_offset,
+                       other_end_tile, out_end);
+}
+
+/* Extend vertical-edge blit start and end for apron corners. */
+static void
+handle_vertical_corners(const struct tu_tile_config *tile,
+                        const struct tu_tile_config *other_tile,
+                        unsigned view,
+                        VkRect2D *tile_dst,
+                        struct tu_rect2d_float *other_src)
+{
+   float other_apron_height =
+      (float)APRON_SIZE * get_effective_frag_area(tile, view).height /
+      get_effective_frag_area(other_tile, view).height;
+   if ((unsigned)other_src->y_start > other_tile->subsampled_pos[view].offset.y) {
+      tile_dst->offset.y -= APRON_SIZE;
+      tile_dst->extent.height += APRON_SIZE;
+      other_src->y_start -= other_apron_height;
+   }
+   if ((unsigned)other_src->y_end <
+       other_tile->subsampled_pos[view].offset.y +
+       other_tile->subsampled_pos[view].extent.height) {
+      tile_dst->extent.height += APRON_SIZE;
+      other_src->y_end += other_apron_height;
+   }
+}
+
+static void
+handle_horizontal_corners(const struct tu_tile_config *tile,
+                          const struct tu_tile_config *other_tile,
+                          unsigned view,
+                          VkRect2D *tile_dst,
+                          struct tu_rect2d_float *other_src)
+{
+   float other_apron_width =
+      (float)APRON_SIZE * get_effective_frag_area(tile, view).width /
+      get_effective_frag_area(other_tile, view).width;
+   if (other_src->x_start > other_tile->subsampled_pos[view].offset.x) {
+      tile_dst->offset.x -= APRON_SIZE;
+      tile_dst->extent.width += APRON_SIZE;
+      other_src->x_start -= other_apron_width;
+   }
+   if ((unsigned)other_src->x_end <
+       other_tile->subsampled_pos[view].offset.x +
+       other_tile->subsampled_pos[view].extent.width) {
+      tile_dst->extent.width += APRON_SIZE;
+      other_src->x_end += other_apron_width;
+   }
+}
+unsigned
+tu_calc_subsampled_aprons(VkRect2D *dst,
+                          struct tu_rect2d_float *src,
+                          unsigned view,
+                          const struct tu_tile_config *tiles,
+                          const struct tu_tiling_config *tiling,
+                          const struct tu_vsc_config *vsc,
+                          const struct tu_framebuffer *fb,
+                          const VkOffset2D *fdm_offsets)
+{
+   unsigned count = 0;
+
+   VkOffset2D bin_offset = {};
+   if (fdm_offsets)
+      bin_offset = tu_bin_offset(fdm_offsets[view], tiling);
+
+   for (unsigned y = 0; y < vsc->tile_count.height; y++) {
+      for (unsigned x = 0; x < vsc->tile_count.width; x++) {
+         const struct tu_tile_config *tile = &tiles[y * vsc->tile_count.width + x];
+
+         if (tile->merged_tile || !(tile->visible_views & (1u << view)))
+             continue;
+
+         int x_neighbor = tile->pos.x + tile->sysmem_extent.width;
+         int y_neighbor = tile->pos.y + tile->sysmem_extent.height;
+
+         /* Start with vertically adjacent tiles. For a given neighbor to the
+          * right, produce aprons for both this tile and its neighbor along
+          * their shared edge. We handle tiles that only share an edge:
+          *
+          *     -------- -------
+          *    |        |       |
+          *    |  tile  | other |
+          *    |        |       |
+          *     -------- -------
+          *
+          * Tiles that only share a corner:
+          *
+          *              -------
+          *             |       |
+          *             | other |
+          *             |       |
+          *     -------- -------
+          *    |        |
+          *    |  tile  |
+          *    |        |
+          *     -------- 
+          * 
+          * And tiles where the corner of one tile comes from the edge of
+          * another:
+          *
+          *              -------
+          *             |       |
+          *             |       |
+          *             |       |
+          *     --------| other |
+          *    |        |       |
+          *    |  tile  |       |
+          *    |        |       |
+          *     -------- -------
+          *
+          */
+         if (x_neighbor < vsc->tile_count.width) {
+            int y_start = MAX2(tile->pos.y - 1, 0);
+            int y_end = MIN2(tile->pos.y + tile->sysmem_extent.height,
+                             vsc->tile_count.height - 1);
+            const struct tu_tile_config *other_tile;
+
+            /* Sweep all tiles directly to the right, keeping in mind
+             * merged tiles.
+             */
+            for (int y = y_start; y <= y_end;
+                 y = other_tile->pos.y + other_tile->sysmem_extent.height) {
+               other_tile = tu_get_merged_tile_const(&tiles[y * vsc->tile_count.width + x_neighbor]);
+
+               if (!(other_tile->visible_views & (1u << view)))
+                   continue;
+
+               /* If they are next to each other then neither needs an apron. */
+               if (tile->subsampled_pos[view].offset.x +
+                   tile->subsampled_pos[view].extent.width ==
+                   other_tile->subsampled_pos[view].offset.x)
+                  continue;
+
+               /* If other_tile isn't entirely to the right of tile, it is not
+                * vertically adjacent and will be handled below instead.
+                */
+               if (other_tile->pos.x < tile->pos.x + tile->sysmem_extent.width)
+                  continue;
+
+               VkExtent2D frag_area = get_effective_frag_area(tile, view);
+               VkExtent2D other_frag_area =
+                  get_effective_frag_area(other_tile, view);
+
+               unsigned tile_start, tile_end;
+               calc_shared_vert_edge(tile, other_tile, tiling, fb, view,
+                                     bin_offset, &tile_start, &tile_end);
+
+               unsigned other_tile_start, other_tile_end;
+               calc_shared_vert_edge(other_tile, tile, tiling, fb, view,
+                                     bin_offset, &other_tile_start,
+                                     &other_tile_end);
+
+               VkRect2D tile_dst;
+
+               tile_dst.offset.y = tile_start;
+               tile_dst.extent.height = tile_end - tile_start;
+
+               tile_dst.offset.x = tile->subsampled_pos[view].offset.x +
+                  tile->subsampled_pos[view].extent.width;
+               tile_dst.extent.width = APRON_SIZE;
+
+               struct tu_rect2d_float other_src;
+
+               other_src.x_start = other_tile->subsampled_pos[view].offset.x;
+               other_src.x_end = other_src.x_start +
+                  (float)APRON_SIZE * frag_area.width / other_frag_area.width;
+
+               other_src.y_start = other_tile_start;
+               other_src.y_end = other_tile_end;
+
+               /* Extend start and end for apron corners. */
+               handle_vertical_corners(tile, other_tile, view, &tile_dst,
+                                       &other_src);
+
+               /* Add other_tile -> tile blit to the list. */
+               dst[count] = tile_dst;
+               src[count] = other_src;
+               count++;
+
+               VkRect2D other_dst;
+
+               other_dst.offset.y = other_tile_start;
+               other_dst.extent.height = other_tile_end - other_tile_start;
+
+               other_dst.offset.x =
+                  other_tile->subsampled_pos[view].offset.x - APRON_SIZE;
+               other_dst.extent.width = APRON_SIZE;
+
+               struct tu_rect2d_float tile_src;
+
+               tile_src.x_end = tile->subsampled_pos[view].offset.x
+                  + tile->subsampled_pos[view].extent.width;
+               tile_src.x_start = tile_src.x_end -
+                  (float)APRON_SIZE * other_frag_area.width / frag_area.width;
+
+               tile_src.y_start = tile_start;
+               tile_src.y_end = tile_end;
+
+               handle_vertical_corners(other_tile, tile, view, &other_dst,
+                                       &tile_src);
+
+               /* Add tile -> other_tile blit to the list. */
+               dst[count] = other_dst;
+               src[count] = tile_src;
+               count++;
+            }
+         }
+
+         /* Now do the same thing but for horizontally adjacent tiles. Because
+          * the above loop handled tiles that only share a corner, we only
+          * have to handle neighbors below it that share an edge. However,
+          * these neighbors may also share a corner if they are merged tiles.
+          */
+         if (y_neighbor < vsc->tile_count.height) {
+            const struct tu_tile_config *other_tile;
+
+            /* Sweep all tiles directly below, keeping in mind merged tiles.
+             */
+            for (int x = tile->pos.x;
+                 x < tile->pos.x + tile->sysmem_extent.width;
+                 x = other_tile->pos.x + other_tile->sysmem_extent.width) {
+               other_tile = tu_get_merged_tile_const(&tiles[y_neighbor * vsc->tile_count.width + x]);
+
+               if (!(other_tile->visible_views & (1u << view)))
+                   continue;
+
+               /* If both are next to each other then neither needs an apron. */
+               if (tile->subsampled_pos[view].offset.y +
+                   tile->subsampled_pos[view].extent.height ==
+                   other_tile->subsampled_pos[view].offset.y)
+                  continue;
+
+               VkExtent2D frag_area = get_effective_frag_area(tile, view);
+               VkExtent2D other_frag_area =
+                  get_effective_frag_area(other_tile, view);
+
+               unsigned tile_start, tile_end;
+               calc_shared_horiz_edge(tile, other_tile, tiling, fb, view,
+                                      bin_offset, &tile_start, &tile_end);
+
+               unsigned other_tile_start, other_tile_end;
+               calc_shared_horiz_edge(other_tile, tile, tiling, fb, view,
+                                      bin_offset, &other_tile_start,
+                                      &other_tile_end);
+
+               VkRect2D tile_dst;
+
+               tile_dst.offset.x = tile_start;
+               tile_dst.extent.width = tile_end - tile_start;
+
+               tile_dst.offset.y = tile->subsampled_pos[view].offset.y +
+                  tile->subsampled_pos[view].extent.height;
+               tile_dst.extent.height = APRON_SIZE;
+
+               struct tu_rect2d_float other_src;
+
+               other_src.y_start = other_tile->subsampled_pos[view].offset.y;
+               other_src.y_end = other_src.y_start +
+                  (float)APRON_SIZE * frag_area.height / other_frag_area.height;
+
+               other_src.x_start = other_tile_start;
+               other_src.x_end = other_tile_end;
+
+               /* Extend start and end for apron corners. */
+               handle_horizontal_corners(tile, other_tile, view, &tile_dst,
+                                         &other_src);
+
+               /* Add other_tile -> tile blit to the list. */
+               dst[count] = tile_dst;
+               src[count] = other_src;
+               assert(tile_dst.offset.x >= 0);
+               assert(tile_dst.offset.y >= 0);
+               count++;
+
+               VkRect2D other_dst;
+
+               other_dst.offset.x = other_tile_start;
+               other_dst.extent.width = other_tile_end - other_tile_start;
+
+               other_dst.offset.y =
+                  other_tile->subsampled_pos[view].offset.y - APRON_SIZE;
+               other_dst.extent.height = APRON_SIZE;
+
+               struct tu_rect2d_float tile_src;
+
+               tile_src.y_end = tile->subsampled_pos[view].offset.y
+                  + tile->subsampled_pos[view].extent.height;
+               tile_src.y_start = tile_src.y_end -
+                  (float)APRON_SIZE * other_frag_area.height / frag_area.height;
+
+               tile_src.x_start = tile_start;
+               tile_src.x_end = tile_end;
+
+               handle_horizontal_corners(other_tile, tile, view, &other_dst,
+                                         &tile_src);
+
+               /* Add tile -> other_tile blit to the list. */
+               dst[count] = other_dst;
+               src[count] = tile_src;
+               assert(other_dst.offset.x >= 0);
+               assert(other_dst.offset.y >= 0);
+               count++;
+            }
+         }
+      }
+   }
+
+   return count;
+}
--- a/src/freedreno/vulkan/tu_subsampled_image.h
+++ b/src/freedreno/vulkan/tu_subsampled_image.h
@ -0,0 +1,88 @@
+/*
+ * Copyright © 2026 Valve Corporation.
+ * SPDX-License-Identifier: MIT
+ */
+
+#include <stdint.h>
+
+#include "tu_common.h"
+
+/* Describe the format used for subsampled image metadata. This is attached to
+ * subsampled images, via a separate UBO descriptor after the image
+ * descriptor. It is written after the render pass which writes to the image,
+ * and is read via code injected into the shader when sampling from a
+ * subsampled image.
+ */
+
+/* The maximum number of bins a subsampled image can have before we disable
+ * subsampling.
+ */
+#define TU_SUBSAMPLED_MAX_BINS 512
+
+/* The maximum number of layers a view of a subsampled image can have.
+ *
+ * There is one metadata structure per layer, and the view uses a UBO for the
+ * metadata, so this is bounded by the maximum UBO size.
+ *
+ * TODO: When we implement fdm2, we should expose this as
+ * maxSubsampledArrayLayers. The Vulkan spec says that the minimum value for
+ * maxSubsampledArrayLayers is 2, so users can only rely on 2 layers even
+ * though we support more.
+ */
+#define TU_SUBSAMPLED_MAX_LAYERS 6
+
+/* This is 2 to allow for floating-point precision errors and in case the user
+ * uses bicubic filtering.
+ */
+#define APRON_SIZE 2
+
+struct tu_subsampled_bin {
+   float scale_x;
+   float scale_y;
+   float offset_x;
+   float offset_y;
+};
+
+struct tu_subsampled_header {
+   /* The bin coordinate to use is calculated as:
+    * bin = int(coord * scale + offset)
+    */
+   float scale_x;
+   float scale_y;
+   float offset_x;
+   float offset_y;
+
+   uint32_t bin_stride;
+   uint32_t pad0[3];
+};
+
+struct tu_subsampled_metadata {
+   struct tu_subsampled_header hdr;
+
+   struct tu_subsampled_bin bins[TU_SUBSAMPLED_MAX_BINS];
+};
+
+void
+tu_emit_subsampled_metadata(struct tu_cmd_buffer *cmd,
+                            struct tu_cs *cs,
+                            unsigned a,
+                            const struct tu_tile_config *tiles,
+                            const struct tu_tiling_config *tiling,
+                            const struct tu_vsc_config *vsc,
+                            const struct tu_framebuffer *fb,
+                            const VkOffset2D *fdm_offsets);
+
+unsigned
+tu_calc_subsampled_aprons(VkRect2D *dst,
+                          struct tu_rect2d_float *src,
+                          unsigned view,
+                          const struct tu_tile_config *tiles,
+                          const struct tu_tiling_config *tiling,
+                          const struct tu_vsc_config *vsc,
+                          const struct tu_framebuffer *fb,
+                          const VkOffset2D *fdm_offsets);
+
+nir_def *
+tu_get_subsampled_coordinates(nir_builder *b,
+                              nir_def *coords,
+                              nir_def *descriptor);
--- a/src/freedreno/vulkan/tu_tile_config.cc
+++ b/src/freedreno/vulkan/tu_tile_config.cc
@ -10,6 +10,9 @@

 #include "tu_cmd_buffer.h"
 #include "tu_tile_config.h"
+#include "tu_subsampled_image.h"
+
+#include "util/u_worklist.h"

 static void
 tu_calc_frag_area(struct tu_cmd_buffer *cmd,
@ -369,6 +372,370 @@ tu_merge_tiles(struct tu_cmd_buffer *cmd, const struct tu_vsc_config *vsc,
   }
 }

+/* Get the default position of the tile in subsampled space. It may be shifted
+ * over later, but it has to stay within the non-subsampled rectangle (i.e.
+ * the result we return with frag_area = 1,1). If the tile is made
+ * non-subsampled then its frag_area becomes 1,1.
+ */
+static VkRect2D
+get_default_tile_pos(const struct tu_physical_device *phys_dev,
+                     struct tu_tile_config *tile,
+                     unsigned view,
+                     const struct tu_framebuffer *fb,
+                     const struct tu_tiling_config *tiling,
+                     const VkOffset2D *fdm_offsets,
+                     VkExtent2D frag_area)
+{
+   VkOffset2D offset = {};
+   if (fdm_offsets)
+      offset = tu_bin_offset(fdm_offsets[view], tiling);
+   VkOffset2D aligned_offset = {};
+   aligned_offset.x = offset.x / phys_dev->info->tile_align_w *
+      phys_dev->info->tile_align_w;;
+   aligned_offset.y = offset.y / phys_dev->info->tile_align_h *
+      phys_dev->info->tile_align_h;
+   int32_t fb_start_x =
+      MAX2(tile->pos.x * (int32_t)tiling->tile0.width - offset.x, 0);
+   int32_t fb_end_x =
+      (tile->pos.x + tile->sysmem_extent.width) * tiling->tile0.width - offset.x;
+   int32_t fb_start_y =
+      MAX2(tile->pos.y * (int32_t)tiling->tile0.height - offset.y, 0);
+   int32_t fb_end_y =
+      (tile->pos.y + tile->sysmem_extent.height) * tiling->tile0.height - offset.y;
+
+   /* For tiles in the last row/column, we cannot create an apron for their
+    * right/bottom edges because we don't know what addressing mode the
+    * sampler will use. If the edge of the framebuffer is the same as the edge
+    * of the image, then when sampling the image near the edge we'd expect the
+    * sampler border handling to kick in, but that doesn't work unless the
+    * tile is shifted to the end of the framebuffer. Because the images are
+    * made larger, we have to shift it over by the same amount, which is
+    * currently gmem_align_w/gmem_align_h, so that if the framebuffer is the
+    * same size as the original API image then the border works correctly.
+    *
+    * For tiles not in the first row/column, we align the FDM offset down so
+    * that we can use the faster tile store method. This means that the
+    * subsampled space tile start may be shifted compared to framebuffer
+    * space. This will create a gap between the first and second tiles, which
+    * will require an apron even if neither is subsampled. This works because
+    * gmem_align_w/gmem_align_h is always at least the apron size times two.
+    */
+   bool stick_to_end_x = fb_end_x >= fb->width;
+   bool stick_to_end_y = fb_end_y >= fb->height;
+   unsigned fb_offset_x = fdm_offsets ?
+      phys_dev->info->tile_align_w : 0;
+   unsigned fb_offset_y = fdm_offsets ?
+      phys_dev->info->tile_align_h : 0;
+   int32_t start_x, end_x, start_y, end_y;
+   if (stick_to_end_x) {
+      end_x = fb->width + fb_offset_x;
+      start_x = end_x - DIV_ROUND_UP(fb->width - fb_start_x, frag_area.width);
+   } else if (tile->pos.x == 0) {
+      start_x = 0;
+      end_x = fb_end_x / frag_area.width;
+   } else {
+      start_x = tile->pos.x * tiling->tile0.width - aligned_offset.x;
+      end_x = start_x + tile->sysmem_extent.width * tiling->tile0.width / frag_area.width;
+   }
+
+   if (stick_to_end_y) {
+      end_y = fb->height + fb_offset_y;
+      start_y = end_y - DIV_ROUND_UP(fb->height - fb_start_y, frag_area.height);
+   } else if (tile->pos.y == 0) {
+      start_y = 0;
+      end_y = fb_end_y / frag_area.height;
+   } else {
+      start_y = tile->pos.y * tiling->tile0.height - aligned_offset.y;
+      end_y = start_y + tile->sysmem_extent.height * tiling->tile0.height / frag_area.height;
+   }
+
+   if (stick_to_end_x || stick_to_end_y)
+      tile->subsampled_border = true;
+
+   return (VkRect2D) {
+      .offset = { start_x, start_y },
+      .extent = { end_x - start_x, end_y - start_y },
+   };
+}
+
+static void
+make_non_subsampled(const struct tu_physical_device *phys_dev,
+                    struct tu_tile_config *tile,
+                    unsigned view,
+                    const struct tu_framebuffer *fb,
+                    const struct tu_tiling_config *tiling,
+                    const VkOffset2D *fdm_offsets)
+{
+   tile->subsampled_views &= ~(1u << view);
+   tile->subsampled_pos[view] =
+      get_default_tile_pos(phys_dev, tile, view, fb, tiling, fdm_offsets,
+                           (VkExtent2D) { 1, 1 });
+}
+
+static bool
+aprons_intersect(struct tu_tile_config *a, struct tu_tile_config *b,
+                 unsigned view)
+{
+   if (a->subsampled_pos[view].offset.x +
+       a->subsampled_pos[view].extent.width + APRON_SIZE * 2 <=
+       b->subsampled_pos[view].offset.x)
+      return false;
+
+   if (b->subsampled_pos[view].offset.x +
+       b->subsampled_pos[view].extent.width + APRON_SIZE * 2 <=
+       a->subsampled_pos[view].offset.x)
+      return false;
+
+   if (a->subsampled_pos[view].offset.y +
+       a->subsampled_pos[view].extent.height + APRON_SIZE * 2 <=
+       b->subsampled_pos[view].offset.y)
+      return false;
+
+   if (b->subsampled_pos[view].offset.y +
+       b->subsampled_pos[view].extent.height + APRON_SIZE * 2 <=
+       a->subsampled_pos[view].offset.y)
+      return false;
+
+   return true;
+}
+
+/*
+ * Calculate the location of each bin in the subsampled image and whether we
+ * need to avoid subsampling it. The constraint we have to deal with here is
+ * that for any two tiles sharing an edge, either both must not be subsampled
+ * (so that we do not need to insert an apron) or they must be at least 4
+ * pixels apart along that edge to create an apron of 2 pixels around each
+ * tile. The apron includes the corner of the tile, so tiles that only touch
+ * corners also count as touching along both edges. The two strategies
+ * available to us to deal with this are disabling subsampling and shifting
+ * over the origin of the tile, which only works when there is enough free
+ * space to shift it. This is complicated by the fact that one or both of the
+ * neighboring tiles may be a merged tile, so each tile may have several
+ * neighbors sharing an edge instead of just 3.
+ *
+ * By default, we make each bin start at an aligned version of the start in
+ * framebuffer space, b_s. This means that the tile grid is shifted up and to
+ * the right for FDM offset, making sure the last row/column of tiles always
+ * fits within the image and we only need a small fixed amount of extra space
+ * to hold the overflow.
+ */
+static void
+tu_calc_subsampled(struct tu_tile_config *tiles,
+                   const struct tu_physical_device *phys_dev,
+                   const struct tu_tiling_config *tiling,
+                   const struct tu_framebuffer *fb,
+                   const struct tu_vsc_config *vsc,
+                   const VkOffset2D *fdm_offsets)
+{
+   u_worklist worklist;
+   u_worklist_init(&worklist, vsc->tile_count.width * vsc->tile_count.height,
+                   NULL);
+
+   for (unsigned y = 0; y < vsc->tile_count.height; y++) {
+      for (unsigned x = 0; x < vsc->tile_count.width; x++) {
+         struct tu_tile_config *tile = &tiles[y * vsc->tile_count.width + x];
+
+         if (!tile->visible_views || tile->merged_tile)
+            continue;
+
+         u_foreach_bit (view, tile->visible_views) {
+            VkOffset2D offset = {};
+            if (fdm_offsets)
+               offset = tu_bin_offset(fdm_offsets[view], tiling);
+            tile->subsampled_pos[view] =
+               get_default_tile_pos(phys_dev, tile, view, fb, tiling, fdm_offsets,
+                                    tile->frag_areas[view]);
+
+            if (tile->frag_areas[view].width != 1 ||
+                tile->frag_areas[view].height != 1)
+               tile->subsampled_views |= 1u << view;
+         }
+
+         tile->subsampled = true;
+         tile->worklist_idx = y * vsc->tile_count.width + x;
+
+         u_worklist_push_tail(&worklist, tile, worklist_idx);
+      }
+   }
+
+   while (!u_worklist_is_empty(&worklist)) {
+      struct tu_tile_config *tile =
+         u_worklist_pop_head(&worklist, struct tu_tile_config, worklist_idx);
+
+      /* First, iterate over the vertically adjacent tiles and check for
+       * vertical issues.
+       */
+      for (unsigned i = 0; i < 2; i++) {
+         int x_offset = i == 0 ? -1 : tile->sysmem_extent.width;
+         int x_pos = tile->pos.x + x_offset;
+         if (x_pos < 0 || x_pos >= vsc->tile_count.width)
+            continue;
+         int y_start = MAX2(tile->pos.y - 1, 0);
+         int y_end = MIN2(tile->pos.y + tile->sysmem_extent.height,
+                          vsc->tile_count.height - 1);
+         struct tu_tile_config *other_tile =
+            tu_get_merged_tile(&tiles[y_start * vsc->tile_count.width + x_pos]);
+         /* Sweep from (x_pos, y_start) to (x_pos, y_end), keeping in mind
+          * merged tiles.
+          */
+         for (int y = y_start; y <= y_end;
+              y = other_tile->pos.y + other_tile->sysmem_extent.height) {
+            other_tile = tu_get_merged_tile(&tiles[y * vsc->tile_count.width + x_pos]);
+            uint32_t common_views = tile->visible_views &
+               other_tile->visible_views;
+            if (common_views == 0)
+               continue;
+
+            if (((tile->subsampled_views | other_tile->subsampled_views) &
+                 common_views) == 0)
+               continue;
+
+            struct tu_tile_config *left_tile = (i == 0) ? other_tile : tile;
+            struct tu_tile_config *right_tile = (i == 0) ? tile : other_tile;
+
+            /* Due to bin merging, the right tile may not actually be
+             * to the right of the left tile, instead extending to the right
+             * of it, for example if other_tile includes (0, 0) and (1, 0) and
+             * the current tile is (0, 1) or vice versa. top_tile will then
+             * also be vertically adjacent, and we can skip it because it will
+             * be handled below, and it should not touch horizontally
+             * which means it will also not touch vertically.
+             */
+            if (right_tile->pos.x < left_tile->pos.x +
+                left_tile->sysmem_extent.width)
+               continue;
+
+            u_foreach_bit (view, common_views) {
+               if (!((tile->subsampled_views | other_tile->subsampled_views) &
+                    (1u << view)))
+                  continue;
+
+               if (!aprons_intersect(tile, other_tile, view))
+                  continue;
+
+               /* Try shifting the right tile to the right. */
+               if (right_tile->subsampled_views & (1u << view)) {
+                  VkRect2D right_unsubsampled =
+                     get_default_tile_pos(phys_dev, right_tile, view, fb,
+                                          tiling, fdm_offsets,
+                                          (VkExtent2D) { 1, 1 });
+                  const unsigned shift_amount =
+                     MAX2(APRON_SIZE * 2, phys_dev->info->tile_align_w);
+                  if (right_tile->subsampled_pos[view].offset.x +
+                      right_tile->subsampled_pos[view].extent.width +
+                      shift_amount <= right_unsubsampled.offset.x +
+                      right_unsubsampled.extent.width) {
+                     right_tile->subsampled_pos[view].offset.x +=
+                        shift_amount;
+                     u_worklist_push_tail(&worklist, right_tile,
+                                          worklist_idx);
+                     continue;
+                  }
+               }
+
+               /* Now we have to make both tiles non-subsampled. */
+               if (tile->subsampled_views & (1u << view)) {
+                  make_non_subsampled(phys_dev, tile, view, fb, tiling, fdm_offsets);
+                  u_worklist_push_tail(&worklist, tile, worklist_idx);
+               }
+
+               if (other_tile->subsampled_views & (1u << view)) {
+                  make_non_subsampled(phys_dev, other_tile, view, fb, tiling, fdm_offsets);
+                  u_worklist_push_tail(&worklist, other_tile, worklist_idx);
+               }
+            }
+         }
+      }
+
+      /* Do the identical thing for horizontally adjacent tiles.
+       */
+      for (unsigned i = 0; i < 2; i++) {
+         int y_offset = i == 0 ? -1 : tile->sysmem_extent.height;
+         int y_pos = tile->pos.y + y_offset;
+         if (y_pos < 0 || y_pos >= vsc->tile_count.height)
+            continue;
+         int x_start = MAX2(tile->pos.x - 1, 0);
+         int x_end = MIN2(tile->pos.x + tile->sysmem_extent.width,
+                          vsc->tile_count.width - 1);
+         struct tu_tile_config *other_tile =
+            tu_get_merged_tile(&tiles[y_pos * vsc->tile_count.width + x_start]);
+         /* Sweep from (x_start, y_pos) to (x_end, y_pos), keeping in mind
+          * merged tiles.
+          */
+         for (int x = x_start; x <= x_end;
+              x = other_tile->pos.x + other_tile->sysmem_extent.width) {
+            other_tile = tu_get_merged_tile(&tiles[y_pos * vsc->tile_count.width + x]);
+            uint32_t common_views = tile->visible_views &
+               other_tile->visible_views;
+            if (common_views == 0)
+               continue;
+
+            if (((tile->subsampled_views | other_tile->subsampled_views) &
+                 common_views) == 0)
+               continue;
+
+            struct tu_tile_config *top_tile = (i == 0) ? other_tile : tile;
+            struct tu_tile_config *bottom_tile = (i == 0) ? tile : other_tile;
+
+            /* Due to bin merging, the bottom tile may not actually be
+             * below the top tile, instead extending below it, for example
+             * if other_tile includes (0, 0) and (0, 1) and the current
+             * tile is (1, 0) or vice versa. top_tile will then also be
+             * vertically adjacent, and we can skip it because it will have
+             * been handled above, and it should not touch vertically which
+             * means it will also not touch horizontally.
+             */
+            if (bottom_tile->pos.y < top_tile->pos.y +
+                top_tile->sysmem_extent.height)
+               continue;
+
+            u_foreach_bit (view, common_views) {
+               if (!((tile->subsampled_views | other_tile->subsampled_views) &
+                    (1u << view)))
+                  continue;
+
+               if (!aprons_intersect(tile, other_tile, view))
+                  continue;
+
+               /* Try shifting the bottom tile down. */
+               if (bottom_tile->subsampled_views & (1u << view)) {
+                  VkRect2D bottom_unsubsampled =
+                     get_default_tile_pos(phys_dev, bottom_tile, view, fb,
+                                          tiling, fdm_offsets,
+                                          (VkExtent2D) { 1, 1 });
+                  const unsigned shift_amount =
+                     MAX2(APRON_SIZE * 2, phys_dev->info->tile_align_h);
+                  if (bottom_tile->subsampled_pos[view].offset.y +
+                      bottom_tile->subsampled_pos[view].extent.height +
+                      shift_amount <= bottom_unsubsampled.offset.y +
+                      bottom_unsubsampled.extent.height) {
+                     bottom_tile->subsampled_pos[view].offset.y +=
+                        shift_amount;
+                     u_worklist_push_tail(&worklist, bottom_tile,
+                                          worklist_idx);
+                     continue;
+                  }
+               }
+
+               /* Now we have to make both tiles non-subsampled. One or both
+                * may be shifted so we have to un-shift them.
+                */
+               if (tile->subsampled_views & (1u << view)) {
+                  make_non_subsampled(phys_dev, tile, view, fb, tiling, fdm_offsets);
+                  u_worklist_push_tail(&worklist, tile, worklist_idx);
+               }
+
+               if (other_tile->subsampled_views & (1u << view)) {
+                  make_non_subsampled(phys_dev, other_tile, view, fb, tiling, fdm_offsets);
+                  u_worklist_push_tail(&worklist, other_tile, worklist_idx);
+               }
+            }
+         }
+      }
+   }
+
+   u_worklist_fini(&worklist);
+}

 struct tu_tile_config *
 tu_calc_tile_config(struct tu_cmd_buffer *cmd, const struct tu_vsc_config *vsc,
@ -420,6 +787,13 @@ tu_calc_tile_config(struct tu_cmd_buffer *cmd, const struct tu_vsc_config *vsc,
      }
   }

+   if (cmd->state.fdm_subsampled &&
+       vsc->tile_count.width * vsc->tile_count.height <= TU_SUBSAMPLED_MAX_BINS) {
+      tu_calc_subsampled(tiles, cmd->device->physical_device,
+                         cmd->state.tiling, cmd->state.framebuffer,
+                         vsc, fdm_offsets);
+   }
+
   return tiles;
 }

--- a/src/freedreno/vulkan/tu_tile_config.h
+++ b/src/freedreno/vulkan/tu_tile_config.h
@ -18,10 +18,37 @@ struct tu_tile_config {
   uint32_t pipe;
   uint32_t slot_mask;
   uint32_t visible_views;
+   
+   /* Whether to use subsampled_pos instead of the normal origin in
+    * framebuffer space when storing this tile.
+    */
+   bool subsampled;
+
+   /* If subsampled is true, whether this is a border tile that may not be
+    * aligned.
+    */
+   bool subsampled_border;
+
+   /* If subsampled is true, which views to store subsampled. If true, the
+    * view is stored low-resolution as is, if false the view is expanded to
+    * its full size in sysmem when resolving. However the origin of the tile
+    * in subsampled space is always subsampled_pos when subsampled is true,
+    * regardless of the value of this field.
+    */
+   uint32_t subsampled_views;
+
+   /* Used internally. */
+   unsigned worklist_idx;

   /* The tile this tile was merged with. */
   struct tu_tile_config *merged_tile;

+   /* For subsampled images, the start of the tile in the final subsampled
+    * image for each view. This may or may not be the start of the tile in
+    * framebuffer space, due to the need to shift tiles over.
+    */
+   VkRect2D subsampled_pos[MAX_VIEWS];
+
   /* For merged tiles, the extent in tiles when resolved to system memory.
    */
   VkExtent2D sysmem_extent;
@ -34,6 +61,25 @@ struct tu_tile_config {
   VkExtent2D frag_areas[MAX_VIEWS];
 };

+/* After merging, follow the trail of merged_tile pointers back to the tile
+ * this tile was ultimately merged with.
+ */
+static inline struct tu_tile_config *
+tu_get_merged_tile(struct tu_tile_config *tile)
+{
+   while (tile->merged_tile)
+      tile = tile->merged_tile;
+   return tile;
+}
+
+static inline const struct tu_tile_config *
+tu_get_merged_tile_const(const struct tu_tile_config *tile)
+{
+   while (tile->merged_tile)
+      tile = tile->merged_tile;
+   return tile;
+}
+
 struct tu_tile_config *
 tu_calc_tile_config(struct tu_cmd_buffer *cmd, const struct tu_vsc_config *vsc,
                    const struct tu_image_view *fdm, const VkOffset2D *fdm_offsets);