Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

Hang Yin; Xiaomin He; PeiWen Yuan; Yiwei Li; Jiayi Shi; Wenxiao Fan; Shaoxiong Feng; Kan Li

arXiv:2512.06769·cs.CV·December 16, 2025

Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

Hang Yin, Xiaomin He, PeiWen Yuan, Yiwei Li, Jiayi Shi, Wenxiao Fan, Shaoxiong Feng, Kan Li

PDF

Open Access

TL;DR

This paper introduces SiTe, a simple data augmentation method that improves spatial understanding in vision-language models by stitching images and generating spatially-aware captions without extra annotations.

Contribution

It proposes a novel, annotation-free data augmentation technique called Stitch and Tell (SiTe) that enhances spatial reasoning in vision-language models by injecting structured spatial supervision.

Findings

01

SiTe improves spatial understanding tasks by over 4% in benchmark scores.

02

It maintains or enhances performance on general vision-language benchmarks.

03

The method is simple, plug-and-play, and does not require costly annotations or advanced models.

Abstract

Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $Stitch and Tell$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Language, Metaphor, and Cognition