Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding
Hang Yin, Xiaomin He, PeiWen Yuan, Yiwei Li, Jiayi Shi, Wenxiao Fan, Shaoxiong Feng, Kan Li

TL;DR
This paper introduces SiTe, a simple data augmentation method that improves spatial understanding in vision-language models by stitching images and generating spatially-aware captions without extra annotations.
Contribution
It proposes a novel, annotation-free data augmentation technique called Stitch and Tell (SiTe) that enhances spatial reasoning in vision-language models by injecting structured spatial supervision.
Findings
SiTe improves spatial understanding tasks by over 4% in benchmark scores.
It maintains or enhances performance on general vision-language benchmarks.
The method is simple, plug-and-play, and does not require costly annotations or advanced models.
Abstract
Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Language, Metaphor, and Cognition
