SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
Andrew Li, Rahul Thapa, Rahul Chalamala, Qingyang Wu, Kezhen Chen,, James Zou

TL;DR
This paper introduces SMiR, a synthetic data pipeline and benchmark for multi-image reasoning, enabling cost-effective dataset creation and comprehensive evaluation of vision-language models' reasoning abilities.
Contribution
The paper presents a novel synthetic data-generation pipeline and benchmark for multi-image reasoning, addressing resource challenges and evaluation gaps in the field.
Findings
SMiR generates 160K synthetic training samples efficiently.
SMiR-Bench provides a diverse, multi-turn reasoning benchmark with 200 examples.
Fine-tuned models show improved multi-image reasoning performance.
Abstract
Vision-Language Models (VLMs) excel at understanding single images, aided by high-quality instruction datasets. However, multi-image reasoning remains underexplored in the open-source community due to two key challenges: (1) scaling datasets with correlated images and complex reasoning instructions is resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks are lacking. To address this, we introduce SMiR, a synthetic data-generation pipeline for multi-image reasoning, along with a high-quality dataset generated using this pipeline. SMiR efficiently extracts correlated images via multimodal embeddings, integrates visual and descriptive information, and leverages open-source LLMs to generate quality instructions. Using this approach, we produce 160K synthetic training samples, offering a cost-effective alternative to closed-source solutions. Additionally, we present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
