Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan

TL;DR
This paper introduces SPRITE, a framework that uses simulators and large models to generate high-quality, diverse spatial reasoning data, significantly improving the spatial understanding of vision-language models.
Contribution
SPRITE reframes ground-truth generation as a code-generation task, enabling scalable, diverse, and precise spatial reasoning data synthesis using simulators and large language models.
Findings
VLM trained on SPRITE data outperforms other datasets on spatial benchmarks.
SPRITE dataset includes over 300k instruction pairs from 3 simulators.
Scalability analysis confirms the importance of diverse data for robust spatial reasoning.
Abstract
Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
