Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

Yihong Tang; Ao Qu; Zhaokai Wang; Dingyi Zhuang; Zhaofeng Wu; Wei Ma; Shenhao Wang; Yunhan Zheng; Zhan Zhao; Jinhua Zhao

arXiv:2410.16162·cs.CV·October 3, 2025

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces Sparkle, a framework that enhances vision language models' basic 2D spatial skills through synthetic training, leading to improved performance on complex and real-world spatial reasoning tasks.

Contribution

The paper proposes a systematic approach to improve VLMs' spatial reasoning by training on disentangled basic skills using synthetic data, enabling better generalization.

Findings

01

VLMs trained with Sparkle outperform baseline models on complex spatial tasks.

02

Enhanced basic spatial skills lead to significant improvements in real-world spatial reasoning.

03

Synthetic data effectively generalizes to out-of-distribution spatial reasoning scenarios.

Abstract

Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. Many spatial reasoning tasks depend on fundamental two-dimensional (2D) skills, yet our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems, including simple pathfinding tasks that humans solve effortlessly. To address this, we enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities. We first disentangle 2D spatial reasoning into three core components: direction comprehension, distance estimation, and localization. We hypothesize that mastering these skills substantially improves performance on complex spatial tasks that require advanced reasoning and combinatorial problem solving, while also generalizing to real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning· underline

Taxonomy

TopicsConstraint Satisfaction and Optimization