InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, Wenhai Wang

TL;DR
InternSpatial introduces the largest dataset and benchmark for spatial reasoning in vision-language models, enabling improved understanding of spatial queries across diverse visual environments and instruction formats.
Contribution
It provides a large-scale, diverse dataset and a novel benchmark with multi-view reasoning tasks, advancing spatial understanding in vision-language models.
Findings
Models trained on InternSpatial improved by 12.1% on InternSpatial-Bench.
Achieved 10.7% improvement on VSI-Bench.
Maintained strong performance on general benchmarks.
Abstract
Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that…
Peer Reviews
Decision·ICLR 2026 Poster
### Reasonable Dataset Design - The dataset covers diverse visual domains (indoor, outdoor, object-centric, embodied, urban) and both single-view and multi-view reasoning setups. - It supports a wide variety of instruction modalities, text, bounding boxes, masks, numeric indicators, coordinate-based prompts, totaling 19 instruction types, a major advancement over prior datasets like SpatialVLM or OSD. ### Data Generation Process - The data pipeline integrates multiple pretrained modules for dept
### The limit of Template-Driven QA Generation. While efficient, the template-based QA generation may lead to limited linguistic diversity and potential overfitting to templated phrasing. The authors acknowledge this, but do not quantify how template rigidity affects generalization to natural human queries. ### Lack of Qualitative Error Analysis The evaluation focuses almost exclusively on quantitative metrics. There is little qualitative examination of failure modes (e.g., reasoning about o
In general, I think a12M dataset is a quite significant improvement from previous QA datasets in terms of data quantity. The dataset comes from a wide variety of data, as shown by Figure 4. Results on InternSpatial-Bench, VSI-Bench as well as other general benchmark results show that training on this dataset brings a lot of improvements.
It would be great to understand how much the image datasets are helping with the training in general. The alignment to view space from 2D images requires depth estimation followed by camera estimation, both of which could potentially introduce significant errors. I wonder if it would be possible to see the improvements based on InternVL-Spatial-8B trained with only 3D datasets and/or only 2D datasets. Also, this would give more insights on whether there are domain gaps within the training datase
1. The proposed dataset is large-scale especially regarding the number of QA pairs. 2. The data generation pipeline is sound. 3. It shows promising performance leveraging the curated data.
1. Could the author compare the proposed dataset with previous ones in terms of scenarios, question types, etc.? 2. Could the authors validate the effectiveness of the proposed data on more open-source frameworks? 3. Could the authors explain the performance show limited gain on rotation estimation and object counting? 4. Could the generated QA pairs reflect the complexity or ambiguity of human spatial questions. 5. Will data generation pipeline and data be publicly available?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
