Visual Spatial Tuning
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao

TL;DR
This paper introduces Visual Spatial Tuning (VST), a new framework that significantly improves the spatial perception and reasoning abilities of vision-language models using large-scale datasets and a progressive training pipeline.
Contribution
The paper presents VST, a comprehensive approach with new datasets and training methods to enhance human-like visuospatial abilities in VLMs without compromising their general capabilities.
Findings
Achieved 34.8% on MMSI-Bench and 61.2% on VSIBench.
VST improves spatial reasoning without harming general model performance.
State-of-the-art results on multiple spatial benchmarks.
Abstract
Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline:…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Strengths: - The paper is clearly written - The constructed dataset seems large and comprehensive
Weakness: - The datasets are not released - It seems arbitrary to choose the tasks and data, I do not see a clear guideline or rationale why or why not a dataset/task is used - In figure 2, how the percentage of each data is determined? - In addition to Qwen, the authors need to demonstrate that the datasets can also enhance other VLMs
The benchmark is holistic: - The two-stage decomposition, VST-Perception (VST-P) for foundational 3D knowledge and VST-Reasoning (VST-R) for higher-order reasoning, is conceptually clean and mirrors cognitive-development theory (Piaget). - The progressive training pipeline (SFT → CoT → RL) is consistent and thoughtfully motivated. Strong Engineering and Benchmarking - Comprehensive experiments on CV-Bench, MMSI-Bench, VSI-Bench, and general-purpose multimodal benchmarks. - Extensive ablation ta
Lack of Human Validation or Quality Assurance - None of the CoT or RL data appear to be manually verified, and the paper does not quantify the noise level or logical accuracy of the synthetic reasoning. - This raises concerns about the trustworthiness of the training signal, especially since spatial reasoning is fragile to geometric hallucinations. - Even partial human spot-checking would have dramatically strengthened credibility. Limited Analysis of Model Behavior - The paper reports aggregat
1. The VST-Perception dataset (4.1 M samples) and VST-Reasoning subset (135 K samples) represent one of the most comprehensive spatial datasets to date, spanning single-image, multi-view, and video inputs and covering a broad range of perception and reasoning tasks. 2. The framework enhances 3D spatial understanding in a standard VLM without introducing any additional 3D or geometry-specific encoders, making the method lightweight, widely applicable, and easy to integrate with existing models.
1. The paper follows a well-known recipe—dataset curation plus progressive fine-tuning—already explored in prior works such as SpatialVLM and SpatialRGPT. The contribution lies mainly in scale rather than in any methodological or theoretical innovation. 2. The GRPO stage adds only negligible improvement over supervised fine-tuning, and the paper provides minimal analysis of how RL alters model behavior or reasoning quality. This raises doubts about whether the third stage meaningfully contribut
1. Authors clearly demonstrate gaps in existing work related to spatial perception and reasoning. The proposed dataset and training framework directly addresses identified problem. 2. Clever tricks for utilizing 3D information in BEV format for teacher VLM prompting. 3. Detailed overview of data generation process. 3. Extensive evaluation across numerous tasks to evaluate benefits from proposed dataset and framework.
1. In each table, please report numbers for the corresponding base model (e.g. in Table 3 include Qwen2.5-VL-3B as base model for VST-3B). 2. Across several tasks in Table 2 & 3, the RL fine-tuning appears to reduce performance (e.g. MMMU drops from 50.6 -> 49.4 for VST-7B-RL). Also for MMMU, the Qwen2.5-VL-7B performance drops drastically for both VSTS-7B models. Please discuss this behavior in detail. 3. "Stage 2: CoT Cold Start" is identical to SFT except with longer conversations involving
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
