Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Xiaoyu Zhan; Wenxuan Huang; Hao Sun; Xinyu Fu; Changfeng Ma; Shaosheng Cao; Bohan Jia; Shaohui Lin; Zhenfei Yin; Lei Bai; Wanli Ouyang; Yuanqi Li; Jie Guo; Yanwen Guo

arXiv:2511.01618·cs.CV·November 4, 2025

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo

PDF

Open Access

TL;DR

This paper introduces Viewpoint Learning and a new dataset to enhance the spatial reasoning abilities of Multimodal Large Language Models, enabling better 3D understanding and cross-view consistency.

Contribution

It presents a novel two-stage fine-tuning approach and a hybrid initialization method to significantly improve MLLMs' spatial reasoning capabilities.

Findings

01

Enhanced performance on 3D reasoning tasks

02

Improved cross-view consistency in MLLMs

03

Effective generalization to out-of-domain data

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Constraint Satisfaction and Optimization