Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
Jiahua Chen, Qihong Tang, Weinong Wang, Qi Fan

TL;DR
This paper introduces a training-free framework that improves 3D spatial reasoning in Multimodal Large Language Models by reconstructing 3D scenes and synthesizing novel views for better perspective understanding.
Contribution
It presents a novel, training-free approach that combines 3D reconstruction and view synthesis to enhance spatial reasoning in MLLMs, outperforming existing models.
Findings
Outperforms GPT-5.2 and Gemini-2.5-Flash on 3DSRBench and Rel3D benchmarks.
Significantly improves spatial comprehension in MLLMs.
Utilizes explicit 3D reconstruction and external knowledge for viewpoint synthesis.
Abstract
Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
