MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan

TL;DR
MindJourney introduces a test-time scaling framework that enhances vision-language models with a controllable world model for improved 3D spatial reasoning without fine-tuning, demonstrated by performance gains on the SAT benchmark.
Contribution
The paper presents a novel test-time scaling approach coupling VLMs with a video diffusion-based world model for robust 3D spatial reasoning.
Findings
Achieves over 7.7% performance boost on SAT benchmark.
Improves VLM reasoning without fine-tuning.
Demonstrates effectiveness of world models for test-time scaling.
Abstract
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies · Data Management and Algorithms
