Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
Yanbing Zhang, Bo Wang, Jianhui Liu, Nan Jiang, Jiaxiu Jiang, Haoze Sun, Yijun Yang, Shenghe Zheng, Lin Song, Haoyang Huang, Nan Duan, Wenbo Li

TL;DR
This paper introduces TwNV, a paradigm that enhances spatial reasoning in Large Multimodal Models by integrating generative novel-view synthesis, leading to consistent accuracy improvements across various tasks and architectures.
Contribution
The paper presents a systematic approach combining novel-view synthesis with reasoning to improve spatial understanding in LMMs, demonstrating significant accuracy gains.
Findings
Numerical camera-pose instructions outperform free-form language for view control.
Synthesized view quality directly impacts spatial reasoning accuracy.
Iterative multi-turn view refinement further enhances model performance.
Abstract
Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
