Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao

TL;DR
Video2Layout introduces a continuous, metric-grounded approach to reconstruct spatial layouts from video, enhancing fine-grained spatial reasoning in multimodal models and outperforming grid-based methods on benchmarks.
Contribution
The paper presents a novel framework that reconstructs continuous spatial layouts from video, improving spatial reasoning and generalization over traditional grid-based methods.
Findings
Achieves 3.24% higher accuracy on spatial reasoning benchmarks.
Effectively reduces ambiguity in natural language spatial descriptions.
Demonstrates improved real-world generalization through reinforcement fine-tuning.
Abstract
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, prior studies attempt to construct a spatial understanding via grid-based cognitive maps. However, current grid-based map methods rely on discretized representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework uses continuous object boundary coordinates to enable quantitative spatial computation, which effectively reduces ambiguity in natural language descriptions of spatial relationships. Specifically, our method comprises two stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
