Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Meng Cao; Haokun Lin; Haoyuan Li; Haoran Tang; Rongtao Xu; Dong An; Xue Liu; Ian Reid; Xiaodan Liang

arXiv:2512.01821·cs.CV·December 9, 2025

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu, Dong An, Xue Liu, Ian Reid, Xiaodan Liang

PDF

Open Access

TL;DR

This paper presents MILO, a novel spatial reasoning framework for multimodal large language models that incorporates implicit 3D world modeling and visual feedback to improve understanding of spatial structures.

Contribution

It introduces MILO, an implicit spatial world modeling paradigm, and RePE, a new relative positional encoding scheme, along with a large-scale dataset GeoGen for training.

Findings

01

Enhanced spatial reasoning performance across multiple benchmarks.

02

Improved understanding of 3D space in multimodal models.

03

Significant gains over existing methods in spatial tasks.

Abstract

Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Human Motion and Animation