Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu; Dingkang Liang; Tianrui Feng; Kui Xia; Yumeng Zhang; Xiaofan Li; Xiao Tan; Xiang Bai

arXiv:2603.19235·cs.CV·March 20, 2026

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

PDF

Open Access

TL;DR

This paper introduces VEGA-3D, a novel framework that leverages implicit 3D priors learned by video generation models to enhance scene understanding and spatial reasoning in multimodal large language models without explicit 3D data.

Contribution

It proposes a plug-and-play method to extract and utilize implicit 3D priors from pre-trained video diffusion models for improved 3D scene understanding.

Findings

01

Outperforms state-of-the-art baselines on 3D understanding benchmarks.

02

Enriches multimodal models with dense geometric cues without explicit 3D supervision.

03

Demonstrates the effectiveness of generative priors for physical-world reasoning.

Abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning