Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
Dongxu Wei, Qi Xu, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu

TL;DR
This paper introduces a novel method for 3D scene generation using an implicit 3D latent space, enabling efficient, spatially consistent, and flexible scene synthesis from arbitrary views, surpassing traditional 2D diffusion models.
Contribution
It proposes a 3D representation autoencoder and diffusion transformer that operate directly in 3D latent space, improving 3D scene generation quality and flexibility.
Findings
Enables representation of 3D scenes from arbitrary views with fixed complexity.
Achieves spatially consistent 3D scene generation.
Supports decoding to images and point maps without per-trajectory diffusion.
Abstract
3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
