Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
Quankai Gao, Iliyan Georgiev, Tuanfeng Y. Wang, Krishna Kumar Singh, Ulrich Neumann, Jae Shin Yoon

TL;DR
Can3Tok introduces a novel 3D scene-level VAE that effectively encodes complex scene data into low-dimensional representations, enabling scalable and generalizable 3D scene generation and cross-modal applications.
Contribution
It is the first 3D scene-level VAE capable of encoding numerous Gaussian primitives into a low-dimensional latent space, addressing scale inconsistency and enabling scene-level generative modeling.
Findings
Can3Tok successfully generalizes to novel 3D scenes.
Compared methods fail to converge or generalize.
Demonstrates effective image-to-3D and text-to-3D scene generation.
Abstract
3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
