Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

Quankai Gao; Iliyan Georgiev; Tuanfeng Y. Wang; Krishna Kumar Singh; Ulrich Neumann; Jae Shin Yoon

arXiv:2508.01464·cs.CV·August 5, 2025

Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

Quankai Gao, Iliyan Georgiev, Tuanfeng Y. Wang, Krishna Kumar Singh, Ulrich Neumann, Jae Shin Yoon

PDF

Open Access

TL;DR

Can3Tok introduces a novel 3D scene-level VAE that effectively encodes complex scene data into low-dimensional representations, enabling scalable and generalizable 3D scene generation and cross-modal applications.

Contribution

It is the first 3D scene-level VAE capable of encoding numerous Gaussian primitives into a low-dimensional latent space, addressing scale inconsistency and enabling scene-level generative modeling.

Findings

01

Can3Tok successfully generalizes to novel 3D scenes.

02

Compared methods fail to converge or generalize.

03

Demonstrates effective image-to-3D and text-to-3D scene generation.

Abstract

3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis