TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
Jiawei Ren, Michal Jan Tyszkiewicz, Jiahui Huang, Zan Gojcic

TL;DR
TokenGS introduces a novel Transformer-based approach for 3D Gaussian prediction that directly regresses 3D means with learnable tokens, improving robustness and efficiency in 3D scene reconstruction.
Contribution
It proposes replacing Gaussian mean regression along camera rays with direct 3D coordinate regression using learnable tokens, enabling flexible, robust 3D Gaussian Splatting.
Findings
Achieves state-of-the-art feed-forward reconstruction on static and dynamic scenes.
Demonstrates improved robustness to pose noise and multiview inconsistencies.
Supports efficient test-time optimization without degrading learned priors.
Abstract
In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
