Versatile Video Tokenization with Generative 2D Gaussian Splatting

Zhenghao Chen; Zicong Chen; Lei Liu; Yiming Wu; Dong Xu

arXiv:2508.11183·cs.CV·August 18, 2025

Versatile Video Tokenization with Generative 2D Gaussian Splatting

Zhenghao Chen, Zicong Chen, Lei Liu, Yiming Wu, Dong Xu

PDF

TL;DR

The paper introduces GVT, a versatile video tokenizer using generative 2D Gaussian Splatting that adaptively encodes spatial and temporal content, improving video reconstruction, recognition, and compression.

Contribution

It proposes a novel Gaussian Video Transformer with generative 2D Gaussians and strategies for spatial adaptability and temporal separation, enhancing versatility over fixed-grid methods.

Findings

01

Achieves state-of-the-art video reconstruction quality.

02

Outperforms baseline in action recognition tasks.

03

Provides comparable compression performance.

Abstract

Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.