Versatile Video Tokenization with Generative 2D Gaussian Splatting
Zhenghao Chen, Zicong Chen, Lei Liu, Yiming Wu, Dong Xu

TL;DR
The paper introduces GVT, a versatile video tokenizer using generative 2D Gaussian Splatting that adaptively encodes spatial and temporal content, improving video reconstruction, recognition, and compression.
Contribution
It proposes a novel Gaussian Video Transformer with generative 2D Gaussians and strategies for spatial adaptability and temporal separation, enhancing versatility over fixed-grid methods.
Findings
Achieves state-of-the-art video reconstruction quality.
Outperforms baseline in action recognition tasks.
Provides comparable compression performance.
Abstract
Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
