Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
Aniruddha Mahapatra, Long Mai, David Bourgin, Yitian Zhang, Feng Liu

TL;DR
This paper introduces a progressive training method for video tokenizers that enhances temporal compression and reconstruction quality, enabling efficient high-quality video generation with fewer tokens.
Contribution
We propose a bootstrapped, progressive training approach with cross-level feature mixing to improve temporal compression in video tokenizers beyond existing limits.
Findings
Significant improvement in reconstruction quality at higher temporal compression ratios.
Effective reduction in token budget for high-quality video generation.
Enhanced latent space compactness for efficient diffusion modeling.
Abstract
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsDiffusion
