Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

Aniruddha Mahapatra; Long Mai; David Bourgin; Yitian Zhang; Feng Liu

arXiv:2501.05442·cs.CV·August 5, 2025

Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

Aniruddha Mahapatra, Long Mai, David Bourgin, Yitian Zhang, Feng Liu

PDF

Open Access

TL;DR

This paper introduces a progressive training method for video tokenizers that enhances temporal compression and reconstruction quality, enabling efficient high-quality video generation with fewer tokens.

Contribution

We propose a bootstrapped, progressive training approach with cross-level feature mixing to improve temporal compression in video tokenizers beyond existing limits.

Findings

01

Significant improvement in reconstruction quality at higher temporal compression ratios.

02

Effective reduction in token budget for high-quality video generation.

03

Enhanced latent space compactness for efficient diffusion modeling.

Abstract

Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsDiffusion