VidTok: A Versatile and Open-Source Video Tokenizer
Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian

TL;DR
VidTok is a versatile, open-source video tokenizer that achieves state-of-the-art performance in both continuous and discrete tokenizations through innovative architecture, quantization, and training strategies, advancing video generation and understanding.
Contribution
The paper introduces VidTok, a novel video tokenizer that combines architectural improvements, Finite Scalar Quantization, and advanced training methods to outperform existing approaches.
Findings
Achieves superior PSNR, SSIM, LPIPS, and FVD metrics.
Demonstrates robustness across multiple evaluation settings.
Provides an open-source tool for the research community.
Abstract
Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Video Coding and Compression Technologies · Generative Adversarial Networks and Image Synthesis
