VidTok: A Versatile and Open-Source Video Tokenizer

Anni Tang; Tianyu He; Junliang Guo; Xinle Cheng; Li Song; Jiang Bian

arXiv:2412.13061·cs.CV·December 18, 2024

VidTok: A Versatile and Open-Source Video Tokenizer

Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian

PDF

Open Access 1 Repo 1 Models

TL;DR

VidTok is a versatile, open-source video tokenizer that achieves state-of-the-art performance in both continuous and discrete tokenizations through innovative architecture, quantization, and training strategies, advancing video generation and understanding.

Contribution

The paper introduces VidTok, a novel video tokenizer that combines architectural improvements, Finite Scalar Quantization, and advanced training methods to outperform existing approaches.

Findings

01

Achieves superior PSNR, SSIM, LPIPS, and FVD metrics.

02

Demonstrates robustness across multiple evaluation settings.

03

Provides an open-source tool for the research community.

Abstract

Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/vidtok
pytorchOfficial

Models

🤗
microsoft/VidTok
model· ♡ 42
♡ 42

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Video Coding and Compression Technologies · Generative Adversarial Networks and Image Synthesis