DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

TL;DR
DiCoDe introduces a scalable autoregressive video generation method using diffusion-compressed deep tokens, enabling efficient training and high-quality video synthesis with language models, and demonstrating promising results across various model sizes.
Contribution
The paper presents DiCoDe, a novel approach that leverages diffusion-trained deep tokens for scalable autoregressive video generation with language models, achieving high compression and quality.
Findings
Achieves 1000x token compression enabling efficient training.
Performs comparably to existing methods in video quality.
Scaling up model size improves performance consistently.
Abstract
Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Video Analysis and Summarization
MethodsDiffusion
