DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Yizhuo Li; Yuying Ge; Yixiao Ge; Ying Shan; Ping Luo

arXiv:2412.04446·cs.CV·March 17, 2026

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

PDF

Open Access 1 Models

TL;DR

DiCoDe introduces a scalable autoregressive video generation method using diffusion-compressed deep tokens, enabling efficient training and high-quality video synthesis with language models, and demonstrating promising results across various model sizes.

Contribution

The paper presents DiCoDe, a novel approach that leverages diffusion-trained deep tokens for scalable autoregressive video generation with language models, achieving high compression and quality.

Findings

01

Achieves 1000x token compression enabling efficient training.

02

Performs comparably to existing methods in video quality.

03

Scaling up model size improves performance consistently.

Abstract

Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
liyz/DiCoDe
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Video Analysis and Summarization

MethodsDiffusion