ViD-GPT: Introducing GPT-style Autoregressive Generation in Video   Diffusion Models

Kaifeng Gao; Jiaxin Shi; Hanwang Zhang; Chunping Wang; Jun Xiao

arXiv:2406.10981·cs.CV·June 18, 2024

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao

PDF

Open Access 1 Repo

TL;DR

ViD-GPT introduces a causal, autoregressive approach with GPT-style generation and kv-cache acceleration to improve long video generation quality and efficiency in diffusion models.

Contribution

It pioneers the integration of GPT-style causal attention and prompt-based conditioning into video diffusion models for long video synthesis.

Findings

01

Achieves state-of-the-art results in long video generation.

02

Significantly improves inference speed with kv-cache mechanism.

03

Demonstrates superior qualitative and quantitative performance.

Abstract

With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computations, which restricts the receptive context of each autoregression step, and results in the model lacking long-term dependencies. Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames. For Causal Generation, we introduce causal temporal attention into VDM, which forces each generated frame to depend on its previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dawn-lx/causal-videogen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Cosine Annealing · Byte Pair Encoding · Attention Dropout · Weight Decay · Dropout · Adam · Linear Warmup With Cosine Annealing · Linear Layer · Dense Connections