Tune-A-Video: One-Shot Tuning of Image Diffusion Models for   Text-to-Video Generation

Jay Zhangjie Wu; Yixiao Ge; Xintao Wang; Weixian Lei; Yuchao Gu; Yufei; Shi; Wynne Hsu; Ying Shan; Xiaohu Qie; Mike Zheng Shou

arXiv:2212.11565·cs.CV·March 20, 2023·27 cites

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei, Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

PDF

Open Access 3 Repos 10 Models

TL;DR

Tune-A-Video introduces a one-shot tuning approach for text-to-video generation using pre-trained image diffusion models, enabling efficient and consistent video synthesis from a single text-video pair.

Contribution

The paper proposes a novel one-shot tuning method for text-to-video generation that leverages pre-trained image diffusion models with a new spatio-temporal attention mechanism.

Findings

01

Effective generation of videos from a single text-video pair

02

High content consistency in generated videos

03

Significant reduction in computational cost

Abstract

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting $\unicode x 2014$ One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsALIGN · Diffusion