Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei, Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

TL;DR
Tune-A-Video introduces a one-shot tuning approach for text-to-video generation using pre-trained image diffusion models, enabling efficient and consistent video synthesis from a single text-video pair.
Contribution
The paper proposes a novel one-shot tuning method for text-to-video generation that leverages pre-trained image diffusion models with a new spatio-temporal attention mechanism.
Findings
Effective generation of videos from a single text-video pair
High content consistency in generated videos
Significant reduction in computational cost
Abstract
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation settingOne-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Tune-A-Video-library/a-man-is-surfingmodel· 8 dl· ♡ 108 dl♡ 10
- 🤗Tune-A-Video-library/mo-di-bear-guitarmodel· 19 dl· ♡ 2219 dl♡ 22
- 🤗Tune-A-Video-library/birdgif-testmodel· ♡ 2♡ 2
- 🤗Tune-A-Video-library/redshift-man-skiingmodel· 19 dl· ♡ 1319 dl♡ 13
- 🤗Tune-A-Video-library/df-cpt-mo-di-bear-guitarmodel· 22 dl· ♡ 222 dl♡ 2
- 🤗kyujinpy/Tune-A-VideKO-v1-5model· 19 dl· ♡ 119 dl♡ 1
- 🤗kyujinpy/Tune-A-VideKO-anythingmodel· 18 dl· ♡ 118 dl♡ 1
- 🤗kyujinpy/Tune-A-VideKO-disneymodel· 18 dl· ♡ 318 dl♡ 3
- 🤗please-go-faster/baselinemodel· 1 dl1 dl
- 🤗rarun/product-adsmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsALIGN · Diffusion
