SimDA: Simple Diffusion Adapter for Efficient Video Generation
Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang

TL;DR
SimDA introduces a parameter-efficient method to adapt large text-to-image models for video generation by fine-tuning only 24 million parameters and incorporating novel spatial and temporal adapters for improved performance.
Contribution
The paper presents a lightweight adaptation framework that efficiently converts T2I models into T2V models using minimal parameter tuning and new attention mechanisms.
Findings
Achieves high-definition video generation with minimal fine-tuning.
Utilizes a lightweight spatial and temporal adapter design.
Enables quick one-shot video editing with only 2 minutes of tuning.
Abstract
The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video (T2V) still falls short of expectations though attracting increasing interests. Existing works either train from scratch or adapt large T2I model to videos, both of which are computation and resource expensive. In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way. In particular, we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides, we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With similar model architecture, we further train a video super-resolution model to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging
MethodsAdapter · Diffusion
