Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo, Luo, Xi Yin

TL;DR
Latent-Shift introduces a parameter-free temporal shift module for efficient text-to-video generation in latent space, leveraging a pretrained image diffusion model to produce videos with comparable or better quality while reducing computational complexity.
Contribution
The paper presents a novel parameter-free temporal shift module that enables efficient video generation using a pretrained image diffusion model without additional parameters.
Findings
Latent-Shift achieves comparable or better video quality than existing methods.
The approach significantly reduces computational costs compared to traditional video diffusion models.
It can generate videos effectively even when fine-tuned primarily for text-to-video tasks.
Abstract
We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsConcatenated Skip Connection · Max Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · U-Net · Diffusion
