Latent-Shift: Latent Diffusion with Temporal Shift for Efficient   Text-to-Video Generation

Jie An; Songyang Zhang; Harry Yang; Sonal Gupta; Jia-Bin Huang; Jiebo; Luo; Xi Yin

arXiv:2304.08477·cs.CV·April 19, 2023·27 cites

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo, Luo, Xi Yin

PDF

Open Access

TL;DR

Latent-Shift introduces a parameter-free temporal shift module for efficient text-to-video generation in latent space, leveraging a pretrained image diffusion model to produce videos with comparable or better quality while reducing computational complexity.

Contribution

The paper presents a novel parameter-free temporal shift module that enables efficient video generation using a pretrained image diffusion model without additional parameters.

Findings

01

Latent-Shift achieves comparable or better video quality than existing methods.

02

The approach significantly reduces computational costs compared to traditional video diffusion models.

03

It can generate videos effectively even when fine-tuned primarily for text-to-video tasks.

Abstract

We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsConcatenated Skip Connection · Max Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · U-Net · Diffusion