Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's   Impact on Spatio-Temporal Cross-Attentions

Ashkan Taghipour; Morteza Ghahremani; Mohammed Bennamoun; Aref Miri; Rekavandi; Zinuo Li; Hamid Laga; Farid Boussaid

arXiv:2407.19205·cs.CV·July 30, 2024·1 cites

Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri, Rekavandi, Zinuo Li, Hamid Laga, Farid Boussaid

PDF

Open Access

TL;DR

This paper explores the impact of CLIP image embeddings on video generation within the SVD framework, proposing a new efficient method that reduces computational load by replacing cross-attention with a linear layer, without sacrificing quality.

Contribution

It introduces VCUT, a training-free, efficient approach that eliminates the need for continuous cross-attention during inference, significantly reducing computational costs and latency.

Findings

01

CLIP embeddings are crucial for aesthetic quality but not for subject/background consistency.

02

Replacing cross-attention with a linear layer maintains quality while improving efficiency.

03

VCUT reduces MACs by up to 322T and decreases model parameters by 50M, with a 20% latency reduction.

Abstract

This paper investigates the role of CLIP image embeddings within the Stable Video Diffusion (SVD) framework, focusing on their impact on video generation quality and computational efficiency. Our findings indicate that CLIP embeddings, while crucial for aesthetic quality, do not significantly contribute towards the subject and background consistency of video outputs. Moreover, the computationally expensive cross-attention mechanism can be effectively replaced by a simpler linear layer. This layer is computed only once at the first diffusion inference step, and its output is then cached and reused throughout the inference process, thereby enhancing efficiency while maintaining high-quality outputs. Building on these insights, we introduce the VCUT, a training-free approach optimized for efficiency within the SVD architecture. VCUT eliminates temporal cross-attention and replaces spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsDiffusion · Contrastive Language-Image Pre-training