RepVideo: Rethinking Cross-Layer Representation for Video Generation

Chenyang Si; Weichen Fan; Zhengyao Lv; Ziqi Huang; Yu Qiao; Ziwei Liu

arXiv:2501.08994·cs.CV·January 16, 2025

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu

PDF

Open Access 1 Models

TL;DR

RepVideo introduces a novel cross-layer feature accumulation method for text-to-video diffusion models, significantly improving spatial accuracy and temporal coherence in generated videos by stabilizing semantic representations.

Contribution

It proposes RepVideo, a new framework that enhances semantic stability and temporal consistency by aggregating features across layers in diffusion-based video generation.

Findings

01

Improves spatial appearance accuracy in generated videos.

02

Enhances temporal consistency across frames.

03

Captures complex spatial relationships effectively.

Abstract

Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Vchitect/RepVideo
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Diffusion