RepVideo: Rethinking Cross-Layer Representation for Video Generation
Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu

TL;DR
RepVideo introduces a novel cross-layer feature accumulation method for text-to-video diffusion models, significantly improving spatial accuracy and temporal coherence in generated videos by stabilizing semantic representations.
Contribution
It proposes RepVideo, a new framework that enhances semantic stability and temporal consistency by aggregating features across layers in diffusion-based video generation.
Findings
Improves spatial appearance accuracy in generated videos.
Enhances temporal consistency across frames.
Captures complex spatial relationships effectively.
Abstract
Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Diffusion
