Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models
Sungwon Hwang, Hyojin Jang, Kinam Kim, Minho Park, Jaegul Choo

TL;DR
This paper introduces CREPA, a novel regularization technique for fine-tuning Video Diffusion Models that enhances both visual quality and temporal semantic consistency across frames.
Contribution
It adapts the Representation Alignment method for VDMs and proposes CREPA to improve cross-frame semantic coherence during fine-tuning.
Findings
CREPA improves visual fidelity in fine-tuned VDMs.
CREPA enhances cross-frame semantic consistency.
Empirical results on large-scale VDMs validate CREPA's effectiveness.
Abstract
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Advanced Neuroimaging Techniques and Applications
MethodsDiffusion
