Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

Sungwon Hwang; Hyojin Jang; Kinam Kim; Minho Park; Jaegul Choo

arXiv:2506.09229·cs.CV·June 26, 2025

Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

Sungwon Hwang, Hyojin Jang, Kinam Kim, Minho Park, Jaegul Choo

PDF

Open Access 3 Models

TL;DR

This paper introduces CREPA, a novel regularization technique for fine-tuning Video Diffusion Models that enhances both visual quality and temporal semantic consistency across frames.

Contribution

It adapts the Representation Alignment method for VDMs and proposes CREPA to improve cross-frame semantic coherence during fine-tuning.

Findings

01

CREPA improves visual fidelity in fine-tuned VDMs.

02

CREPA enhances cross-frame semantic consistency.

03

Empirical results on large-scale VDMs validate CREPA's effectiveness.

Abstract

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Advanced Neuroimaging Techniques and Applications

MethodsDiffusion