VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang; Jiaqi Liao; Shaofeng Zhang; Fanqing Meng; Xiangpeng Wan; Junchi Yan; Yu Cheng

arXiv:2505.23656·cs.CV·May 30, 2025

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng

PDF

1 Repo

TL;DR

VideoREPA introduces a novel method to enhance physics understanding in text-to-video models by distilling knowledge from foundation models through token relation alignment, leading to more physically plausible video generation.

Contribution

It is the first REPA approach specifically designed for finetuning T2V models to inject physical knowledge and improve physics consistency in generated videos.

Findings

01

Significant improvement in physics commonsense benchmarks.

02

Enhanced physical plausibility in generated videos.

03

Effective distillation of physics understanding from foundation models.

Abstract

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aHapBean/VideoREPA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion