TL;DR
VideoREPA introduces a novel method to enhance physics understanding in text-to-video models by distilling knowledge from foundation models through token relation alignment, leading to more physically plausible video generation.
Contribution
It is the first REPA approach specifically designed for finetuning T2V models to inject physical knowledge and improve physics consistency in generated videos.
Findings
Significant improvement in physics commonsense benchmarks.
Enhanced physical plausibility in generated videos.
Effective distillation of physics understanding from foundation models.
Abstract
Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
