DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation
Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, Ser-Nam Lim

TL;DR
DiReCT introduces a novel regularization framework that disentangles semantic and physical information in contrastive video generation, enhancing physical realism without extra training cost.
Contribution
The paper proposes DiReCT, a post-training method that separates semantic and physical signals in contrastive learning for physics-aware video generation.
Findings
Improves physical commonsense score on VideoPhy by 16.7% over baseline.
Effectively separates semantic and physical information in contrastive trajectories.
Enhances physics consistency in generated videos without increasing training time.
Abstract
Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
