PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai; Kunpeng Li; Menglin Jia; Jialiang Wang; Junzhe Sun; Feng Liang; Weifeng Chen; Felix Juefei-Xu; Chu Wang; Ali Thabet; Xiaoliang Dai; Xuan Ju; Alan Yuille; Ji Hou

arXiv:2512.24551·cs.CV·March 6, 2026

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

PDF

Open Access

TL;DR

This paper introduces a physics-aware framework for text-to-video generation that ensures physical consistency by leveraging large-scale physics-augmented data, a preference optimization model, and physics-guided rewards, outperforming existing methods.

Contribution

It presents a novel physics-aware preference optimization framework and a physics-augmented data construction pipeline for more physically consistent video synthesis.

Findings

01

Outperforms state-of-the-art methods on PhyGenBench and VideoPhy2.

02

Effectively captures complex physical phenomena in generated videos.

03

Utilizes physics-guided rewards to improve physical accuracy.

Abstract

Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that uses real-world video as winning case to guarantee correct physics learning and builds upon the groupwise Plackett-Luce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition