PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

TL;DR
PhyPrompt introduces a reinforcement learning framework that automatically refines prompts to generate physically plausible videos, significantly improving physical realism and semantic fidelity across multiple models without requiring expert input.
Contribution
The paper presents a novel RL-based prompt refinement method that integrates physics principles into text-to-video generation, outperforming larger models and general-purpose approaches.
Findings
40.8% joint success rate on VideoPhy2
Improved physical commonsense by 11 percentage points
Zero-shot transfer across diverse T2V architectures
Abstract
State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation
