PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

Shang Wu; Chenwei Xu; Zhuofan Xia; Weijian Li; Lie Lu; Pranav Maneriker; Fan Du; Manling Li; Han Liu

arXiv:2603.03505·cs.CV·March 5, 2026

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

PDF

Open Access

TL;DR

PhyPrompt introduces a reinforcement learning framework that automatically refines prompts to generate physically plausible videos, significantly improving physical realism and semantic fidelity across multiple models without requiring expert input.

Contribution

The paper presents a novel RL-based prompt refinement method that integrates physics principles into text-to-video generation, outperforming larger models and general-purpose approaches.

Findings

01

40.8% joint success rate on VideoPhy2

02

Improved physical commonsense by 11 percentage points

03

Zero-shot transfer across diverse T2V architectures

Abstract

State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation