MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas

TL;DR
MoReGen is a physics-grounded multi-agent framework for generating intent-aligned, physically accurate videos from text prompts, addressing the challenge of physical validity in text-to-video synthesis.
Contribution
We introduce MoReGen, a novel multi-agent, physics-aware T2V framework, and MoReSet, a benchmark dataset for evaluating physical validity in generated videos.
Findings
State-of-the-art T2V models often lack physical validity.
MoReGen achieves more physically coherent video synthesis.
MoReSet provides a new standard for evaluating physics in T2V.
Abstract
While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications
