MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Xiangyu Bai; He Liang; Bishoy Galoaa; Utsav Nandi; Shayda Moezzi; Yuhang He; Sarah Ostadabbas

arXiv:2512.04221·cs.CV·December 11, 2025

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas

PDF

Open Access

TL;DR

MoReGen is a physics-grounded multi-agent framework for generating intent-aligned, physically accurate videos from text prompts, addressing the challenge of physical validity in text-to-video synthesis.

Contribution

We introduce MoReGen, a novel multi-agent, physics-aware T2V framework, and MoReSet, a benchmark dataset for evaluating physical validity in generated videos.

Findings

01

State-of-the-art T2V models often lack physical validity.

02

MoReGen achieves more physically coherent video synthesis.

03

MoReSet provides a new standard for evaluating physics in T2V.

Abstract

While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications