Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning
Nicolae Cudlenco, Mihai Masala, Marius Leordeanu

TL;DR
This paper introduces a novel multi-agent system that constructs formal event graphs for video generation, ensuring semantic reliability and executability in a 3D game engine, outperforming neural generators.
Contribution
It presents a new architecture separating narrative planning from execution, using a formal event graph and tool constraints to guarantee executable video specifications.
Findings
Agentic narratives outperform procedural baselines in quality.
Engine-generated videos surpass neural generators in physical validity.
Structured event graphs enable reliable, executable video synthesis.
Abstract
Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
