AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

TL;DR
AniMaker is a multi-agent framework that generates coherent storytelling animations from text by using MCTS-driven clip generation and a specialized evaluation system, improving quality and efficiency over existing methods.
Contribution
This paper introduces AniMaker, the first multi-agent system combining MCTS-based clip generation and a novel animation evaluation framework for story-coherent video creation from text.
Findings
Achieves higher quality animations as per VBench and AniEval metrics.
Significantly improves multi-candidate clip generation efficiency.
Creates more coherent and story-consistent animations from text input.
Abstract
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Artificial Intelligence in Games · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
