MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling
Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

TL;DR
MAViS is a multi-agent framework that enhances long-sequence video storytelling by improving assistive capabilities, visual quality, and expressiveness through modular, collaborative stages and optimized script-tool compatibility.
Contribution
It introduces a novel multi-agent, modular framework with the 3E Principle and Script Writing Guidelines for improved long-sequence video generation.
Findings
Achieves state-of-the-art performance in assistive capability, visual quality, and expressiveness.
Enables rapid exploration of diverse visual storytelling with high-quality, complete videos.
Provides multimodal outputs including narratives and background music.
Abstract
Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Games · Video Analysis and Summarization · Human Motion and Animation
