MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang; Ziqi Huang; Ruoxi Jia; Paul Debevec; Ning Yu

arXiv:2508.08487·cs.CV·January 27, 2026

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

PDF

Open Access 1 Video

TL;DR

MAViS is a multi-agent framework that enhances long-sequence video storytelling by improving assistive capabilities, visual quality, and expressiveness through modular, collaborative stages and optimized script-tool compatibility.

Contribution

It introduces a novel multi-agent, modular framework with the 3E Principle and Script Writing Guidelines for improved long-sequence video generation.

Findings

01

Achieves state-of-the-art performance in assistive capability, visual quality, and expressiveness.

02

Enables rapid exploration of diverse visual storytelling with high-quality, complete videos.

03

Provides multimodal outputs including narratives and background music.

Abstract

Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling· underline

Taxonomy

TopicsArtificial Intelligence in Games · Video Analysis and Summarization · Human Motion and Animation