AutoMV: An Automatic Multi-Agent System for Music Video Generation

Xiaoxuan Tang; Xinping Lei; Chaoran Zhu; Shiyun Chen; Ruibin Yuan; Yizhi Li; Changjae Oh; Ge Zhang; Wenhao Huang; Emmanouil Benetos; Yang Liu; Jiaheng Liu; Yinghao Ma

arXiv:2512.12196·cs.MM·December 16, 2025

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma

PDF

Open Access

TL;DR

AutoMV is an innovative multi-agent system that automates the creation of full-length music videos by integrating music analysis, script generation, scene creation, and quality evaluation, significantly outperforming existing methods and commercial products.

Contribution

This paper introduces AutoMV, the first comprehensive multi-agent framework for automatic full-length music video generation, including a new benchmark for evaluation and insights into AI-based video judging.

Findings

01

AutoMV outperforms existing methods and commercial products in quality.

02

The benchmark effectively differentiates between various M2V generation approaches.

03

Large multimodal models show potential but still underperform compared to human judges.

Abstract

Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis