YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen; Weida Wang; Runhua Shi; Huan Yang; Chaofan Ding; Zihao Chen

arXiv:2512.02492·cs.CV·December 3, 2025

YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen

PDF

Open Access

TL;DR

YingVideo-MV is a novel cascaded framework that automatically generates high-quality, music-driven long videos with synchronized camera motions, leveraging audio analysis, shot planning, and diffusion architectures for improved coherence.

Contribution

It introduces the first cascaded approach for music-driven long-video generation, integrating camera motion control and adaptive denoising strategies for enhanced video quality.

Findings

01

Achieves high-quality, synchronized music videos with camera motions.

02

Outperforms existing methods in coherence and expressiveness.

03

Demonstrates effective long-sequence consistency and synchronization.

Abstract

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing