VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan,, Qifeng Chen, Wei Xue, Yike Guo

TL;DR
VidMuse is a new framework for generating high-quality, semantically aligned music from videos, utilizing a large dataset and long-short-term modeling to improve coherence and diversity.
Contribution
We introduce VidMuse, a simple yet effective video-to-music generation framework that leverages a large dataset and long-short-term modeling for improved audio-visual alignment.
Findings
Outperforms existing models in audio quality and diversity
Produces semantically aligned music with video content
Utilizes a large-scale dataset of 360K video-music pairs
Abstract
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets are available at https://vidmuse.github.io/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Multimedia Communication and Technology
